# Play store apps dataset

## Why this dataset is interesting?

The Google Play Store is Google's app marketplace. Most people access the Google Play Store when they want to install new apps onto Android their phones.

Like any market apps in the play store are subject to **supply** and **demand**... that is to say that certain kinds of apps get downloaded a lot while others don't. Certain kinds of apps get paid for while others don't. Some categories of apps have lots and lots of competition while others don't.

A dataset like this can help you spot opportunities.

## Ideas for questions this data can help you answer

* What categories of applications get a lot of downloads per day?
* What categories of applications don't get many downloads per day?
* In what app categories are there market leaders (one app that clearly is getting downloaded more than the others)?
* How many downloads per day might you expect if you took the time to build an app?
* What can the data tell you about monetization approaches?








## Some data - 62683 apps

In [11]:
import pandas as pd
df = pd.read_csv('google-play-store-11-2018.csv')
df.describe()

Unnamed: 0,reviews,ratings,min_installs,score,ratings_per_day,price,rating_one_star,rating_two_star,rating_three_star,rating_four_star,rating_five_star
count,62683.0,62683.0,62694.0,62683.0,62694.0,62694.0,62694.0,62694.0,62694.0,62694.0,62694.0
mean,15298.43,49363.28,2035663.0,4.221624,38.620506,0.414998,3078.124,1211.618,3094.328,7227.599,34742.95
std,226150.5,769025.5,23868720.0,0.815517,430.42277,3.793236,60502.31,21934.03,52302.51,111706.6,536804.2
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16.0,41.0,1000.0,4.100497,0.0,0.0,2.0,1.0,2.0,4.0,28.0
50%,144.0,398.0,50000.0,4.403101,0.0,0.0,24.0,9.0,23.0,47.0,264.0
75%,1500.0,4488.0,500000.0,4.637007,6.0,0.0,307.0,109.0,284.0,609.0,2933.0
max,22053770.0,81284860.0,1000000000.0,5.0,40526.0,369.99,9658715.0,3368101.0,7164984.0,12223420.0,52952660.0


In [12]:
df.head(5)

Unnamed: 0,app_id,title,reviews,ratings,min_installs,score,offers_iap,ad_supported,released,ratings_per_day,genre,genre_id,price,rating_one_star,rating_two_star,rating_three_star,rating_four_star,rating_five_star
0,com.prettyteengames.royal.princess.wedding.mak...,Royal Princess Wedding Makeover and Dress Up,375.0,1023.0,100000,4.179863,True,True,2017-12-20,3,Casual,GAME_CASUAL,0.0,115,31,98,90,689
1,com.MayGreenStudio.dressup,Momo's Dressup,13492.0,25974.0,1000000,4.711096,False,True,2017-03-07,42,Casual,GAME_CASUAL,0.0,673,213,806,2561,21721
2,air.theflash.f2game.PrettyGirl23,Princess Pretty Girl,1974.0,4610.0,500000,4.295445,False,True,2015-01-18,3,Casual,GAME_CASUAL,0.0,382,206,287,528,3207
3,air.com.dressupone.animeschooluniforms,Anime School Uniforms,2586.0,6081.0,500000,4.209505,False,True,2013-08-20,3,Casual,GAME_CASUAL,0.0,628,193,524,668,4068
4,air.theflash.f2game.PrettyGirl7,Wedding Pretty girl,1409.0,3728.0,500000,4.195011,False,True,2014-09-01,2,Casual,GAME_CASUAL,0.0,358,185,300,414,2471


In [16]:
df.columns

Index(['app_id', 'title', 'reviews', 'ratings', 'min_installs', 'score',
       'offers_iap', 'ad_supported', 'released', 'ratings_per_day', 'genre',
       'genre_id', 'price', 'rating_one_star', 'rating_two_star',
       'rating_three_star', 'rating_four_star', 'rating_five_star'],
      dtype='object')

In [60]:
len(df.index)

62694

In [17]:
#  Q1;What categories of applications get a lot of downloads per day?

In [15]:
x = df.groupby('genre_id')['ratings_per_day'].sum().sort_values(ascending=False)
print(x.head(10))  # Display the top 10 categories

genre_id
GAME_ACTION        241042
TOOLS              216170
GAME_CASUAL        183407
COMMUNICATION      147756
GAME_ARCADE        142482
SOCIAL             119592
GAME_STRATEGY      117991
GAME_SIMULATION    114926
GAME_SPORTS        106174
GAME_RACING        101867
Name: ratings_per_day, dtype: int64


In [46]:
#Q2;What categories of applications don't get many downloads per day?

In [22]:
x = df.groupby('genre_id')['ratings_per_day'].sum().sort_values(ascending=True)
x.head(1)

genre_id
EVENTS    98
Name: ratings_per_day, dtype: int64

In [23]:
#Q3;In what app categories are their market leaders (one app that clearly is getting downloaded more than the others)?

In [59]:
res = df.groupby('genre_id').apply(lambda x: x.loc[x['min_installs'].idxmax()]).reset_index(drop=True)
# Display the top apps with the most downloads in each genre
res[['title', 'min_installs', 'genre_id']]

  res = df.groupby('genre_id').apply(lambda x: x.loc[x['min_installs'].idxmax()]).reset_index(drop=True)


Unnamed: 0,title,min_installs,genre_id
0,Sketch - Draw & Paint,100000000,ART_AND_DESIGN
1,"Android Auto - Google Maps, Media & Messaging",10000000,AUTO_AND_VEHICLES
2,Best Hairstyles step by step,5000000,BEAUTY
3,Google Play Books,1000000000,BOOKS_AND_REFERENCE
4,"OfficeSuite - Office, Word, Docs, Sheets Slide...",100000000,BUSINESS
5,네이버 웹툰 - Naver Webtoon,10000000,COMICS
6,Gmail,1000000000,COMMUNICATION
7,Zoosk Dating App: Meet Singles,10000000,DATING
8,Duolingo: Learn Languages Free,100000000,EDUCATION
9,Google Play Games,1000000000,ENTERTAINMENT


In [67]:
#q4;How many downloads per day might you expect if you took the time to build an app?

In [58]:
downloads_by_genre = df.groupby('genre')['ratings_per_day'].mean().sort_values(ascending=False).reset_index()
print(downloads_by_genre)
#series so by use ni kia in sort_values

                      genre  ratings_per_day
0                  Strategy       258.185996
1                    Action       247.476386
2                    Racing       188.293900
3                    Arcade       158.313333
4                    Social       143.052632
5                    Casual       138.944697
6                      Word       136.277778
7              Role Playing       127.422414
8             Communication       120.518760
9                 Adventure       102.709552
10  Video Players & Editors        83.897849
11                   Sports        79.962211
12                   Puzzle        72.210134
13                   Trivia        64.151111
14               Simulation        63.355017
15              Photography        55.830139
16                   Casino        49.992718
17                  Weather        48.910569
18                    Music        46.621005
19                    Board        39.562914
20                    Tools        34.911176
21        

In [71]:
#q5;What can the data tell you about monetization approaches?

In [56]:
mo_approach = df.groupby(['offers_iap', 'ad_supported', 'price'])['ratings_per_day'].mean().reset_index()
# Sorting by highest engagement
mo_approach = mo_approach.sort_values(by='ratings_per_day', ascending=False)
# Display results
print(mo_approach)

     offers_iap  ad_supported  price  ratings_per_day
322        True         False   0.00       143.469817
379        True          True   0.00       121.083207
389        True          True   7.99        92.000000
347        True         False   6.99        86.000000
385        True          True   3.49        33.000000
..          ...           ...    ...              ...
20        False         False   1.27         0.000000
73        False         False   2.30         0.000000
22        False         False   1.31         0.000000
23        False         False   1.34         0.000000
9         False         False   1.10         0.000000

[393 rows x 4 columns]


In [57]:
df[['offers_iap', 'ad_supported', 'price', 'ratings_per_day']]


Unnamed: 0,offers_iap,ad_supported,price,ratings_per_day
0,True,True,0.0,3
1,False,True,0.0,42
2,False,True,0.0,3
3,False,True,0.0,3
4,False,True,0.0,2
...,...,...,...,...
62689,True,True,0.0,15
62690,False,True,0.0,7
62691,True,True,0.0,11
62692,False,True,0.0,6


This analysis groups apps by monetization setup (whether they offer in‑app purchases, show ads, and what price they charge) and calculates the average ratings_per_day for each combination. This shows which monetization approaches tend to get higher user engagement, for example free apps with in‑app purchases and ads versus paid apps without ads.