# Introduction

In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.

# About the Dataset of Google Play Store Apps & Reviews

**Data Source:** <br>
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed [here](
https://www.kaggle.com/lava18/google-play-store-apps).

# Import Statements

In [None]:
import pandas as pd
import plotly.express as px


# Notebook Presentation

In [None]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [None]:
df_apps = pd.read_csv('apps.csv')

In [None]:
df_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social,"July 28, 2017",4.1 and up
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education,"April 15, 2016",3.0 and up
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization,"July 11, 2018",4.2 and up
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business,"August 6, 2018",4.1 and up


What else should I know about the data?



So we can see that 12 different features were originally scraped from the Google Play Store.



* Obviously, the data is just a sample out of all the Android apps. It doesn't include all Android apps of which there are millions.
* I’ll assume that the sample is representative of the App Store as a whole. This is not necessarily the case as, during the web scraping process, this sample was served up based on geographical location and user behaviour of the person who scraped it - in our case Lavanya Gupta.
* The data was compiled around 2017/2018. The pricing data reflect the price in USD Dollars at the time of scraping. (developers can offer promotions and change their app’s pricing).
* I’ve converted the app’s size to a floating-point number in MBs. If data was missing, it has been replaced by the average size for that category.
* The installs are not the exact number of installs. If an app has 245,239 installs then Google will simply report an order of magnitude like 100,000+. I’ve removed the '+' and we’ll assume the exact number of installs in that column for simplicity.

# Data Cleaning

How many rows and columns does `df_apps` have? What are the column names? What does the data look like and draw any 5 random samples.

In [None]:
print(f'There are {df_apps.shape[0]} rows.')
print(f'There are {df_apps.shape[1]} columns.\n')
print(f'The column names are: {df_apps.columns}')

There are 10841 rows.
There are 12 columns.

The column names are: Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')


We can already see that there are some data issues that we need to fix. In the Ratings and Type columns there are NaN (Not a number values) and in the Price column we have dollar signs that will cause problems.

In [None]:
df_apps.sample(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Last_Updated,Android_Ver
8280,Megatramp - a Story of Success!,FAMILY,4.3,322976,19.0,5000000,Free,0,Teen,Simulation,"June 13, 2018",4.2 and up
8482,Cartoon Wars: Blade,GAME,4.4,165656,38.0,5000000,Free,0,Teen,Arcade,"February 7, 2018",2.3 and up
2733,Easy DIY CD Craft Ideas,ART_AND_DESIGN,,7,5.6,5000,Free,0,Everyone,Art & Design,"May 30, 2018",2.3 and up
4440,Breweries (CZ/SK),LIFESTYLE,4.4,1740,4.4,50000,Free,0,Teen,Lifestyle,"July 27, 2017",4.1 and up
10757,Twitter,NEWS_AND_MAGAZINES,4.3,11667403,6.3,500000000,Free,0,Mature 17+,News & Magazines,"August 6, 2018",Varies with device


### Drop Unused Columns

Remove the columns called `Last_Updated` and `Android_Version` from the DataFrame. We will not be using these columns.

In [None]:
df_apps.drop(columns=['Last_Updated','Android_Ver'], inplace=True)

In [None]:
df_apps.isna().sum()

Unnamed: 0,0
App,0
Category,0
Rating,1474
Reviews,0
Size_MBs,0
Installs,0
Type,1
Price,0
Content_Rating,0
Genres,0


### Find and Remove NaN values in Ratings

How may rows have a NaN value (not-a-number) in the Ratings column?

In [None]:
nan_rows = df_apps[df_apps.Rating.isna()]
nan_rows.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
0,Ak Parti Yardım Toplama,SOCIAL,,0,8.7,0,Paid,$13.99,Teen,Social
1,Ain Arabic Kids Alif Ba ta,FAMILY,,0,33.0,0,Paid,$2.99,Everyone,Education
2,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,,0,5.5,0,Paid,$1.49,Everyone,Personalization
3,Command & Conquer: Rivals,FAMILY,,0,19.0,0,,0,Everyone 10+,Strategy
4,CX Network,BUSINESS,,0,10.0,0,Free,0,Everyone,Business


We see that NaN values in ratings are associated with no reviews (and no installs). That makes sense.

In [None]:
print(f'Number of rows which have NaN values in the Ratings column are: {nan_rows.shape[0]}')

Number of rows which have NaN values in thr Ratings column are: 1474


Creating a DataFrame called `df_apps_clean` that does not include these rows.

In [None]:
df_apps_clean = df_apps.dropna()
df_apps_clean.isna().sum()

Unnamed: 0,0
App,0
Category,0
Rating,0
Reviews,0
Size_MBs,0
Installs,0
Type,0
Price,0
Content_Rating,0
Genres,0


In [None]:
df_apps_clean.shape

(9367, 10)

### Find and Remove Duplicates

Are there any duplicates in data?
How many entries can you find for the "Instagram" app?


In [None]:
duplicated_rows = df_apps_clean[df_apps_clean.duplicated()]
duplicated_rows.head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
946,420 BZ Budeze Delivery,MEDICAL,5.0,2,11.0,100,Free,0,Mature 17+,Medical
1133,MouseMingle,DATING,2.7,3,3.9,100,Free,0,Mature 17+,Dating
1196,"Cardiac diagnosis (heart rate, arrhythmia)",MEDICAL,4.4,8,6.5,100,Paid,$12.99,Everyone,Medical
1231,Sway Medical,MEDICAL,5.0,3,22.0,100,Free,0,Everyone,Medical
1247,Chat Kids - Chat Room For Kids,DATING,4.7,6,4.9,100,Free,0,Mature 17+,Dating


In [None]:
duplicated_rows.shape

(476, 10)

In [None]:
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social


In [None]:
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App','Type','Price'])
df_apps_clean[df_apps_clean.App=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social


In [None]:
df_apps_clean.shape

(8199, 10)

# Find Highest Rated Apps

Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

In [None]:
df_apps_clean.sort_values('Rating', ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
21,KBA-EZ Health Guide,MEDICAL,5.0,4,25.0,1,Free,0,Everyone,Medical
1639,DC-014,PHOTOGRAPHY,5.0,3,16.0,500,Free,0,Everyone,Photography
1675,WPBS-DT,FAMILY,5.0,3,6.3,500,Free,0,Everyone,Entertainment
1677,CG FM,FAMILY,5.0,2,6.6,500,Free,0,Everyone,Entertainment
958,Food-Aw - Order Food Online in Aruba,FOOD_AND_DRINK,5.0,1,24.0,100,Free,0,Everyone,Food & Drink


Only apps with very few reviews (and a low number on installs) have perfect 5 star ratings (most likely by friends and family).

# Find 5 Largest Apps in terms of Size (MBs)

What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, could there be limit in place or can developers make apps as large as they please?

In [None]:
df_apps_clean.sort_values('Size_MBs', ascending=False).head()

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
7928,Stickman Legends: Shadow Wars,GAME,4.4,38419,100.0,1000000,Paid,$0.99,Everyone 10+,Action
9944,Gangster Town: Vice District,FAMILY,4.3,65146,100.0,10000000,Free,0,Mature 17+,Simulation
7927,The Walking Dead: Our World,GAME,4.0,22435,100.0,1000000,Free,0,Teen,Action
8719,Draft Simulator for FUT 18,SPORTS,4.6,162933,100.0,5000000,Free,0,Everyone,Sports
8718,Mini Golf King - Multiplayer Game,GAME,4.5,531458,100.0,5000000,Free,0,Everyone,Sports


Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. A quick google search would also have revealed that this limit is imposed by the Google Play Store itself. It’s interesting to see that a number of apps actually hit that limit exactly.

# Find the 5 App with Most Reviews

Which apps have the highest number of reviews? Are there any paid apps among the top 50?

In [None]:
df_apps_clean.sort_values('Reviews', ascending=False).head(50)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
10805,Facebook,SOCIAL,4.1,78158306,5.3,1000000000,Free,0,Teen,Social
10811,Facebook,SOCIAL,4.1,78128208,5.3,1000000000,Free,0,Teen,Social
10789,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10785,WhatsApp Messenger,COMMUNICATION,4.4,69119316,3.5,1000000000,Free,0,Everyone,Communication
10797,WhatsApp Messenger,COMMUNICATION,4.4,69109672,3.5,1000000000,Free,0,Everyone,Communication
10808,Instagram,SOCIAL,4.5,66577446,5.3,1000000000,Free,0,Teen,Social
10809,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10806,Instagram,SOCIAL,4.5,66577313,5.3,1000000000,Free,0,Teen,Social
10810,Instagram,SOCIAL,4.5,66509917,5.3,1000000000,Free,0,Teen,Social
10790,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56646578,3.5,1000000000,Free,0,Everyone,Communication


If we look at the number of reviews, we can find the most popular apps on the Android App Store. These include the usual suspects: Facebook, WhatsApp, Instagram etc. What’s also notable is that the list of the top 50 most reviewed apps does not include a single paid app!

# Data Visualisation with Plotly

All Android apps have a content rating like “Everyone” or “Teen” or “Mature 17+”. Let’s take a look at the distribution of the content ratings in our dataset.

In [None]:
ratings = df_apps_clean.Content_Rating.value_counts()
ratings

Unnamed: 0_level_0,count
Content_Rating,Unnamed: 1_level_1
Everyone,6621
Teen,912
Mature 17+,357
Everyone 10+,305
Adults only 18+,3
Unrated,1


In [None]:
fig = px.pie(labels=ratings.index,
values=ratings.values,
title="Content Rating",
names=ratings.index,
)
fig.update_traces(textposition='outside', textinfo='percent+label')

fig.show()

In [None]:
fig = px.pie(labels=ratings.index, names=ratings.index, values= ratings.values, title="Content Rating", hole=0.6)
fig.update_traces(textposition= 'inside', textfont_size=15, textinfo= 'percent')
fig.show()

# Numeric Type Conversions for the Installations & Price Data

How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?


In [None]:
print(df_apps_clean.info(),"\n")

<class 'pandas.core.frame.DataFrame'>
Index: 8199 entries, 21 to 10835
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             8199 non-null   object 
 1   Category        8199 non-null   object 
 2   Rating          8199 non-null   float64
 3   Reviews         8199 non-null   int64  
 4   Size_MBs        8199 non-null   float64
 5   Installs        8199 non-null   object 
 6   Type            8199 non-null   object 
 7   Price           8199 non-null   object 
 8   Content_Rating  8199 non-null   object 
 9   Genres          8199 non-null   object 
dtypes: float64(2), int64(1), object(7)
memory usage: 704.6+ KB
None 



This shows that we are dealing with a non-numeric data type in case of 'Installs' column. In this case, the type is "object".

In [None]:
df_apps_clean.Installs = df_apps_clean.Installs.astype(str).str.replace("," , "")
df_apps_clean.Installs = pd.to_numeric(df_apps_clean.Installs)
df_apps_clean[['App','Installs']].groupby('Installs').count()

Unnamed: 0_level_0,App
Installs,Unnamed: 1_level_1
1,3
5,9
10,69
50,56
100,303
500,199
1000,698
5000,425
10000,988
50000,457


# Find the Most Expensive Apps, Filter out the Junk, and Calculate a (ballpark) Sales Revenue Estimate

1. Investigate the top 20 most expensive apps in the dataset.
2. Filter out the junk.
3. What are the top 10 highest grossing paid apps according to this estimate?
4. Out of the top 10 highest grossing paid apps, how many are games?



Looking at the data type of the price column, we also see that is of type object.

In [None]:
df_apps_clean.Price = df_apps_clean.Price.astype(str).str.replace('$', "")
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price)
df_apps_clean.sort_values('Price', ascending=False).head(20)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
3946,I'm Rich - Trump Edition,LIFESTYLE,3.6,275,7.3,10000,Paid,400.0,Everyone,Lifestyle
2461,I AM RICH PRO PLUS,FINANCE,4.0,36,41.0,1000,Paid,399.99,Everyone,Finance
4606,I Am Rich Premium,FINANCE,4.1,1867,4.7,50000,Paid,399.99,Everyone,Finance
3145,I am rich(premium),FINANCE,3.5,472,0.94,5000,Paid,399.99,Everyone,Finance
3554,💎 I'm rich,LIFESTYLE,3.8,718,26.0,10000,Paid,399.99,Everyone,Lifestyle
5765,I am rich,LIFESTYLE,3.8,3547,1.8,100000,Paid,399.99,Everyone,Lifestyle
1946,I am rich (Most expensive app),FINANCE,4.1,129,2.7,1000,Paid,399.99,Teen,Finance
2775,I Am Rich Pro,FAMILY,4.4,201,2.7,5000,Paid,399.99,Everyone,Entertainment
3221,I am Rich Plus,FAMILY,4.0,856,8.7,10000,Paid,399.99,Everyone,Entertainment
3114,I am Rich,FINANCE,4.3,180,3.8,5000,Paid,399.99,Everyone,Finance


What’s going on here? There are 15 I am Rich Apps in the Google Play Store apparently. They all cost $300 dollars$ or more, which is the main point of the app. The story goes that in 2008, Armin Heinrich released the very first I am Rich app in the iOS App Store for $999.90. The app does absolutely nothing. It just displays the picture of a gemstone and can be used to prove to your friends how rich you are. Armin actually made a total of 7 sales before the app was hastily removed by Apple. Nonetheless, it inspired a bunch of copycats on the Android App Store, but if you search today, you’ll find all of these apps have disappeared as well. The high installation numbers are likely gamed by making the app was available for free at some point to get reviews and appear more legitimate.

Therefore, to avoid the misinterpretation of the Most Expensive 'Real' Apps, we can remove the rows with price greater that $250.

In [None]:
df_apps_clean = df_apps_clean[df_apps_clean.Price<250]
df_apps_clean.sort_values('Price', ascending=False).head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres
2281,Vargo Anesthesia Mega App,MEDICAL,4.6,92,32.0,1000,Paid,79.99,Everyone,Medical
1407,LTC AS Legal,MEDICAL,4.0,6,1.3,100,Paid,39.99,Everyone,Medical
2629,I am Rich Person,LIFESTYLE,4.2,134,1.8,1000,Paid,37.99,Everyone,Lifestyle
2481,A Manual of Acupuncture,MEDICAL,3.5,214,68.0,1000,Paid,33.99,Everyone,Medical
2463,PTA Content Master,MEDICAL,4.2,64,41.0,1000,Paid,29.99,Everyone,Medical


### Highest Grossing Paid Apps (ballpark estimate)

In [None]:
df_apps_clean["Revenue_Estimate"] =  df_apps_clean.Price*df_apps_clean.Installs
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]

Unnamed: 0,App,Category,Rating,Reviews,Size_MBs,Installs,Type,Price,Content_Rating,Genres,Revenue_Estimate
9220,Minecraft,FAMILY,4.5,2376564,19.0,10000000,Paid,6.99,Everyone 10+,Arcade;Action & Adventure,69900000.0
8825,Hitman Sniper,GAME,4.6,408292,29.0,10000000,Paid,0.99,Mature 17+,Action,9900000.0
7151,Grand Theft Auto: San Andreas,GAME,4.4,348962,26.0,1000000,Paid,6.99,Mature 17+,Action,6990000.0
7477,Facetune - For Free,PHOTOGRAPHY,4.4,49553,48.0,1000000,Paid,5.99,Everyone,Photography,5990000.0
7977,Sleep as Android Unlock,LIFESTYLE,4.5,23966,0.85,1000000,Paid,5.99,Everyone,Lifestyle,5990000.0
6594,DraStic DS Emulator,GAME,4.6,87766,12.0,1000000,Paid,4.99,Everyone,Action,4990000.0
6082,Weather Live,WEATHER,4.5,76593,4.75,500000,Paid,5.99,Everyone,Weather,2995000.0
7954,Bloons TD 5,FAMILY,4.6,190086,94.0,1000000,Paid,2.99,Everyone,Strategy,2990000.0
7633,Five Nights at Freddy's,GAME,4.6,100805,50.0,1000000,Paid,2.99,Teen,Action,2990000.0
6746,Card Wars - Adventure Time,FAMILY,4.3,129603,23.0,1000000,Paid,2.99,Everyone 10+,Card;Action & Adventure,2990000.0


The top spot of the highest-grossing paid app goes to … Minecraft at close to $70 million. It’s quite interesting that Minecraft (along with Bloons and Card Wars) is actually listed in the Family category rather than in the Game category. If we include these titles, we see that 7 out the top 10 highest-grossing apps are games. The Google Play Store seems to be quite flexible with its category labels.

# Analysing App Categories

If you were to release an app, would you choose to go after a competitive category with many other apps? Or would you target a popular category with a high number of downloads? Or perhaps you can target a category which is both popular but also one where the downloads are spread out among many different apps. That way, even if it’s more difficult to discover among all the other apps, your app has a better chance of getting installed, right? Let’s analyse this with bar charts and scatter plots and figure out which categories are dominating the market.

In [None]:
df_apps_clean.Category.nunique()

33

In [None]:
top10_category = df_apps_clean.Category.value_counts()[:10]
top10_category

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
FAMILY,1606
GAME,910
TOOLS,719
PRODUCTIVITY,301
PERSONALIZATION,298
LIFESTYLE,297
FINANCE,296
MEDICAL,292
PHOTOGRAPHY,263
BUSINESS,262


### Highest Competition (Number of Apps)

In [None]:
v_bar = px.bar(x=top10_category.index, y=top10_category.values)
v_bar.update_layout(xaxis_title='Category',
                    yaxis_title='Number of Apps',
                    title='Top Categories')
v_bar.show()

Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is.

### Most Popular Categories (Highest Downloads)

In [None]:
category_installs = df_apps_clean.groupby('Category').agg({'Installs':'sum'})
category_installs.sort_values('Installs', ascending=True, inplace=True)
category_installs

Unnamed: 0_level_0,Installs
Category,Unnamed: 1_level_1
EVENTS,15949410
BEAUTY,26916200
PARENTING,31116110
MEDICAL,39162676
COMICS,44931100
LIBRARIES_AND_DEMO,52083000
AUTO_AND_VEHICLES,53129800
HOUSE_AND_HOME,97082000
ART_AND_DESIGN,114233100
DATING,140912410


In [None]:
h_bar = px.bar(x=category_installs.Installs, y=category_installs.index, orientation='h', title='Category Popularity')
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category')
h_bar.show()

Now we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?

### Category Concentration - Downloads vs. Competition


In [None]:
cat_number = df_apps_clean.groupby('Category').agg({'App':pd.Series.count})
cat_merged_df = pd.merge(cat_number, category_installs, on='Category', how='inner')
print(f'The dimensions of the DataFrame are: {cat_merged_df.shape}')
cat_merged_df.sort_values('Installs', ascending=False)


The dimensions of the DataFrame are: (33, 2)


Unnamed: 0_level_0,App,Installs
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
GAME,910,13858762717
COMMUNICATION,257,11039241530
TOOLS,719,8099724500
PRODUCTIVITY,301,5788070180
SOCIAL,203,5487841475
PHOTOGRAPHY,263,4649143130
FAMILY,1606,4437554490
VIDEO_PLAYERS,148,3916897200
TRAVEL_AND_LOCAL,187,2894859300
NEWS_AND_MAGAZINES,204,2369110650


In [None]:
scatter = px.scatter(cat_merged_df, # data
                    x='App', # column name
                    y='Installs',
                    title='Category Concentration',
                    size='App',
                    hover_name=cat_merged_df.index,
                    color='Installs')

scatter.update_layout(xaxis_title="Number of Apps (Lower=More Concentrated)",
                      yaxis_title="Installs",
                      yaxis=dict(type='log'))

scatter.show()

What we see is that the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps.

# Extracting Nested Data from a Column

1. How many different types of genres are there?
2. Can an app belong to more than one genre?
3. Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's [.stack() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html).


In [None]:
print(f"Number of Genres: {len(df_apps_clean.Genres.unique())}\n")

df_apps_clean.Genres.value_counts().sort_values(ascending=True)[:5]

Number of Genres: 114



Unnamed: 0_level_0,count
Genres,Unnamed: 1_level_1
Lifestyle;Pretend Play,1
Strategy;Education,1
Adventure;Education,1
Role Playing;Brain Games,1
Tools;Education,1


If we look at the number of unique values in the Genres column we get 114. But this is not accurate if we have nested data like we do here.
We somehow need to separate the genre names to get a clear picture.


In [None]:
stack = df_apps_clean.Genres.str.split(';', expand=True).stack()
print(f"We now have a single column with shape: {stack.shape}\n")

num_genres = stack.value_counts()
print(f'Number of genres: {len(num_genres)}')

21     0                    Medical
28     0                     Arcade
47     0                     Arcade
82     0                     Arcade
99     0                    Medical
                     ...           
10824  0               Productivity
10828  0    Video Players & Editors
10829  0    Video Players & Editors
10831  0           News & Magazines
10835  0                     Arcade
Length: 8564, dtype: object 

We now have a single column with shape: (8564,)

Number of genres: 53


This shows us we actually have 53 different genres.

#Competition in Genres

In [None]:
num_genres.head()

Unnamed: 0,count
Tools,719
Education,587
Entertainment,498
Action,304
Productivity,301


In [None]:
bar = px.bar(x=num_genres.index[:15],
             y=num_genres.values[:15],
             title='Top Genres',
             hover_name=num_genres.index[:15],
             color=num_genres.values[:15],
             color_continuous_scale='Agsunset')

bar.update_layout(xaxis_title='Genre',
                  yaxis_title='Number of Apps',
                  coloraxis_showscale=False)

bar.show()

#Free vs. Paid Apps per Category

Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.

In [None]:
df_apps_clean.Type.value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Free,7595
Paid,589


We see that the majority of apps are free on the Google Play Store. But perhaps some categories have more paid apps than others. Let’s investigate.

In [None]:
df_free_vs_paid = df_apps_clean.groupby(['Category','Type'], as_index=False).agg({'App':pd.Series.count})
df_free_vs_paid

Unnamed: 0,Category,Type,App
0,ART_AND_DESIGN,Free,58
1,ART_AND_DESIGN,Paid,3
2,AUTO_AND_VEHICLES,Free,72
3,AUTO_AND_VEHICLES,Paid,1
4,BEAUTY,Free,42
...,...,...,...
56,TRAVEL_AND_LOCAL,Paid,8
57,VIDEO_PLAYERS,Free,144
58,VIDEO_PLAYERS,Paid,4
59,WEATHER,Free,65


In [None]:
df_free_vs_paid[df_free_vs_paid.Category == 'FAMILY']

Unnamed: 0,Category,Type,App
19,FAMILY,Free,1456
20,FAMILY,Paid,150


Unsurprisingly the biggest categories have the most paid apps. However, there might be some patterns if we put the numbers of a graph!

In [None]:
g_bar = px.bar(df_free_vs_paid,
              x='Category',
              y='App',
              title='Free vs Paid Apps by Category',
              color='Type',
              barmode='group',
              hover_name='Category',
              labels={'App':'Number of Apps'}
              )

g_bar.update_layout(xaxis_title='Category',
                  yaxis_title='Number of Apps',
                  xaxis={'categoryorder':'total descending'},
                  yaxis=dict(type='log'))

g_bar.show()

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.

#Lost Downloads for Paid Apps

And how many downloads are you potentially giving up because your app is paid? How does the median number of installations compare? Is the difference large or small?


In [None]:
box = px.box(df_apps_clean,
             x='Type',
             y='Installs',
             color='Type',
             notched=True,
             points='all',
             title='How Many Downloads are Paid Apps Giving Up?',
             log_y=True)

box.show()

From the hover text in the chart, we see that the median number of downloads for free apps is 5,00,000, while the median number of downloads for paid apps is around 5,000! This is massively lower.

But does this mean we should give up on selling a paid app? Let’s see how much revenue we would estimate per category.



#Revenue by App Category

Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?


In [None]:
df_paid_apps = df_apps_clean[df_apps_clean.Type=='Paid']

median_earning = df_paid_apps[df_paid_apps.Category == "TOOLS"]['Revenue_Estimate'].median()
print(f"The median app earns {median_earning} in the Tools category.")

The median app earns 5990.0 in the Tools category.


In [None]:
box = px.box(df_paid_apps,
             x='Category',
             y='Revenue_Estimate',
             title='How Much Can Paid Apps Earn?',
             log_y=True)

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Ballpark Revenue',
                  xaxis={'categoryorder':'min ascending'}
                  )

box.show()

If an Android app costs $30,000 dollars$ to develop, then the average app in very few categories would cover that development cost. The median paid photography app earned about $20,000 dollars$. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.

So, if we were to list a paid app, how should we price it?


# How Much Can You Charge? Examine Paid App Pricing Strategies by Category

**Challenge**: What is the median price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using `{categoryorder':'max descending'}` to sort the categories.

In [None]:
median_price = df_paid_apps['Price'].median()
print(f"The median app earns {median_price} in the Tools category.")

The median app earns 2.99 in the Tools category.


In [None]:
box = px.box(df_paid_apps,
             x='Category',
             y='Price',
             title='Price per Category',
             )

box.update_layout(xaxis_title='Category',
                  yaxis_title='Paid App Price',
                  xaxis={'categoryorder':'max descending'},
                  yaxis=dict(type='log')
                  )

box.show()

The median price for an Android app is $2.99.

However, some categories have higher median prices than others. This time we see that Medical apps have the most expensive apps as well as a median price of ($5.49).

In contrast, Personalisation apps are quite cheap on average at $1.49.

Other categories which higher median prices are Business ($4.99) and Dating ($6.99). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.

