## Profitable App Profiles for the App Store and Google Play Markets


Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. 

We’re working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build. At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. 

Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. 

To save time and resources, we will use the publicly available dataset for our case study

In [327]:
import pandas as pd
import numpy as np
import matplotlib

### The Google Play data set ###
android = pd.read_csv('googleplaystore.csv')
print(android.shape)
android.head(1)


(10841, 13)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up


In [328]:
### The App Store data set ###
ios = pd.read_csv('AppleStore.csv', index_col=0)
print(ios.shape)
ios.head(1)

(7197, 16)


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1


## Cleaning the data

Explore the data set more and we can see that some apps have odd rating, such as 19. This is clearly off because the maximum rating for an app is 5. As a consequence, we’ll delete these rows.

In [329]:
android = android[android.Rating <= 5]
android.shape

(9366, 13)

In [330]:
ios = ios[ios.user_rating <= 5]
ios.shape

(7197, 16)

## Removing Duplicate Entries

Some apps have more than one entry. We don’t want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we can probably find a better way.

If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews.

In [331]:
android[android.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won’t remove rows randomly; instead, we’ll keep the rows with the highest number of reviews, on the assumption that the higher the number of reviews, the more reliable the ratings.

In [332]:
def keep_max_only(df,sort_by_col,dedup_by_col):
    df = android.sort_values(by=sort_by_col,ascending=True).drop_duplicates(subset=dedup_by_col,keep='last')
    return df

android = keep_max_only(android,sort_by_col='Reviews',dedup_by_col='App')

android[android.App == 'Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


## Removing Non-English Apps

If you explore the data sets enough, you’ll notice the names of some of the apps suggest they are not directed toward an English-speaking audience.

In [333]:
ios.loc[6739:6739]

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
6739,1072278593,スピードOnline トランプゲーム,151349248,USD,0.0,0,0,0.0,0.0,1.1.21,4+,Games,40,5,1,1


We’re not interested in keeping these kind of apps, so we’ll remove them.

In [334]:
def check_eng(text):
    num_non_eng = 0
    for char in text:
        if char.isascii() == False:
            num_non_eng += 1
    if num_non_eng > 3:
        return False
    else: return True

def keep_eng_only(df,column_name):
    df[column_name] = df[column_name].apply(lambda x: x if check_eng(x) else np.nan)
    df = df.dropna(subset=column_name)
    return df

In [335]:
ios = keep_eng_only(ios,'track_name')
ios.shape

(6183, 16)

In [336]:
android = keep_eng_only(android,'App')
android.shape

(8166, 13)

### Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we’ll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.


In [337]:
ios_final = ios[ios.price == 0]
ios_final.shape

(3222, 16)

In [338]:
android_final = android[android.Price == '0']
android_final.shape

(7564, 13)

### Most Common Apps by Genre



As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play. =>
If the app has a good response from users, we then develop it further. =>
If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store. 

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

We start by examining the frequency table for the prime_genre column of the App Store data set.

In [339]:
ios_final.prime_genre.value_counts(normalize=True).head().mul(100).round(1).astype(str) + '%'

Games                58.2%
Entertainment         7.9%
Photo & Video         5.0%
Education             3.7%
Social Networking     3.3%
Name: prime_genre, dtype: object

We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. 

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. 

However, the fact that fun apps are the most numerous doesn’t also imply that they also have the greatest number of users — the demand might not be the same as the offer. 

Let’s continue by examining the Category columns of the Google Play data set.

In [340]:
android_final.Category.value_counts(normalize=True).head().mul(100).round(1).astype(str) + '%'

FAMILY          19.7%
GAME            10.8%
TOOLS            8.7%
FINANCE          3.8%
PRODUCTIVITY     3.7%
Name: Category, dtype: object

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we’d like to get an idea about the kinds of apps that have most users.

### Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. 

For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we’ll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app. 

Below, we calculate the average number of user ratings per app genre on the App Store:

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don’t seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

One problem with this data is that is not precise. For instance, we don’t know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don’t need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don’t need perfect precision with respect to the number of users. We’re going to leave the numbers as they are, which means that we’ll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In [341]:
android_final['Installs'] = (android_final['Installs'].str.replace(".","")
                .str.replace(",","")
                .str.replace("+","")
                .astype('float'))

android_final.head()

  android_final['Installs'] = (android_final['Installs'].str.replace(".","")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  android_final['Installs'] = (android_final['Installs'].str.replace(".","")


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
8497,DK Browser,COMMUNICATION,4.0,1,2.4M,10.0,Free,0,Everyone,Communication,"April 25, 2017",1.0,4.2 and up
9178,i am EB,PHOTOGRAPHY,5.0,1,5.4M,10.0,Free,0,Teen,Photography,"February 1, 2017",1.0,4.1 and up
7122,CB Fit,HEALTH_AND_FITNESS,5.0,1,7.8M,10.0,Free,0,Everyone,Health & Fitness,"July 9, 2018",4.2.2,4.1 and up
8869,DT CLOTHINGS,SHOPPING,5.0,1,7.9M,10.0,Free,0,Everyone,Shopping,"July 25, 2018",1.0.1,4.1 and up
10776,Monster Ride Pro,GAME,5.0,1,24M,10.0,Free,0,Everyone,Racing,"March 5, 2018",2.0,2.3 and up


We also try to compute the average number of installs for each genre

In [342]:
android_final.groupby(by='Category').agg({'Installs':'mean'}).sort_values(by='Installs',ascending=False).applymap("{0:,.0f}".format).head()

Unnamed: 0_level_0,Installs
Category,Unnamed: 1_level_1
COMMUNICATION,47166160
SOCIAL,27302664
VIDEO_PLAYERS,27115353
PRODUCTIVITY,20537622
PHOTOGRAPHY,18738970


On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs

In [346]:
android_final[android_final.Category == 'COMMUNICATION'].sort_values(by='Installs',ascending=False).head(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
451,Gmail,COMMUNICATION,4.3,4604483,Varies with device,1000000000.0,Free,0,Everyone,Communication,"August 2, 2018",Varies with device,Varies with device
411,Google Chrome: Fast & Secure,COMMUNICATION,4.3,9643041,Varies with device,1000000000.0,Free,0,Everyone,Communication,"August 1, 2018",Varies with device,Varies with device
464,Hangouts,COMMUNICATION,4.0,3419513,Varies with device,1000000000.0,Free,0,Everyone,Communication,"July 21, 2018",Varies with device,Varies with device
391,Skype - free IM & video calls,COMMUNICATION,4.1,10484169,Varies with device,1000000000.0,Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device
381,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,1000000000.0,Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device
382,Messenger – Text and Video Chat for Free,COMMUNICATION,4.0,56646578,Varies with device,1000000000.0,Free,0,Everyone,Communication,"August 1, 2018",Varies with device,Varies with device
371,Google Duo - High Quality Video Calls,COMMUNICATION,4.6,2083237,Varies with device,500000000.0,Free,0,Everyone,Communication,"July 31, 2018",37.1.206017801.DR37_RC14,4.4 and up
4676,Viber Messenger,COMMUNICATION,4.3,11335481,Varies with device,500000000.0,Free,0,Everyone,Communication,"July 18, 2018",Varies with device,Varies with device
420,UC Browser - Fast Download Private & Secure,COMMUNICATION,4.5,17714850,40M,500000000.0,Free,0,Teen,Communication,"August 2, 2018",12.8.5.1121,4.0 and up
383,imo free video calls and chat,COMMUNICATION,4.3,4785988,11M,500000000.0,Free,0,Everyone,Communication,"June 8, 2018",9.8.000000010501,4.0 and up


If we removed all the communication apps that have over 100 million installs, the pattern would be significant different

In [349]:

android_final_clean = android_final[android_final.Installs < 100000000]
android_final_clean[android_final_clean.Category == "COMMUNICATION"].groupby(by='Category').agg({'Installs':'mean'}).sort_values(by='Installs',ascending=False).applymap("{0:,.0f}".format)

Unnamed: 0_level_0,Installs
Category,Unnamed: 1_level_1
COMMUNICATION,4525998
