## Profitable App Profiles for the App Store and Google Play Markets 

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
apple = pd.read_csv("AppleStore.csv")
google = pd.read_csv("googleplaystore.csv")

## Sneak Peak at the Data

We'll be workng with two data sets in this project. 

- A data set containing data about approximately ten thousand Android apps from Google Play — the data was collected in August 2018
    - https://www.kaggle.com/lava18/google-play-store-apps
- A data set containing data about approximately seven thousand iOS apps from the App Store — the data was collected in July 2017
    - https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps


In [3]:
apple.head(2)

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1


In [4]:
apple.describe()

Unnamed: 0,id,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
count,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0,7197.0
mean,863131000.0,199134500.0,1.726218,12892.91,460.373906,3.526956,3.253578,37.361817,3.7071,5.434903,0.993053
std,271236800.0,359206900.0,5.833006,75739.41,3920.455183,1.517948,1.809363,3.737715,1.986005,7.919593,0.083066
min,281656500.0,589824.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
25%,600093700.0,46922750.0,0.0,28.0,1.0,3.5,2.5,37.0,3.0,1.0,1.0
50%,978148200.0,97153020.0,0.0,300.0,23.0,4.0,4.0,37.0,5.0,1.0,1.0
75%,1082310000.0,181924900.0,1.99,2793.0,140.0,4.5,4.5,38.0,5.0,8.0,1.0
max,1188376000.0,4025970000.0,299.99,2974676.0,177050.0,5.0,5.0,47.0,5.0,75.0,1.0


In [5]:
google.head(2)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


In [6]:
google.describe(include='all')

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
count,10841,10841,9367.0,10841.0,10841,10841,10840,10841.0,10840,10841,10841,10833,10838
unique,9660,34,,6002.0,462,22,3,93.0,6,120,1378,2832,33
top,ROBLOX,FAMILY,,0.0,Varies with device,"1,000,000+",Free,0.0,Everyone,Tools,"August 3, 2018",Varies with device,4.1 and up
freq,9,1972,,596.0,1695,1579,10039,10040.0,8714,842,326,1459,2451
mean,,,4.193338,,,,,,,,,,
std,,,0.537431,,,,,,,,,,
min,,,1.0,,,,,,,,,,
25%,,,4.0,,,,,,,,,,
50%,,,4.3,,,,,,,,,,
75%,,,4.5,,,,,,,,,,


## Data Cleaning 
### Part 1: Modifying incorret data entry 
The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [7]:
google.loc[10471:10473,:]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10471,Xposed Wi-Fi-Pwd,PERSONALIZATION,3.5,1042,404k,"100,000+",Free,0,Everyone,Personalization,"August 5, 2014",3.0.0,4.0.3 and up
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,
10473,osmino Wi-Fi: free WiFi,TOOLS,4.2,134203,4.1M,"10,000,000+",Free,0,Everyone,Tools,"August 7, 2018",6.06.14,4.4 and up


It appears that row 10472 has *Category* missing and a column shift for next couple values. Instead of deleting this row, we'll modify it to make its column values consistent with the header row.

In [8]:
google.iloc[10472,1:] = google.iloc[10472,1:].shift(1)

Now row 10472 looks correct with *Category* missing. 

In [9]:
google.loc[10472,:]

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              NaN
Rating                                                1.9
Reviews                                                19
Size                                                 3.0M
Installs                                           1,000+
Type                                                 Free
Price                                                   0
Content Rating                                   Everyone
Genres                                                NaN
Last Updated                            February 11, 2018
Current Ver                                        1.0.19
Android Ver                                    4.0 and up
Name: 10472, dtype: object

### Part 2 : Removing duplicate entries

We notice that there are some duplicate entries for the same App in the GooglePlay dataset. For example:

In [10]:
google[google["App"]== "Instagram"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


Let's count the number of duplicate entries to confirm. 

In [11]:
app_entry = {}
for name in google.App:
    if name in app_entry:
        app_entry[name] += 1
    else:
        app_entry[name] = 1        

In [12]:
duplicate = dict((k, v) for k, v in app_entry.items() if v > 1)
single =  dict((k, v) for k, v in app_entry.items() if v == 1)
print(len(duplicate))
print(len(single))

798
8862


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries. Examining the rows we printed out for Instagram, we see that the main difference between the multple entries lies in the fourth position of each row, which identifies the number of reviews. Because the higher number of reviews correspond to the more recent entry, we will only keep the entry with highest review counts in the case of duplicate entries. 

In [13]:
google["Reviews"] = pd.to_numeric(google["Reviews"], errors="coerce")
google_sorted = google.sort_values(by=["App", "Reviews"])
google = google_sorted.drop_duplicates(subset="App", keep="last")
# # Create a dictionary to store the highest number of reviews
# app_review = {}
# for index, row in google.iterrows():
#     name = row["App"]
#     review = row["Reviews"]
#     if name in app_review:
#         if review > app_review[name]:
#             app_review[name] = review
#     else:
#         app_review[name] = review
# print(len(app_review))

### Part 3: Removing non-English Apps

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters, otherwise it doesn't.

In [14]:
def check_eng(string):
    num_special = 0
    for char in string:
        if ord(char) > 127:
            num_special += 1
    return num_special <= 3

Let's test if our check_eng function works as expected.

In [15]:
print(check_eng("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(check_eng("Instachat 😜"))
print(check_eng('Docs To Go™ Free Office Suite'))

False
True
True


Now we'll use our check_eng function to filter out non-English apps from both Google datasets and Apple datasets.

In [21]:
google["IfEnglish"] = google["App"].apply(check_eng)

Let's manually exmaine the rows we removed just to ensure that our check_eng function did the right thing.

In [20]:
google[google["IfEnglish"]==0]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,IfEnglish
5266,AJ렌터카 법인 카셰어링,MAPS_AND_NAVIGATION,,0.0,27M,10+,Free,0,Everyone,Maps & Navigation,"July 30, 2018",1.0.0.0,4.3 and up,False
5346,Al Quran Free - القرآن (Islam),BOOKS_AND_REFERENCE,4.7,1777.0,23M,"50,000+",Free,0,Everyone,Books & Reference,"February 15, 2015",1.1,2.2 and up,False
5841,Ay Yıldız Duvar Kağıtları,PERSONALIZATION,,3.0,6.5M,100+,Free,0,Everyone,Personalization,"December 10, 2017",1.0.0,4.3 and up,False
9971,AÖF Ev İdaresi 1. Sınıf,FAMILY,,2.0,11M,"1,000+",Free,0,Everyone,Education,"July 15, 2018",3.0,4.1 and up,False
6417,BL 女性向け恋愛ゲーム◆ごくメン,FAMILY,4.2,1901.0,8.2M,"100,000+",Free,0,Mature 17+,Simulation,"July 7, 2016",1.3.0,2.3.3 and up,False
6406,BL 女性向け恋愛ゲーム◆俺プリクロス,FAMILY,4.2,3379.0,62M,"100,000+",Free,0,Mature 17+,Simulation,"March 23, 2017",1.6.3,2.3.3 and up,False
6629,BQ-መጽሐፍ ቅዱሳዊ ጥያቄዎች,GAME,4.7,191.0,7.2M,"5,000+",Free,0,Everyone,Trivia,"July 31, 2018",4.1.2,4.1 and up,False
6729,BS Calendar / Patro / पात्रो,PRODUCTIVITY,4.2,218.0,Varies with device,"50,000+",Free,0,Everyone,Productivity,"July 15, 2018",Varies with device,Varies with device,False
7396,Bonjour 2017 Abidjan CI ❤❤❤❤❤,FAMILY,,235.0,3.3M,"10,000+",Free,0,Everyone,Entertainment,"February 16, 2017",1.0.2.0,2.0 and up,False
7463,CK 初一 十五,LIFESTYLE,4.0,294.0,153k,"10,000+",Free,0,Everyone,Lifestyle,"July 3, 2013",1.0.12,2.1 and up,False


### Part 4 : Isolating the free apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis.

In [22]:
google = google[google["Price"]=="0"]
apple = apple[apple["price"]==0]
print(google.shape)
print(apple.shape)

(8906, 14)
(4056, 16)


## Gaining Insights from Data
## Q1: what are the most common genres for each market? 

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets.

As a first steop, let's find out what genres are most common for each market.


### App Store

In [19]:
apple.prime_genre.value_counts(ascending=False,normalize=True).head(10)*100

Games                55.645957
Entertainment         8.234714
Photo & Video         4.117357
Social Networking     3.525641
Education             3.254438
Shopping              2.983235
Utilities             2.687377
Lifestyle             2.317554
Finance               2.071006
Sports                1.947732
Name: prime_genre, dtype: float64

Among the free English apps, more than a half (55.64%) are games. Entertainment apps are more than 8%, followed by photo and video apps, which are close to 5%. 

### GooglePlay Market

There are two relevant variables we can use in the GooglePlay dataset: *Genres* and *Category*. Since the *Genres* column is much more granular (it has more categories), we'll focus on on *Category*. Let's make a normalized frequency table for *Category*.


In [22]:
google.Category.value_counts(ascending=False,normalize=True).head(15)*100

FAMILY                19.011791
GAME                   9.691185
TOOLS                  8.433464
BUSINESS               4.581696
LIFESTYLE              3.930376
PRODUCTIVITY           3.885458
FINANCE                3.683324
MEDICAL                3.514879
SPORTS                 3.380124
PERSONALIZATION        3.312746
COMMUNICATION          3.234138
HEALTH_AND_FITNESS     3.065693
PHOTOGRAPHY            2.942167
NEWS_AND_MAGAZINES     2.829871
SOCIAL                 2.650197
Name: Category, dtype: float64

The landscape seems significantly different on Google Play: there is not a single category that dominates the market and we do see a higher number of apps designed for practical purposes. However, if we look closer, the family category (which accounts for almost 19% of the apps) means mostly games for kids.

The most common genres may not be the most popular genres. In the next section, we'll explore what kind of apps have most users.

## Q2 : what are the most popular genres for each market?
The metric we'll use to measure popularity is the average number of installs for each app genre. For the Google Play data set, we can find this information in the *Installs* column, but for the App Store data set this information is missing. We'll take the total number of user ratings as a proxy, which we can find in the *rating_count_tot app*.

### App Store


In [31]:
apple_avg_rating_by_genre = apple.groupby(by="prime_genre").mean()["rating_count_tot"].sort_values(ascending=False)
apple_avg_rating_by_genre

prime_genre
Reference            67447.900000
Music                56482.029851
Social Networking    53078.195804
Weather              47220.935484
Photo & Video        27249.892216
Navigation           25972.050000
Travel               20216.017857
Food & Drink         20179.093023
Sports               20128.974684
Health & Fitness     19952.315789
Productivity         19053.887097
Games                18924.688968
Shopping             18746.677686
News                 15892.724138
Utilities            14010.100917
Finance              13522.261905
Entertainment        10822.961078
Lifestyle             8978.308511
Book                  8498.333333
Business              6367.800000
Education             6266.333333
Catalogs              1779.555556
Medical                459.750000
Name: rating_count_tot, dtype: float64

We see that the most popular genres are significantly differnet from the most common genres. The genres that have the highest average number of user ratings is Reference, followed by Music, Soicla Networking and Weather. Our popularity metric for the most common category, Games, is only 1/4 of that for Reference category.  

However, the average number of user ratings only gives us a very high-level overview of the popularity of genres. Let's look closer.

In [41]:
reference = apple.loc[apple["prime_genre"] == "Reference", ["track_name","rating_count_tot"]]
reference.head(10)

Unnamed: 0,track_name,rating_count_tot
6,Bible,985920
90,Dictionary.com Dictionary & Thesaurus,200047
335,Dictionary.com Dictionary & Thesaurus for iPad,54175
551,Google Translate,26786
715,"Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...",18418
738,New Furniture Mods - Pocket Wiki & Game Tools ...,17588
757,Merriam-Webster Dictionary,16849
913,Night Sky,12122
1106,City Maps for Minecraft PE - The Best Maps for...,8535
1451,LUCKY BLOCK MOD ™ for Minecraft PC Edition - T...,4693


In [24]:
social = apple.loc[apple["prime_genre"] == "Social Networking", ["track_name","rating_count_tot"]]
social.head(10)

Unnamed: 0,track_name,rating_count_tot
0,Facebook,2974676
5,Pinterest,1061624
43,Skype for iPhone,373519
48,Messenger,351466
51,Tumblr,334293
63,WhatsApp Messenger,287589
72,Kik,260965
111,"ooVoo – Free Video Call, Text and Voice",177501
117,TextNow - Unlimited Text + Calls,164963
120,Viber Messenger – Text & Call,164249


For Social Networking apps, the average number of ratings is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to Reference apps, which have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com that skew up the average rating counts. We see the same pattern in Navigation and Music apps. 

One way to address the skewness of rating counts is to remove the extremly popular apps for each genre and then rework the averages. A better way may be to keep the outliers in the data but make sure to highlight these high influence apps when presenting our findings to business development team. 

### GooglePlay Market

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are not precise (100+, 1,000+, 5,000+, etc.):

In [25]:
google["Installs"].value_counts().head(10)

1,000,000+     1397
100,000+       1031
10,000,000+     935
10,000+         913
1,000+          751
100+            616
5,000,000+      607
500,000+        493
50,000+         429
5,000+          403
Name: Installs, dtype: int64

Though the precision of these numbers is not ideal, we'll work with what is available to get a high-level sense of the popularity of each genre. We'll need to convert each install number to float — this means that we need to remove the commas and the plus characters.

In [29]:
google["Installs"] = google["Installs"].str.replace(",","").str.replace("+","")
google["Installs"] = google["Installs"].apply(float)

In [32]:
google_avg_install_by_genre = google.groupby(by="Category").mean()["Installs"].sort_values(ascending=False)
google_avg_install_by_genre

Category
COMMUNICATION          3.832263e+07
VIDEO_PLAYERS          2.457395e+07
SOCIAL                 2.325365e+07
PHOTOGRAPHY            1.777202e+07
PRODUCTIVITY           1.673896e+07
GAME                   1.555843e+07
TRAVEL_AND_LOCAL       1.398408e+07
ENTERTAINMENT          1.171976e+07
TOOLS                  1.078701e+07
NEWS_AND_MAGAZINES     9.401636e+06
BOOKS_AND_REFERENCE    8.587352e+06
SHOPPING               7.001693e+06
PERSONALIZATION        5.183851e+06
WEATHER                5.074486e+06
HEALTH_AND_FITNESS     4.188822e+06
MAPS_AND_NAVIGATION    3.993340e+06
FAMILY                 3.671820e+06
SPORTS                 3.638640e+06
ART_AND_DESIGN         1.952105e+06
FOOD_AND_DRINK         1.924898e+06
EDUCATION              1.833495e+06
BUSINESS               1.708216e+06
LIFESTYLE              1.436127e+06
FINANCE                1.387692e+06
HOUSE_AND_HOME         1.331541e+06
DATING                 8.540288e+05
COMICS                 8.032348e+05
AUTO_AND_VEHICLES  

On average, communication apps have the most installs. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [48]:
google.loc[google["Category"] == "COMMUNICATION",["App","Installs"]].sort_values(by="Installs",ascending=False)

Unnamed: 0,App,Installs
451,Gmail,1.000000e+09
4234,Skype - free IM & video calls,1.000000e+09
464,Hangouts,1.000000e+09
381,WhatsApp Messenger,1.000000e+09
382,Messenger – Text and Video Chat for Free,1.000000e+09
411,Google Chrome: Fast & Secure,1.000000e+09
474,LINE: Free Calls & Messages,5.000000e+08
4676,Viber Messenger,5.000000e+08
420,UC Browser - Fast Download Private & Secure,5.000000e+08
4039,Google Duo - High Quality Video Calls,5.000000e+08


We see the same pattern for the video players category, which is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Similar to our concern over our analysis of GooglePlay Apps, the main concern is that these app genres 

- appear more popular they they really are; 
- are already highly saturated with a few giants dominating the market. 

So we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well. We found this genre has some potential to work well on both GoolePlay and App Store.

Let's take a look at some of the apps from this genre and their number of installs:

In [49]:
google.loc[google["Category"] == "BOOKS_AND_REFERENCE",["App","Installs"]].sort_values(by="Installs",ascending=False)

Unnamed: 0,App,Installs
152,Google Play Books,1.000000e+09
3941,Bible,1.000000e+08
4715,Wattpad 📖 Free Books,1.000000e+08
5651,Audiobooks from Audible,1.000000e+08
4083,Amazon Kindle,1.000000e+08
5319,Al-Quran (Free),1.000000e+07
5345,Quran for Android,1.000000e+07
9625,JW Library,1.000000e+07
179,Moon+ Reader,1.000000e+07
144,Cool Reader,1.000000e+07


## Conclusion
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that 

- the most popular genres, such as Social Networking (Communication) are already highly saturated and dominiated by a few giants, thus presenting great barriers of entry;
- taking a popular collection of books and turning it into an app could be profitable for both the Google Play and the App Store markets. In order to distinguish our products from the already existing libraries, we need to add some special features, such as audio version of the books in our collection, anotation and sharing tools, etc.