# Profitable App Profiles for the App Store and Google Play Markets

The project is about to investigate and analyze applications and their profitibality that are already on google apps and appstore.
My goal in this project is to find the reason behind how the apps can be profitable and how the engagement of users are maximized with the app.

There are two data sets below which I worked on;

[A data set](https://www.kaggle.com/lava18/google-play-store-apps/home#googleplaystore.csv) containing data about approximately ten thousand Android apps from Google Play.

[A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store

# Introduction

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# Read the AppleStore data
ios = pd.read_csv('AppleStore.csv')

# Read the Googleplaystore data
android = pd.read_csv('googleplaystore.csv')

In [2]:
print(ios.shape)
ios.head()

(7197, 17)


Unnamed: 0.1,Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,1,281656475,PAC-MAN Premium,100788224,USD,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,1
1,2,281796108,Evernote - stay organized,158578688,USD,0.0,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,1
2,3,281940292,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,USD,0.0,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,1
3,4,282614216,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,USD,0.0,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,1
4,5,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1


In [3]:
print(android.shape)
android.head()

(10841, 13)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


I see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

There are 7197 IOS apps in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Not all column names are self-explanatory in this case, but details about each column can be found in the data set documentation.

# Data Cleaning

## Deleting Wrong Data

The Google Play data set has a dedicated discussion section, and one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [4]:
print('incorrect row:\n\n', android.iloc[10472], '\n\n\ncorrect row:\n', )  # incorrect row
android.iloc[0]     # correct row

incorrect row:

 App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    0
Price                                            Everyone
Content Rating                                        NaN
Genres                                  February 11, 2018
Last Updated                                       1.0.19
Current Ver                                    4.0 and up
Android Ver                                           NaN
Name: 10472, dtype: object 


correct row:



App               Photo Editor & Candy Camera & Grid & ScrapBook
Category                                          ART_AND_DESIGN
Rating                                                       4.1
Reviews                                                      159
Size                                                         19M
Installs                                                 10,000+
Type                                                        Free
Price                                                          0
Content Rating                                          Everyone
Genres                                              Art & Design
Last Updated                                     January 7, 2018
Current Ver                                                1.0.0
Android Ver                                         4.0.3 and up
Name: 0, dtype: object

The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and I can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, we'll delete this row.

In [5]:
print(len(android))
android.drop([10472], axis=0, inplace=True)
print(len(android))

10841
10840


## Removing Duplicate Entries

Now it's time to find the number of apps duplicated

In [6]:
android_duplicated = android['App'].value_counts()[android['App'].value_counts()>1]
print('Number of unique apps: ', len(android) - android_duplicated.sum() + len(android_duplicated))        
print('Number of duplicated apps: ', android_duplicated.sum() - len(android_duplicated))  
print('\n')
print('Examples of duplicate apps:')
print(android_duplicated.head().index)


Number of unique apps:  9659
Number of duplicated apps:  1181


Examples of duplicate apps:
Index(['ROBLOX', 'CBS Sports App - Scores, News, Stats & Watch Live',
       'Duolingo: Learn Languages Free', 'ESPN', '8 Ball Pool'],
      dtype='object')


Let's find a way to remove duplicated apps regarding a pattern.

In [7]:
# Let's look at Instagram app
android[android['App']=='Instagram']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2545,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
2611,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
3909,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


In order to remove the duplicated apps, one should not delete them randomly. When I check the features of a duplicated app which is 'Instagram' in this case, observe that only count of reviews has been changing. Therefore, I might use this information for row deletion and keep the the app with the highest review number

In [8]:
# Sort the data according to reviews and drop duplicates with low reviews
android_clean = android.copy().sort_values('Reviews', ascending=False)
android_clean.drop_duplicates(subset='App', keep='first', inplace=True)
len(android_clean)

9659

In [9]:
ios_app_frequency = ios['track_name'].value_counts()
for app in ios_app_frequency[ios_app_frequency>1].index:
    print(ios.loc[ios['track_name']==app, 'ver'])
    
# Delete the app with older version
ios.drop([7128, 5603], inplace=True)

3319    2.0.0
5603     0.81
Name: ver, dtype: object
7092      1.4
7128    1.0.1
Name: ver, dtype: object


## Removing Non-English Apps

Now, I will detect non-english apps and remove them to investigate only english named apps because the audience is english speaker.

Here the function which helps to detect non-english name. 
The important part is when I look for non-english name I could miss some useful data including symbols in the app name. That's why I add a criteria that if the characthers are above 3 then consider it as a non-english.

In [10]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127: # English charaters are between 0 and 127
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

Now, it's time to remove non-english apps from android and ios data set

In [11]:
# For Android
android_english = android_clean[android_clean['App'].apply(is_english)].copy()
print(len(android_english))

# For Ios
ios_english = ios[ios['track_name'].apply(is_english)].copy()
print(len(ios_english))

9614
6181


## Isolating the Free Apps

After cleaning non-English apps, now I will remove non-free apps as the target is to reach only to free apps

In [12]:
# For Android
android_final = android_english[android_english['Price']=='0'].copy()
print('# of android apps to be analyzed: ', len(android_final))

# For Ios
ios_final = ios_english[ios_english['price']==0].copy()    
print('# of ios apps to be analyzed: ', len(ios_final))

# of android apps to be analyzed:  8862
# of ios apps to be analyzed:  3220


So far,

- Removed inaccurate data,
<br>
- Removed duplicate app entries,
<br>
- Removed non-English apps,
<br>
- Isolated the free apps

As I mentioned in the introduction, the aim is to determine the kinds of apps that are likely to attract more users because the revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1) Build a minimal Android version of the app, and add it to Google Play.
<br>
2) If the app has a good response from users, we then develop it further.
<br>
3) If the app is profitable after six months, we also build an IOS version of the app and add it to the App Store.

Because the end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. 

# Most Popular Apps by Genre on the App Store


In [13]:
# The frequency table for the prime_genre column of the App Store
ios_final['prime_genre'].value_counts(normalize=True, ascending=False)*100

Games                58.136646
Entertainment         7.888199
Photo & Video         4.968944
Education             3.664596
Social Networking     3.291925
Shopping              2.608696
Utilities             2.515528
Sports                2.142857
Music                 2.049689
Health & Fitness      2.018634
Productivity          1.739130
Lifestyle             1.583851
News                  1.335404
Travel                1.242236
Finance               1.118012
Weather               0.869565
Food & Drink          0.807453
Reference             0.559006
Business              0.527950
Book                  0.434783
Navigation            0.186335
Medical               0.186335
Catalogs              0.124224
Name: prime_genre, dtype: float64

I observe that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in the data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [14]:
# The frequency table for the category column of the Google Play
android_final['Category'].value_counts(normalize=True, ascending=False)*100

FAMILY                 18.946062
GAME                    9.693072
TOOLS                   8.451817
BUSINESS                4.592643
LIFESTYLE               3.904311
PRODUCTIVITY            3.893026
FINANCE                 3.701196
MEDICAL                 3.520650
SPORTS                  3.396524
PERSONALIZATION         3.317536
COMMUNICATION           3.238547
HEALTH_AND_FITNESS      3.080569
PHOTOGRAPHY             2.945159
NEWS_AND_MAGAZINES      2.798465
SOCIAL                  2.663056
TRAVEL_AND_LOCAL        2.335816
SHOPPING                2.245543
BOOKS_AND_REFERENCE     2.143986
DATING                  1.861882
VIDEO_PLAYERS           1.794177
MAPS_AND_NAVIGATION     1.399233
FOOD_AND_DRINK          1.241255
EDUCATION               1.173550
ENTERTAINMENT           0.947867
LIBRARIES_AND_DEMO      0.936583
AUTO_AND_VEHICLES       0.925299
HOUSE_AND_HOME          0.823742
WEATHER                 0.801174
EVENTS                  0.710900
PARENTING               0.654480
ART_AND_DE

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids. The second most is game category with around 9.7%. 

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:

In [15]:
# The frequency table for the genre column of the Google Play 
android_final['Genres'].value_counts(normalize=True, ascending=False)*100

Tools                                  8.440533
Entertainment                          6.070864
Education                              5.348680
Business                               4.592643
Productivity                           3.893026
Lifestyle                              3.893026
Finance                                3.701196
Medical                                3.520650
Sports                                 3.464229
Personalization                        3.317536
Communication                          3.238547
Action                                 3.103137
Health & Fitness                       3.080569
Photography                            2.945159
News & Magazines                       2.798465
Social                                 2.663056
Travel & Local                         2.324532
Shopping                               2.245543
Books & Reference                      2.143986
Simulation                             2.042428
Dating                                 1

The difference between the Genres and the Category columns is not crystal clear, but one thing I can notice is that the Genres column is much more granular (it has more categories). I'm only looking for the bigger picture at the moment, so I'll only work with the Category column moving forward.

Up to this point, I found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now I'd like to get an idea about the kind of apps that have most users.

In [16]:
# Calculate average number of ratings per prime genre
ios_genre = ios_final.groupby(by=['prime_genre']).sum()['rating_count_tot'].sort_index()
ios_genre_dist = ios_final['prime_genre'].value_counts().sort_index()
np.true_divide(ios_genre, ios_genre_dist).sort_values(ascending=False)

prime_genre
Navigation           86090.333333
Reference            74942.111111
Social Networking    71548.349057
Music                57326.530303
Weather              52279.892857
Book                 39758.500000
Food & Drink         33333.923077
Finance              31467.944444
Photo & Video        28441.543750
Travel               28243.800000
Shopping             26919.690476
Health & Fitness     23298.015385
Sports               23008.898551
Games                22812.924679
News                 21248.023256
Productivity         21028.410714
Utilities            18684.456790
Lifestyle            16485.764706
Entertainment        14029.830709
Business              7491.117647
Education             7003.983051
Catalogs              4004.000000
Medical                612.000000
Name: rating_count_tot, dtype: float64

In [17]:
# Rating count for navigation genre
ios_final[ios_final['prime_genre']=='Navigation'][['track_name', 'rating_count_tot']].sort_values(by='rating_count_tot', ascending=False)

Unnamed: 0,track_name,rating_count_tot
174,"Waze - GPS Navigation, Maps & Real-time Traffic",345046
1693,Google Maps - Navigation & Transit,154911
200,Geocaching®,12811
1203,CoPilot GPS – Car Navigation & Offline Maps,3582
280,ImmobilienScout24: Real Estate Search in Germany,187
959,Railway Route Search,5


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

The aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold.

In [18]:
# Rating count for reference genre
ios_final[ios_final['prime_genre']=='Reference'][['track_name', 'rating_count_tot']].sort_values(by='rating_count_tot', ascending=False)

Unnamed: 0,track_name,rating_count_tot
4,Bible,985920
116,Dictionary.com Dictionary & Thesaurus,200047
375,Dictionary.com Dictionary & Thesaurus for iPad,54175
681,Google Translate,26786
503,"Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...",18418
6462,New Furniture Mods - Pocket Wiki & Game Tools ...,17588
587,Merriam-Webster Dictionary,16849
1023,Night Sky,12122
6567,City Maps for Minecraft PE - The Best Maps for...,8535
6506,LUCKY BLOCK MOD ™ for Minecraft PC Edition - T...,4693


Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating.

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

# Most Popular Apps by Genre on Google Play

For the Google Play market, There actually is data about the number of installs, so I should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.).

In [19]:
android_final['Installs'].value_counts()

1,000,000+        1395
100,000+          1024
10,000,000+        932
10,000+            904
1,000+             744
100+               613
5,000,000+         606
500,000+           494
50,000+            423
5,000+             400
10+                314
500+               288
50,000,000+        203
100,000,000+       188
50+                170
5+                  70
1+                  45
500,000,000+        24
1,000,000,000+      20
0+                   4
0                    1
Name: Installs, dtype: int64

One problem with this data is that is not precise. For instance, I don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, I don't need very precise data for our purposes — I only want to get an idea which app genres attract the most users, and I don't need perfect precision with respect to the number of users.

I'm going to leave the numbers as they are, which means that I'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, I'll need to convert each install number to float — this means that I need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [20]:
# Clean the Install column removing '+' and ',' then transform the data into float
android_final['Installs'] = android_final['Installs'].str.replace(',', '').str.replace('+', '').astype('float')

In [21]:
# Calculate average number of installs per genre
android_genre = android_final.groupby(by=['Category']).sum()['Installs'].sort_index()
android_genre_dist = android_final['Category'].value_counts().sort_index()
np.true_divide(android_genre, android_genre_dist).sort_values(ascending=False)

Category
COMMUNICATION          3.845612e+07
VIDEO_PLAYERS          2.472787e+07
SOCIAL                 2.325365e+07
PHOTOGRAPHY            1.780563e+07
PRODUCTIVITY           1.678733e+07
GAME                   1.556097e+07
TRAVEL_AND_LOCAL       1.398408e+07
ENTERTAINMENT          1.171976e+07
TOOLS                  1.068230e+07
NEWS_AND_MAGAZINES     9.549178e+06
BOOKS_AND_REFERENCE    8.767812e+06
SHOPPING               7.036877e+06
PERSONALIZATION        5.201483e+06
WEATHER                5.074486e+06
HEALTH_AND_FITNESS     4.188822e+06
MAPS_AND_NAVIGATION    4.056942e+06
FAMILY                 3.695054e+06
SPORTS                 3.638640e+06
ART_AND_DESIGN         1.986335e+06
FOOD_AND_DRINK         1.924898e+06
EDUCATION              1.820673e+06
BUSINESS               1.712290e+06
LIFESTYLE              1.437816e+06
FINANCE                1.387692e+06
HOUSE_AND_HOME         1.331541e+06
DATING                 8.540288e+05
COMICS                 8.176573e+05
AUTO_AND_VEHICLES  

In [22]:
# Install count for communication category
android_final[android_final['Category']=='COMMUNICATION'][['App', 'Installs']].sort_values(by='Installs', ascending=False).head(20)

Unnamed: 0,App,Installs
382,Messenger – Text and Video Chat for Free,1000000000.0
4234,Skype - free IM & video calls,1000000000.0
464,Hangouts,1000000000.0
411,Google Chrome: Fast & Secure,1000000000.0
451,Gmail,1000000000.0
381,WhatsApp Messenger,1000000000.0
465,imo free video calls and chat,500000000.0
4676,Viber Messenger,500000000.0
4039,Google Duo - High Quality Video Calls,500000000.0
474,LINE: Free Calls & Messages,500000000.0


On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs.

I see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously I found out this part of the market seems a bit saturated, so I'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since I found this genre has some potential to work well on the App Store, and the aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [23]:
# Install count for books_and_references category
android_final[android_final['Category']=='BOOKS_AND_REFERENCE'][['App', 'Installs']].sort_values(by='Installs', ascending=False).head(30)

Unnamed: 0,App,Installs
152,Google Play Books,1000000000.0
5651,Audiobooks from Audible,100000000.0
3941,Bible,100000000.0
4715,Wattpad 📖 Free Books,100000000.0
4083,Amazon Kindle,100000000.0
5323,Al Quran Indonesia,10000000.0
173,HTC Help,10000000.0
8293,Dictionary,10000000.0
144,Cool Reader,10000000.0
179,Moon+ Reader,10000000.0


This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

However, it looks like the market is already full of libraries, so I need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

# Conclusion

In this project, I analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

I concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so I need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.