# Profitable Apps Profile For App Store and Google Play Markets

The aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. As a data analyst in a company that build Android and iOS mobile apps, our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.  
At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users who use our app - the more users that see and engage with the ads.

The goal of this project is to analyze data to help our developers understand the type of apps that are likely to attract more users

# Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significiant amount of time and money, so we'll try to analyze a sample of the data instead. 
To avoid spending resources on collecting new data ourselves, we should firt try to see if we can find any relevant existing data at no cost . Luckily, there are two data sets that seem suitable for our goals:

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

## Explore the csv files using Pandas

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_rows = 200         # to diplay maximum number of 200 rows 
pd.options.display.min_rows = 200

android = pd.read_csv('/home/mike/Downloads/googleplaystore.csv')
ios = pd.read_csv('/home/mike/Downloads/AppleStore.csv')

In [2]:
android.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [3]:
ios.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [4]:
android.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [5]:
# check the number of unique apps

len(android.App.unique())

9660

In [6]:
# identify the duplicate apps using the 'df.duplicated()' method

android_duplicate = android[android.duplicated(['App'],keep=False)].sort_values('App') #keep =False parameter gives
                                                                                     #  all duplicates
    
android_duplicate

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1393,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
1407,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
2543,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2322,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2385,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8M,"1,000+",Paid,$16.99,Everyone,Medical,"January 27, 2017",1.0.5,4.0.3 and up
2256,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8M,"1,000+",Paid,$16.99,Everyone,Medical,"January 27, 2017",1.0.5,4.0.3 and up
1337,21-Day Meditation Experience,HEALTH_AND_FITNESS,4.4,11506,15M,"100,000+",Free,0,Everyone,Health & Fitness,"August 2, 2018",3.0.0,4.1 and up
1434,21-Day Meditation Experience,HEALTH_AND_FITNESS,4.4,11506,15M,"100,000+",Free,0,Everyone,Health & Fitness,"August 2, 2018",3.0.0,4.1 and up
3083,365Scores - Live Scores,SPORTS,4.6,666521,25M,"10,000,000+",Free,0,Everyone,Sports,"July 29, 2018",5.5.9,4.1 and up
5415,365Scores - Live Scores,SPORTS,4.6,666246,25M,"10,000,000+",Free,0,Everyone,Sports,"July 29, 2018",5.5.9,4.1 and up


It appears:

* There are a total of 9660 unique apps on the android
* The duplicated apps have every column values in common except for Reviews

For duplicate apps we will keep rows with the maximum Reviews and drop the other duplicates

### Before dropping the duplicated rows, we need to convert the Reviews column to integer type

In [7]:
print(android_duplicate['Reviews'].isnull().sum()) # counts the number of nan values in the column

android_duplicate['Reviews'] = android_duplicate['Reviews'].astype('int')

0


In [8]:
# select the duplicate apps with the biggest number of Reviews

and_dup_max = android_duplicate.sort_values(['App','Reviews'], ascending=False
                                           ).drop_duplicates(['App'],keep='first')

and_dup_max

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
3652,wetter.com - Weather and Radar,WEATHER,4.2,189313,38M,"10,000,000+",Free,0,Everyone,Weather,"August 6, 2018",Varies with device,Varies with device
3202,trivago: Hotels & Travel,TRAVEL_AND_LOCAL,4.2,219848,Varies with device,"50,000,000+",Free,0,Everyone,Travel & Local,"August 2, 2018",Varies with device,Varies with device
3085,"theScore: Live Sports Scores, News, Stats & Vi...",SPORTS,4.4,133833,34M,"10,000,000+",Free,0,Everyone 10+,Sports,"July 25, 2018",6.17.2,4.4 and up
2637,textPlus: Free Text & Calls,SOCIAL,4.1,382121,28M,"10,000,000+",Free,0,Everyone,Social,"July 26, 2018",7.3.1,4.1 and up
565,stranger chat - anonymous chat,DATING,3.5,13204,6.1M,"1,000,000+",Free,0,Mature 17+,Dating,"July 7, 2018",2.4.1,4.1 and up
1921,slither.io,GAME,4.4,5235294,Varies with device,"100,000,000+",Free,0,Everyone,Action,"November 14, 2017",Varies with device,2.3 and up
5664,"realestate.com.au - Buy, Rent & Sell Property",HOUSE_AND_HOME,3.8,14657,Varies with device,"1,000,000+",Free,0,Everyone,House & Home,"July 16, 2018",Varies with device,Varies with device
3334,osmino Wi-Fi: free WiFi,TOOLS,4.2,134203,4.1M,"10,000,000+",Free,0,Everyone,Tools,"August 6, 2018",6.06.14,4.4 and up
2595,"ooVoo Video Calls, Messaging & Stories",SOCIAL,4.3,1157004,34M,"50,000,000+",Free,0,Everyone,Social,"October 16, 2017",4.2.1,4.3 and up
2331,mySugr: the blood sugar tracker made just for you,MEDICAL,4.6,21189,36M,"1,000,000+",Free,0,Everyone,Medical,"August 6, 2018",3.52.1,5.0 and up


In [9]:
# drop duplicates with the largest number of Reviews

android_not_needed_dups = android_duplicate.drop(and_dup_max.index, axis=0)

android_not_needed_dups

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1407,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
2322,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2256,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8M,"1,000+",Paid,$16.99,Everyone,Medical,"January 27, 2017",1.0.5,4.0.3 and up
1434,21-Day Meditation Experience,HEALTH_AND_FITNESS,4.4,11506,15M,"100,000+",Free,0,Everyone,Health & Fitness,"August 2, 2018",3.0.0,4.1 and up
5415,365Scores - Live Scores,SPORTS,4.6,666246,25M,"10,000,000+",Free,0,Everyone,Sports,"July 29, 2018",5.5.9,4.1 and up
2522,420 BZ Budeze Delivery,MEDICAL,5.0,2,11M,100+,Free,0,Mature 17+,Medical,"June 6, 2018",1.0.1,4.1 and up
3953,8 Ball Pool,SPORTS,4.5,14184910,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1970,8 Ball Pool,GAME,4.5,14201604,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1844,8 Ball Pool,GAME,4.5,14200550,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1755,8 Ball Pool,GAME,4.5,14200344,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up


In [10]:
# Drop unneeded duplicate apps with lesser Reviews number from the android dataframe

android_no_dups = android.copy().drop(android_not_needed_dups.index, axis=0)

android_no_dups

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
10,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


After removing the duplicate apps, we have a total of 9660 apps remaining in our dataframe.

We will examine the `Rating` column and see if we can gain any insight 

In [11]:
android_no_dups.Rating.unique()

array([ 4.1,  4.7,  4.5,  4.3,  4.4,  3.8,  4.2,  4.6,  3.2,  4. ,  4.8,
        3.9,  4.9,  3.6,  3.7,  nan,  3.3,  3.4,  3.5,  3.1,  5. ,  3. ,
        2.5,  2.8,  2.7,  1. ,  1.9,  2.9,  2.6,  2.3,  2.2,  1.7,  2. ,
        1.8,  2.4,  1.6,  2.1,  1.4,  1.5,  1.2, 19. ])

There appears a column where the `Rating` is greater than 5. This is unusual, since the `Ratings` was done on a scale of 1 to 5. We will drop the row containing this value to avoid data error in our dataframe

In [12]:
android_no_dups[android_no_dups['Rating'] == 19]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


We can see the app at index `10472` contains error value.
We can check for  the app on googleplaystore to fill in the correct values for all columns or drop it entirely.
The app will be dropped.

In [13]:
# using the 'inplace=True' effects the drop in the android_no_dups dataframe

android_no_dups.drop([10472],axis =0,inplace=True)


# Isolating English Apps

Since we are producing apps for English audience, we'll analyze data for apps with English Names for both <font color='red'> android </font> and <font color='blue'> ios </font> apps

In [14]:
android_no_dups.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [15]:
ios.columns

Index(['id', 'track_name', 'size_bytes', 'currency', 'price',
       'rating_count_tot', 'rating_count_ver', 'user_rating',
       'user_rating_ver', 'ver', 'cont_rating', 'prime_genre',
       'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'],
      dtype='object')

In [16]:
# function to check the ascii equivalent of each string in a variable and count if there are more than 3 such strings

def english_only(name):
    n= 0
    for i in name:
        if ord(i) > 127:
            n+=1
    
    if n > 3:
        return False
    else:
        return True
    
veracity = android_no_dups['App'].apply(lambda x: english_only(x))
print(veracity.value_counts(dropna=False))

True     9614
False      45
Name: App, dtype: int64


In [17]:
# uncheck to calculate the speed of the applied function

# %timeit android_no_dups['App'].apply(lambda x: english_only(x))

In [18]:
# english only android app

android_english = android_no_dups[veracity]
android_english

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
10,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


In [19]:
# english only ios names

v_ = ios['track_name'].apply(english_only)
v_

ios[~v_]

ios_english = ios.copy()[v_]
ios_english

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.00,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.00,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.00,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.00,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.00,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
5,429047995,Pinterest,74778624,USD,0.00,1061624,1814,4.5,4.0,6.26,12+,Social Networking,37,5,27,1
6,282935706,Bible,92774400,USD,0.00,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
7,553834731,Candy Crush Saga,222846976,USD,0.00,961794,2453,4.5,4.5,1.101.0,4+,Games,43,5,24,1
8,324684580,Spotify Music,132510720,USD,0.00,878563,8253,4.5,4.5,8.4.3,12+,Music,37,5,18,1
9,343200656,Angry Birds,175966208,USD,0.00,824451,107,4.5,3.0,7.4.0,4+,Games,38,0,10,1


After removing the non-English apps for both **android** and **ios**, we are left with 9614 and 6183 apps respectively

***

# Isolating the Free Apps

***
Since we are interested in apps that are free to download, we will remove apps which are not free using the " `type` and `price` " columns for **android_english** and `price` column only for **ios_english**

Lets explore the *price* and *Type* column in <font color='red' size=4> android_no_dups </font> dataframe

In [20]:
# check for duplicate apps in ios and drop them

ios_english[ios_english.duplicated(['track_name'],keep=False)]

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
2948,1173990889,Mannequin Challenge,109705216,USD,0.0,668,87,3.0,3.0,1.4,9+,Games,37,4,1,1
4442,952877179,VR Roller Coaster,169523200,USD,0.0,107,102,3.5,3.5,2.0.0,4+,Games,37,5,1,1
4463,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1
4831,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1


In [21]:
# only two app rows need be dropped

ios_english.drop([4463,4831], axis=0,inplace=True)

In [22]:
print(android_english.Type.value_counts(dropna=False),sep='\n')

android_final = android_english.loc[(android_english.loc[:,'Type'] == 'Free') & (
    android_english.loc[:,'Price']== '0')]

android_final

Free    8863
Paid     750
NaN        1
Name: Type, dtype: int64


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up
10,Text on Photo - Fonteee,ART_AND_DESIGN,4.4,13880,28M,"1,000,000+",Free,0,Everyone,Art & Design,"October 27, 2017",1.0.4,4.1 and up


In [23]:
ios_final = ios_english.loc[ios_english.loc[:,'price']==0.0]

ios_final

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
5,429047995,Pinterest,74778624,USD,0.0,1061624,1814,4.5,4.0,6.26,12+,Social Networking,37,5,27,1
6,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
7,553834731,Candy Crush Saga,222846976,USD,0.0,961794,2453,4.5,4.5,1.101.0,4+,Games,43,5,24,1
8,324684580,Spotify Music,132510720,USD,0.0,878563,8253,4.5,4.5,8.4.3,12+,Music,37,5,18,1
9,343200656,Angry Birds,175966208,USD,0.0,824451,107,4.5,3.0,7.4.0,4+,Games,38,0,10,1


After cleaning both **android** and **ios** app data , we have *8863* and *3220* apps respectively 

# Most Common App By Genre

## Part One

As we mentioned earlier, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to GooglePlay.
2. If the app has a good response from users. we develop it further.
3. If the apps is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on bothe GooglePlay and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app thatr makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tableins for `Genres` and `Category` in **android** apps and `prime_genre`  **ios** apps

In [24]:
%who

and_dup_max	 android	 android_duplicate	 android_english	 android_final	 android_no_dups	 android_not_needed_dups	 english_only	 ios	 
ios_english	 ios_final	 np	 pd	 v_	 veracity	 


In [25]:
print(android_final.columns, sep='\n')

ios_final.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')


Index(['id', 'track_name', 'size_bytes', 'currency', 'price',
       'rating_count_tot', 'rating_count_ver', 'user_rating',
       'user_rating_ver', 'ver', 'cont_rating', 'prime_genre',
       'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'],
      dtype='object')

In [26]:
(android_final.Genres.value_counts(normalize=True, ascending=False,
                                  dropna=False)*100).reset_index(
    name='percentage').head(50)

Unnamed: 0,index,percentage
0,Tools,8.450863
1,Entertainment,6.070179
2,Education,5.348076
3,Business,4.592125
4,Lifestyle,3.892587
5,Productivity,3.892587
6,Finance,3.700779
7,Medical,3.531536
8,Sports,3.463838
9,Personalization,3.317161


android apps by Genre column shows a very granular distribution of different apps. Let's compare with the information on the category column.

In [27]:
(android_final.Category.value_counts(normalize=True, ascending=False,
                                  dropna=False)*100).reset_index(
    name='percentage')

Unnamed: 0,index,percentage
0,FAMILY,18.898793
1,GAME,9.725826
2,TOOLS,8.462146
3,BUSINESS,4.592125
4,LIFESTYLE,3.90387
5,PRODUCTIVITY,3.892587
6,FINANCE,3.700779
7,MEDICAL,3.531536
8,SPORTS,3.396141
9,PERSONALIZATION,3.317161


Android apps by `category` show a less granular distribution of apps. As we are interested in the bigger picture, we'll use android app segregation by *category* column 

In [28]:
(ios_final.prime_genre.value_counts(normalize=True, ascending=False,
                                  dropna=False)*100).reset_index(
    name='percentage')

Unnamed: 0,index,percentage
0,Games,58.136646
1,Entertainment,7.888199
2,Photo & Video,4.968944
3,Education,3.664596
4,Social Networking,3.291925
5,Shopping,2.608696
6,Utilities,2.515528
7,Sports,2.142857
8,Music,2.049689
9,Health & Fitness,2.018634


Comparing the apps in android and ios, App store is dominated by apps for fun, while GooglePlay Store is dominated by apps for productivity.

Now let's get an idea about the kind of apps that have most users

## Most Popular Apps by Genre on App Store

One way to find out what generes are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find the information in the Installs column, but for the App Store data set this information is not available. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the *rating_count_tot* column.

In [29]:
ios_final.groupby('prime_genre'
                 ).mean().sort_values('rating_count_tot',
                                      ascending=False)[
    'rating_count_tot']

prime_genre
Navigation           86090.333333
Reference            74942.111111
Social Networking    71548.349057
Music                57326.530303
Weather              52279.892857
Book                 39758.500000
Food & Drink         33333.923077
Finance              31467.944444
Photo & Video        28441.543750
Travel               28243.800000
Shopping             26919.690476
Health & Fitness     23298.015385
Sports               23008.898551
Games                22812.924679
News                 21248.023256
Productivity         21028.410714
Utilities            18684.456790
Lifestyle            16485.764706
Entertainment        14029.830709
Business              7491.117647
Education             7003.983051
Catalogs              4004.000000
Medical                612.000000
Name: rating_count_tot, dtype: float64

On average, navigation apps have the highest number of user reviews. Let's examine the navigation genre to gain more insight into the apps covered app

In [30]:
top_8_genre = ios_final[ios_final['prime_genre'].isin([
    'Navigation','Reference','Social Networking','Music','Weather',
'Book','Food & Drink','Finance'])
         ].sort_values(
['prime_genre','rating_count_tot'],ascending=False)[
    ['prime_genre','track_name','rating_count_tot']]

top_8_genre

Unnamed: 0,prime_genre,track_name,rating_count_tot
22,Weather,"The Weather Channel: Forecast, Radar & Alerts",495626
89,Weather,The Weather Channel App for iPad – best local ...,208648
95,Weather,"WeatherBug - Local Weather, Radar, Maps, Alerts",188583
133,Weather,MyRadar NOAA Weather Radar Forecast,150158
138,Weather,AccuWeather - Weather for Life,144214
189,Weather,Yahoo Weather,112603
355,Weather,Weather Underground: Custom Forecast & Local R...,49192
374,Weather,NOAA Weather Radar - Weather Forecast & HD Radar,45696
443,Weather,Weather Live Free - Weather Forecast & Alerts,35702
619,Weather,Storm Radar,22792


In [31]:
top_8_genre[top_8_genre['prime_genre'].isin(
    ['Navigation','Reference','Social Networking'])]

Unnamed: 0,prime_genre,track_name,rating_count_tot
0,Social Networking,Facebook,2974676
5,Social Networking,Pinterest,1061624
43,Social Networking,Skype for iPhone,373519
48,Social Networking,Messenger,351466
51,Social Networking,Tumblr,334293
63,Social Networking,WhatsApp Messenger,287589
72,Social Networking,Kik,260965
111,Social Networking,"ooVoo – Free Video Call, Text and Voice",177501
117,Social Networking,TextNow - Unlimited Text + Calls,164963
120,Social Networking,Viber Messenger – Text & Call,164249


On close examination, Navigation, Social Networking and Refrence genre popularity appear skewed by a few extremely popular apps. 
For example Social Networking is skewed by extremely popular social media giants like Facebook, Pinterest et al. Thus making it unadvisable to build a social Networking app since our model is to generate revenue through ads.

Also, in the Navigation genre, apart from the two apps - **Waze - GPS Navigation, Maps & Real-time Traffic** and**Google Maps - Navigation & Transit** that massively skewed the popularity of this genre, other apps in this genre pose an unremarkable 'user ratings'.

In [32]:
top_8_genre[(top_8_genre['prime_genre']=='Reference') & (top_8_genre['rating_count_tot'] < 200500)].mean()

rating_count_tot    21355.176471
dtype: float64

The **Reference** genre popularity is also skewed by two apps, *Bible* and *Dictionary.com Dictionary & Thesaurus*. However, if we remove the Bible, the mean of rating_count_tot columns is a better reflection of the popularity of the genre.

What we could do is translate a popular book into an app. We could also add addtional features like quizzes about the book ( taking advantage of the fact that majority of the apps on App Store are fun games), an audio version of the book. Additionally, we could add a dictionary so that users don't have to exit our app to use a dictionary.

## Most Popular Genre GooglePlay Store

For Google PlayStore market, information about the number of installs is contained in the Installs column. Let's explore the installs column

In [33]:
k_= android_final.copy().loc[:,'Installs'].str.replace(',','').str.replace('+','').astype('int')
android_final['install_clean'] = k_

  k_= android_final.copy().loc[:,'Installs'].str.replace(',','').str.replace('+','').astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  android_final['install_clean'] = k_


In [34]:
android_final.install_clean.unique()

array([     10000,    5000000,   50000000,     100000,      50000,
          1000000,   10000000,       5000,     500000, 1000000000,
        100000000,       1000,  500000000,        100,        500,
               50,         10,          1,          5,          0])

In [35]:
pop_genre_rank = android_final.groupby(['Category']).mean().sort_values('install_clean',ascending=False
                                                                       ).reset_index()

pop_genre_rank

Unnamed: 0,Category,Rating,install_clean
0,COMMUNICATION,4.126923,38456120.0
1,VIDEO_PLAYERS,4.043448,24727870.0
2,SOCIAL,4.252736,23253650.0
3,PHOTOGRAPHY,4.164516,17840110.0
4,PRODUCTIVITY,4.181915,16787330.0
5,GAME,4.232034,15588020.0
6,TRAVEL_AND_LOCAL,4.068156,13984080.0
7,ENTERTAINMENT,4.118824,11640710.0
8,TOOLS,4.027854,10801390.0
9,NEWS_AND_MAGAZINES,4.104545,9549178.0


let's explore each of these popular genre to gain insight into the kind of app we might develop

In [36]:
genre_app = {}

for cat in pop_genre_rank['Category'][:11]:
    genre_app[cat] = android_final[android_final['Category']== cat][['App','install_clean']].sort_values(
    'install_clean', ascending=False)
    
genre_app

{'COMMUNICATION':                                                      App  install_clean
 336                                   WhatsApp Messenger     1000000000
 382             Messenger – Text and Video Chat for Free     1000000000
 411                         Google Chrome: Fast & Secure     1000000000
 451                                                Gmail     1000000000
 464                                             Hangouts     1000000000
 468                        Skype - free IM & video calls     1000000000
 403                          LINE: Free Calls & Messages      500000000
 4676                                     Viber Messenger      500000000
 420          UC Browser - Fast Download Private & Secure      500000000
 465                        imo free video calls and chat      500000000
 4039               Google Duo - High Quality Video Calls      500000000
 4123                        imo beta free calls and text      100000000
 412                       Firefox

In [37]:
app_above_100m = pd.DataFrame()
for k,v in genre_app.items():
    v['Category'] = k
    app_above_100m = app_above_100m.append(v[v['install_clean'] >= 100000000], ignore_index=True)
    
    
app_above_100m = app_above_100m[
    ['Category','App','install_clean']].groupby('Category').size().reindex(pop_genre_rank['Category'
                                                                                                    ][:11]
                                                                                     ).reset_index(
    name='count_above_100m')

app_above_100m

Unnamed: 0,Category,count_above_100m
0,COMMUNICATION,27
1,VIDEO_PLAYERS,9
2,SOCIAL,13
3,PHOTOGRAPHY,19
4,PRODUCTIVITY,22
5,GAME,59
6,TRAVEL_AND_LOCAL,5
7,ENTERTAINMENT,5
8,TOOLS,29
9,NEWS_AND_MAGAZINES,3


Communication Apps have the most installs of the apps in Google Play Store. However, this percentage is heavily skewed by a few apps who have over 100,000,000 installation. 

If we remove the few heavy weght application like 'WhatsApp', 'Google Chrome', 'Messenger' , 'Gmail, et al which heavily influence the number of installs for this app category, the average number of installs will be reduced by approximately 10 times.

In [38]:
# investigating the number of installs for each app category

genre_install_dist = {}

for a_ in pop_genre_rank['Category'][:11]:
    genre_install_dist[a_]=android_final[(android_final['Category'
                            ] == a_)]['install_clean'].value_counts().sort_index(ascending=
                                                         False)
genre_install_dist

{'COMMUNICATION': 1000000000     6
 500000000      5
 100000000     16
 50000000       7
 10000000      43
 5000000       22
 1000000       40
 500000         9
 100000        16
 50000         10
 10000         20
 5000          16
 1000          19
 500            8
 100           28
 50             5
 10            14
 5              2
 1              1
 Name: install_clean, dtype: int64,
 'VIDEO_PLAYERS': 1000000000     2
 500000000      1
 100000000      6
 50000000      10
 10000000      26
 5000000        7
 1000000       33
 500000         4
 100000        12
 50000          6
 10000         15
 5000          14
 1000           8
 500            6
 100            7
 10             2
 Name: install_clean, dtype: int64,
 'SOCIAL': 1000000000     3
 500000000      2
 100000000      8
 50000000       5
 10000000      30
 5000000       19
 1000000       33
 500000        14
 100000        21
 50000         10
 10000         18
 5000          11
 1000          22
 500            6
 1

At closer investigation, we realize that , for the top 11 most popular app category, their popularity is skewed by a few apps having above 100million installs. Let's remove this apps, and compare the apps popularity

In [39]:
genre_less_100m = {}

for ap in pop_genre_rank['Category'][:11]:
    genre_less_100m[ap]= android_final[(android_final['Category'
                            ] == ap) & (android_final['install_clean']  < 100000000)
                                      ].groupby('Category').agg([np.mean,np.size]
                                                               ).reset_index().loc[
        :,('install_clean',['mean','size'])].iloc[0].values
genre_less_100m

{'COMMUNICATION': array([3.60348539e+06, 2.60000000e+02]),
 'VIDEO_PLAYERS': array([5.54487813e+06, 1.50000000e+02]),
 'SOCIAL': array([3.08458252e+06, 2.23000000e+02]),
 'PHOTOGRAPHY': array([7.67053229e+06, 2.42000000e+02]),
 'PRODUCTIVITY': array([3.37965732e+06, 3.23000000e+02]),
 'GAME': array([6.27256469e+06, 8.03000000e+02]),
 'TRAVEL_AND_LOCAL': array([2.94407963e+06, 2.02000000e+02]),
 'ENTERTAINMENT': array([6118250,      80]),
 'TOOLS': array([3.19146113e+06, 7.21000000e+02]),
 'NEWS_AND_MAGAZINES': array([1.50284188e+06, 2.45000000e+02]),
 'BOOKS_AND_REFERENCE': array([1.43721222e+06, 1.85000000e+02])}

In [40]:
genre_df = pd.DataFrame.from_dict(genre_less_100m,orient='index', columns=['install_review','size_u_100m'])

genre_df.reset_index(inplace=True)

genre_df

Unnamed: 0,index,install_review,size_u_100m
0,COMMUNICATION,3603485.0,260.0
1,VIDEO_PLAYERS,5544878.0,150.0
2,SOCIAL,3084583.0,223.0
3,PHOTOGRAPHY,7670532.0,242.0
4,PRODUCTIVITY,3379657.0,323.0
5,GAME,6272565.0,803.0
6,TRAVEL_AND_LOCAL,2944080.0,202.0
7,ENTERTAINMENT,6118250.0,80.0
8,TOOLS,3191461.0,721.0
9,NEWS_AND_MAGAZINES,1502842.0,245.0


In [41]:
# comparing the top 11 most popular genre after removing apps with over 100,000,000 installs

k1_ = pd.concat([pop_genre_rank[:11],genre_df], join='inner', axis=1)
k1_['percentage_reduction'] = ((k1_['install_clean']-k1_['install_review'])/k1_['install_clean'])*100

k1_ = pd.concat([k1_,app_above_100m['count_above_100m']],join = 'inner',axis=1)
k1_.drop('index',axis=1,inplace=True)

k1_ = k1_[['Category','Rating','install_clean','install_review','percentage_reduction'
          ,'size_u_100m','count_above_100m']]

k1_['skewed_percentage'] = (k1_['count_above_100m']/(k1_['size_u_100m'] +k1_['count_above_100m']))*100

k1_

k1_.rename({'install_review':'install_less_100m',
           'percentage_reduction':'install_percentage_drop',
           'size_u_100m':'count_under_100m',
           'skewed_percentage': 'percentage_above_100m'},inplace=True,axis=1)
k1_

Unnamed: 0,Category,Rating,install_clean,install_less_100m,install_percentage_drop,count_under_100m,count_above_100m,percentage_above_100m
0,COMMUNICATION,4.126923,38456120.0,3603485.0,90.629618,260.0,27,9.407666
1,VIDEO_PLAYERS,4.043448,24727870.0,5544878.0,77.576404,150.0,9,5.660377
2,SOCIAL,4.252736,23253650.0,3084583.0,86.735062,223.0,13,5.508475
3,PHOTOGRAPHY,4.164516,17840110.0,7670532.0,57.004009,242.0,19,7.279693
4,PRODUCTIVITY,4.181915,16787330.0,3379657.0,79.867811,323.0,22,6.376812
5,GAME,4.232034,15588020.0,6272565.0,59.760339,803.0,59,6.844548
6,TRAVEL_AND_LOCAL,4.068156,13984080.0,2944080.0,78.946916,202.0,5,2.415459
7,ENTERTAINMENT,4.118824,11640710.0,6118250.0,47.440902,80.0,5,5.882353
8,TOOLS,4.027854,10801390.0,3191461.0,70.45324,721.0,29,3.866667
9,NEWS_AND_MAGAZINES,4.104545,9549178.0,1502842.0,84.262082,245.0,3,1.209677


As is the case with **Communication** category, **Video_players**,**Social**, **Photography** and others are dominated by a few apps which give an outsized measure of popularity to these categories.

For example, the *Video_players* is dominated by apps such as YouTube, Google Play Movies, VLC and the likes. Competing with these apps for market share in the video player industry is not best play.

The *Photograhpy* category is also dominated by a few apps as is the case with *video players*. More so, since our mode of generating income is through ads, building a photography app may not generate much income as users rarely spend long time on these beyond pics manipulation.

Examining the *Productivity* category, it is dominated by apps which act as either cloud storage, calendars, or text editors et al. Ability to generate revenue through in-app ad in these category will be curtailed since usage is only for some instances in the day

Taking a closer look at <font color='Red'> *Books and Reference* </font> category, the total installs here is skewed by five(5) apps: Google play Books, wattpad Free Books, Audiobooks from Audible, Bible and Amazon kindle.
However, if we remove these apps, the total installs fairly distributed among apps from the translation of the Quran, to e-book readers, dictionaries as well as translation of other popular books among others.
This holds a potential for exploration.
We could look popular book to translate into an app and add exciting features to stand it apart from other apps in this category. Including dictionary as part of the feature with quizzles and/or a discuss forum will add features of fun and productivity which have made other category popular.

Building an app based on the translation of a popular book with possible features as highlighted above could be profitable for both Google playStore and App Store.


## Conclusion

In this project, an analysis of apps on App Store and Google Play Store with the goal of recommending an app profile that is profitable for both markets.

Analysis revealed, that taking a recent popular book and turning it into an app could be profitble for both App Store and Google Play Store. But since the market is full of libraries, we have to add exciting features like quizzes, quotes and/or discussion forum to give a differentiating and distinguishing edge over other books in the library.

In [42]:
# uncheck this code block to view the barplots of installs before and after apps with 100m and above where removed


# import matplotlib.pyplot as plt
# import seaborn as sns

# %matplotlib inline

# plt.rcParams["figure.figsize"] = [10,5]

# plt.style.use('fivethirtyeight')

# k1_.plot(x='Category',y=['install_clean','install_less_100m'],kind='barh')

# # sns.despine()

In [43]:
# uncomment this code to view the barplot of apps with 100m and above installs vs percentage drop in installs


# k1_.plot(x='Category',y=['install_percentage_drop','percentage_above_100m'],kind='barh',width=0.7)

# plt.legend(bbox_to_anchor=(0.75,1.15),ncol=2)  #bbox_to_anchor is used to manipulate the location of legend
#                                                 # ncol indicates the number of column for the legend