# Most profitable apps for the App Store and Google Play

The company we are working for is a company that builds free Android and iOS apps. Since the apps are free, the revenue comes from in-app ads meaning that the more users that see and engage with the ads the better. 

As such, the developers need us to identify what type of apps are more likely to attract more users and generate more revenue.



## Opening and Exploring the Data

As of August 2024 there are over 2.2 million apps in the Play Store and over 2 million apps in the App Store. [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

Since collecting the data can be time consuming and expensive, we have looked for the availability of datasets online that would allow us to analyse sample data and we have found in Kaggle one for the [Play Store](https://www.kaggle.com/lava18/google-play-store-apps) and one for the [App Store](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

We will start by opening and exploring the data:

In [6]:
from csv import reader

# The data for Google Play
opened_file = open('googleplaystore.csv', encoding = 'utf-8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android [1:]

#The data for the App Store
opened_file = open('AppleStore.csv', encoding = 'utf-8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


We are creating a function called `explore_data()` to make it easier to explore the datasets. We are also adding the option to show the number of rows and columns for any dataset if we wish to see that information.

The documentation for each of the datasets can be found here: [Android](https://www.kaggle.com/datasets/lava18/google-play-store-apps), [iOS](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

In [12]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))      

#### Exploring the data from the Google Play
The Google Play data has 10841 rows and 13 columns of which the most useful for our purpose would be `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`.

In [18]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


#### Exploring the data from the App Store:
The App Store data has 7197 rows and 16 columns of which the ones that would relevant to us are `track_name`, `currency`,`price`, `rating_count_tot`, `rating_count_ver`, `prime_genre`.

In [19]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


## Cleaning up the data

During this process we will be removing duplicate data, data that is not needed, and correcting or removing data that is wrong.

As a reminder the company we work for only works with free apps. The apps are also designed with an English speaking market in mind. This means that we have no interest in paid apps or apps that are in a different language to English.

##### Deletion of wrong data
In the Google Play's data there's a discussion section where an error in row 10472 is being mentioned. It turns out that that the row in question is missing the `Category` and that is causing the rating to be 19 (when the maximum rating is 5) rather than 1.9 as it should be.

In [22]:
print(android[10472]) #incorrect row
print('\n')
print(android_header)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We are going to proceed by deleting this row from the dataset.

In [23]:
print (len(android))
del android[10472] #only run this once or correct rows could be deleted
print (len(android))

10841
10840


##### Removal of duplicate entries
By exploring the dataset further we notice that there are duplicated entries. For example "Instagram" appears four times.

In [25]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total we have been able to found 1181 duplicated entries:

In [29]:
duplicated_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicated_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicated apps is:', len(duplicated_apps))
print('Number of unique apps is:', len(unique_apps))
print('Examples of duplicated apps:', duplicated_apps[:5])

Number of duplicated apps is: 1181
Number of unique apps is: 9659
Examples of duplicated apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


We want to remove the duplicate entries to avoid counting the same app several times. By looking at our sample data from Instagram, we can see that the number of reviews for each app is different. The entry with the higher number of reviews is likely the newest entry. Keeping the entry with the highest number of reviews could therefore be a good criteria when deleting the duplicate values as we will be keeping the most up to date data.

To do that we are going to create a dictionary that stores the name of the app as a key, and the number of reviews as the value assigned to that key. In the case of those apps that appear in the dataset more than once, what we are going to do is make sure that we check that the value assigned to the app within the dictionary is the highest of all the duplicated values. We do this by checking if the stored value in the dictionary of an app is smaller than the one being checked. If it is, that value is replaced with the new one, additionally if that app is not yet included in the dictionary it will be included in it.

In [36]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
        
print('Expected entries:', len(android)-1181) # Amount of unique entries we expect in the dictionary
print('Actual entries:', len(reviews_max))

Expected entries: 9659
Actual entries: 9659


Once we have created the dictionary, we are going to use it to clean the dataset. In order to do that, we are creating two empty lists that will store the cleaned data set, and the app names.
We will then loop through the android dataset using the dictionary to check for the highest review number. If the value from the dataset regarding the number of reviews matches the value in the dictionray and if the app isn't in the `already_added`list, the whole row will be added to the `android_clean` list, while the name of the app will be added to the `already_added`list.

In [47]:
android_clean =  [] # will store the new cleaned dataset
already_added = [] # will store app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

We are now going to explore the `android_clean` dataset to ensure everything has worked as expected. 

In [46]:
explore_data(android_clean, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


##### Removing Non-English Apps

As we mentioned earlier the company is only interested in those apps that are in English since the focus of the company is the English speaking market. Due to this, we are going to proceed to remove any app that is not in English.

One of the ways that can be used to remove those sort of apps is by elminating those that have characters that are not used in the English language. All characters in a string have a number associated to them, for example the later "a" has the character 97 associated to it, while the letter "á" has the 225. We can check the number associated to each character using `ord()`. 

In [50]:
print(ord('a'))
print(ord('á'))

97
225


Characters used in English range from 0 to 127 in the ASCII, knowing this we can build a function that identifies all the apps that include characters outside of that range.

In [53]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False        
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


Although the function above does work to identify characters outside the 0 to 127 ASCII range, it also removes some apps that are in English but that include characters that are not part of that range, meaning using that function will lead to a loss of important and meaningful data.

In [55]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat😜'))

False
False


To prevent that from happening, we are going to set the condition that the app is only removed when the title has at least three special characters outside of the 0 to 127 ASCII range. The function won't be perfect, but it will remove the likelyhood of deleting too many relevant entries.

In [58]:
def is_english(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


We then loop through the two datasets to create the `android_english` and the `ios_english` datasets that only contained what we expect to be exclusively English entries. We are left with 9614 entries in Android and 6183 in ios.

In [67]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

##### Removing paid apps from the datasets

To remove the paid apps and obtain our final datasets we will loop through each dataset and look for those apps that have a price equal to 0 in the case of Android and 0.0 in the case of iOS and append them to two new lists.

In [72]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print('Android total final entries:',len(android_final))
print('iOS total final entries:', len(ios_final))

Android total final entries: 8864
iOS total final entries: 3222


## Analysing the data
As mentioned at the start, since the revenue of the company is tied to the amount of people that use the apps, the goal of the company is to determine the which apps are more likley to attract more users.

To minimise risks and overhead, the company:
1. Builds a light version of the app and ads it to the Google Play
2. If the app is well received by users then it gets developed further
3. After six months if the app is profitable, the company builds the iOS version of the app and publishes it in the App Store

Since the goal is to publish apps in both platforms, is important for us to find the type of apps that work well in both stores and not just one of them.

##### Most common Apps by Genre

To find out what the most common genres are in each of the app stores, we will create frequency tables that look at the `prime_genre`column in the App Store and `Genres`and `Category`.

Below we create the function for the frequency tables `freq_table`, and the fucntion to display the percentages in descending order `display_table`.


In [86]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
        
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset,index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
        

We are now going to use those two functions to examine the `prime_genre` for iOS, and the `Genres` and `Category` for android.

In [88]:
display_table(ios_final, -5) #prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


It looks like in iOS over half of the apps (58%) are games, followed by the entertainment apps that amount for almost 8% of the apps, with photo & video representing 5% of the apps.

In terms of number of apps the app store seems to be dominated by apps for fun, although that doesn't necessarily mean that those are the apps with the most number of users.

In [89]:
display_table(android_final, 1) #Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

At first sight Google Play shows a different type of distribution, with family apps accounting for 19% of the apps, followed by game apps at 10%. However, it seems like most of the apps under the `FAMILY` category are actually games for children. Even then, games account for 29% rather than 58% with `TOOLS` accounting for 8% and `BUSINESS` for 5%, signalling a heavier weight of this type of apps in android.

Again, this doesn't necessarily mean these categories have the largest amount of users, which is what we want to figure out.

In [90]:
display_table(android_final, -4) #Genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In the case of `Genres` the data seems to be a lot more granular than what we actually need for our analysis, for example there is an "Education", an "Educational", and an "Education;Education" genre. As such, moving forward we will stick to `Categories`for android.

##### Most popular apps by Genre

We have already seen how in the App Store the most common aps are "fun" apps, while in the case of Google Play it was a bit more balanced between "fun" apps and "practical" apps. Now we want to know what kind of apps have the most amount of users.

To find out how many users each genre has in android, we can use the number of `Installs`of the apps that belong to each `Category`. In the case of the iOS apps we don't have a similar column, but we can use the `rating_count_tot`as a guide assuming that those apps with the higher amount of ratings will also be the most popular among users.

We are going to start by calculating the average number of ratings per app genre on the App Store.

In [101]:
genres_ios = freq_table(ios_final,-5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1

    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings) 


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Navigation seems to be the genre with the highest amount of user reviews. However, that number is heavily influenced by two apps Google Maps and Waze, which combined have over half a million user reviews. Something similar is happening in the genre "Social Networking" and in "Music". A few very popular apps belonging to well known brands are skewing how popular those genres are.

The "Reference" genre is also heavily influenced by the app of the Bible and a few dictionaries, however it could be a good opportunity to do something different with the potential of getting more exposure. It could also be combined with the "Book" category. For example with an app that provides quizzes, quotes, fun facts about books that a user can check while reading the book.

Weather apps would require pluggin into third party APIs which might not be free, and finance apps would require a lot of domain knowledge.

In [114]:
for app in ios_final:
    if app[-5] == "Navigation":
        print(app[1],':', app[5]) #print number of users per app in the Navigation genre

print('\n')
        
for app in ios_final:
    if app[-5] == "Social Networking":
        print(app[1], ':', app[5])
        
print('\n')

for app in ios_final:
    if app[-5] == "Music":
        print(app[1], ':', app[5])

print('\n')

for app in ios_final:
    if app[-5] == "Reference":
        print(app[1], ':', app[5])

print('\n')

for app in ios_final:
    if app[-5] == "Book":
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, 

Now we are going to look at Google Play and the most popular apps by looking at those with the higher number of installs. The problem is that the dataset has the number of installations in cohorts rather than by specific number. In our case it's not a big issue because we just want to know the categories with the highest number of installations, we don't care that much about a precise number.

In [116]:
display_table(android_final, 5) # Number of installations

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


As we said we are going to leave the cohorts for number of installations as they are, however, to work with computations on the installations data we need to ensure that each `string` is converted into a `float`. To not have issues with that conversion we need to remove commas and the plus character.

In [117]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
            

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Like what happened with some of the genres in the App Store, it looks like some of the categories in the Play Store are being skewed by the popularity of a few big apps. For example in the `COMMUNICATIONS` category, WhatsApp, Android Messages, Skyple, Gmail etc. seem to be affecting the results in a significant manner.

In [118]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                     or app[5] == '500,000,000+'
                                     or app[5] == '100,000,000+'):
        print(app[0],':',app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In fact, by removing the apps with over 100M installs, the average number of installs per app goes down significantly.

In [123]:
under_100m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace('+','')
    n_installs = n_installs.replace(',','')
    n_installs = float(n_installs)
    if app[1] == 'COMMUNICATION' and (n_installs < 100000000):
        under_100m.append(n_installs)

sum(under_100m) / len(under_100m)
        

3603485.3884615386

Something similar is happening with the video players category, social media apps, photography apps, and productivity apps. As a result, this categories might seem more popular and appealing than they really are and it can also mean that these niches are more challenging due to the high dominance of very strong players in the market.

The games category seems popular, but also a bit saturated especially in the App Store, since we want apps that work for both app stores it is better to explore other categories.

Let's take a look at the category `BOOKS_AND_REFERENCE` which would fit our original idea of an app that provides fun facts, references, quizzes, and daily quotes for book readers.



In [124]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0],':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

Although there are still a few apps with a lot of installations, these don't seem to be that many, meaning that there could still be an interesting niche to consider.

In [128]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                              or app[5] == '500,000,000+'
                                              or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Let's see what kind of apps are in the middle in terms of popularity:

In [129]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

A lot of the apps seem to be focused in providing software for reading books, dictionaries and libraries. Seeing this, it would be better to stay away from creating an app with that same or similar function. By looking at the data it seems that building an app around a very popular book or books can be profitable, which means our original idea could be a good fit.

### Conclusion
In this project we analysed the data in reference to the App Store and Google Play store with the aim of recommending the type of genre that would make more sense to build for the company we are working for. 

In this case, we believe that creating a companion app for popular books that allow users to follow along with their reading and that can help them learn fun facts about the author and the writing process of the book, as well as providing fun quizzes that could be used in book clubs, and daily quotes could be an interesting idea to explore.