# Finding a Profitable Niche in Oversaturated App Markets

*For the purposes of this project, I am a data analyst working for a fictional company that builds both Android and iOS apps. This company generates revenue mainly via in-app ads that are free to download and install, so the more users we have that see and engage with our ads*

**What's this project about?**

In the late 2010s, the app markets are oversaturated by now but my company still needs to find a way to generated revenue. 

This project aims to find the kinds of apps that make the most money on both the Apple App Store and Google Play store and the findings in this analysis will use the Python programming language to help the developers in my company make data-driven decisions as to what kind of app to make. 

**What's the goal for this project?**

The goal of this project is to determine what kinds of apps are the most popular but at the same time, find a niche that isn't too saturated either if the app stands a chance at competing and making money. 

# **Opening and exploring the data**


The code in the cell below opens both datasets. You can download the [App Store dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) and the [Google Play dataset](https://www.kaggle.com/lava18/google-play-store-apps/home)

In [98]:
from csv import reader 

###Opens up the Applestore.csv dataset

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
iOS = list(read_file)
iOS_header = iOS[0]
iOS = iOS[1:]

###Opens up the googleplaystore.csv dataset

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

I explored the data using a function named `explore_data`. This code prints the number of apps(rows) and columns for each dataset.

The dataset is a list of lists. `Start` and `end` are expected to be integers and respectively represent the starting and ending indices of a slice from the dataset. `False` is the default argument for rows_and columns

In [99]:
def explore_data(dataset, start, end, rows_and_columns=False):

    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #Adds a new(empty)line after each row
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

print(iOS_header)
print('\n')
explore_data(iOS, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215

# Deleting Incorrect Data

In this cell, I cleaned the data for the Google Play store. One of the rows(row 10472) has an error- a rating is out of 5 stars and the "Life Made Wi-Fi Touchscreen Photo Frame" lists
a rating of 19 out of 5 of stars, which is mathematically
illogical, so I deleted this app from the dataset. Now the
dataset has 10,480 apps instead of 10,481. 

In [100]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(android[0])

print(len(android))
del android[10472]
print(len(android))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
10841
10840


# Removing Duplicate Entries

Another part of data cleaning involves checking for duplicate entries and deleting them. The code below creates two lists: one storing duplicate apps and another for apps with only one entry(unique_apps). I looped through the android dataset and for each iteration, the app name is saved to a variable convienently named name.

If the `name` was already in `unique_apps`, it gets appended into the `duplicate_apps` list. Otherwise it's appended to the `unique_apps` list. 

In [101]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of unique apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of unique apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


The criteria I used as to what duplicate entries to delete wasn't arbitrary- I only kept the most recent entry(the one with the highest amount of user reviews, which makes that row the most recent). 

To do this, I created a dictionary, initializing first with an empty dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest amount of user reviews for that app.

I assigned the name of the app to variable named `name` and converted the number of reviews(as indexed in the 4th column) to `n_reviews`. 

If the `name` already exists as a key in the dictionary and `reviews_max[name]` is less than the number of reviews, the loop updates the number of reviews for that entry in the `reviews_max` dictionary. 

If the `name` is **not in** the dictionary, the loop creates a new entry in the dictionsry where the key is the app name, and the value is the number of reviews.

I didn't use an `else` clause because had I did that, then the number of reviews will be incorrectly updated whenever number of maximum reviews in the app `name` is less than the number of reviews evaluates to false.

In [102]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3]) #3 is the 4th column, reviews
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

I then inspected the length of the dictionary to ensure everything went as expected.

In [103]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


I then used the dictionary created sbove to remove the duplicate rows. First I created 2 empty lists. `android_clean` stores the new cleaned dataset.

The loop below up until the `if` clause is the same as the loop used above. 

But in the `if` clause, if `n_reviews` is equal to the same number of maximum reviews in the app `name` **and** `name` isn't already in the `already_added` list(we need to add this additional condition to account for cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps), the clause does two things: 

* Append the entire row to the `android_clean` list(which eventually will be a list of lists and store the cleaned dataset).
* Append the `name` to the `already_added` list to keep track of apps that we already added.

In [104]:
android_clean = []
already_added = [] #just stores app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if [n_reviews == reviews_max] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

Then I explored the `android_clean` dataset to ensure I ended up with 9,659 rows.

In [105]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


# Removing Apps Not in English

The language of the apps my company makes is English and we'd only like to analyze apps in English.

There are some apps further down both datasets that may be in characters not in English text(ex. Arabic, Chinese characters), indicating these are not apps directed to an English-speaking audience. 

In [106]:
print(iOS[813][1])
print(iOS[6731][1])
print('\n')
print(android_clean[4412][0]) #App title translated to English
print(android_clean[7940][0]) #App title translated to English

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


Where Am I At?
MBU DX Cluster


One way to remove each app whose name contains a symbol not commonly found in English text(A-Z, 0-9, punctuation marks, other symbols like +, &, etc.). 

Each character we use in a string has a corresponding number associated with it behind the scenes. For instance, 'a' is 97 and 'A' is 65, but '爱' is 29,233. All English characters have a number betwee 0 and 127, as per the ASCII standard. 

The function below, in addtion to the built-in "ord()" function, finds out the encoding number in each character. 

In [107]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


Some English-language apps use symbols or emojis, meaning the following two apps would evaluate to False, despite being apps in English. Let's check the ord numbers of an example symbol and emoji:

In [108]:
print(is_english('Docs To Go™- Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


To minimize apps being removed that are English but happen to have a symbol or emoji, this for loop only removes an app if it has more than 3 characters outside the 0-127 ASCII range. This filter isn't perfect but it's better than the one above. 

In [109]:
def is_english(string):
    non_ascii = 0 #initalizes the for loop
    
    for character in string: 
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


In [110]:
android_english = []
iOS_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in iOS:
    name = app[1]
    if is_english(name):
        iOS_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(iOS_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+'

We are left with 9,614 Android apps and 6,183 iOS apps.

# Isolating the Free Apps

The last step in the data cleaning process is to isolate the free apps from the ones that aren't for our analysis.

In [111]:
iOS_final = []
android_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in iOS_english:
    price = app[4]
    if price == '0.0':
        iOS_final.append(app)

print(len(android_final))
print(len(iOS_final))
    

8862
3222


There are 8,862 free Android apps and 3,222 free iOS apps.

# Most Common Apps by Genre

To minimize risk and overhead, our validation strategy for an app idea is the following: 

1. Build an Android MVP(minimal viable producr) app and add it to Google Play.
2. If it gets a good response from users, develop it further.
3. If after 6 months the app is profitable, build an iOS version and add it to the App store. 

We need to find app profiles that are successful to both markets since our end goal is to add the app onto both Google Play and the App store.

To find what the most common genres for each market are, we need to build frequency tables for a few columns in the datasets.

In [112]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table: 
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

Let's see what categories are the most popular in the App Store.

In [113]:
display_table(iOS_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


*Disclosure: our dataset only contains free English apps, so any conclusions shouldn't be extended beyond that scope. In other words, just because gaming apps are the most numerous among the free English apps on Google Play doesn't mean that will still hold true for Google Play as a whole!

The most common genre on iOS is games at more than 58%, followed by Entertainment(almost 8%), Photo & Video(almost 5%), Education(almost 4%), and Social Networking(3.29%). . 

Apps meant to be fun number far greater than anything meant to be practical(and not just in the Education category, like with Reference or Food & Drink) when it comes to the App Store. However, just because there are far more fun apps than practical ones doesn't necessarily mean they'll have far more users- it could be the same. 

Let's examine the Google Play store. Note that for this dataset, Category and Genre are closely related.

In [114]:
display_table(android_final, 1) #Category

FAMILY : 18.449559918754233
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994
PARENTING : 0.65

On Google Play, there are far more practical apps than fun ones. But after investigating further, the "family" category is mainly games for kids. 

But practical apps still outnumber the fun apps on Google Play, as only about 19% of the apps are "Family".

In [115]:
display_table(android_final, -4) #Genre

Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

The Genre category has more categories than Category itself.
The frequency tables generated above only tell us the most frequent app genres, not the most users.

Let's see what kind of apps have the most **users**.

# Most Popular Apps by Genre on the App Store

One way to find out what genres have the most users is to calculate the average number of installs per genre. We can find this information in the "Installs" column on Google Play dataset, but for the App Store, this information is missing.

Instead, as a workaround, we'll take the total number of users as a proxy, in the "rating_count_tot" column.

To calculate the average number of user ratings per genre on the App Store

In [116]:
genres_iOS = freq_table(iOS_final, -5)

for genre in genres_iOS:
    total = 0
    len_genre = 0
    for app in iOS_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ":", avg_n_ratings)

Travel : 28243.8
Reference : 74942.11111111111
News : 21248.023255813954
Lifestyle : 16485.764705882353
Business : 7491.117647058823
Photo & Video : 28441.54375
Weather : 52279.892857142855
Entertainment : 14029.830708661417
Games : 22788.6696905016
Food & Drink : 33333.92307692308
Medical : 612.0
Education : 7003.983050847458
Shopping : 26919.690476190477
Social Networking : 71548.34905660378
Health & Fitness : 23298.015384615384
Utilities : 18684.456790123455
Music : 57326.530303030304
Book : 39758.5
Catalogs : 4004.0
Navigation : 86090.33333333333
Finance : 31467.944444444445
Productivity : 21028.410714285714
Sports : 23008.898550724636


Navigation apps have the highest number of user reviews on average, but this figure is influenced heavily by Waze and Google Maps:

In [117]:
for app in iOS_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) #print app name & number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same is true for the Social Networking, Music and Reference categories- all are dominated by only several apps. Yet strangely, Twitter was not in the Social Networking category:

In [118]:
for app in iOS_final:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

In [119]:
for app in iOS_final:
    if app[-5] == 'Music':
        print(app[1], ':', app[5])

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

In [120]:
for app in iOS_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


I recommend creating an app that is an encyclopedia for the timeline and history of mobile apps that potentially has a quiz format(since fun apps are popular on the App Store) and the Reference category isn't saturated with heavy-hitters(yet).

Sure, one could use Wikipedia, but social media has a rich history despite it's short age of less than 15 years- there's always a YouTube or Facebook scandal that makes the news every week and there are 52 weeks in any given year, hence such an app would have plenty of great content. 

# Most Popular Apps by Genre on Google Play

We actually have data about the number of installations for a given app on Google Play, but the values aren't precise enough. That is, does an app have 100,000 installs, 200,000, 350,000? But we don't need very precise data for our purposes, as we just want to find out what app genre attracts the most users. 

We'll just assume 100,000+ installs means 100,000 installs, 10,000+ installs means 10,000, etc.

In [121]:
display_table(android_final, 5)

1,000,000+ : 15.730083502595352
100,000+ : 11.543669600541637
10,000,000+ : 10.550665763935905
10,000+ : 10.212141728729407
1,000+ : 8.395396073121193
100+ : 6.917174452719477
5,000,000+ : 6.82690137666441
500,000+ : 5.574362446400361
50,000+ : 4.773188896411646
5,000+ : 4.513653802753328
10+ : 3.5432182351613632
500+ : 3.2498307379823967
50,000,000+ : 2.279395170390431
100,000,000+ : 2.1214172872940646
50+ : 1.9183028661701647
5+ : 0.7898894154818324
1+ : 0.5077860528097494
500,000,000+ : 0.2708192281651997
1,000,000,000+ : 0.22568269013766643
0+ : 0.045136538027533285
0 : 0.011284134506883321


However, we'll need to convert each install number from a string to a float if we want to perform computations and that involves removing the commas and plus characters. Otherwise, the conversion will result in an error message.

In [122]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '') #removes plus characters
            n_installs = n_installs.replace(',', '') #removes commas
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ":", avg_n_installs)

ENTERTAINMENT : 21134600.0
TRAVEL_AND_LOCAL : 13984077.710144928
AUTO_AND_VEHICLES : 647317.8170731707
SHOPPING : 7036877.311557789
EVENTS : 253542.22222222222
FOOD_AND_DRINK : 1924897.7363636363
SPORTS : 3638640.1428571427
MEDICAL : 120616.48717948717
COMICS : 817657.2727272727
GAME : 15837565.085714286
BUSINESS : 1712290.1474201474
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
LIBRARIES_AND_DEMO : 638503.734939759
SOCIAL : 23253652.127118643
EDUCATION : 3082017.543859649
FAMILY : 2691618.159021407
BEAUTY : 513151.88679245283
VIDEO_PLAYERS : 24852732.40506329
WEATHER : 5074486.197183099
TOOLS : 10695245.286096256
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
ART_AND_DESIGN : 1905351.6666666667
HEALTH_AND_FITNESS : 4188821.9853479853
BOOKS_AND_REFERENCE : 8767811.894736841
PERSONALIZATION : 5201482.6122448975
PHOTOGRAPHY : 17805627.643678162
MAPS_AND_NAVIGATION : 4056941.7741935486
NEWS_AND_MAGAZINES : 9549178.467741935
HOUSE_AND_HOME : 1313681.9054054054

Communication apps have the most installs, on average, with over 38 million, but like the App Store, this category too is dominated by a few apps and these apps have either a billion installs, 500 million, or 100 million.

In [123]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100.000.000+'):
                print(app[0], ':', app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
imo free video calls and chat : 500,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+


The average amount of installs would be reduced roughly 10 times if we removed all the communication apps that have over 100 million installs:

In [124]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace('+', '')
    n_installs = n_installs.replace(',', '')
    if(app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

The video players category is also dominated by a few apps, like YouTube, Google Play Movies & TV, or MX Player. We notice the same pattern with social(ex. Facebook, Google+), photography(ex. Google Photos), and productivity apps(ex. Microsoft Word, Dropbox, Google Calendar). 

In [133]:
for app in android_final:
    if app[1] == 'VIDEO_PLAYERS' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
                print(app[0], ':', app[5])

YouTube : 1,000,000,000+
Motorola FM Radio : 100,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+


In [134]:
for app in android_final:
    if app[1] == 'SOCIAL' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
                print(app[0], ':', app[5])

Facebook : 1,000,000,000+
Instagram : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Snapchat : 500,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
LinkedIn : 100,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


In [132]:
for app in android_final:
    if app[1] == 'PRODUCTIVITY' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
                print(app[0], ':', app[5])

Microsoft Word : 500,000,000+
Adobe Acrobat Reader : 100,000,000+
Google Drive : 1,000,000,000+
Microsoft Outlook : 100,000,000+
Microsoft Excel : 100,000,000+
Microsoft OneDrive : 100,000,000+
Microsoft OneNote : 100,000,000+
Google Keep : 100,000,000+
ES File Explorer File Manager : 100,000,000+
Dropbox : 500,000,000+
Google Calendar : 500,000,000+
Google Docs : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Google Sheets : 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Cloud Print : 500,000,000+
CamScanner - Phone PDF Creator : 100,000,000+


The main concern again is that these handful of apps in each of the above categories are more popular than they really are. Also, these categories are dominated by a few giants and that's hard to compete against for a small company not named Facebook, Google+, etc. 

The games category, as we already saw, is oversaturated, but let's see if our conclusion about the Reference category holds water with the Google Play store, in their Books & Reference category

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc.

In [129]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])


Wattpad 📖 Free Books : 100,000,000+
E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Amazon Kindle : 100,000,000+
Cool Reader : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Oxford Dictionary of English : Free : 10,000,000+
Offline: English to Tagalog Dictionary : 500,000+
Spanish English Translator : 10,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
NOOK App for NOOK Devices : 500,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,00

In [131]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
                print(app[0], ':', app[5])

Wattpad 📖 Free Books : 100,000,000+
Amazon Kindle : 100,000,000+
Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Audiobooks from Audible : 100,000,000+


This category is also dominated by a handful apps. Let's see what apps are in the middle of the pack, with installs between a million and 50 million: 

In [135]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                      or app[5] == '5,000,000+'
                                      or app[5] == '10,000,000+'
                                      or app[5] == '50,000,000+'):
                print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
Oxford Dictionary of English : Free : 10,000,000+
Spanish English Translator : 10,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
English Dictionary - Offline : 10,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) :

In this install range, it appears to be dominated by software for reading and processing ebooks, but there are several apps that are dictionaries, so if we started making the app to be a dictionary of Internet slang that gets regularly updated, our app might have a chance to compete in this niche. After all, it's hard to keep track of what lingo is being used these days on social media so such a reference would certainly be helpful. 

# In Conclusion

In this project, I analyzed the datasets from the App Store and Google Play Store to determine what kind of app my fictional app company should make.

I determined such an app to be a dictionary of Internet slang that gets regularly updated. If the Android App succeeds, then the iOS App should add some features, like an encyclopedia for the timeline and history of mobile apps that potentially has a quiz format(since fun apps are popular on the App Store) and the Reference category isn't saturated with heavy-hitters.