# Analyzing Google Play and Apple Store apps for Profitable Profiles 

The aim is to find mobile app profiles that are profitable in both the Google Play and Apple Store market.  The goal is to provide data analysis for a company that builds Android and iOS mobile apps so they can make data-driven decisions on what types of apps to build.

This analysis is necessary because any mobile apps built will be available free to all users.  And any source of revenue will come from in-app ads.  This means the number of users will directly affect any subsequent revenue generated from in-app ads.  We will need to analyze what types of apps are likely to attract users in both markets.


# Opening and Exploring Datasets

The most recent data for this analysis, as of September 2018, consists of:

- Apple Store = ~2 million iOS apps
- Google Play = ~2.1 million Android apps

However, to save on costs and time, we are utilizing a condensed free version with a subset of data for each market.

- [Google Play Store dataset](https://www.kaggle.com/lava18/google-play-store-apps) contains approximately ten thousand records of apps currently in their marketplace.  Download directly using [this link](https://www.kaggle.com/lava18/google-play-store-apps/download)
- [Apple Store dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) contains approximately seven thousand records of apps currently in their marketplace.  Download directly using [this link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download)

Below we will open and explore both datasets.  To start, we open, read and store both datasets into individual variables.


In [1]:
from csv import reader

# Google Play dataset
opened_file = open('/Users/AJT/my_datasets/googleplaystore.csv')
read_file = reader(opened_file)
android_dataset = list(read_file)
android_header = android_dataset[0]
android_body = android_dataset[1:]

# Apple Store dataset
opened_file = open('/Users/AJT/my_datasets/AppleStore.csv')
read_file = reader(opened_file)
ios_dataset = list(read_file)
ios_header = ios_dataset[0]
ios_body = ios_dataset[1:]

To make it easier read/explore the data, we create a function to display our data in a easier to read format.  An option will be available to view the number of rows and also the number of columns.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


Below we used the "explore_data" function to sort the data for both datasets.  We only returned the first few rows of data for each to give an idea of what the format looks like.  As we can see that the first group of data for each is the column/categories.  

As can be seen below, the most relevant Google Play columns for our analysis would be `App`,`Category`, `Rating`, `Installs`, `Type`, `Price` and `Genres`.

In [3]:
print(android_header) # google dataset column names/categories
print('\n')
explore_data(android_body, 0, 3, True) # start is offset by 1, but end is based on file index

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


As can be seen below, the most relevant Apple Store columns for our analysis would be `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver` and `prime_genre`.  Not all the columns are self-explanatory, so you can refer to "Content:" section in the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
print(ios_header)
print('\n')
explore_data(ios_body, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


# Deleting Incorrect Data

Google Play has a general [discussion board](https://www.kaggle.com/lava18/google-play-store-apps/discussion).  In the discussion board, there is one data issue that comes-up, in line 10472.  You can view the [specific discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

In [5]:
print(android_header) # show columns
print('\n')
print(android_body[10472]) # incorrect row
print('\n')
print(android_body[0]) # example of properly formatted row

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


As we can see, the incorrect row, App: *'Life Made WI-Fi Touchscreen Photo Frame'*, is off because we can see the `Rating` column shows *19*, but Google Play rating column has a max rating of *5*.  Also, `Installs` column shows *'Free'*, which is a column off and belongs under `Type` column.  Therefore, we deleted this bad row.

In [6]:
print(len(android_body))
del android_body[10472]
print(len(android_body))

10841
10840


# Removing Duplicate Entries

## Part One

After checking for rows w/ bad data, we looked into duplicate rows by searching by the column that holds the name of the app for each row.  As an example below, we can see the app "Instagram" has four records in Google Play data. 

In [7]:
for app in android_body:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Below, the code checks for duplicates.  We see the total number of duplicate apps instances in Google Play is 1,181.  As an example, we have displayed a list of 10 of the apps that have more than one entry in Google Play.

In [8]:
unique_android_apps = []
duplicate_android_apps = []

for app in android_body:
    name = app[0]
    
    if name in unique_android_apps:
        duplicate_android_apps.append(name)
    
    else:
        unique_android_apps.append(name)

print('Number of duplicate apps: ', '\n', len(duplicate_android_apps))
print('\n')
print('Examples of first 10 apps: ', '\n', duplicate_android_apps[:10])


Number of duplicate apps:  
 1181


Examples of first 10 apps:  
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Because of the duplicate apps that exist in the dataset, we will need to remove them based on a specific criteria.  

Based on what we see for duplicate apps, we find that all things are equal except for the number of reviews for each record (an example is provided below for two apps with duplicate records).  

Therefore, one suggestion would be to select the record with the highest number of reviews.  We would keep only the record with the highest number and then remove any duplicate records that have the same highest number for the same app.


## Part Two

The code below will allow us to keep only one record per app.  The app record we keep will have the highest number of reviews.

Again, we check the expected number of records against the actual number of records after we fill `reviews_max` dictionary.  

As we can see, the output matches the actual output, but there still may be duplicates where an app name might have another record with the same number of reviews.

In [9]:
reviews_max = {}

for app in android_body:
    name = app[0]
    n_reviews = float(app[3])
    
    #if the name(key) is in the app
    #and if key value for the app name is less than number of reviews
    
    if name in reviews_max and reviews_max[name] < n_reviews: 
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(android_body) - 1181)
print('Actual length:', len(reviews_max))


Expected length: 9659
Actual length: 9659


Next, we remove duplicate apps that might have another entry with the same number of reviews.

Below, we recheck the entire Google dataset and compare against the `reviews_max` dictionary, which is now filled.  

As we can see, the number of rows is still 9659.  Included are example rows of data.

In [10]:
android_clean = []
already_added = []

for app in android_body: # loop through Google Play data once again (add name and # of reviews for each row)
    name = app[0]
    n_reviews = float(app[3]) 
    
    if (reviews_max[name] == n_reviews) and (name not in already_added): # check each app # of reviews in "review_max" against Google Play data
        android_clean.append(app) # ONLY ADD if app name, # of reviews AND it hasn't been added to empty "already_added" list 
        already_added.append(name) # cannot have more than one app name (even with same # of reviews) 

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps

## Part One

Now, we are going to remove records that are not in English because the target comparison is against those apps that have an English-speaking audience.

Below, we have written a function to check if any character in the string is greater than 127 (0-127 are English characters) using the built-in ord() function.

If any character in a word is greater than 127, than we know it's not English, as we can see in an example below.

In [11]:
def english_check(string):
    for character in string:
        if ord(character) > 127:
            return False
        
    return True

        
print(english_check('Instagram'))
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Instachat 😜'))
    

True
False
False
False


## Part Two

However, there is an issue with emoji's and symbols, which are over ASCII number of 127.

i.e. `Instachat 😜` and `Docs To Go™ Free Office Suite` were NOT accepted as show `False` for being English name apps.

Both are English audience apps, but happen to have special characters or emojis, that are above the ASCIII range of 127 character. 

We can attempt to minimize the impact of data loss due to these characters and allow characters/symbols over ASCII range of 0-127 to be accepted up to three times in a string.

We can assume that if the majority of characters are English, that the app itself has an English audience.

In [12]:
def english_check(string):
    counter = 0
    for character in string:    
        if ord(character) > 127:
            counter += 1
    
    if counter > 3:
        return False
    else:
        return True

        
print(english_check('Instagram'))
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Instachat 😜'))

True
False
True
True


As shown above, the new code to check for English name apps and accepts emojis/symbols up to three times.  

i.e. 'Instachat 😜' and 'Docs To Go™ Free Office Suite' were accepted as show as True for being English name apps.

In [13]:
#append name of English name apps to separate list for both datasets
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if english_check(name):
        android_english.append(app)

for app in ios_body:
    name = app[1]
    if english_check(name):
        ios_english.append(app)


explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

With the duplicates removed, we now run our filtered Google/Apple datasets through our `english_check` function we wrote above.  

We now have 9,614 Android apps and 6,183 iOS apps remaining.

# Isolating the Free Apps
Based on the criteria, we are looking for free apps only to analyze which have the potential for most revenue from in-app ads.  

So below, we have created new lists based to further narrow down our dataset to only those that are price of zero (free), for both datasets.

In [25]:
android_final = []
ios_final = []

for app in android_english: #index 7
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english: #index 4
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
    
print(len(android_final))
print(len(ios_final))
#print(ios_final)
    

8864
3222


We are left with 8,864 Android apps and 3,222 iOS apps.  This will be our dataset for analysis.

# Analysis - Most Common Apps by Genre %

## Part One
So to recap, we narrowed the dataset and removed incorrect data, duplicates, non-English apps, and non-free apps.  This leaves us with the actual/clean data to analyze.

To begin, we want to:
1. Build a minimial Android app.
2. If response is positive, we develop the Android app further.
3. If after six months it's profitable, we can build an iOS version and put onto the Apple Store.

Because of this, we will need to analyze the data to find types of apps that are successful for both Google and Apple Stores.

## Part Two

We will build two functions to:
1. Build frequency table that shows percent-to-total in a dictionary
2. Utilize frequency table function and percent-to-total dictionary, reverse key/value order to use for sorted() function for DESC order of values


In [37]:
# freq table checks the occurrance the same thing occurred 
# i.e. ratings name being the key and value being the iteration
def freq_table(dataset, index):
    table = {} 
    total = 0
    for row in dataset:
        total += 1
        value = row[index] # i.e. ratings in index # assigned to value
        if value in table: # i.e. checking ratings in table dictionary keys
            table[value] += 1 # table[value] is the key > if key exists, add 1 to the value
        else:
            table[value] = 1 # table[value] is the key > if key DOESN'T exists, add the key w/ value of 1
    
    # print(table)
    # total = sum(table.values())
    table_percentages = {}
    for key in table: # looping over dictionary (table) is default over the dictionary keys
        # print(table[key])
        percentage = (table[key] / total) * 100 # finds the value using key in table and divides by total
        table_percentages[key] = percentage # assigns the value in percentage to the key
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key) # reversed order value and then key
        # print(key_val_as_tuple)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True) # sorted() DESC if reverse = True
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) # printed in reverse index 1 and then 0


## Part Three

Examine the frequency table for iOS app `prime_genres`column in Apple Store dataset.

In [36]:
display_table(ios_final,-5) # prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that the iOS apps based in order of largest % of prime_genres, the vast majority apps fall within "Games" genre = 58%+.  Followed by "Entertainment" and "Photo & Video". While practical apps; "Education", "Shopping", "Utilities", "Productivity" etc. account for  a very small percentage of app genres.  But high % of a particular genre doesn't necessarily mean high/highest number of users.

In [28]:
display_table(android_final,1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Above, we can see the distribution of genres within Google Play apps is vastly different than Apple Store.  "Games" genre comes in 2nd and, in general, more productivity based apps ("Family", "Tools", "Business" etc.) are available.

In [29]:
display_table(android_final,-4) # Genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Google Play has a Genres, as well as the Category column we just looked at.  Genres looks more granular with it broken down into more categories.  

Apple Store looks more entertainment genre based while Google Play looks like it's more practical genre based.

# Analysis - Most Popular Genre by Number of Ratings - Apple Store

We are now going to try and ascertain the number of users per genre, to get more insight.  For Google Play, `Installs` column will give us a good approximate for this data, but for Apple Store, we will use a proxy,`rating_count_tot` column.

We will be looking for the average number of users for Google Play (or average number of ratings with Apple Store data).  

In [30]:
ios_genres = freq_table(ios_final,-5)

for genre in ios_genres: # for each genre in the freq_table (one)
    running_total = 0
    len_genre = 0
    for app in ios_final: # loop through ios_final (to many) > grab each app rating count
        genre_app = app[-5]
        if genre_app == genre:
            rating_count = float(app[5])
            running_total += rating_count
            len_genre += 1
    avg_num_ratings = running_total / len_genre
    print(genre, ':', avg_num_ratings)


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Looking at the highest number of user reviews, on average, we find navigation apps.  But after looking at the figures we see it is skewed heavily towards a few; Waze and Google Maps.

In [38]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same thing applies when reviewing Social well apps, which are heavily skewed towards a few; Facebook, Pandora etc.

Navigation, Social Networking or Music apps might seem more popular than they really are. The averages are skewed due to the few which are well over 100,000 ratings, while other apps are around 10,000 or less. 

One idea is to remove the heavily skewed apps and recalculate the average, but we may do this for a different analysis.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [39]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The Reference genre does shows promise, in that the numbers on average are low and therefore one could stand-out easier; aside from the Bible and Dictionary,.  With this said, ideas around variations of the Bible and Dictionary (i.e. audio version, quiz version, verses/words of the day etc).  Incorporating elements of genres that are dominant (entertainment related: games, video, social networking) may differentiate an app in a low-average Reference genre and give an opportunity for succeeding.  This would combine the somewhat oversaturated entertainment/fun apps market and practical unsaturated practical apps market.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

# Analysis - Most Popular Genre by Number of Installs - Google Play

We will be using the actual installations, `Installs` column in the Google dataset.

In [41]:
display_table(android_final, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The issues is that it is not as precise, given the wide ranges.  I.e. 100,000+ can be anywhere in between 100,000 and the next interval of 500,000.

For our purposes, we will assume that 100,000+ is 100,000 to give us an estimate and convert all numbers to float (removing any commas and addition signs at the end).

In [42]:
android_category = freq_table(android_final, 1) 

for category in android_category: # loop through category frequency (one)
    total = 0 # running total of each category/genre
    len_category = 0 # starts at zero and adds 1 for each genre in nested loop finds against outer genre
    
    for app in android_final: # loop through each category (many)
        category_app = app[1]
        # print(type(app[5]))
        if category_app == category: 
            n_installs = app[5] # save num of installs
            n_installs = n_installs.replace('+','') # remove string characters and convert to float
            n_installs = n_installs.replace(',','')
            total += float(n_installs) # add converted n_installs variable to total
            len_category += 1 # increment length of category by 1
    
    avg_installs = total / len_category
    print(category, ':', avg_installs)
            

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, Communication apps have the most installs. However, as with the iOS Navigation, Music and Social Networking apps, Communication apps are skewed, as well, by a few apps (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [43]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

To drill down further, we removed all the outlier Communication apps that have over 100 million installs.

In [44]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

A few things stand out:
1. the categories that are skewed by a few large players in the market may not be as popular 
2. due to the dominance of those few apps, it may be hard to compete. 

However, the Gaming genre is popular, but it may be oversaturated.  We will look further to find another suggestion for an app genre.

The books and reference genre looks fairly popular as well (right below entertainment/fun apps).  We see it has fairly high number of installs and rating count for both Google Play and Apple Store.  This may be something of interest, being that the goal is to find something that is profitable in both markets.

In [47]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E


As we drill further into the Google Play Book and Reference genre, we see it seems there's still a small number of extremely popular apps that skew the average:

In [49]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+



However, aside from the 5 apps with 100,000,000 or above, we see plenty of apps that are sizeable, yet small.  Below is a list of Book and Reference apps that are between 1,000,000 and 100,000,000 downloads:

In [50]:

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

Based on the analysis, it looks like the Book and Reference niche seems to have many ebooks, ebook translators and dictionaries.  It would probably be best to avoid such competition.  

The Quran has many apps variations for it, which may indicate demand.  

# Conclusions

We have analyzed the data about what potential genres would be profitable for mobiles apps in both the App Store and Google Play.

The conclusions is that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so variations of books/dictionary, for instance the Quran, may be of interest that differentiates with features such as daily quotes, quizzes, forums for discussions, etc.  