# Profitable App Profiles for the App Store and Google Play Markets

In this project we give advice to a company that builds Android and iOS mobile Apps. These are free apps and the only source or revenue are in-app ads. Therefore the more users users who use and interact with these apps, the better.

Our goal is to help the developers understand which apps are more likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from <a href="https://dq-content.s3.amazonaws.com/350/googleplaystore.csv">this link</a>.

* A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from <a href="https://dq-content.s3.amazonaws.com/350/AppleStore.csv">this link</a>.

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding="utf8")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

To make it easier to explore the two data sets, we'll first write a function named `explore_data()` that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


---
The data contains 10841 rows and 13 columns. From these columns, the most interesting for our purpose seem to be:
'App', 'Rating', 'Reviews', 'Installs', 'Type' & 'Price'. We are only interested in those apps with <b>Type = Free (and therefore Price = 0)</b>

Now we can look at the  ios data:

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


---
The ios data contains 7197 rows and 16 columns, being the most relevant for us: 'track_name', 'price', 'rating_count_tot', 'user_rating', 'rating_count_ver'. We are interested in the free apps: <b>price = 0.0</b>

## Deleting Wrong Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472 - instead of looking only for this row, I decided to print all rows that have ratings higher than 5 (max). 'Rating' corresponds to the index 2 in the android data set.

In [4]:
for row_n, row in enumerate(android):
    if float(row[2]) > 5:
        print(row_n) # we print also the row number
        print(row)

10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


---
In fact the only row with wrong values for ratings is the row number 10472 - therefore we are going to delete it

In [5]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


## Removing Duplicate Entries

### Part One
If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [6]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 cases where an app occurs more than once:

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of unique apps:', len(unique_apps))
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of unique apps: 9659
Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


---
Now we are not going to delete the duplicated rows randomly, but we will intentionally leave the one with the highest amount of reviews because this means more reliable ratings (or more recent entry)

To do that, we will:

* Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
* Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

### Part Two
If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Now the length of `reviews_max` must be the same as 9659 (calculated above)

In [9]:
print('Expected length:', len(unique_apps))
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now based on the `reviews_max` list we are going to delete the duplicate entries and store the result in a new list called `android_clean`

We also need an `already_added` list to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [10]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

In [11]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


---
We have 9659 as expected

---

The ios dataset has no duplicate entry, as we can see below - therefore we do not need to perform this data cleaning for this dataset

In [12]:
duplicate_apps = []
unique_apps = []

for app in ios:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of unique apps:', len(unique_apps))
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of unique apps: 7197
Number of duplicate apps: 0


Examples of duplicate apps: []


# Removing Non-English Apps

## Part One
If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [13]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in `ord()` function to find out the corresponding encoding number of each character.

In [14]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

In [15]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


## Part Two
To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [16]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


In [17]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1] # Attention: for ios the name is in the second position!
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

# Isolating the free apps

As stated in the introduction we are only interested in the free apps, since our source of revenue are the in-app ads.

---
For the android dataset we have following information available:
* `app[6]`: 'Type' = 'Free' (for free apps). 
* `app[7]`: 'Price' = 0 (for the free apps). 

Now let's check how many apps fall into these categories.

In [18]:
count_free_apps = 0
for app in android_english:
    typ = app[6]
    if typ == 'Free':
        count_free_apps += 1
print('Number of free apps based on Type:', count_free_apps)

count_free_apps = 0
for app in android_english:
    price = app[7]
    if price == '0':
        count_free_apps += 1
print('Number of free apps based on Price:',count_free_apps)

Number of free apps based on Type: 8863
Number of free apps based on Price: 8864


Interesting that the values are not the same. Either there is one free app with the wrong 'Type' or with the wrong 'Price'.
We will isolate it

In [19]:
wrong_price_label = []
for app in android_english:
    typ = app[6]
    price = app[7]
    if (typ == 'Free' and price != '0') or (typ != 'Free' and price == '0'):
        wrong_price_label.append(app)

print(wrong_price_label)

[['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']]


We can see that this app has 'Type' equal to NaN - Probably this is due to a data entry error - In this case we will consider the price (`app[7]`).

For ios we only have the price entry, so we do not need to perform this check. Finally we isolate the free apps

In [20]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = float(app[4])
    if price == 0:
        ios_final.append(app)
        
print('Number of free android apps: ', len(android_final))
print('Number of free ios apps: ', len(ios_final))

Number of free android apps:  8864
Number of free ios apps:  3222


## Data Analysis (Genres)
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

Our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


A profile that works well for both markets might be a productivity app that makes use of <b>gamification</b>.
Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets:

* <b>andorid</b>: columns `Genres` and `Category`
* <b>ios</b>: column `prime_genre`

We are going to create a function to extract frequency tables (absolute value and percentage) from the dataset. This function will receive the dataset and an index which corresponds to the column we are interested in

Besides that we also create a function for displaying the dictionary as a list of tuples (for sorting purposes)

In [21]:
def freq_table(dataset, index):
    frequency_dict = {}
    for row in dataset:
        column = row[index]
        if column in frequency_dict:
            frequency_dict[column] += 1
        else:
            frequency_dict[column] = 1
    return frequency_dict

# receives a freq_table and display it in percentages
def freq_table_percent(frequency_dict):
    frequency_dict_percent = {}
    num_apps = sum(frequency_dict.values())
    for key in frequency_dict:
        frequency_dict_percent[key] = (frequency_dict[key] / num_apps) * 100
    return frequency_dict_percent


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table2 = freq_table_percent(table)
    
    table_display = []
    
    for (key1,v1), (key2,v2) in zip(table.items(), table2.items()):   
        key_val_as_tuple = (v1, key1, v2)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0], ',', '{0:.1f}%'.format(entry[2]))

In [59]:
# andorid: columns 'Genres' (9) and 'Category' (1)
display_table(android_final, 1)

FAMILY : 1676 , 18.9%
GAME : 862 , 9.7%
TOOLS : 750 , 8.5%
BUSINESS : 407 , 4.6%
LIFESTYLE : 346 , 3.9%
PRODUCTIVITY : 345 , 3.9%
FINANCE : 328 , 3.7%
MEDICAL : 313 , 3.5%
SPORTS : 301 , 3.4%
PERSONALIZATION : 294 , 3.3%
COMMUNICATION : 287 , 3.2%
HEALTH_AND_FITNESS : 273 , 3.1%
PHOTOGRAPHY : 261 , 2.9%
NEWS_AND_MAGAZINES : 248 , 2.8%
SOCIAL : 236 , 2.7%
TRAVEL_AND_LOCAL : 207 , 2.3%
SHOPPING : 199 , 2.2%
BOOKS_AND_REFERENCE : 190 , 2.1%
DATING : 165 , 1.9%
VIDEO_PLAYERS : 159 , 1.8%
MAPS_AND_NAVIGATION : 124 , 1.4%
FOOD_AND_DRINK : 110 , 1.2%
EDUCATION : 103 , 1.2%
ENTERTAINMENT : 85 , 1.0%
LIBRARIES_AND_DEMO : 83 , 0.9%
AUTO_AND_VEHICLES : 82 , 0.9%
HOUSE_AND_HOME : 73 , 0.8%
WEATHER : 71 , 0.8%
EVENTS : 63 , 0.7%
PARENTING : 58 , 0.7%
ART_AND_DESIGN : 57 , 0.6%
COMICS : 55 , 0.6%
BEAUTY : 53 , 0.6%


In [60]:
# andorid: columns 'Genres' (9) and 'Category' (1)
display_table(android_final, 9)

Tools : 749 , 8.4%
Entertainment : 538 , 6.1%
Education : 474 , 5.3%
Business : 407 , 4.6%
Productivity : 345 , 3.9%
Lifestyle : 345 , 3.9%
Finance : 328 , 3.7%
Medical : 313 , 3.5%
Sports : 307 , 3.5%
Personalization : 294 , 3.3%
Communication : 287 , 3.2%
Action : 275 , 3.1%
Health & Fitness : 273 , 3.1%
Photography : 261 , 2.9%
News & Magazines : 248 , 2.8%
Social : 236 , 2.7%
Travel & Local : 206 , 2.3%
Shopping : 199 , 2.2%
Books & Reference : 190 , 2.1%
Simulation : 181 , 2.0%
Dating : 165 , 1.9%
Arcade : 164 , 1.9%
Video Players & Editors : 157 , 1.8%
Casual : 156 , 1.8%
Maps & Navigation : 124 , 1.4%
Food & Drink : 110 , 1.2%
Puzzle : 100 , 1.1%
Racing : 88 , 1.0%
Role Playing : 83 , 0.9%
Libraries & Demo : 83 , 0.9%
Auto & Vehicles : 82 , 0.9%
Strategy : 81 , 0.9%
House & Home : 73 , 0.8%
Weather : 71 , 0.8%
Events : 63 , 0.7%
Adventure : 60 , 0.7%
Comics : 54 , 0.6%
Beauty : 53 , 0.6%
Art & Design : 53 , 0.6%
Parenting : 44 , 0.5%
Card : 40 , 0.5%
Casino : 38 , 0.4%
Trivia : 

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

In [23]:
# ios: column 'prime_genre' (11)
display_table(ios_final, 11)

Games : 1874 , 58.2%
Entertainment : 254 , 7.9%
Photo & Video : 160 , 5.0%
Education : 118 , 3.7%
Social Networking : 106 , 3.3%
Shopping : 84 , 2.6%
Utilities : 81 , 2.5%
Sports : 69 , 2.1%
Music : 66 , 2.0%
Health & Fitness : 65 , 2.0%
Productivity : 56 , 1.7%
Lifestyle : 51 , 1.6%
News : 43 , 1.3%
Travel : 40 , 1.2%
Finance : 36 , 1.1%
Weather : 28 , 0.9%
Food & Drink : 26 , 0.8%
Reference : 18 , 0.6%
Business : 17 , 0.5%
Book : 14 , 0.4%
Navigation : 6 , 0.2%
Medical : 6 , 0.2%
Catalogs : 4 , 0.1%


For the ios, as we see above, the most frequent genre is Games with more than half 58.2%, followed by Entertainment (7.9%) and Photo & Video (5.0%).  We can say the the large majority of <b>free English</b> apps (the ones we are considering here) are used for fun and make use of <b>gamification</b>. This does not mean, however, that these apps have the highest number of users.

For the android apps this tendency is different for less than 10% (9.7%) of the apps are of the category Games - for android the Category includes several Genres. The Android apps are much more balanced as there is no single Category that dominates. It is also important here to highlight that these frequency tables show the most frequent app genres and not what genres have the most users. 

## Data Analysis (Most Popular)
Now we are going to analyze which genres are the most popular (most users). For that we can calculate the average number of installs for each app genre. For the android (Google Play) data set, we can find this information in the `Installs` column, but this information is missing for the ios (App Store) data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.


* <b>andorid</b>: column `Installs` [5]
* <b>ios</b>: column `rating_count_tot` [5]

In [46]:
frequency_dict_ios = freq_table(ios_final, 11)
genre_avg_users_tuple = []
for genre in frequency_dict_ios:
    total = 0
    len_genre = 0
    for row in ios_final:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    avg_number_users = total / len_genre
    genre_avg_users_tuple.append((avg_number_users, genre))

tuple_sorted = sorted(genre_avg_users_tuple, reverse = True)
for entry in tuple_sorted:
    print(entry[1], ':', '{0:.0f}'.format(entry[0]))

Navigation : 86090
Reference : 74942
Social Networking : 71548
Music : 57327
Weather : 52280
Book : 39758
Food & Drink : 33334
Finance : 31468
Photo & Video : 28442
Travel : 28244
Shopping : 26920
Health & Fitness : 23298
Sports : 23009
Games : 22789
News : 21248
Productivity : 21028
Utilities : 18684
Lifestyle : 16486
Entertainment : 14030
Business : 7491
Education : 7004
Catalogs : 4004
Medical : 612


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together.

For the Android apps we do a similar approach for the Installs column. However, by analyzing the data, we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)

In [49]:
display_table(android_final, 5)

1,000,000+ : 1394 , 15.7%
100,000+ : 1024 , 11.6%
10,000,000+ : 935 , 10.5%
10,000+ : 904 , 10.2%
1,000+ : 744 , 8.4%
100+ : 613 , 6.9%
5,000,000+ : 605 , 6.8%
500,000+ : 493 , 5.6%
50,000+ : 423 , 4.8%
5,000+ : 400 , 4.5%
10+ : 314 , 3.5%
500+ : 288 , 3.2%
50,000,000+ : 204 , 2.3%
100,000,000+ : 189 , 2.1%
50+ : 170 , 1.9%
5+ : 70 , 0.8%
1+ : 45 , 0.5%
500,000,000+ : 24 , 0.3%
1,000,000,000+ : 20 , 0.2%
0+ : 4 , 0.0%
0 : 1 , 0.0%


To tackle this problem we are going to consider that  100,000+ installs has 100,000 installs and will perform following steps:
* replace ',' for ' '
* replace '+' for ' '
* convert the value to float

In [57]:
frequency_dict_android = freq_table(android_final, 1)
category_avg_users_tuple = []
for category in frequency_dict_android:
    total = 0
    len_category = 0
    for row in android_final:
        category_app = row[1]
        #print(category)
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_number_users = total / len_category
    category_avg_users_tuple.append((avg_number_users, category))

tuple_sorted = sorted(category_avg_users_tuple, reverse = True)
for entry in tuple_sorted:
    print(entry[1], ':', '{0:.0f}'.format(entry[0]))

COMMUNICATION : 38456119
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15588016
TRAVEL_AND_LOCAL : 13984078
ENTERTAINMENT : 11640706
TOOLS : 10801391
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8767812
SHOPPING : 7036877
PERSONALIZATION : 5201483
WEATHER : 5074486
HEALTH_AND_FITNESS : 4188822
MAPS_AND_NAVIGATION : 4056942
FAMILY : 3695642
SPORTS : 3638640
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924898
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1437816
FINANCE : 1387692
HOUSE_AND_HOME : 1331541
DATING : 854029
COMICS : 817657
AUTO_AND_VEHICLES : 647318
LIBRARIES_AND_DEMO : 638504
PARENTING : 542604
BEAUTY : 513152
EVENTS : 253542
MEDICAL : 120551


On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs.




In [62]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess


If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [63]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [64]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [65]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+



However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [67]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Data Analysis (Most Popular)
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.