# Profitable App Profiles for the App Store and Google Play Markets

This project is going to investigate how the number of users impacts the revenue generated by an app.

The goal of this project is to gain insights from App Store and Google Play Store data to help better understand what types of app attract more users.

## Opening and Exploring the Data Sets

The data that will be used to gain insights on the relationship between between engagement and revenue will be a [data set](https://www.kaggle.com/lava18/google-play-store-apps) comprising of approximately 10,000 Android Apps from the Google Play Store. This data was collected in August 2018. Likewise, another [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)  comprising of just over 7000 iOS Apps from the App Store.  This data was collected in July 2017.

In [1]:
from csv import reader

openfile_apple = open('AppleStore.csv', encoding='utf8')
openfile_gplay = open('googleplaystore.csv', encoding='utf8')
readfile_apple = reader(openfile_apple)
readfile_gplay = reader(openfile_gplay)

gplay_data = list(readfile_gplay)
gplay_header = gplay_data[0]
gplay_data = gplay_data[1:]

apple_data = list(readfile_apple)
apple_header = apple_data[0]
apple_data = apple_data[1:]

To aid in the data analysis, a function `explore_data()` will be repeatedly used to visualise rows of the data sets more clearly.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

To begin with, the first 5 rows of each dataset as well as the number of rows/ columns will be explored.

In [3]:
# Print header row
print(str(gplay_header) + '\n')
# Explore first 5 rows
explore_data(gplay_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
# Print header row 
print(str(apple_header) + '\n')
# Explore first 5 rows
explore_data(apple_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


## Data Cleaning
### Deleting Incorrect Data

As can be read in the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play data, there is a wrong entry in row 10472 of the data. This data will be investigated below.

In [5]:
# Print gplay header row
print('Column names:\n' + str(gplay_header) + '\n')
# Print incorrect row
print('Incorrect data:\n' + str(gplay_data[10472]) + '\n')
# Print a correct row
print('Correct data:\n' + str(gplay_data[10473]))

Column names:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Incorrect data:
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

Correct data:
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


As can be seen above, some of the columns in `gplay_data[10472]` do not match up to the colum names in the header row. For example, it has a rating of 19 when the maximum on the Google Play Store is 5. Upon further inspection, it is evident that the incorrect row of data does not contain the right amount of data entries. Therefore, this row will be deleted.

In [6]:
del gplay_data[10472]

**POSSIBLY EXPLORE APPLE DATA DISCUSSION FOR INCORRECT DATA**

### Deleting Duplicate Data Entries

Further reading of the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play Data with reveal that there are reports of duplicate data. According to the post in the discussion, there are multiple entries of the same App where the only differentiating factor is the review count. This will be investigatet below using a popular app, `Twitter`.

In [7]:
print(str(gplay_header) + '\n')

for column in gplay_data:
    app_name = column[0]
    if app_name == 'Twitter':
        print(column)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11667403', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'August 6, 2018', 'Varies with device', 'Varies with device']
['Twitter', 'NEWS_AND_MAGAZINES', '4.3', '11657972', 'Varies with device', '500,000,000+', 'Free', '0', 'Mature 17+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']


As can be seen above, The Google Play data contains 3 entries for the Twitter app. The only difference between these entries is the data and the number of ratings. Below, it will be determined how many duplicate apps exist in the Google Play Store.

In [8]:
duplicate_apps = []
unique_apps = []

for column in gplay_data:
    app_name = column[0]
    
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicate apps: ' + str(len(duplicate_apps)) + '\n')
print('Examples: ' + str(duplicate_apps[:10]))

Number of duplicate apps: 1181

Examples: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Therefore, there are 1181 duplicate entries that need to be deleted from the dataset. The metholodogy of doing this can either be based on the `Last Updated` data where only the most recent entry is kept or by sorting based on the `Reviews` column (number of reviews) and only keeping the entry with the most reviews (as this will be the most recent). 

As can be seen with the Twitter duplicate entries, 2 of the entries share the same date whereas all have a unique number of ratings. For this reason, duplicate entries will be deleted based on the number of reviews.

To delete the duplicate entries, a dictionary will be created that contains each Google Play app with no duplicates based on the maximum number of reviews. This will be achieved by:

- Looping through every entry in the data set
- If the dictionary does not contain the datapoint, a new entry will be added to the dictionary
- If the dictionary already contains the datapoint, it will be replaced by the new datapoint ONLY if it has more reviews than the old one

In [9]:
reviews_max = {}

for column in gplay_data:
    app_name = column[0]
    n_reviews = float(column[3])
    
    if app_name in reviews_max and reviews_max[app_name] < n_reviews:
        reviews_max[app_name] = n_reviews
        
    elif app_name not in reviews_max:
        reviews_max[app_name] = n_reviews

To check the dictionary `reviews_max` was created correctly, the length can be checked against what was expected. It was determined earlier that the Google Play data containes 1181 duplicate entries. therefore using the length of the data, an expected number of unique entries can be calculated.

In [10]:
print('Expected: ' + str(len(gplay_data) - 1181))
print('Actual: ' + str(len(reviews_max)))

Expected: 9659
Actual: 9659


Therefore, `reviews_max` lines up with what was expected.

The duplicates will be removed from the data using the `reviews_max` as below.

In [11]:
gplay_clean = []
already_added = []

for column in gplay_data:
    app_name = column[0]
    n_reviews = float(column[3])
    
    if (reviews_max[app_name] == n_reviews) and (app_name not in already_added):
        gplay_clean.append(column)
        already_added.append(app_name)

To confirm the gplay_clean data has been created correctly, it can be explored and chacked that the length aligns with the expected: `9659`below:

In [12]:
print(str(gplay_header) + '\n')
explore_data(gplay_clean, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Removing Non-English Apps

This project only exists in the contect of the english app audience. Therefore, non-english apps should be deleted from the dataset. The function below does this by:

- Looping through each character in a given string
- If there are morw than 3 occurences of non-english characters, the funtion return `False` (non-English)
- Otherwise, the funtion returns `True` (English)

Some examples in the next cell show how this performs. The function is not optimal as some data entries could be labelled non-English however it works well enough for this project.

In [13]:
def is_english(input_str):
    non_eng_count = 0
    
    for char in input_str:
        
        if ord(char) > 127:
            non_eng_count += 1
            
            if non_eng_count > 3:
                return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [14]:
gplay_english = []
apple_english = []

for column in gplay_clean:
    app_name = column[0]
    if is_english(app_name):
        gplay_english.append(column)
        
for column in apple_data:
    app_name = column[1]
    if is_english(app_name):
        apple_english.append(column)
        
explore_data(gplay_english, 0, 3, True)
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

As can be seen above, after removing non-English apps from both datasets, there are 9614 Google Play Store apps and 6183 iOS apps.

### Isolating the Free Apps

As the project is only considering free apps on each app store. All paid apps need to be removed from the datasets. The cell below removes all non-free apps.

The remaining data is explored and shows that after all of the data cleaning is complete. There are 8864 data entries left for Google Play apps and 3222 entries left for iOS apps.

In [15]:
gplay_free = []
apple_free = []

for column in gplay_english:
    price = column[7]
    if price == '0':
        gplay_free.append(column)
        
for column in apple_english:
    price = column[4]
    if price == '0.0':
        apple_free.append(column)
        
print(str(gplay_header) + '\n')
explore_data(gplay_free, 0, 3, True)

print(str(apple_header) + '\n')
explore_data(apple_free, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'F

## Data Analysis: Most Common Apps by Genre

To begin to understand which types of Apps have the most engagement on each app store. It must be determined what types of apps have the most market share. 

The functions below will compute this information.

- The `freq_table` function generates a frequency table for a given column in a dataset and returns a dictionary that shows how much each entry arises with respect to the total number of entries (as a percentage).
- The `display_table` function takes the returned dictionary from the `freq_table` function and returns a sorted tuple list.

In [16]:
def freq_table(dataset, index):
    
    freq_dict = {}
    count = 0
    
    for entry in dataset:
        count += 1
        value = entry[index]
        if value in freq_dict:
            freq_dict[value] += 1
        else:
            freq_dict[value] = 1
            
    table_percent = {}
    for key in freq_dict:
        table_percent[key] = (freq_dict[key] / count) * 100
        
    return table_percent

In [17]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Firstly, the `prime_genre` category in the app store data is analysed using the `display_table` function.

In [18]:
print("Apple Apps: prime_genre\n")
display_table(apple_free, 11)

Apple Apps: prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As can be seen above, with regards to free English apps on the App Store, the games genre dominates with over 58% share in the data. Entertainment comes second at around 8%, 5% Photo & Video, 3.7% Eductaion and 3.2% Social Networking. 

From this, it can be said that for the App Store (only with regards to free, English apps), the most popular are by a large majority, for entertainment and fun. It must be said, however, that this metric is only taking into the account the number of apps per genre and not necessarily the number of users. Therfore, it is a good indicator as to what types of apps have the most engagement however it does not reveal the actual number of users per genre.

Next, the `Genres` and `Category` columns will be analysed using the `display_table` function. Both are being analysed as they appear, at face value, to represent very similar statistics.

In [19]:
print('Google Play Apps: Genres\n')
display_table(gplay_free, 1)

Google Play Apps: Genres

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.65433212

In [20]:
print('Google Play Apps: Category\n')
display_table(gplay_free, 9)

Google Play Apps: Category

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto 

Looking at the `Genres` and `Category` columns in the Google Play data, the `Genres` column appears to show a more 'high level' picture of the types of apps on the store whereas the `Category` column appears to be too specific for the scope of this project. Therefore, the `Genres` column will be focussed on for the Google Play data.

With regards to the free, English apps in the `Genres` Google Play data, the `FAMILY` genre tops the list with an 18.9% share and GAME in second with a 9.7% share. Beyond this however, there are many utility based genres such as `TOOLS`, `BUSINESS` and `LIFESTYLE` that have between 2% and 5% share. This shows that the types of apps downloaded is much more varied and spread than with the App Store.

As said above, it is hard to gauge engagement by analysing the data from this perspective as it does show specifically the number of users per genre. However, it does provide useful insight on which genres are generally most popular.

## Most Popular Apps by Genre on the App Store

The cell below returns each genre with the average number of ratings per genre:

In [21]:
genres_apple = freq_table(apple_free, 11)
avg_ratings_table = []

for genre in genres_apple:
    total = 0
    len_genre = 0
    
    for entry in apple_free:
        genre_app = entry[11]
        
        if genre == genre_app:
            total += float(entry[5])
            len_genre += 1
    
    avg_no_ratings = total / len_genre
    avg_ratings_table.append((avg_no_ratings, genre))

avg_ratings_table = sorted(avg_ratings_table, reverse=True)

for entry in avg_ratings_table:
    print(entry[1], ':', entry[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


As can be seen above, `Navigation` tops the list with over 10000 more average ratings than `Reference` in second place. The `Music` and `Weather` genres both have over 50000 average ratings and the `Book` genre comes in 5th place with just under 40000 average ratings.

The function `genre_and_ratings` created below takes a genre input and returns all the apps of said genre with the corresponding number of ratings. Each of the top 5 genres will be further investigated using this.

In [22]:
def genre_and_ratings(genre):
    for entry in apple_free:
    
        if entry[11] == genre:
            print(entry[1], ':', entry[5])

In [23]:
genre_and_ratings('Navigation')

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


After further investigating thr `Navigation` genre. It can be concluded that although it has the largest number of ratings. The data is hugely swung by the top 2 apps, Waze and Google Maps. These two apps alone have almost 500000 ratings. 

Taking this into account, the `Navigation` genre does not have as many ratings as may appear at face value. If Waze and Google Maps were removed, the average number of ratings for the genre would dramatically reduce.

In [24]:
genre_and_ratings('Reference')

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Likewise with the `Reference` genre, the `Bible` app has almost 1 million ratings and swings the genre's data. However, after dicounting the top app, the other top apps in the genre have a relatively even spread of ratings, with the following 7 top apps having over 10000 ratings.

As found earlier, the App Store is very saturated with more entertainment based apps, therefore it may be a good time to try to create a popular app in the reference genre.

In [25]:
genre_and_ratings('Social Networking')

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Looking at the `Social Networking` apps with ratings data above, it can be seen that there are many free apps in the apple dataset. As with the `Navigation` genre, the top 2 apps, `Facebook` and `Pinterest` have a disproportionately large number of ratings with `Facebook` having almost 3 million and `Pinterest` having over 1 million.

However, discounting these 2 apps at the top of the list, the `Social Networking` genre contains many apps with a lot of ratings (far more than any other genre). Therefore, it can either be concluded that the genre is over saturated with good apps, or that it is easier to create a popular app in the social networking genre. 

In [26]:
genre_and_ratings('Music')

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

The `Music` genre, as with the `Social Networking` genre has many apps in the dataset. The most rated apps tend to be music streaming services, companion apps to music devices, or Music based games. 

Although there are many well rated apps. A lot of the apps do very similar thinsg, therefore the genre is saturated and wuld be less favourable to target.

In [27]:
genre_and_ratings('Weather')

The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583
MyRadar NOAA Weather Radar Forecast : 150158
AccuWeather - Weather for Life : 144214
Yahoo Weather : 112603
Weather Underground: Custom Forecast & Local Radar : 49192
NOAA Weather Radar - Weather Forecast & HD Radar : 45696
Weather Live Free - Weather Forecast & Alerts : 35702
Storm Radar : 22792
QuakeFeed Earthquake Map, Alerts, and News : 6081
Moji Weather - Free Weather Forecast : 2333
Hurricane by American Red Cross : 1158
Forecast Bar : 375
Hurricane Tracker WESH 2 Orlando, Central Florida : 203
FEMA : 128
iWeather - World weather forecast : 80
Weather - Radar - Storm with Morecast App : 78
Yurekuru Call : 53
Weather & Radar : 37
WRAL Weather Alert : 25
Météo-France : 24
JaxReady : 22
Freddy the Frogcaster's Weather Station : 14
Almanac Long-Range Weather Forecast : 12
TodayAir

The `Weather` genre again looks to be well rated. However, users tend not to spend much time in app, therefore it would likley be less profitable in terms of in-app ads when compared to other genres.

From the genres explored above, it can be concluded that either the `Reference` or `Social Networking` genres would be the most profitable. However, the edge must be given to `Reference` as it is far less saturated and not dominated by a few, extremely popular apps.

Next, a similar analysis will be performed on the Google Play data.

## Most Popular Apps by Genre on Google Play

Unlike with the Apple data, the Google Play data has information about the number of installs for each app. The downside to this data, however, is that they are not exact figures and each app is instead put into groups based on number of installs such at 100,000+, 1,000,000+, etc. The cell below removes the '+' or ',' and converts each number of installs into a float. A table is then created that lists each category with the average number of installs for each category.

This isn't as accurate as might be hoped however it is close enough to provide reliable insight.

In [28]:
categories_gplay = freq_table(gplay_free, 1)
avg_ratings_table_gplay = []

for category in categories_gplay:
    total = 0
    len_category = 0
    
    for entry in gplay_free:
        category_app = entry[1]
        
        if category_app == category:
            no_ratings = entry[5]
            no_ratings = no_ratings.replace('+','').replace(',','')
            total += float(no_ratings)
            len_category += 1
            
    avg_no_ratings = total / len_category
    avg_ratings_table_gplay.append((avg_no_ratings, category))
    
avg_ratings_table_gplay = sorted(avg_ratings_table_gplay, reverse=True)
    
for entry in avg_ratings_table_gplay:
    print(entry[1], ':', entry[0])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

As can be seen above, `COMMUNICATION` tops the list with over 38 million average installs. ` VIDEO_PLAYERS` comes in second with just under 25 million average installs, `SOCIAL` 23 million, `PHOTOGRAPHY` 18 million and `PRODUCTIVITY` 16 million. There is a relatively even spread among the top categories with all the top 9 having over 10 million installs on average.

The function `category_and_installs` below takes a genre and installs as input and returns all apps of that genre with corresponding number of installs. This function will be used to further explore each of the top genres.

In [29]:
def category_and_installs(genre, no_installs = None):
    
    for entry in gplay_free:
        if entry[1] == genre and entry[5] == no_installs:
            print(entry[0], ':', entry[5])
        elif entry[1] == genre and no_installs == None:
            print(entry[0], ':', entry[5])

In [30]:
category_and_installs('COMMUNICATION')

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

As can be seen above, the entire `COMMUNICATION` category contains a huge amount of entries. Upon a quick inspection, it can be seen that there are quite a number of apps that contain a number of installs much larger than the 38 million average.

In [31]:
category_and_installs('COMMUNICATION', '1,000,000,000+')
print('\n')
category_and_installs('COMMUNICATION', '100,000,000+')

WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+


imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
Firefox Browser fast & private : 100,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
WeChat : 100,000,000+
Yahoo Mail – Stay Organized : 100,000,000+
BBM - Free Calls & Messages : 100,000,000+


After further inspection the apps in `COMMUNICATION` with a huge amount of installs, it can be seen that there are 6 apps with ovr 1 billion installs, and a further 16 apps that have between 100 million and 1 billion installs. 

Therefore, it can be concluded that this category has the highest amount of average installs because it contains a small amount of hugely popular social networking and browser apps. Therefore, it would likely be difficult to create a profitable app in the category.

Next, the `VIDEO_PLAYERS` category will be analysed.

In [32]:
category_and_installs('VIDEO_PLAYERS', '1,000,000,000+')
print('\n')
category_and_installs('VIDEO_PLAYERS', '100,000,000+')

YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+


Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


This, although not to the same degree, mirrors what was found with the `COMMUNICATION` category. It is dominated by YouTube and Google Play Movies which both have over 1 billion installs, far higher than the average.

Next, the `SOCIAL` catergory will be analysed:

In [33]:
category_and_installs('SOCIAL', '1,000,000,000+')
print('\n')
category_and_installs('SOCIAL', '100,000,000+')

Facebook : 1,000,000,000+
Google+ : 1,000,000,000+
Instagram : 1,000,000,000+


Tumblr : 100,000,000+
Pinterest : 100,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


With regards to the `SOCIAL` category, the same conclusion can be drawn as with the Apple App Store. A minority of hugely popular apps impact the average number of installs hugely.

Next, the more productivity based categories will be analysed. This will include `PHOTOGRAPHY`, `PRODUCTIVITY`, `NEWS_AND_MAGAZINES` and `BOOKS_AND_REFERENCE`.

In [34]:
category_and_installs('PHOTOGRAPHY', '1,000,000,000+')
print('\n')
category_and_installs('PHOTOGRAPHY', '100,000,000+')

Google Photos : 1,000,000,000+


B612 - Beauty & Filter Camera : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Sweet Selfie - selfie camera, beauty cam, photo edit : 100,000,000+
Retrica : 100,000,000+
Photo Editor Pro : 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
Photo Collage Editor : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
Candy Camera - selfie, beauty camera, photo editor : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
AR effect : 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
LINE Camera - Photo editor : 100,000,000+
Photo Editor Collage Maker Pro : 100,000,000+


In [35]:
category_and_installs('PRODUCTIVITY', '1,000,000,000+')
print('\n')
category_and_installs('PRODUCTIVITY', '100,000,000+')

Google Drive : 1,000,000,000+


Microsoft Outlook : 100,000,000+
Microsoft OneDrive : 100,000,000+
Microsoft OneNote : 100,000,000+
Google Keep : 100,000,000+
ES File Explorer File Manager : 100,000,000+
Google Docs : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Adobe Acrobat Reader : 100,000,000+
Google Sheets : 100,000,000+
Microsoft Excel : 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
CamScanner - Phone PDF Creator : 100,000,000+


In [37]:
category_and_installs('NEWS_AND_MAGAZINES', '1,000,000,000+')
print('\n')
category_and_installs('NEWS_AND_MAGAZINES', '100,000,000+')

Google News : 1,000,000,000+




In [36]:
category_and_installs('BOOKS_AND_REFERENCE', '1,000,000,000+')
print('\n')
category_and_installs('BOOKS_AND_REFERENCE', '100,000,000+')

Google Play Books : 1,000,000,000+


Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


After analysing these categories, it can be concluded that the reading based categories suffer least from huge apps influencing the average number of installs. This is especially evident in the `NEWS_AND_MAGAZINES` category which has only one app (Google News) influencing the average.

## Conclusion

Therefore, the best app profile reccomendation for the Google Play Store, along with the App Store, would be to create an app that lies in the Book, reference  or News and Magazine categories as there is likely a gap in the market for such an app to be popular. 

Apps such as these would also be good to target in terms of profitability since most users will tend to spend a longer period of time using the app. This will increase revenue due to in-app ads when compared to other categories/genres.