# Profitable App Profiles for the App Store and Google Play Markets

### A guided project using dataquest.io

**Alastair Wilkins - 4 March 2019**

---

The aim of this project is to understand what makes an app profitable in the online market and enable developers to make data-driven decisions when building apps.

This analysis will focus on free apps that use advertising as their main revenue stream. The amount of users downloading and engaging with the app will therefore influence the amount of revenue earned. The goal of this project is to analyse data to help developers understand what kinds of apps are likely to attract more users.

## Exploring the data

As of September 2018, there were approximately 2.1 million apps on the Google Play Store and approximately 2 million apps on the Apple App Store.

Due to time and funding restrictions, access to all 4 million apps in a format that can be analysed is not available, therefore sample data has been used instead.

The following two data sets have been used:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data on approximately ten thousand Android apps from the Google Play Store. (Collected in August 2018).
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data on approximately seven thousand iOS apps from the Apple App Store. (Collected in July 2017).

Firstly, let's open the data.

In [1]:
from csv import reader

# Google Play data
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
play_apps = list(read_file)
play_header = play_apps[0]
play_apps = play_apps[1:]

# Apple App Store data
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios_apps = list(read_file)
ios_header = ios_apps[0]
ios_apps = ios_apps[1:]

A function has been created to allow for reusability when inspecting data points.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's explore the headers and first two rows of each data set.

In [3]:
print('App Store:\n')
print(ios_header)
print('\n')
explore_data(ios_apps, 1, 3, True)
print('\n')

print('Play Store:\n')
print(play_header)
print('\n')
explore_data(play_apps, 1, 3, True)

App Store:

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Play Store:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '8751

We can immediately see that from the App Store set, there are 7,197 rows and 16 columns, compared to 10,841 rows and 13 columns in the Play Store set.

The columns from the App Store that are likely to be of use in our analysis are:
`track_name, currency, price, rating_count_tot, rating_count_ver, prime_genre`.

Some of the column names for this data set are not immediately self-explanatory, but details of what each column represents can be found at the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) for the data set. 

The columns from the Play Store that are likely to be of use in our analysis are: `App, Category, Rating, Reviews, Type, Price and Genres`.

## Cleaning the Data

Before we can begin analysing, the data must be as accurate as possible and reflect our target audience of English-speaking, **free** app users.
Non-English apps will need to be removed, along with paid apps and any duplicated or inaccurate data.

The data set for the Google Play data has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), where we can see one of the discussions describes an error for a certain row.

The offending row is number `10473`. This is row `10472` in our data set, as the header row has been removed.

In [4]:
print(play_apps[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The third column in our Play Store set represents the rating of an app, as discussed above. In this example, the app has a rating of `19`, which must be an error, as apps on the Play Store cannot exceed a rating of `5`.

Before it is removed, let's check to ensure there aren't any other apps falling foul of this rating rule:

In [5]:
for app in play_apps:
    if float(app[2]) > 5:
        print(app)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The above code has confirmed the app we knew about is the only instance where this rule has been broken.

Let's delete this from our data set.

In [6]:
del play_apps[10472]
# This has been commented out to avoid running it more than once.

The [discussions]() section of the Google Play data set hints that there are many duplicate values in this data set.

The data must be cleansed of these duplicates.
For example, Facebook has multiple entries.

In [7]:
for app in play_apps:
    if app[0] == "Facebook":
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


A criterion must be chosen in order to pick which one of the duplicates stays.
The fourth column in the data set represents the amount of reviews that have been left on the app. The item with the highest amount of reviews must be the most recent data, therefore that is the row that will be kept. 

Let's work out how many duplicates we need to deal with.

In [8]:
duplicates = []
unique = []

for app in play_apps:
    name = app[0]
    if name in unique:
        duplicates.append(name)
    else:
        unique.append(name)
        
print('Quantity of duplicate apps: ' + str(len(duplicates)))
print('\n')
print('10 duplicates from end of list: ', duplicates[-10:])

Quantity of duplicate apps: 1181


10 duplicates from end of list:  ['Garena Free Fire', 'osmino Wi-Fi: free WiFi', 'Fun Kid Racing - Motocross', 'Podcast App: Free & Offline Podcasts by Player FM', 'Motorola FM Radio', 'FarmersOnly Dating', 'Firefox Focus: The privacy browser', 'FP Notebook', 'Slickdeals: Coupons & Shopping', 'AAFP']


The duplicates must now be removed from the data set.
The data set contains 10,840 apps, therefore the expectation is that after removing the duplicates (10,840 - 1181) for there to be 9,659 apps, which will be confirmed below.

The code below runs the following algorithm:

- Create an empty dictionary `reviews_max` and two empty lists, `play_apps_clean` and `play_apps_existing`
- Iterate through all apps in the play store data set
    - Assign the name of the app to a variable
    - Assign the number of reviews to a variable
    - Check whether the app exists in our `reviews_max` dictionary. We will be using this dictionary to store the highest value of reviews for each app.
    - If the app exists in our dictionary, check whether the current iteration value of reviews is greater than what is in our dictionary.
    - If both conditions above are met, the dictionary is overwritten with the current iteration value, as it must be higher.
    - If the current iteration app is not in the dictionary, we will simply add it, with a value of the current iteration reviews.
    
Next, we must create a new data set that strips out the duplicates.

- Using the two empty lists we created above, `play_apps_clean` and `play_apps_existing` we can:
    - Loop through all apps in the play store data set
    - Assign name and number of reviews to variables
    - Check whether the current iteration amount of reviews matches what we have stored in the dictionary for the highest reviews.
    - Also check we haven't already dealt with this app
    - If both conditions are met, we will add this app to our clean data set.
    - Any future iterations that have the same name will be ignored.
    
Lastly, the amount of apps in the data set is checked to confirm that the list of cleaned apps is what we would expect. 

In [9]:
reviews_max = {}
play_apps_clean = []
play_apps_existing = []

for app in play_apps:
    name = app[0]
    reviews = float(app[3])
    
    if (name in reviews_max) and (reviews_max[name] < reviews):
        reviews_max[name] = reviews #Overwrite with higher value
    elif name not in reviews_max:
        reviews_max[name] = reviews
        
for app in play_apps:
    name = app[0]
    reviews = float(app[3])
    
    if (reviews == reviews_max[name]) and (name not in play_apps_existing):
        play_apps_clean.append(app)
        play_apps_existing.append(name)
        
print('Number of unique apps: ' + str(len(play_apps_clean)))

Number of unique apps: 9659


## Removing non-English apps

Now the duplicates have been dealt with, we need to cater for our audience - English speaking users.

After exploring the data, there are apps contained within that have names suggesting they are directed towards a non-English speaking audience.

In [10]:
print('- ' + ios_apps[813][1])
print('- ' + ios_apps[6731][1])
print('\n')
print('- ' + play_apps_clean[4412][0])
print('- ' + play_apps_clean[7940][0])

- 爱奇艺PPS -《欢乐颂2》电视剧热播
- 【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


- 中国語 AQリスニング
- لعبة تقدر تربح DZ


These apps do not suit our audience, therefore we will remove them from the data set.

One approach could be to remove each app which has symbols not commonly found in the English language.
Therefore anything that isn't in the English alphabet, has letters from 0-9 and standard symbols (e.g. + - / ) will be removed.

Each character in a string has an associated value, which we can obtain by using the `ord()` function that is [built-in to Python](https://docs.python.org/3/library/functions.html#ord). The `ord()` function returns the unicode-value of the character.

In [11]:
print(ord('a'))
print(ord('A'))
print(ord('z'))
print(ord('8'))
print(ord('='))
print(ord('龙'))

97
65
122
56
61
40857


According to ASCII, the values for characters that are commonly used in English text are in the range 0 to 127.
Based on this rule, we can start to filter out our data to only contain relevant apps.

In [12]:
def checkEnglishChars(string):
    for char in string:
        if ord(char) > 127:
            return False
        
    return True

# Testing data:
print(checkEnglishChars('Instagram')) # Should return true
print(checkEnglishChars('爱奇艺PPS -《欢乐颂2》电视剧热播')) # Should return false
print(checkEnglishChars('Xero')) # Should return true
print(checkEnglishChars('Instachat 😜')) # Should return true

True
False
True
False


The test seemes to have worked, except for one entry, the last one.
Emoji characters are outside of our 0-127 range and as a result, the function determines it to be a non-english set of characters. 

As Emoji characters are becoming increasingly common in app names, it would reduce the accuracy of our findings if we were to exclude these apps from the data set.

Extra rules must be put in place to ensure we try to minimalise errors like this.

Rather than exclude an app immediately, we will check to see if there are 4 or more non-English characters in the name and exclude them instead.
Unfortunately, the filter is not perfect, but should be more effective than above.

In [13]:
def checkEnglishChars(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
        
    if count >= 4:
        return False
    
    return True

# Testing the new rule of 4 or more characters.
print(checkEnglishChars('Docs To Go™ Free Office Suite')) # Should return true
print(checkEnglishChars('Instachat 😜')) # Should return true
print(checkEnglishChars('爱奇艺PPS -《欢乐颂2》电视剧热播')) # Should return false

True
True
False


Now we will create new lists of apps for both Google Play Store and Apple App Store that only include English-audience apps.

In [14]:
english_play_apps = []
english_ios_apps = []

for app in play_apps_clean:
    name = app[0]
    if (checkEnglishChars(name) == True):
        english_play_apps.append(app)
        # Only append the ones that pass the English test.
        # Do nothing with apps that fail.
        
for app in ios_apps:
    name = app[1]
    if (checkEnglishChars(name) == True):
        english_ios_apps.append(app)

explore_data(english_play_apps, 10, 12, True) # Pick out some random apps to showcase
print('\n')
explore_data(english_ios_apps, 400, 402, True)

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


['Tattoo Name On My Photo Editor', 'ART_AND_DESIGN', '4.2', '44829', '20M', '10,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'April 2, 2018', '3.8', '4.1 and up']


Number of rows: 9614
Number of columns: 13


['700970012', 'Panda Pop', '263766016', 'USD', '0.0', '41214', '88', '4.5', '4.5', '5.5.101', '4+', 'Games', '40', '5', '1', '1']


['520777858', 'The Sandbox - Building & Crafting a Pixel World!', '171482112', 'USD', '0.0', '41108', '258', '4.5', '4.5', '2.0', '4+', 'Games', '38', '5', '45', '1']


Number of rows: 6183
Number of columns: 16


We have now narrowed down our data sets even further, to 9,614 rows (Play Store) and 6,183 rows (App Store).

The final stage of the process will be to extract only the **free** apps within the remaining lists.

In [24]:
play_store_final = []
ios_final = []

# Play store - Price = 7
for app in english_play_apps:
    if app[7] == "0":
        play_store_final.append(app)
        
# App store - Price = 4
for app in english_ios_apps:
    if app[4] == "0.0":
        ios_final.append(app)

explore_data(play_store_final, 0, 2, True)
print('\n')
explore_data(ios_final, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 3222
Number of columns: 16


We are now down to the final data sets, with 8,864 apps left from Play Store and 3,222 left from App Store.
This totals circa 12k apps, which should be enough for our data analysis.

## Analysis

As mentioned in the intro, the aim is to determine which apps are likely to attract more users, because the target for free apps utilising advertising for revenue is to gain as wide of a reach as possible, to in turn promote more clicks and engagements with the adverts.

Most successful apps are published to both the iOS App Store and Google Play store markets, to allow a wide range of users to download the app. Therefore we will need to find app profiles that are successful in both markets.

To begin with, we'll use some basic frequency tables to get an idea of genre spread in the data sets.

In [34]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

def freq_table(dataset, index):
    table = {}
    table_percentages = {}
    count = 0
    
    for app in dataset:
        count += 1
        
        if app[index] not in table:
            table[app[index]] = 1
        else:
            table[app[index]] += 1
            
    for key in table:
        percentage = round((table[key] / count) * 100, 2)
        table_percentages[key] = percentage
        
    return table_percentages  

## Most Common Apps by Genre
Let's begin by analysing the `prime_genre` column from the App Store data set:

In [35]:
display_table(ios_final, 11)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


From these results, over half the population of the apps in our data set (58.16%) are games.
The next largest set is Entertainment apps, followed by Photo & Video apps, Education and Social Networking.

The results show that the english-only apps in our market sample are dominated by apps designed for fun, e.g. gameing, entertainment and social networking, whereas apps designed for productivity and practical purposes (e.g. navigation, finance, news and productivity) are more rare.

It is important to note however that whilst the frequency of fun apps dominates the market, this does not necessarily reflect they have the most users. There may be more supply than demand.

Next we'll take a look at the `Genres` and `Category` columns of the Play Store data set:

In [36]:
display_table(play_store_final, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

From these results, we can see that unlike the iOS App Store, the more dominant types of app in the Play Store comes under practical apps, for example Tools at 8.45% of the set, Education at 5.35%, Business at 4.59% and more (productivity, lifestyle and finance).

It is important to recognise however, that apps in the Play Store can have more than one category, which in turn creates many categories to analyse and is a more granular set of data.

Instead, it may be better to look at the `Category` column from the Play Store data:

In [37]:
display_table(play_store_final, 1)

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


The `Category` column is more descriptive, as the apps can only have one category assigned.

From this data, we see that similar to the iOS App Store, the fun apps are more dominant of the population, for example Family at 18.9% and Game at 9.7%.

However, it is worth noting that there are significant differences in the amount of productivity and practical apps in the Play Store, when compared to the App Store. 
Tools, Business, Lifestyle and Productivity are all among the top categories that are in the data set.

At this stage, we have found that the App Store is dominated by more apps for fun purposes, whereas the Play Store seems to have more of a balance between apps for fun and apps with practical uses.

The next stage will be to assess which kinds of apps yield the most users.

## Most Popular Apps by Genre - App Store (iOS)

Unfortunately, our data set on the app store does not contain anything directly attributable to the amount of installs an app has had.

We will have to use the total user ratings count in place of this data, to get a rough idea. This will be pulled from the `rating_count_tot` column.

In [51]:
genres_table = freq_table(ios_final, 11)

for genre in genres_table:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        app_genre = app[11]
        if app_genre == genre:
            ratings = float(app[5])
            total += ratings
            len_genre += 1
        
    avg_rating = total / len_genre
    
    print(genre, ': ', avg_rating)

Photo & Video :  28441.54375
Navigation :  86090.33333333333
Food & Drink :  33333.92307692308
Reference :  74942.11111111111
Productivity :  21028.410714285714
Sports :  23008.898550724636
Travel :  28243.8
Entertainment :  14029.830708661417
News :  21248.023255813954
Shopping :  26919.690476190477
Catalogs :  4004.0
Business :  7491.117647058823
Finance :  31467.944444444445
Music :  57326.530303030304
Games :  22788.6696905016
Social Networking :  71548.34905660378
Book :  39758.5
Medical :  612.0
Health & Fitness :  23298.015384615384
Lifestyle :  16485.764705882353
Utilities :  18684.456790123455
Weather :  52279.892857142855
Education :  7003.983050847458


On average, navigation apps have the highest reviews, but this is largely influenced by Google Maps and Waze, which combined have nearly half a million reviews:

In [57]:
for app in ios_final:
    if app[11] == "Navigation":
        print(app[1], ': ', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic :  345046
Google Maps - Navigation & Transit :  154911
Geocaching® :  12811
CoPilot GPS – Car Navigation & Offline Maps :  3582
ImmobilienScout24: Real Estate Search in Germany :  187
Railway Route Search :  5


This also applies to the Social Networking genre, where giants such as Facebook, Instagram and Twitter contribute to the majority of the reviews. Again with music apps, where Spotify and Shazam heavily influence the numbers.

We could omit these large players from the data set to get a more realistic idea of app popularity by genre, but this report will not go into that depth.

An ideal approach to an app based on this data would be something that is fun and engaging, but also incorporates something useful. For example we could create a music app that lets you read about a song, look up any words you don't understand and share this data with your friends on social networks, almost like a dictionary for music. (For example, the [Genius](https://genius.com/) music service fits this description).

Lastly, let's take a look at the Play Store data:

## Most Popular Apps by Genre - Play Store (Android)

Luckily with the Play Store data, we actually have the number of installs per app in our data set.
However there is one caveat, the number of installs is grouped into intervals, which leaves them open-ended and open for interpretation.

We will have to leave the numbers as they are, e.g. treating 100,000+ as 100,000 installs.

To perform calculations on this data, we will need to remove the '+' signs so the compiler can deal with the number.

In [60]:
category_table = freq_table(play_store_final, 1)

for category in category_table:
    total = 0
    len_category = 0
    
    for app in play_store_final:
        if app[1] == category:
            installs = app[5]
            installs = installs.replace(',', '')
            installs = installs.replace('+', '')
            total += float(installs)
            len_category += 1
            
    avg_installs = total / len_category
    print(category, ': ', avg_installs)

ENTERTAINMENT :  11640705.88235294
HOUSE_AND_HOME :  1331540.5616438356
LIFESTYLE :  1437816.2687861272
HEALTH_AND_FITNESS :  4188821.9853479853
GAME :  15588015.603248259
EDUCATION :  1833495.145631068
PARENTING :  542603.6206896552
DATING :  854028.8303030303
SHOPPING :  7036877.311557789
COMICS :  817657.2727272727
MEDICAL :  120550.61980830671
COMMUNICATION :  38456119.167247385
MAPS_AND_NAVIGATION :  4056941.7741935486
AUTO_AND_VEHICLES :  647317.8170731707
VIDEO_PLAYERS :  24727872.452830188
LIBRARIES_AND_DEMO :  638503.734939759
PRODUCTIVITY :  16787331.344927534
NEWS_AND_MAGAZINES :  9549178.467741935
FOOD_AND_DRINK :  1924897.7363636363
ART_AND_DESIGN :  1986335.0877192982
BEAUTY :  513151.88679245283
BUSINESS :  1712290.1474201474
EVENTS :  253542.22222222222
FAMILY :  3695641.8198090694
SPORTS :  3638640.1428571427
TOOLS :  10801391.298666667
FINANCE :  1387692.475609756
SOCIAL :  23253652.127118643
BOOKS_AND_REFERENCE :  8767811.894736841
WEATHER :  5074486.197183099
PHOTOG

From this analysis, communication apps have on average, the most amount of installs.
However, this is influenced heavily by large apps such as WhatsApp, Skype, Facebook Messenger and more.
Let's take a quick look at apps with over 100m, 500m or 1bn installs:

In [62]:
for app in play_store_final:
    if app[1] == "COMMUNICATION" and (app[5] == "1,000,000,000+"
                                      or app[5] == "500,000,000+"
                                      or app[5] == '100,000,000+'):
        print(app[0], ': ', app[5])

WhatsApp Messenger :  1,000,000,000+
imo beta free calls and text :  100,000,000+
Android Messages :  100,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
Messenger – Text and Video Chat for Free :  1,000,000,000+
imo free video calls and chat :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
Who :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji :  100,000,000+
LINE: Free Calls & Messages :  500,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
Firefox Browser fast & private :  100,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Gmail :  1,000,000,000+
Hangouts :  1,000,000,000+
Messenger Lite: Free Calls & Messages :  100,000,000+
Kik :  100,000,000+
KakaoTalk: Free Calls & Text :  100,000,000+
Opera Mini - fast web browser :  100,000,000+
Opera Browser: Fast and Secure :  100,000,000+
Telegram :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer :  100,000,000+
UC Browser Mini -Tiny Fast Private & Secure :  

The main concern with these results when applying it to a context of app development, is that these giants will be extremely difficult to compete with and therefore this category shouldn't be heavily relied on at a decision-making stage.

## Conclusion

Throughout this project we analysed the data sets for the App Store and Play Store, to try and understand which genre of app would be best to create a profitable app.

The data concludes that a book type app, with elements of social and practical tools would be a good fit for the market, for example an app that lets readers digest pages or chapters and share their analysis with friends.