# Analysing app data from the App Store and Google Play to extract the most profitable insights.

Our aim is to find what type of apps are likely to attract the most users. We are a team of data analysts working at a company that builds Android and IOS mobile apps. We publish our apps on the Google Play and the App Store.

The apps we build at our company are free to download and install, this leads to our main source of revenue being from in-app ads. This means that a higher number of users of our apps means a higher number of engagements with ads and so determines our revenue for any given app. The goal of this project is to analyse data and provide insights that will allow our developer to understand what type of apps would attract the most users and so be most profitable.

# Opening and exploring the datasets

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

As collecting data for over 4 million apps would require a significant amount of time investment and money, we will instead analyse a sample of the data. To avoid spending resources on data collection we will be using two datasets that are suitable for our goals:

* [A dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. This data can be directly downloaded from [here.](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* [A dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. This data can be directly downloaded from [here.](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

To start we will be opening and exploring the datasets. First we need to open the datasets. In the code below (will describe both operations as simultaneous for the purpose of simplifying the explanation), we will:

* Import the `reader()` from the `csv` module
* Open the `AppleStore.csv` and `googleplaystore.csv` files using the `open()` function, and assign the outputs to the variables `opened_file1` and `opened_file2` respectively.
* Read in the opened files using the `reader()` function, and assign the outputs to the variables `read_file1` and `read_file2` separately.
* Transform the read-in files to a list of lists using the `list()` and save the outputs to the variables `apple_data` and `gps_data` separately
* Slice the datasets to remove the first row of the data that contains the headers and assign it to the variables `apple_header` and `gps_header` separately. 
* Then assign the remainder of the rows, minus the header row, back to the variables `apple_data` and `gps_data`.

In [1]:
from csv import reader

#Dataset from the Googleplaystore
opened_file2 = open('googleplaystore.csv')
read_file2 = reader(opened_file2)
gps_data = list(read_file2)
gps_header = gps_data[0]
gps_data = gps_data[1:]

#Dataset from the Applestore
opened_file1 = open('AppleStore.csv')
read_file1 = reader(opened_file1)
apple_data = list(read_file1)
apple_header = apple_data[0]
apple_data = apple_data[1:]

To make exploring the data easier, we will create a function named `explore_data()`. The code below will be used to create and define the function `explore_data`, in it we will:

* define the `explore_data` function to take in four parameters:
    * `dataset` - data opened previously and contained as a list of lists
    * `start` and `end` - both be integers that will represent the start and   end of data slicing 
    * `rows_and_columns` - a boolean that has `False` as a default argument
* slice the `dataset` using `data[start:end]` and assigning the output to the variable `dataset_slice`
* loop through `dataset_slice` and for each iteration prints a row using `print(row)` and adds a new line after each row using `print('\n')`
* Optionally include an `if` statement to print the number of rows and columns if `rows_and_columns` is `True`

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(apple_header)
print('\n')
explore_data(apple_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


In the output above we can see that there are 7197 IOS apps within the App Store data set and 16 columns. The columns that will be most useful for our analysis are `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, `cont_rating`, `prime_genre`. For descriptions of the column names, as they are not particularly self explanatory, see the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Now lets take a look at the Google Play store dataset.

In [4]:
print(gps_header)
print('\n')
explore_data(gps_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In the output above we can see that there are 10841 android apps within the Google Play store data set and 13 columns. The columns that will be most useful for our analysis are `App`, `Category`, `rating` `Installs`, `Type`, `Price`, `Content Rating`, `genres`.

# Deleting Wrong Data

In the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play dataset, we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101) describes an error for row 10472. Below we will print the row and compare it to a correct one and the header row.

In [5]:
print(gps_header) 
print('\n')
print(gps_data[10472])
print('\n')
print(gps_data[11])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


We can see that for row 10472 it has incorrect values for `category` being `1.9` and `Rating` being `19`. The value of `1.9` does not refer to any known category of app due to it being a numerical value and it would seem to be that it was supposed to come under the column `rating`, and the value of `19` is too high for the `rating` column as apps are capped at a maximum rating of 5 on the Google Play Store. It appears this row is missing the `category` value which has shifted the data points to their incorrect positons.

Due to this error we will need to remove this row from the dataset, which we will do in the code below.

In [6]:
print(len(gps_data)) # Number of rows
del gps_data[10472]
print('\n')
print(gps_data[10472]) # confirming data has been deleted
print('\n')
print(len(gps_data)) # number of rows after deletion

10841


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


10840


Checking the [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) of the App Store dataset we can see that no wrong data has been reported and so we have left that dataset alone.

# Duplicate Data

## Identifying the duplicate data

When we explore the the Google Play dataset it can be seen that there are rows of duplicate app data.

For example instagram appears 4 times in the dataset:

In [7]:
for app in gps_data:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Below we will create two list: 
* one for storing names of duplicate apps and another for storing the name of unique apps.

Here we will loop through the Google Play Store dataset and will do the following:
* save the app name to the variable `name`
* if `name` was already in the `unique_apps`, list we appended `name` to the `duplicate_apps` list
* else if `name` was not in the `unique_apps` list, we appended `name` to the `unique_apps` list.
* then outside and after the loop we print the length of the `duplicate_apps` list to find the total of apps with duplicate entries within the dataset, as well as some examples.

In [8]:
duplicate_apps = []
unique_apps = []

for app in gps_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])    

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


It is important that we do not keep duplicate data as it can skewer the results of our analysis, so we will need to remove duplicate entries so there is only one per app. 

From above we have found there to be 1181 cases of duplicate data, now we could remove duplicate entries at random but instead we can do this a better way using information from within the data.

When looking at the Instagram example earlier you can see what varies between the rows printed is the value of the fourth positon on each row that corresponds to the number of reviews given by users for the app. We can extrapolate from this that the higher number of reviews given will mean that a given row of data is more recent. Using this we can remove all duplicate entries for a given app bar the one with the highest number of reviews.

## Removing the duplicate data





In [9]:
print('Expected length:', len(gps_data) - 1181)

Expected length: 9659


As seen above when we remove the duplicate data we should be expecting to be left with 9659 rows of unique data.

To remove the duplicate data we will need to:
* create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app
* use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [10]:
reviews_max = {}

for app in gps_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Actual length:', len(reviews_max))

Actual length: 9659


So above we have built the dictionary `reviews_max` which contains each app of the duplicate data with the highest number reviews per app. From the `Actual length` of `9659` we can see that the length of dictionary has reached what we have predicted. 

In the code below we are going to use the dictionary `reviews_max` to remove the duplicate rows from `gps_data`:
* we will start by creating two empty lists: `gps_clean` (to store the new cleaned data set with no duplicate data) and `already_added` (to store the app names)
* we will loop through the dataset and for each iteration we will:
    * isolate the name of the app and number of reviews to the variables `name` and `n_reviews` respectively
    * add the current row of data `app` to the list `gps_clean` and add the app name `name` to the list `already_added` if:
        * the number of reviews `n_reviews` of the current app is equal to the number of reviews given by that app within the dictionary `reviews_max`, and
        * that the app name `name` is not in the list `already_added`. This is to make sure that duplicate rows that have the same number of reviews are not added in as well. You can see from the Instagram app data earlier that there are two rows with the same number of reviews and so with just the condition `if n_reviews == reviews_max[name]` both rows of data would be added to the list instead of a desired one.

In [11]:
gps_clean = []
already_added = []

for app in gps_data:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        gps_clean.append(app)
        already_added.append(name)
        
explore_data(gps_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


Using the `explore_data` function we defined earlier we can now confirm that the number of rows of data is now 9659. showing that we have cleaned the data of duplicates successfully.

We will not need to clean the `apple_data` dataset at this point as there are no duplicate entries within that data set:
* In the code below we modified and ran the code used previously for identifying duplicate data in the `gps_data` dataset. 
* We can see that there are no duplicate apps present within this dataset.

In [12]:
duplicate_apps2 = []
unique_apps2 = []

for app in apple_data:
    name = app[0]
    if name in unique_apps2:
        duplicate_apps2.append(name)
    else:
        unique_apps2.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps2))

Number of duplicate apps: 0


# Removing non-english apps

We use English for the apps that we develop at our company and so we would like to analyse only apps that have been designed for an english speaking audience.

Below we see a couple of examples from each dataset which shows that there are non-english apps contained within the datasets.

In [13]:
print(apple_data[813][1])
print(apple_data[6731][1])
print('\n')
print(gps_clean[4412][0])
print(gps_clean[7940][0])


爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We are not interested in keeping apps that are like the ones shown above so we will remove them. One way to do this is to remove each app with a name containing a symbol that is not commonly used in english text. English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

These characters that are used within the english text are encoded to ASCII standard and are in the range of 0 to 127. Using this range we can build a function that detects whether a character belongs to the set of common english characters or not.

Using the built-in function `ord()` we can find the corresponding encoded number for each character of a string. We use this in the code below and check whether each characters encoding value is greater than 127, if it is then `False` is returned highlighting that the app is most likely non-english and not relevant for our analytical purposes.

In [14]:
def is_english(a_string):
    for char in a_string:
        if ord(char) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


We can see that the function does work however it is unable to identify english apps that have emojis or characters like `™` as these fall outside the ASCII range and have encoded values of over 127. See the example below:

In [15]:
print(ord('™'))
print(ord('😜'))

8482
128540


Using the function `is_english` as it stands will cause us to lose useful data as many apps wil be incorrectly labeled as non-English. To minimise the impact of data loss we will edit the `is_english` function to return `False` if the name of an app has more than 3 characters with encoded numbers that are greater than 127. This means that apps with 3 characters that are emojis or other special characters will still be labeled as english will return `True` in the function

In [16]:
def is_english(a_string):
    special_char = 0
    for char in a_string:
        if ord(char) > 127:
            special_char += 1
    if special_char > 3:
        return False
    return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Using the `is_english` function we check the name in each row of data, and if the name satisfies our condition (3 or less characters that are more than 127 in encoded values) then we append the corresponding row of data `app` to the list `gps_english` for the Google Play Store data or to the list `apple_english` for the App Store data.

In [17]:
gps_english = []
apple_english = []

for app in gps_clean:
    name = app[0]
    if is_english(name) == True:
        gps_english.append(app)
        
for app in apple_data:
    name = app[1]
    if is_english(name) == True:
        apple_english.append(app)
        
explore_data(gps_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

After we have removed the non-english apps from our data we are now left with 9614 apps for the Google Play Store and 6183 for the App Store.

# Isolating the free apps

As mentioned in our introduction to this project our company only nuild apps that are free to download and install, and our main source of income comes from in-app adds. Now currently our `gps_english` and `apple_english` datasets contain both free and non-free apps. In the code below we will loop through each dataset to isolate the free apps into two new separate lists `gps_free` for Google Play Store apps and `apple_free` for App Store apps.

In [18]:
gps_free = []
apple_free = []

for app in gps_english:
    price = 'Free'
    if price == app[6]:
        gps_free.append(app)
        
for app in apple_english:
    price = float(app[4])
    if price == 0:
        apple_free.append(app)
        
explore_data(gps_free, 0, 3, True)
print('\n')
explore_data(apple_free, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

After exploring the data in the new lists we now have 8863 apps left for the Google Play Store dataset and 3222 apps left for the App Store dataset.

# Most Common Apps by Genre

As we covered in the introduction, our aim is to identify the types of apps that are liekly to attract the most users as our revenue will be determined by the higher numbers of users that see our in app ads. 

To minimize risks and overhead, our validation strategy for an app idea has three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

As we wish to add the app onto both the Google Play Store and the App Store, we need to find app profiles that would be successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

To start this analysis we need to determine the most common geeres for each market. For this, we'll need to build frequency tables for a few columns in our datasets.

By inspecting the header row for each data set in the code below the columns that would be most of use are the `Category` and `Genres` columns for the Google Play Store and the `Prime_Genre` column for the App Store.


In [19]:
print(gps_header)
print('\n')
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In the code below we built two functions to help create frequency tables for the `Category`, `Genres` and `Prime_Genre` columns for the data:
* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [20]:
def freq_table(dataset, index):
    data_table = {}
    total = len(dataset)
    for row in dataset:
        column = row[index]
        if column in data_table:
            data_table[column] += 1
        else:
            data_table[column] = 1
    
    for data in data_table:
        data_table[data] = (data_table[data] / total) * 100
    
    return data_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## App Store Data

First we start by analysing the frequency table for the `Prime_Genre` column of the App Store Data set below

In [21]:
display_table(apple_free, -5) # Prime_Genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the free English apps data we can see that `58.16%` of the apps come under the Genre `Games`, the second most numerous being `Entertainment` at `7.88%`. This is followed by `Photo & Video` at `4.95%`, `Education` at `3.66%` and `Social Networking` at `3.28%`.

From this data it seems that most of the free english apps that are located on the App store have been designed for entertainment (games, photo and video, social networking, sports, music) rather than for practical purposes  (education, shopping, utilities, productivity, lifestyle).

At this point we cannot make a definitive app profile that would be suited to us at this stage for the App Store market. While we can see that Entertainment apps, specifically apps under the category `Games`, occupy more than half of the free english apps on the store this does not show which genre is the most popular or has the most number of users. This only shows which categories of apps are most on offer, which does not necessarily reflect the demand of the users. We will need to make further analyses to pull better conclusions. 

## Google Play Store Data

Now below we will be analysing the frequency tables for the `Category` and `Genres` columns of the Google Play Store data set. Both these columns seems to serve a similar function within the dataset.

In [22]:
display_table(gps_free, 1) # Category

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

From the free English apps data we can see that `18.89%` of the apps come under the category `Family`, followed by `9.72%` under  `Game` and `8.46%` under `Tools`. 

The `Family` category does not give enough clarity so we will investigate it further. From this [article](https://androidcommunity.com/google-play-now-shows-family-category-20150610/) for when the family category was launched on the Google Play Store in 2015 it can be seen that it mostly consists of games and entertainment apps aimed at children.

Even if we class apps under the category `Family` as entertainment (Game, Sports, Social) there is a markedly increase of free english apps that have been designed for practical purposes (Tools, Business, Lifestyle, Productivity, Finance) on the Google Play Store than with the App Store.

In [23]:
display_table(gps_free, -4) # Genres

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

The `Genres` column of the dataset seems to contain similar data to that of the `Category` column, but appears to include more nuanced categorisation. However not all of it is clear as we have a genre of `Education` as well as `Educational;Education` and `Education;Education`. For the purposes of this project it makes more sense to use the `Category` column as its more straight forward and less confusing, allowing us to pull the data we need more easily.

From anaylsing what are the most common apps by genre for free english apps we can see that the App Store is predominantly dominated by apps designed for entertainment (mostly under Games) and that the Google Play Store features more of a mix between entertainment and practical uses.

# Most Popular Apps by Genre on the App Store

To find out what the most popular genres are we need to look for a column in the data that we can use to calculate an average "popularity" value per genre. The App Store data set has no clear winner here so as a workaround we will use the `rating_count_tot` column. The idea being here that the higher the number of user ratings for an app the more popular it should be.

In the code below we will calculate the average number of ratings per genre for the App Store data set and display this information in a descending order:

In [24]:
apple_freq_genres = freq_table(apple_free, -5)

table_display = []
for genre in apple_freq_genres:
    total = 0
    len_genre = 0
    for row in apple_free:
        genre_app = row[-5]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    average_ratings = total / len_genre
    table_pair = (average_ratings, genre)
    table_display.append(table_pair)
    
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


From the output above we can see that the genres with the highest number of average user ratings are `Navigation`, `Reference` and `Social Networking`.

Let's have a look at these genres in more detail:
* Note - we have defined some functions as well below for changing the font of the output text which we will be using going forward.

In [25]:
def go_bold(a_string):
    return ('\033[1m' + a_string + '\033[0m')

def go_bold_red(a_string):
    return ('\033[1;31m' + a_string + '\033[0m')

def bold_underline(a_string):
    return ('\033[1;4m' + a_string + '\033[0m')

print(go_bold('Navigation'))
for app in apple_free:
    if app[-5] == "Navigation":
        print(app[1], ':', app[5])
        
print('\n')
print(go_bold('Reference'))

for app in apple_free:
    if app[-5] == "Reference":
        print(app[1], ':', app[5])
        
print('\n')
print(go_bold("Social Networking"))

for app in apple_free:
    if app[-5] == "Social Networking":
        print(app[1], ':', app[5])

[1mNavigation[0m
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


[1mReference[0m
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft

From the `Navigation`, `Reference` and `Social Networking` genres it appears that their average number of user ratings are being skewed by a few very popular apps. The market these genres entail would be much harder for our company to break into due to the dominance by apps with a signficant audience and share of the local market on the store.

To verify this and investigate further of the top three genres, below We will define a function called `genre_rating` that takes the variable `genre` as a parameter. This function will:
* take in a parameter that takes in a string that corresponds to one of the genre names from the `Prime_Genre` (index `-5`) column in the App Store data set.
* Create 2 variables for counting the total amount of user ratings (`total`) as well as the total for just the apps with over 100,000 (`big_total`). 2 variables for counting the physical total number of apps (`number_apps`) and for the number of apps that count over 100,000 user ratings (`number_big_apps`)
* use `app` to iterate over the `apple_free` data set:
    * Uses an `if` statement within this loop to count the number of big apps within the defined parameter (`genre`) and total their amount of user ratings.
        * This `if` statement also contains a condition where `if` `number_big_apps` is less than `3` then the app name and user rating total will be printed. This is to limit the length of the output and allow us to see the two most popular apps for each genre.
    *  Another `if` statement is used within this loop to count the total number of apps within the defined parameter and total all their user ratings
        * another `if` statement is included to print the two most popular apps in absence of any in the genre that have over 100,000 user ratings.
* `print` the total number of apps in a genre, the total number of apps with over 100,000 user ratings in a genre, apps with 100,000+ user ratings with their user ratings as a percentage of the total number of user ratings of all apps in a genre, and finally the total number of apps with 100,000+ user ratings as a percentage of the total number of apps in a genre.

In [26]:
def genre_rating(genre): # takes a string as input that needs to match a genre 
    print('\n')
    print(bold_underline(genre))
    total = 0
    big_total = 0
    number_apps = 0
    number_big_apps = 0
    for app in apple_free:
        if app[-5] == str(genre) and int(app[5]) > 100000:
            number_big_apps += 1
            if number_big_apps < 3:
                print(app[1], ':', app[5])
            big_total += int(app[5])
        if app[-5] == str(genre):
            total += int(app[5])
            number_apps += 1
            if number_big_apps == 0 and number_apps < 3:
                print(app[1], ':', app[5])
    print(go_bold('Total apps in genre'), ':', number_apps)
    print(go_bold('Total 100000+ apps in genre'), ':', number_big_apps)
    print(go_bold('Percentage of 100000+ apps of total user ratings'), ':', 
           (round(big_total / total, 4) * 100), '%')
    print(go_bold('Percentage of 100000+ apps of total apps'), ':', 
           (round(number_big_apps / number_apps, 4) * 100), '%')
            
genre_rating('Navigation')
genre_rating('Reference')
genre_rating('Social Networking')



[1;4mNavigation[0m
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
[1mTotal apps in genre[0m : 6
[1mTotal 100000+ apps in genre[0m : 2
[1mPercentage of 100000+ apps of total user ratings[0m : 96.78999999999999 %
[1mPercentage of 100000+ apps of total apps[0m : 33.33 %


[1;4mReference[0m
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
[1mTotal apps in genre[0m : 18
[1mTotal 100000+ apps in genre[0m : 2
[1mPercentage of 100000+ apps of total user ratings[0m : 87.92 %
[1mPercentage of 100000+ apps of total apps[0m : 11.110000000000001 %


[1;4mSocial Networking[0m
Facebook : 2974676
Pinterest : 1061624
[1mTotal apps in genre[0m : 106
[1mTotal 100000+ apps in genre[0m : 11
[1mPercentage of 100000+ apps of total user ratings[0m : 82.59 %
[1mPercentage of 100000+ apps of total apps[0m : 10.38 %


Above we have isolated the apps that have over 100000 user ratings (used as proxy for the number of users and so the apps popularity). It appears our initial thoughts were correct (a few apps in the top 3 most popular genres on average are skewing theresults). We also wanted to see what the total share of user ratings would be with the apps with big audiences compared to the total market share for each genre on the App Store. 

What we can see is that apps with 100000+ user ratings for the genre `Navigation` consist of `96.79%` of the total user ratings for that genre while only being `33.33%` of the total `Navigation` apps, for `Reference` this is `87.91%` for `11.11%`, and for `Social Networking` this is `82.58%` for `10.38%`. 

Now while it appears that in the most popular genres on the App Store they are being dominated by a few large apps, it is also worth invesigating the rest with the `genre_rating` function we have created. That is what we will do below:

In [27]:
genre_rating('Music')
genre_rating('Weather')
genre_rating('Book')
genre_rating('Food & Drink')
genre_rating('Finance')
genre_rating('Photo & Video')
genre_rating('Travel')
genre_rating('Shopping')
genre_rating('Health & Fitness')
genre_rating('Sports')
genre_rating('Games')
genre_rating('News')
genre_rating('Productivity')
genre_rating('Utilities')
genre_rating('Lifestyle')
genre_rating('Entertainment')
genre_rating('Business')
genre_rating('Education')
genre_rating('Catalogs')
genre_rating('Medical')



[1;4mMusic[0m
Pandora - Music & Radio : 1126879
Spotify Music : 878563
[1mTotal apps in genre[0m : 66
[1mTotal 100000+ apps in genre[0m : 9
[1mPercentage of 100000+ apps of total user ratings[0m : 87.35000000000001 %
[1mPercentage of 100000+ apps of total apps[0m : 13.639999999999999 %


[1;4mWeather[0m
The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
[1mTotal apps in genre[0m : 28
[1mTotal 100000+ apps in genre[0m : 6
[1mPercentage of 100000+ apps of total user ratings[0m : 88.8 %
[1mPercentage of 100000+ apps of total apps[0m : 21.43 %


[1;4mBook[0m
Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
[1mTotal apps in genre[0m : 14
[1mTotal 100000+ apps in genre[0m : 2
[1mPercentage of 100000+ apps of total user ratings[0m : 64.2 %
[1mPercentage of 100000+ apps of total apps[0m : 14.29 %


[1;

It seems that with most of the genres there are a few apps in each that skew the data of the average number of ratings, we saw this trend earlier in the 3 genres with the highest average user ratings. You see this in the `Games` genre with the apps `Clash of Clans` and `Temple Run`, the `Music` genre with `Pandora` and `Spotify`, and the `Navigation` genre with `Waze` and `Google Maps`. 

Now while there are some genres which do not have apps with user ratings that total over 100,000, they do however have very few apps in their category and it can be seen that their most popular apps have much less user ratings than other popular apps in other genres. This can be seen in the genre `Medical` with the apps `Snorelab` and `Blink Health` which have `1341` and `11981` respectively, `Catalogs` with `CPlus` and `DRAGONS MODS` which have `13345` and `2027` respectively, `Business` with `Indeed` and `Flashlight` which have `38681` and `24744`respectively. So while there would be less competition if we were to design apps for these genres, from looking at their popularity the potential for profitability is far less and so we will avoid these genres.

We want to develop for genres that have a high potential for profitability but are not overly saturated with too many apps, as that would make it hard to break into the market due to high levels of competition. Earlier in the project we analysed the percentage of the total apps that each genre had on the App Store, with `Games` making up `58.16%` and `Entertainment` making up `7.88%`. For these genres we should avoid developing our app for, but we could incorporate features from these genres into our app idea as they are popular but instead have our apps primary focus be elsewhere.

As mentioned previously most of the genres contain a few popular apps that skew their data, so unless we want to develop an app for a genre that has a very small market and so low profitability, we need to make a compromise. Looking at the above data we want to select genres that have their total user ratings more spread out over their apps, where there is less domination by a few popular apps and so gives us a better chance to break into said market. We also want genres that are not oversaturated with apps.

Genres that look promising in this regard are:
* `Book`
* `Shopping`
* `Sports`
* `Productivity`
* `Utilities`
* `Education`

To get an idea of the total market share that each of the above genres has on the App Store for free English apps we will create the function `total_genre_rating` and use it to calculate the total amount of user ratings for a genre (and so the genres popularity and its share of the store market), and calculate the percentage that this genre total is of the App Stores combined user ratings. We have done this in the code below and demonstrated the results for each genre:

In [28]:
def total_genre_rating(genre): # takes a string as input that needs to match a genre 
    print('\n')
    print(bold_underline(genre))
    total_genre = 0
    percent_all_data = 0
    for app in apple_free:
        percent_all_data += int(app[5])
        if app[-5] == str(genre):
            total_genre += int(app[5])
    print(go_bold('Total user ratings for genre'), ':', total_genre)
    print(go_bold('Percentage of genre ratings of the total store ratings'), 
          ':', (round(total_genre / percent_all_data, 4) * 100), '%')

print(go_bold_red("***Genres being considered for app development***"))

total_genre_rating('Shopping')
total_genre_rating('Sports')
total_genre_rating('Utilities')
total_genre_rating('Productivity')
total_genre_rating('Education')
total_genre_rating('Book')

print('\n')
print(go_bold_red("***Other Genres***"))

total_genre_rating('Navigation')
total_genre_rating('Reference')
total_genre_rating('Social Networking')
total_genre_rating('Music')
total_genre_rating('Weather')
total_genre_rating('Food & Drink')
total_genre_rating('Finance')
total_genre_rating('Photo & Video')
total_genre_rating('Travel')
total_genre_rating('Health & Fitness')
total_genre_rating('Games')
total_genre_rating('News')
total_genre_rating('Lifestyle')
total_genre_rating('Entertainment')
total_genre_rating('Business')
total_genre_rating('Catalogs')
total_genre_rating('Medical')

[1;31m***Genres being considered for app development***[0m


[1;4mShopping[0m
[1mTotal user ratings for genre[0m : 2261254
[1mPercentage of genre ratings of the total store ratings[0m : 2.83 %


[1;4mSports[0m
[1mTotal user ratings for genre[0m : 1587614
[1mPercentage of genre ratings of the total store ratings[0m : 1.9800000000000002 %


[1;4mUtilities[0m
[1mTotal user ratings for genre[0m : 1513441
[1mPercentage of genre ratings of the total store ratings[0m : 1.8900000000000001 %


[1;4mProductivity[0m
[1mTotal user ratings for genre[0m : 1177591
[1mPercentage of genre ratings of the total store ratings[0m : 1.47 %


[1;4mEducation[0m
[1mTotal user ratings for genre[0m : 826470
[1mPercentage of genre ratings of the total store ratings[0m : 1.03 %


[1;4mBook[0m
[1mTotal user ratings for genre[0m : 556619
[1mPercentage of genre ratings of the total store ratings[0m : 0.7000000000000001 %


[1;31m***Other Genres***[0m


[1;4mNavigation[0m
[1m

Of the six genres we can see that `Shopping` holds the highest market share with `2.83%` of the total user ratings, and `Book` holds the least at `0.7%`. It interesting to note that although the `Book` genre has the lowest market share it does have the highest average user ratings per app at `39758.5` compared to the other 5 genres. In the output below you can see that this is being cause by a few apps that hold the majority of the total user ratings for the `Book` genre.

* `Utilities` - Includes various calculators, flashlights, web browsers ect. Many of the apps that fit into this category would not be used for long lengths of time such as the alarm clock and QR reader, ideally we want to design an app that can retain a user for longer periods of times so there is higher chance of said user engaging with our ads. Designing a internet browser as well requires alot of time and monetary investment, and so would be out of the scope of our company.
* `Book` - Of all the genres we are looking at for an app profile this one has the least amount of apps but the highest average number of ratings. This is due to the genre having a few apps with a high number of user ratings. Trying to compete with the dominating apps such as `Kindle` and `Audible` would be very difficult as these apps have large companies with great financial backing. There purpose, one to be a library of books and the other to offer the same but in an audio format, would be almost impossible to supplant as there isnt much else to innovate on.
    * For this genre we need to get more creative if we wanted to stand out. The `Hooked` and `Color Therapy Adult Coloring Book for Adults` each take unique approaches and its apparent how they have managed to stand out amongst this genre. Like the `Hooked` app we can focus on fan-fiction or user created content, and include networking opportunites for the users incorporating aspects of apps from the `Social Media` genre. We can have it so users can upload their own source material. We can then have ads displaced on the user interface outside and inside texts.
* `Education` - Here we see a mix of game-style education apps such as `Elevate`, apps designed for educational instituions such as `canvas` and learning apps such as `duolingo`. This genre has potential for high user retention as we can implement learning paths alongside a course structure, however this would best be aimed at a younger audience as focusing on specialist subjects or older audiences could potentially require educational specialists to come up with the material for the app and so could prove to be costly. 
    * Having a learning path with multiple lessons and modules would give us the opportunity to display ads within the app. We could include weekly challenges if we decide to include gaming elements, that could explore the topics and subject we decide to include. This could help with user retention as it provide fresh materials for user to come back to.
* `Productivity` - This genre appears to focus on 'office' style apps such as `Gmail` and `Microsoft Word`. Now while alot of these types of apps would tend to be used more frequently and potentially multiple times a day, it appears that this market is dominated by a few companies with lots of popular apps.
    * However when looking at what types of apps that dominate the App Store it is clear the apps designed for "fun" take this spot over "work" or "office" apps. You can see this with the `Games` genre taking up `53.39%` of all the user ratings for the app store. Designing a app that would fit the `Productivity` genre may give us a chance to stand out on the app store as a whole as it is oversaturated with "fun" apps.
    * As the genre is dominated by office apps we can design an app that employs 'gamification' to help it stand out amongst the other apps. We could include elements like a task manager and planner that would include earning points for unlocks or achievements based on meeting set goals and completing tasks. 
* `Sports` - Consists of many apps belonging to large sport news networks and fantasy sport leagues. Developing an app for reporting sports news would require potentially hiring extra staff to take the roles of writers, reporters or entertainers. It would be a big undertaking that prove to be very costly. 
    * We could look at developing a fantasy sport leagues app, this would not require the expensive overheads that a news network could entail. We can easily embed a social networking function and allow users to set up leagues with their friends so as to encourages users to participate with others online and to get them to encourage people they know to get the app. 
    * `Social Networking` is a popular genre and accounts for `9.48%` of the user ratings of free english apps of the App Store, and so taking some ideas of what makes that genre popular and applying it to our more niche market could help us stand out. 
    * We could include quizzes and trivia games about various sport related topics. We can also include a sports reference guide that would alow users to search current and past information about players, teams, tournaments ect. depending on the sport. Including these aspects that you would most likely find in the `Games` and `Reference` genres can keep user retainment and engagement for longer as they will not have to use any external apps to search up relevant information for sports teams and for related sports trivia games.
* `Shopping` - Apps here are often online stores which would require purchasing and delivering inventory which be outside the scope of our company. While apps like`eBay` are free to use and sign up the primary revenues comes from eBay taking a cut of the profits from sellers and auctioneers. This ultimately goes against the aims we have set out for this project.

Below I have created a funtion called `list_app_ratings` showing the apps and their user ratings for the list 6 genres above. This was for analysis and to show a comparison.

In [29]:
def list_app_ratings(a_string):
    print('\n')
    print(go_bold(a_string))
    for app in apple_free:
        if app[-5] == a_string:
            print(app[1], ':', app[5])
    
list_app_ratings('Book')
list_app_ratings('Education')
list_app_ratings('Shopping')
list_app_ratings('Utilities')
list_app_ratings('Productivity')
list_app_ratings('Sports')



[1mBook[0m
Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


[1mEducation[0m
Duolingo - Learn Spanish, French and more : 162701
Guess My Age  Math Magic : 123190
Lumosity - Brain Training : 96534
Elevate - Brain Training and Games : 58092
Fit Brains Trainer : 46363
ClassDojo : 35440
Memrise: learn languages : 20383
Peak - Brain Training : 20322
Canvas by Instructure : 19981
ABCmouse.com - Early Learning Academy : 18749
Quizlet: Study Flashcards, Languages & Vocabulary : 16683
Photomath - Camera Calculator : 16523
iTunes U : 15801
Blackboard Mo



# Most Popular Apps by Genre on Google Play

For the Google Play market we actually have data on the number of installs for an app so we should be able to get a clearer picture of a genres and apps popularity. When looking at this data however it appears that the numbers do not seem precise enough and the values given appear to be open-ended (100+, 1,000+, 5,000+, etc.)

In [30]:
display_table(gps_free, 5) # the Installs columns

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


The problem with this data is that if an app is defined as having 1,000+ installs we would not know whether this app has 1,000 installs, 1,250, or 1,500. This goes the same across all the other install categorisations. Now we don't need very precise data for our purposes, we only want to find out which app genres attract the most users.

We will leave the numbers as they are so that an app with 100,000+ installs will have 100,000 installs and an app with 10,000+ installs with have 10,000 installs and so on. 

To perform computations and further analysis we will need to convert each install number from a string to a float. Here we will need to remove the commas and the plus characters, or the conversion will fail and cause an error. To help make the output more readable we have defined a function call `sort_table_desc` which will sort a frequency table into descending order.

In [31]:
def sort_table_desc(freq_table): # defining a function to sort the output into descending order
    freq_table_sorted = sorted(freq_table, reverse = True)
    for entry in freq_table_sorted:
        print(entry[1], ':', '{:,}'.format(entry[0]))
    print('\n')

gps_categories = freq_table(gps_free, 1) # using an earlier defined function for generating a frequency table

gps_categories_table = [] # creating an empty list to use for ordering the output
for category in gps_categories:
    total = 0
    len_category = 0
    for app in gps_free:
        category_app = app[1]
        if category_app == category:
            num_installs = app[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            num_installs = float(num_installs)
            total += num_installs
            len_category += 1
    avg_num_installs = total / len_category
    key_cat_as_tuple = (avg_num_installs, category)
    gps_categories_table.append(key_cat_as_tuple)

sort_table_desc(gps_categories_table)

for app in gps_free:
    num_installs = app[5]
    num_installs = num_installs.replace('+', '')
    num_installs = num_installs.replace(',', '')
    num_installs = int(num_installs)
    app[5] = str(num_installs)

COMMUNICATION : 38,456,119.167247385
VIDEO_PLAYERS : 24,727,872.452830188
SOCIAL : 23,253,652.127118643
PHOTOGRAPHY : 17,840,110.40229885
PRODUCTIVITY : 16,787,331.344927534
GAME : 15,588,015.603248259
TRAVEL_AND_LOCAL : 13,984,077.710144928
ENTERTAINMENT : 11,640,705.88235294
TOOLS : 10,801,391.298666667
NEWS_AND_MAGAZINES : 9,549,178.467741935
BOOKS_AND_REFERENCE : 8,767,811.894736841
SHOPPING : 7,036,877.311557789
PERSONALIZATION : 5,201,482.6122448975
WEATHER : 5,074,486.197183099
HEALTH_AND_FITNESS : 4,188,821.9853479853
MAPS_AND_NAVIGATION : 4,056,941.7741935486
FAMILY : 3,697,848.1731343283
SPORTS : 3,638,640.1428571427
ART_AND_DESIGN : 1,986,335.0877192982
FOOD_AND_DRINK : 1,924,897.7363636363
EDUCATION : 1,833,495.145631068
BUSINESS : 1,712,290.1474201474
LIFESTYLE : 1,437,816.2687861272
FINANCE : 1,387,692.475609756
HOUSE_AND_HOME : 1,331,540.5616438356
DATING : 854,028.8303030303
COMICS : 817,657.2727272727
AUTO_AND_VEHICLES : 647,317.8170731707
LIBRARIES_AND_DEMO : 638,503.

From the data above we can see that the top 3 genres for average number of installs per app are `Communication` at `38,456,119`, `Video_players` at `24,727,872`, and `Social` at `23,253,652`. Below we will explore the genre `Communication` further. 

In [32]:
app_installs_table = []
for app in gps_free:
    if app[1] == 'COMMUNICATION':
        key_cat_as_tuple = (int(app[5]), app[0])
        app_installs_table.append(key_cat_as_tuple)

sort_table_desc(app_installs_table)

WhatsApp Messenger : 1,000,000,000
Skype - free IM & video calls : 1,000,000,000
Messenger – Text and Video Chat for Free : 1,000,000,000
Hangouts : 1,000,000,000
Google Chrome: Fast & Secure : 1,000,000,000
Gmail : 1,000,000,000
imo free video calls and chat : 500,000,000
Viber Messenger : 500,000,000
UC Browser - Fast Download Private & Secure : 500,000,000
LINE: Free Calls & Messages : 500,000,000
Google Duo - High Quality Video Calls : 500,000,000
imo beta free calls and text : 100,000,000
Yahoo Mail – Stay Organized : 100,000,000
Who : 100,000,000
WeChat : 100,000,000
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000
Telegram : 100,000,000
Opera Mini - fast web browser : 100,000,000
Opera Browser: Fast and Secure : 100,000,000
Messenger Lite: Free Calls & Messages : 100,000,000
Kik : 100,000,000
KakaoTalk: Free Calls & Text : 100,000,000
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000
Firefox Browser 

The data from the App Store showed us that there was a skewing of a genres popularity being caused by a few apps with a high number of user ratings. There seems to be similar trend here with the data from the Google Play Store. 

We can see that there are a few apps with over a billion installs (`WhatsApp`, `Skype`, `Messenger`, `Hangouts`, `Google Chrome`, and `Gmail`) and even more with over 500 and 100 million installs.

From analysing this genre below we can see that there are 27 apps that have over 100 million installs. When these apps are removed we can see that the average number of installs per app drops to `3,603,485` which is a ten-fold decrease.  

In [33]:
under_100m_apps = []
over_100m_apps = [] 
total = 0
total_under_100m = 0
for app in gps_free:
    num_installs = int(app[5])
    if app[1] == 'COMMUNICATION' and num_installs < 100000000:
        under_100m_apps.append(num_installs)
        total += 1
        total_under_100m += 1
    elif app[1] == 'COMMUNICATION' and num_installs >= 100000000:
        over_100m_apps.append(num_installs)
        total += 1
print(go_bold_red('COMMUNICATION'))        
print(go_bold('Number of apps'), ':', total)
print(go_bold('Number of apps under 100 million'), ':', total_under_100m)
print(go_bold('Average installs for apps under 100 million'), ':', 
              ('{:,}'.format(round(sum(under_100m_apps) / len(under_100m_apps)))))
print(go_bold('Average installs for apps at and over 100 million'), ':', 
              ('{:,}'.format(round(sum(over_100m_apps) / len(over_100m_apps)))))

[1;31mCOMMUNICATION[0m
[1mNumber of apps[0m : 287
[1mNumber of apps under 100 million[0m : 260
[1mAverage installs for apps under 100 million[0m : 3,603,485
[1mAverage installs for apps at and over 100 million[0m : 374,074,074


To see if this pattern continues amongst the other top 5 genres, we have created the function `average_installs`. This function will carry out the same analysis that we did for the `Communication` genre.

Below we can see that this pattern does in fact hold up amongst the other popular genres. When we remove the apps that have over 100 million installs we can see that the average drops dramatically. Of the top 5 genres we can see that `Photography` has the least disparity with a roughly 2.5 times decrease in the average number of installs per app, however this is still a significant decrease.

In [34]:
def average_installs(a_string): # Accepts only string from 'Category' column of gps_free
    under_100m_apps = []
    over_100m_apps = [] 
    total = 0
    total_under_100m = 0
    for app in gps_free:
        num_installs = int(app[5])
        if app[1] == a_string and num_installs < 100000000:
            under_100m_apps.append(num_installs)
            total += 1
            total_under_100m += 1
        elif app[1] == a_string and num_installs >= 100000000:
            over_100m_apps.append(num_installs)
            total += 1
    print(go_bold_red(a_string))        
    print(go_bold('Number of apps'), ':', total)
    print(go_bold('Number of apps under 100 million'), ':', total_under_100m)
    print(go_bold('Average installs for apps under 100 million'), ':', 
                  ('{:,}'.format(round(sum(under_100m_apps) / len(under_100m_apps)))))
    if sum(over_100m_apps) != 0:
        print(go_bold('Average installs for apps at and over 100 million'), ':', 
              ('{:,}'.format(round(sum(over_100m_apps) / len(over_100m_apps)))))
    print('\n')
    
average_installs('VIDEO_PLAYERS')
average_installs('SOCIAL')
average_installs('PHOTOGRAPHY')
average_installs('PRODUCTIVITY')

[1;31mVIDEO_PLAYERS[0m
[1mNumber of apps[0m : 159
[1mNumber of apps under 100 million[0m : 150
[1mAverage installs for apps under 100 million[0m : 5,544,878
[1mAverage installs for apps at and over 100 million[0m : 344,444,444


[1;31mSOCIAL[0m
[1mNumber of apps[0m : 236
[1mNumber of apps under 100 million[0m : 223
[1mAverage installs for apps under 100 million[0m : 3,084,583
[1mAverage installs for apps at and over 100 million[0m : 369,230,769


[1;31mPHOTOGRAPHY[0m
[1mNumber of apps[0m : 261
[1mNumber of apps under 100 million[0m : 242
[1mAverage installs for apps under 100 million[0m : 7,670,532
[1mAverage installs for apps at and over 100 million[0m : 147,368,421


[1;31mPRODUCTIVITY[0m
[1mNumber of apps[0m : 345
[1mNumber of apps under 100 million[0m : 323
[1mAverage installs for apps under 100 million[0m : 3,379,657
[1mAverage installs for apps at and over 100 million[0m : 213,636,364




Looking at each of these other genres in more detail below we can see that the market is being dominated by a few apps with a high number of installs. For the `VIDEO_PLAYERS` genres we can see giants such as `YouTube`, `Google Play Movies & TV` and `MX Player`. For `SOCIAL` We have the likes of `Instagram`, `Google+` and `Facebook` that dominate this genre. For `PHOTOGRAPHY` we have `Google Photos` at a million installs and various photo editor apps at 100,000. For `PRODUCTIVITY` alot of the top apps are owned by Google and Microsoft, making it not just competing against popular apps but also against monopolies of that market.

The concern here is that these apps with high numbers of installs are distorting the popularity of their genres, and if we were to develop an app for these genres we would be up against these giant companies.

In [35]:
def installs_per_app(a_string):
    app_installs_table = []
    for app in gps_free:
        if app[1] == a_string:
            key_cat_as_tuple = (int(app[5]), app[0])
            app_installs_table.append(key_cat_as_tuple)
    return app_installs_table

print(go_bold_red('VIDEO_PLAYERS')) 
sort_table_desc(installs_per_app('VIDEO_PLAYERS'))
print(go_bold_red('SOCIAL')) 
sort_table_desc(installs_per_app('SOCIAL'))
print(go_bold_red('PHOTOGRAPHY')) 
sort_table_desc(installs_per_app('PHOTOGRAPHY'))
print(go_bold_red('PRODUCTIVITY')) 
sort_table_desc(installs_per_app('PRODUCTIVITY'))

[1;31mVIDEO_PLAYERS[0m
YouTube : 1,000,000,000
Google Play Movies & TV : 1,000,000,000
MX Player : 500,000,000
VivaVideo - Video Editor & Photo Movie : 100,000,000
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000
VLC for Android : 100,000,000
Motorola Gallery : 100,000,000
Motorola FM Radio : 100,000,000
Dubsmash : 100,000,000
Vote for : 50,000,000
Vigo Video : 50,000,000
VMate : 50,000,000
Samsung Video Library : 50,000,000
Ringdroid : 50,000,000
MiniMovie - Free Video and Slideshow Editor : 50,000,000
LIKE – Magic Video Maker & Community : 50,000,000
KineMaster – Pro Video Editor : 50,000,000
HD Video Downloader : 2018 Best video mate : 50,000,000
DU Recorder – Screen Recorder, Video Editor, Live : 50,000,000
video player for android : 10,000,000
iMediaShare – Photos & Music : 10,000,000
YouTube Studio : 10,000,000
Video Player All Format : 10,000,000
Video Downloader - for Instagram Repost App : 10,000,000
Video Downloader : 10,000,000
Ustream : 10,000,000
Quik – F

We have discussed possiblites for app designs that we could look into for development that incorporate aspects from the most popular genres, but target genres where there is potential growth and profitability. As directly targeting the markets of the most popular genres would most likely prove to be fruitless as we would be up against much larger and more numerous competitors, where in some genres they have an almost monopolistic control of the market.

As disscussed earlier in the project, as part of our aims we want to develop an app that can be profitable across both the Google Play Store and the App Store. When conducting our analysis for the App Store we found the genres most promising were:
* Book
* Education
* Productivity
* Sports

As we have now established the genres best to target for the App Store, we will now use those insights to look for genres for potential app profiles on the Google Play Store. The ones that look most promising at a glance to develop a cross-store app are:
* COMICS
* EDUCATION
* FAMILY
* SPORTS
* BOOKS_AND_REFERENCE
* PRODUCTIVITY

We will investigate these specific genres and discuss them below. As we have covered ideas for potential app profiles with data from the App Store we will 


In [36]:
print(go_bold_red('COMICS')) 
sort_table_desc(installs_per_app('COMICS'))
print(go_bold_red('EDUCATION')) 
sort_table_desc(installs_per_app('EDUCATION'))
print(go_bold_red('FAMILY')) 
sort_table_desc(installs_per_app('FAMILY'))
print(go_bold_red('SPORTS')) 
sort_table_desc(installs_per_app('SPORTS'))
print(go_bold_red('BOOKS_AND_REFERENCE')) 
sort_table_desc(installs_per_app('BOOKS_AND_REFERENCE'))
print(go_bold_red('PRODUCTIVITY')) 
sort_table_desc(installs_per_app('PRODUCTIVITY'))

[1;31mCOMICS[0m
LINE WEBTOON - Free Comics : 10,000,000
comico Popular Original Cartoon Updated Everyday Comico : 5,000,000
Perfect Viewer : 5,000,000
Narrator's Voice : 5,000,000
Comics : 5,000,000
漫咖 Comics - Manga,Novel and Stories : 1,000,000
pixiv comic - everyone's manga app : 1,000,000
WebComics : 1,000,000
Tapas – Comics, Novels, and Stories : 1,000,000
Memes Button : 1,000,000
Marvel Unlimited : 1,000,000
Manga Zero - Japanese cartoon and comic reader : 1,000,000
Manga Rock - Best Manga Reader : 1,000,000
Lezhin Comics - Daily Releases : 1,000,000
GANMA! - All original stories free of charge for all original comics : 1,000,000
DC Comics : 1,000,000
Röhrich Werner Soundboard : 500,000
Manga Master - Best manga & comic reader : 500,000
Manga Books : 500,000
Izneo, Read Manga, Comics & BD : 500,000
Buff Thun - Daily Free Webtoon / Comics / Web Fiction / Mini Game : 500,000
TappyToon Comics & Webtoons : 100,000
Q Avatar (Avatar Maker) : 100,000
Laftel - Watching and Announcing S

In [37]:
average_installs('COMICS')
average_installs('EDUCATION')
average_installs('FAMILY')
average_installs('SPORTS')
average_installs('BOOKS_AND_REFERENCE')
average_installs('PRODUCTIVITY')

[1;31mCOMICS[0m
[1mNumber of apps[0m : 55
[1mNumber of apps under 100 million[0m : 55
[1mAverage installs for apps under 100 million[0m : 817,657


[1;31mEDUCATION[0m
[1mNumber of apps[0m : 103
[1mNumber of apps under 100 million[0m : 103
[1mAverage installs for apps under 100 million[0m : 1,833,495


[1;31mFAMILY[0m
[1mNumber of apps[0m : 1675
[1mNumber of apps under 100 million[0m : 1661
[1mAverage installs for apps under 100 million[0m : 2,344,308
[1mAverage installs for apps at and over 100 million[0m : 164,285,714


[1;31mSPORTS[0m
[1mNumber of apps[0m : 301
[1mNumber of apps under 100 million[0m : 299
[1mAverage installs for apps under 100 million[0m : 2,994,083
[1mAverage installs for apps at and over 100 million[0m : 100,000,000


[1;31mBOOKS_AND_REFERENCE[0m
[1mNumber of apps[0m : 190
[1mNumber of apps under 100 million[0m : 185
[1mAverage installs for apps under 100 million[0m : 1,437,212
[1mAverage installs for apps at and over 10

The `COMICS` genre has quite a low average installs per app at `817,657`. While there might be some opportunity here due to less competition from bigger developers, the potential for profitability would be less compared to the other genres listed above. However the idea of an app we have suggested for the `Book` Genre on the App Store could be replicated for the `COMICS` genre where instead the app can focus on user created comics and graphic novels. The same idea with users networking with each other and sharing their created works. This type of app would also fit under the `Book` genre on the App Store.

The `EDUCATION` genre's apps all have less than 50 million installs. This means compared to other genres we would be competing against less dominating apps. With an average of `1,833,495` installs per app we could see better profitability than with the `LIBRARIES_AND_DEMO` and `COMIC` genres. As was the case with the App Store we see the same types of apps here with game-style education apps, utility apps designed for educational institutions or environments, and strictly learning based on such topics as programming, foreign languages and maths. As with the App Store we can draw the same conclusions for this genre by looking to develop a gamification educational app that aims for a younger audience or those with a lower overall level of education. This can help us stand out amongst the apps in this genre, as it appears to be filled with 'brain training/cogntive games' or straight up course and module structured learning apps. There is not really a blending of the two in this genre.  

As with the `Game` genre being oversaturated on the App Store we decided not to investigate it for the Google Play Store either. For the same reasons is why we will not be looking to develop an app for the `FAMILY` genre as it looks to be for game apps aimed at children. It also suffers the same problem as with the `GAME` genre with it containing lots of apps (`1675` in this case) that do not get much traction and so it would be harder for us to compete within this genre.

The `SPORTS` genre has far more apps than it does on the App Store, it also includes gaming apps as well that are focused on sports such as `PES CLUB MANAGER` and `EA SPORTS UFC`. While the App Store's genre had `69` apps, the Google Play Store contains `301` suggesting a potential for more competition on this store. There also appears to be only 2 apps within this genre that have over 100 million installs and when these are removed the average installs per app drops from `3,638,640` to `2,994,083`. Compared to other genres on the Google Play Store this is a good result as we will have less big competitors to deal with. Within the App Store there are fantasy league sports apps, this si also the case in the Google Play Store as well, and as suggested on our analysis of this genre for the App Store we will need to innovate and add extra features to stand out amongst the fantasy league sports apps. As noted before because of the way that Google Play Store has decided to categorise sports gaming apps, under this genre, we will be competing with them for this market. Though by focusing the gaming aspects on party activites such as trivia and quizzes it can offer something different to what the core gaming apps offer in this genre.

Looking at the `BOOKS_AND_REFERENCE` genre on we can already see that there are considerably more apps (190) than with the `Books` (14 apps) and the `Reference` (18 apps) genres combined on the App Store. What this can mean is that developing an app for this genre may prove to be more profitable on the App Store, by virtue of there being less competiton, than on the Google Play Store. For the Google Play Store there also appears to be more apps designed to be dictionaires and far more apps dedicated to a single book, there are quite a few that have been for the Quran for example. For this genre there are only 5 apps that have over 100 million installs, however there is a signifcant decrease in average installs per app when these apps are removed from `8,767,811` to `1,437,212`. Here we are also seeing some apps that are skewing the popularity of this genre. For this genre like on the App Store, `BOOKS_AND_REFERENCE`has libraries that are the most popular apps in the genre. As discussed previously having an app focusing a user created content with social/forum functions can help us stand out in this genre. 

The `PRODUCTIVITY` has the second most amount of apps for the 6 genres we have selected for potential app profiles above at `345`. This, like most of the other genres we have looked at for further analysis, is a considerably higher amount of apps than what is contained in the `Productivity` genre on the App Store at `56`. Here on the Google Play Store we can also see that this genre is much more popular, being the 5th most popular, but on the App Store this is not the case with the genre being in the bottom half in terms of popularity. We can see that are `22` apps in this genre on the Google Play Store that have over 100 million installs which certainly skew the popularity as when these are removed the average installs drops from `16,787,331` to `3,379,657`. However when we look at the apps that are popular on the Google PLay Store not all of them are as popular on the App Store. This could potentially be due to the user ratings of the App Store not working as general tool for popularity, as perhaps users of apps from the `Productivity` genre are less likely to rate those apps than with other genres. Here for designing an app we would also be facing more competition on the Google Play Store due to the higher number of apps present in this genre in comparison to the App Store. We could be facing a similar issue in terms of profitability like as discussed with the `BOOKS_AND_REFERENCE` genre. Much like with the App Store this genre is dominated by 'office' apps and developing a gamification office app would help it to stand out.

# Conclusion

We set out on this project to find a profitable app profile that can be used across both the Google Play Store and the App Store. As our main source of revenue would come from in app advertisements we wanted to look at the data of free english apps. 

What we found is that they are multiple avenues that we can approach when designing a profitable app. Genres which seem to have potential across both stores are productivity, books and reference (books for Google Play Store), sports, and education. We wanted to develop apps for more practical purposes rather than entertainment or 'fun' due to the oversaturation of such genres as `GAME` and `Games`. This is refelected in the genres we decided to come up with app recommendations for.

From the genres we investigated further for potential app profiles, we noticed that the majority of their Google Play Store equivalents had more apps. We have noted that this could mean that app we develop may end up performing better on the App Store due to less competition. However, as noted in our validation strategy we will be building the app for the Google Play Store first so we should be able to circumvent this issue.