# My Project

# Introduction

In this project I am acting as a data analyst for a company which builds apps for Android and iOS mobile users to be available on the Google Play and Apple Store.

The company's apps are free to download and play; the primary source of revenue is from advertisements within apps.  Therefore, revenue is determined by the nuber of users of apps.

I aim to investigate two datasets taken from the above mentioned app stores to determine which types of apps are likely to attract the most users.

## Explore the Data Set

As mentioned in the above section, we are examining two data sets extracted from the iOS App Store and Android Google Play Store.

As of 2022 there are approximately 3.79 million apps in the App Store and the Google Play Store has approximately 3.30 million apps.

Source: __[Statistia iOS](https://www.statista.com/statistics/268251/number-of-apps-in-the-itunes-app-store-since-2008/)__
__[Statistia Android](https://www.statista.com/statistics/289418/number-of-available-apps-in-the-google-play-store-quarter/)__

I do not have the resources to collect and analyse data for almost 7 million apps so instead I will be using sample data sets for both stores.

Both datasets are publically accessible and free to use.

**Android Google Play Store Data**
The Android data can be downloaded via __[this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)__.

The data set was collected in August 2018 and contains data on approximately 10,000 apps.

**iOS App Store Data**
The iOS data can be downloaded via __[this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)__.

This dataset was collected in July 2017 and contains data on approximately 7,000 apps available on the iOS Store.

It is important to note that both data sets are a few years old, and may not capture accurately current trends, but it will give an indication of which apps will be the most popular and which app genres will not be overly saturated.

### Create Function Explore Data

In the section below I create a function "explore_data" which will allow us to examine our two datasets.

The breakdown of the function is as follows:

 - Take as input the following parameters:
    - Dataset, for which we will be using our two list of lists (App Store data and Google Play data)
    - Start and End, this will be two intigers which will act as indices to splice the dataset
    - Rows_and_columns, a boolean set to default False
 - Splices the data set
 - Iterates through each row in the slice and prints them, with an empty line after each row
 - If rows_and_column is True the function will then print the number of rows and columns in the data set
 
This will present the data in a readable way to give us a more clear understanding of our dataset by showing up a sample slice, the column headings and the number of rows and columns in the dataset.

In [1]:
# create explore_data function
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
# Open Apple Store and Google Play Files

apple_store = open('C:/Users/alice/Documents/Data Quest/Data Quest Python Project/AppleStore.csv', encoding='utf8')
google_play_store = open('C:/Users/alice/Documents/Data Quest/Data Quest Python Project/googleplaystore.csv', encoding='utf8')

# Read Files using Reader command

from csv import reader
read_file = reader(apple_store)
apple_store = list(read_file)
read_file = reader(google_play_store)
google_play_store = list(read_file)

### Explore Data

In [2]:
# Use function explore_data on Apple Store Data
explore_data(apple_store, 0, 10, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061

The number of rows and number of columns print informs me that the Apple Store dataset has 7,198 rows and 16 columns.

Apple Store column headers are `ID`, `track_name`, `size_bytes`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, `user_rating`, `user_rating_ver`, `ver`, `cont_rating`, `prime_genre`, `sup_devices.num`, i`padSc_urls.num`, `lang.num`, `vpp_lic`.

The link below is where the apple store data can be found.  This describes in more detail the definitions of the headings __[Apple_Store_Definitions](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)__.

The column headings are broken down as follows:

"id" : App ID

"track_name": App Name

"size_bytes": Size (in Bytes)

"currency": Currency Type

"price": Price amount

"ratingcounttot": User Rating counts (for all version)

"ratingcountver": User Rating counts (for current version)

"user_rating" : Average User Rating value (for all version)

"userratingver": Average User Rating value (for current version)

"ver" : Latest version code

"cont_rating": Content Rating

"prime_genre": Primary Genre

"sup_devices.num": Number of supporting devices

"ipadSc_urls.num": Number of screenshots showed for display

"lang.num": Number of supported languages

"vpp_lic": Vpp Device Based Licensing Enabled

Looking through these the columns which appear to be most useful are `track_name`, `price`, `user_rating` and `prime_genre`.

In [3]:
# Use function explore data on Google Play Store Data
explore_data(google_play_store, 0, 10, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

The above output shows us that the Google Play Store dataset holds 10,841 rows and 13 columns.

The 13 column headings are `App`, `Category`, `Rating`, `Reviews`, `Size`, `Installs`, `Type`, `Price`, `Content Rating`, `Genres`, `Last Updated`, `Current Ver` and `Android Ver`.

These column headings are more intuitive than the iOS data headings, however the difference between `Category` and `Genres` is unclear, as these seem synonymous.  A detailed look at the dataset can be found __[here](https://www.kaggle.com/datasets/lava18/google-play-store-apps)__.

Looking through these, the columns which seem the most useful are `App`, `Category`, `Price`, `Installs`, `Rating` and `Genres`.

# Data Cleaning

In the section below I will delete any inaccurate data and correct or remove it and delete duplicate data and remove the duplicates.

As the company only builds apps which are free to download and install and are designed for an English speaking audience, we will remove non-English apps and remove apps which are not free.

### Identifying and Removing Missing Data

This __[link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)__ informs us that entry 10472 for the Google Play Store data set has a missing entry.  In the code below I will print the row at that index to see if this is correct.

In [4]:
print(google_play_store[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Here we can see that for the 'Rating' column the data does indeed lack an entry.  We must remove rows with errors and will therefore use the del statement to remove the row.

In [5]:
del(google_play_store[10473])

Now the missing entry has been removed, I will now examine both datasets to determine if there are any missing rows.

To do this I create function `identify_missing` which:
 - Takes the dataset as input
 - Loops over the rows in the dataset
 - If the lenght of the row does not match the lenght of the dataset heading the function will print the row.

In [6]:
def identify_missing(dataset):
    for row in dataset[1:]:
        if len(row) != len(dataset[0]):
            print(dataset.index(row))
            print(len(row))
            
identify_missing(google_play_store)
identify_missing(apple_store)

As the function yielded no output, I have determined there are no more rows missing data.

## Identifying and Removing Duplicate Values

The google play store data set has duplicate values, while the apple store data set does not.  In the code below I will illustrate this.

The code first initiates two empty lists called `duplicate_apps` and `unique_apps`.  It then loops through the dataset and determines if the app name exists in the `unique_apps` list.  If it does, it will be added to the `duplicate_apps` list.  If not, it will be added to the `unique_apps` list.  The `duplicate_apps` list will therefore be a list of all duplicated app names.

The code then prints the number of duplicated apps in the list and splices the list to print the first 15 app names.

In [7]:
# Identify duplicate apps Google Play Store
duplicate_apps = []
unique_apps = []
for app in google_play_store:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('\033[1mNumber of duplicate apps Google Play:\033[0m', len(duplicate_apps))
print('\n')
print('\033[1mExamples of duplicate apps Google Play:\033[0m', duplicate_apps[:15])

# Identify duplicate apps Apple Store
duplicate_apps = []
unique_apps = []
for app in apple_store:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('\n')
print('\033[1mNumber of duplicate apps Apple Store:\033[0m', len(duplicate_apps))
print('\n')
print('\033[1mExamples of duplicate apps Apple Store:\033[0m', duplicate_apps[:15])

[1mNumber of duplicate apps Google Play:[0m 1181


[1mExamples of duplicate apps Google Play:[0m ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


[1mNumber of duplicate apps Apple Store:[0m 0


[1mExamples of duplicate apps Apple Store:[0m []


The google play store has 1181 duplicate values, while the apple store has 0.  We now need to remove the duplicate values from the google play store, as we do not want these to distort the data set.

The app 'Instagram' has several duplicate entries, please see below:

In [8]:
for app in google_play_store:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Here we can see that the 'Reviews' data varies in each row.  The most recent data will have the highest number of reviews, therefore this is the data we want to keep.

Now we must create a new data set, in which we have removed the duplicate values.
The first step of this is to create a dictionary in which each key is a unique app name and each value is the highest number of reviews.

To create the below dictionary 'reviews_max' we used a for loop to loop through the data in the Google Play Store data and extract for each app the name of the app and the number of reviews value, which we converted into a float as a dictionary value cannot be string type.  We then used an if clause to determine if the app already exists in the 'reviews_max' dictionary and if the total number of reviews was higher for that app.  If these criteria were met, the app was added to the dictionary.

We then used the elif function (still inside the for loop) to determine is the app was not in the 'reviews_max' dictionary.  It saves processing power to use the elif function as this means it is not run for every line and only for apps not previously looped through, which is the criteria we are interested in.  This adds any new apps to the dictionary with the app name as the key and the total number of reviews as the value.

In the identifying step we determined that the dataset has 1181 duplicate values, therefore the length of the new dictionary should be equal to the length of the Google Play Store data minus 1181.

In [9]:
reviews_max = {}
for app in google_play_store[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print('\033[1mActual length:\033[0m', len(reviews_max))
print('\033[1mExpected length:\033[0m', len(google_play_store[1:]) - 1181)
    

[1mActual length:[0m 9659
[1mExpected length:[0m 9659


Now we have created a dictionary containing the highest number of reviews for each app, we must use this to create a new list without duplicate apps.

To do this we created two empty lists called 'android_clean' and 'already_added'.  We created a for loop which runs through each app in the Play Store data and isolates the app name and number of reviews.

If the number of reviews for the app matches the number of reviews in the 'reviews_max' dictionary and the name is not in the already added list we added the row to the 'android_clean' list and the app name to the 'already_added' list.

This leaves us with the 'android_clean' list, a new list of the Google Play Store data with duplicate values removed.

In [10]:
android_clean = []
already_added = []

for app in google_play_store[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print('\033[1mActual length:\033[0m', len(android_clean))
print('\033[1mExpected length:\033[0m', len(google_play_store[1:]) - 1181)


[1mActual length:[0m 9659
[1mExpected length:[0m 9659


### Identify and Remove Non English Apps

Now we will identify and remove apps which are not English.  We are not interested in these apps as the company develops apps for English speaking audiences.

To do this we will identify and remove apps containing non English characters.

In the section below we create the function 'Identify_English_Character' which returns the booleans True if characters in string are English or False if characters are not English.

Each string character is associated with a corresponding number.  We can use the built in ord() function to call the corresponding number.  Please see below

In [11]:
print(ord('a'))
print(ord('A'))
print(ord('3'))
print(ord('*'))
print(ord('爱'))
print(ord('é'))

97
65
51
42
29233
233


We know that according to the ASCII(American Standard Code for Information Interchange) system, the corresponding number range for English characters are 0 to 127 so we can use the ord function to identify string with characters outside of and inside of this range.

The corresponding number for 'é' is 233, which falls outside of this range, therefore we know that languages using the Latin Alphabet with umlauts included in the string will be excluded.  This means however, that non-English language apps containing only standard English characters will still be included. - will this be addressed later?

The below 'Identify_English_Character' function takes string as an input and within a for loop iterates through each character to determine if the character has a corresponding number greater than 127 (non-English Character) and returns False if these conditions are met and True if they are not met.  Therefore, the function will return True for English Character app names.

We demonstrate the functions versatility by using it on the below app names:
 - 'Instagram'
 - '爱奇艺PPS -《欢乐颂2》电视剧热播'
 - 'Docs To Go™ Free Office Suite'
 - 'Instachat 😜'

In [12]:
def Identify_English_Charachter(string):
    for character in string:
        if ord(character) > 127:
            return(False)
        
    return(True)
print(Identify_English_Charachter('Instagram'))
print(Identify_English_Charachter('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(Identify_English_Charachter('Docs To Go™ Free Office Suite'))
print(Identify_English_Charachter('Instachat 😜'))

True
False
False
False


In the examples above although the two final examples are English apps which we would want to include, they contain characters which fall outside the ASCII range as their corresponding numbers are above 127.  Please see below.

In [13]:
print(ord('™'))
print(ord('😜'))

8482
128540


In order to minimise data loss from excluding apps with such characters, we will only remove apps containing 3 or more non-English characters.  This is not perfect, for example non-English apps of 3 characters or less will still be included and English apps of over 3 characters outside of the ASCII range will be excluded, however this will minimise the data loss and is functional for the purposes of this study.

We will now rewrite the function below and test it using the same app names.

In [14]:
def Identify_English_Character(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1

    if non_ascii > 3:
        return False
    else:
        return True
print(Identify_English_Character('Instagram'))
print(Identify_English_Character('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(Identify_English_Character('Docs To Go™ Free Office Suite'))
print(Identify_English_Character('Instachat 😜'))

True
False
True
True


We can see that the 'Identify_English_Character' function now works in the way we want it to.  In the code below we will apply this to our two datasets by looping through these and appending English apps to two empty lists which will be the updated lists for our data.

We will then use the 'explore_data' function we created previously to print a slice of the data and return the length of the dataset and number of columns.

In [15]:
android_English = []
iOS_English = []
for app in android_clean:
    name = app[0]
    if Identify_English_Character(name):
        android_English.append(app)
for app in apple_store[1:]:
    name = app[1]
    if Identify_English_Character(name):
        iOS_English.append(app)

explore_data(android_English, 0, 3, True)
print('\n')
explore_data(iOS_English, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

### Remove non-Free Apps

Previously we mentioned that we are only interested in free apps.  Therefore we need to exclude apps with a price that does not equal 0.

In the section below we will use an if statement within a for loop to loop through the data extracted above to isolate free apps in both data sets.  I will then count the number of apps in the new datasets.

In [16]:
Android_Data = []
for app in android_English:
    price = app[6]
    if 'Free' in price:
        Android_Data.append(app)
print(Android_Data[0:5])

iOS_Data = []
for app in iOS_English:
    price = app[4]
    if '0.0' in price:
        iOS_Data.append(app)
print(iOS_Data[0:5])
print('\n')
print('\033[1mLength Android: \033[0m',len(Android_Data))
print('\n')
print('\033[1mLength iOS: \033[0m',len(iOS_Data))




[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '

We are left with 8863 Google Play Store apps and 3222 Apple Store apps.

## Summary

To summarise the section above, we first removed incomplete entries by excluding entries whose length did not match the number of headings.  We then removed removed duplicate entries from our data by identifying and isolating the most recent reviews.  Then, we removed non English apps from our data sets using the ord function to identify non-English characters.  Finally we removed non-Free apps.  This leaves us with the 'Android_Data' and 'iOS_Data' data sets.

# Analysis

In order to minimise potential risks and keep overheads low our validation strategy for an app idea will do the following:
 - Build and launch a basic Android version of the app to the Google Play Store
 - If the app is popular we will develop it further
  - If the app is profitable after 6 months we will build and launch an iOS version to the App Store

As we intend to make an app available on both Goodle Play and the App Store, we must identify apps which are successful on both.

To do this we will begin by determining the most common genres on both stores.

### Most Common Genres

Now we have cleaned our data we are left with two datasets which will be our focus for the analysis, containing only apps which are free to download and are in English.  These apps are free from column headings, so each row contains data on an app:
 - `Android_Data`
 - `iOS_Data`
 
 Firstly, we will take a look at the original headings below.

In [17]:
print('\033[1mAndroid_Data Column Headings: \033[0m', google_play_store[0])
print('\n')
print('\033[1miOS_Data Column Headings: \033[0m', apple_store[0])

[1mAndroid_Data Column Headings: [0m ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


[1miOS_Data Column Headings: [0m ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


[Explore Data](#Explore-Data) As you may remember, in the Explore Data section we included a breakdown of the Apple Store data headings.

To identify the most popular app type the following headings will be important:
- Apple Store Data
    - `prime_genre`

- Google Play Store Data
    - `Category`
    - `Genres`

### Genre Frequency

The easiest way to determine the proportion of genres within the datasets is using a frequency table.

In the section below we create `freq_table` function which:
- Generates a frequency table of genres displaying percentage
- Takes as input the dataset and an index
- Initiates an empty frequency table `genre_counting`
- Calculates the length of the dataset
- Assigns the value 0 to a variable named `total`
- Loops through the dataset and for each row:
    - Adds 1 to the variable total
    - Assignes the value at the given index to a variable named genre. At index 11 for the `iOS_Data` genre will be the `prime_genre`.  For the `Android_Data` at index 1 genre will be `Category` and at index 9 will be `Genres`
    - If the genre exists in the `genre_counting` table at the key for that genre will increase value by 1
    - If the genre does not exist in the `genre_counting` table will create a key within the `genre_counting` table named after the current genre and assigns the value 1 to the key.
 - Outside of the for loop initiates a new dictionary called `genre_counting_percentages`
 - loops through the `genre_counting` dictionary to calculate the percentage of apps which fall under that genre
 - Assigns the percentage value to each genre key in `genre_counting_percentages`
 - Returns `genre_counting_percentages`

We then create function `display_table` which:
 - Takes as input a dataset and index
 - Uses the `freq_table` function with the dataset and inex as input and returns `table` - `genre_counting_percentages`
 - Initiates empty table `table_display`
 - Loops through table to transform dictionary into tuple for each key and append the tuple onto `table_display`
- Uses the sorted function to display percentages in descending order as `table_sorted`
- Within the loop prints each entry in `table_sorted`

The second function makes use of the sorting function to sort in reverse, however this function sorts by dictionary keys and we want to sort by values to display the most common apps first.  As a work around we transformed the dictionary into a list of tuples, where each tuple contained dictionary key and corresponding value.

In [18]:
# Create function freq_table
def freq_table(dataset, index):
    total_number_of_apps = len(dataset)
    genre_counting = {}
    total = 0
    for row in dataset:
        total += 1
        genre = row[index]
        if genre in genre_counting:
            genre_counting[genre] += 1
        else:
            genre_counting[genre] = 1
            
    genre_counting_percentages = {}
    for key in genre_counting:
        percentage = (genre_counting[key] / total) * 100
        genre_counting_percentages[key] = percentage
        
    return genre_counting_percentages

# Create function display_table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('\033[1miOS_Data prime_genre Percentages\033[0m')
print(display_table(iOS_Data, 11))

[1miOS_Data prime_genre Percentages[0m
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


In [19]:
print('\033[1mAndroid_Data Category Percentages\033[0m')
print(display_table(Android_Data, 1))

[1mAndroid_Data Category Percentages[0m
FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTI

In [20]:
print('\033[1mAndroid_Data Genres Percentages\033[0m')
(display_table(Android_Data, 9))

[1mAndroid_Data Genres Percentages[0m
Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.936477490691639


### What does this tell us?
***App Store***
The most common genre is games at 58%, over half of the apps on the app store.  Entertainment comes in second at just under 8%, followed by photo and video at almost 5%.

The most common apps on the App Store are for entertainment purposes (Games, Photo and Video, Entertainment, Photo & Video), while there are far fewer practical apps (Weather, News, Finance, Reference).  Although entertainment apps are more frequent, this does not show us whether these have a larger number of users.

***Google Play Store***
Unlike the App Store, the Google Play store seems to have more apps for practical purposes than entertainment.

The most popular category is Family (19%), which is an unclear definition.  This is followed by Game at almost 10% and Tools (8%) and Business (6%).

This conclusion is supported by the genres frequency table, which shows a greater percentage dedicated to tools (8%) than entertainment (6%).

Interestingly, the Google Play Store (unlike the app store) does not have one largely dominant category and seems to be more balanced.

The conclusions we draw here are based on an analysis of free English apps, so these cannot be extrapolated to paid or non English apps.


### Average Number of Installs App Store

The conclusion above identifies which genres which have the most apps in them, however we are interested in which apps will be the most popular.  To do this we need to determine which genres have the highest average number of installs.

For the android data set we can use the `Installs` columnm, which will give us the number of installs for each app.

The iOS data does not have this information, so for this we will need to estimate these.  We will use the `rating_count_tot` column as as approximate value, as the more ratings an app has the more users it will have.

To begin we will calculate the average number of user ratings per `prime_genre` in the iOS dataset.

In the below section we will isolate the apps for each genre, calculate the total user ratings of apps in that genre and divide this by the number of apps in the genre to give us the average number of ratings for each genre.

In [21]:
# Create frequency table of iOS_Data
genre_iOS = freq_table(iOS_Data, -5)

# Use nested loop to calculate average installs for each genre
ratings_table = []
for genre in genre_iOS:
    total = 0
    len_genre = 0
    for app in iOS_Data:
        genre_app = app[-5]
        if genre_app == genre:
            number_ratings = float(app[5])
            total += number_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    table_pair = (avg_n_ratings, genre)
    ratings_table.append(table_pair)
 
# Use table_sorted function to display average installs per genre in descending order
table_sorted = sorted(ratings_table, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])
    

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


The above table shows us that the app genre with the highest average number of ratings is `Navigation` (86,090), followed by `Reference` (74,942) and then `Social Networking` (71,548).  This would imply that an app in one of these fields would be the most popular in the App Store.

Let us explore the first of these categories, `Navigation`.

In [22]:
# Print all Navigation apps
for app in iOS_Data:
    genre = app[-5]
    if genre == 'Navigation':
        print(app[1], ' : ', app[5])
        

Waze - GPS Navigation, Maps & Real-time Traffic  :  345046
Google Maps - Navigation & Transit  :  154911
Geocaching®  :  12811
CoPilot GPS – Car Navigation & Offline Maps  :  3582
ImmobilienScout24: Real Estate Search in Germany  :  187
Railway Route Search  :  5


We can see here that the average number of user ratings here is skewed by a small number of apps which have a large number of average user ratings, in this case `Waze` and `Google Maps`.

We want to identify an app genre which will be successful, which will be difficult to achieve in a genre which is monopolized by a small number of highly successful apps.

Studying the next two popular genres shows a similar phenomenon.


In [23]:
# Print all Reference apps
for app in iOS_Data:
    genre = app[-5]
    if genre == 'Reference':
        print(app[1], ' : ', app[5])

Bible  :  985920
Dictionary.com Dictionary & Thesaurus  :  200047
Dictionary.com Dictionary & Thesaurus for iPad  :  54175
Google Translate  :  26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran  :  18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition  :  17588
Merriam-Webster Dictionary  :  16849
Night Sky  :  12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)  :  8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools  :  4693
GUNS MODS for Minecraft PC Edition - Mods Tools  :  1497
Guides for Pokémon GO - Pokemon GO News and Cheats  :  826
WWDC  :  762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free  :  718
VPN Express  :  14
Real Bike Traffic Rider Virtual Reality Glasses  :  8
教えて!goo  :  0
Jishokun-Japanese English Dictionary & Translator  :  0


In [24]:
# Print all Social Networking apps
for app in iOS_Data:
    genre = app[-5]
    if genre == 'Social Networking':
        print(app[1], ' : ', app[5])

Facebook  :  2974676
Pinterest  :  1061624
Skype for iPhone  :  373519
Messenger  :  351466
Tumblr  :  334293
WhatsApp Messenger  :  287589
Kik  :  260965
ooVoo – Free Video Call, Text and Voice  :  177501
TextNow - Unlimited Text + Calls  :  164963
Viber Messenger – Text & Call  :  164249
Followers - Social Analytics For Instagram  :  112778
MeetMe - Chat and Meet New People  :  97072
We Heart It - Fashion, wallpapers, quotes, tattoos  :  90414
InsTrack for Instagram - Analytics Plus More  :  85535
Tango - Free Video Call, Voice and Chat  :  75412
LinkedIn  :  71856
Match™ - #1 Dating App.  :  60659
Skype for iPad  :  60163
POF - Best Dating App for Conversations  :  52642
Timehop  :  49510
Find My Family, Friends & iPhone - Life360 Locator  :  43877
Whisper - Share, Express, Meet  :  39819
Hangouts  :  36404
LINE PLAY - Your Avatar World  :  34677
WeChat  :  34584
Badoo - Meet New People, Chat, Socialize.  :  34428
Followers + for Instagram - Follower Analytics  :  28633
GroupMe  :  

In [25]:
# Print all Book apps
for app in iOS_Data:
    genre = app[-5]
    if genre == 'Book':
        print(app[1], ' : ', app[5])



Kindle – Read eBooks, Magazines & Textbooks  :  252076
Audible – audio books, original series & podcasts  :  105274
Color Therapy Adult Coloring Book for Adults  :  84062
OverDrive – Library eBooks and Audiobooks  :  65450
HOOKED - Chat Stories  :  47829
BookShout: Read eBooks & Track Your Reading Goals  :  879
Dr. Seuss Treasury — 50 best kids books  :  451
Green Riding Hood  :  392
Weirdwood Manor  :  197
MangaZERO - comic reader  :  9
ikouhoushi  :  0
MangaTiara - love comic reader  :  0
謎解き  :  0
謎解き2016  :  0


As shown above, the book genre has fairly high average user ratings and is not overly saturated.  Based on this I would recommend developing a `Book` app.

The apps `Kindle` and `Audible` are largely popular so an ereader app would struggle to compete with these.

There are a small number of apps developed from popular children's books, such as `Dr. Seuss Treasury` and `Green Riding Hood` so this could be one avenue to explore.

### Average Number of Installs Google Play Store

In this section we will look at the `Android_Data` to acertain the most downloaded categories in the Google Play Store.

To do this we will be using the `Categories` column at index 1 to define our categories and we will be using the `Installs` column to determine average number of installs.  As in the [Average Number of Installs App Store](#Average-number-of-Installs-App-Store) section above, we will be creating a frequency table and calculating the average number of installs for each genre and then displaying the results in a table.

This presents us with a small number of problems:
 - The values at `Installs` are loosely defined, for example 100,000,000+ and 100,000+.  We do not know where in the range these values fall.
     - To counter this we will assume that an app with 100,000,000+ installs has 100,000,000 installs.
 - The values at `Installs` are currently string, as they contain a plus and a comma.
     - We will be using the replace function to remove these from the install values and will use the built in float function to convert these from string to float values.

In [26]:
# Create Android_table using freq_table function
Android_table = freq_table(Android_Data,1)
genre_list = []
for genre in Android_table:
    total = 0
    len_genre = 0
    for item in Android_Data:
        genre_app = item[1]
        if genre_app == genre:
            str_installs = item[5]
            plus_removed = str_installs.replace('+', '')
            comma_removed = plus_removed.replace(',', '')
            num_installs = float(comma_removed)
            total += num_installs
            len_genre += 1
    avg_installs = (total / len_genre)
    
    # Initiate list of average installs and genre and append this to genre_list
    l1 = [avg_installs, genre]
    genre_list.append(l1)

# Sort genre_list into descending order using sorted function
table_sorted = sorted(genre_list, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

We can see here that the apps in the Google Play Store with the highest average number of installs are `COMMUNICATION`(38,456,119), `VIDEO_PLAYERS`(24,727,872), `SOCIAL`(23,253,652) and `PHOTOGRAPHY`(17,840,110).  It therefore follows that an app in one of these genres would be popular.

As we explored with the iOS data analysis, a small number of popular apps within a genre can distort the data and demonstrate a saturated market.

We can see this below with the communication genre, where apps such as `WhatsApp Messanger` (1,000,000,000+) are skewing the results.

In [27]:
# Display the three most popular COMMUNICATION apps
communication_table = []
for genre in Android_Data:
    category = genre[1]
    name = genre[0]
    if category == 'COMMUNICATION':
        str_installs = genre[5]
        plus_removed = str_installs.replace('+', '')
        comma_removed = plus_removed.replace(',', '')
        num_installs = float(comma_removed)
        l1 = [num_installs, name]
        communication_table.append(l1)
table_sorted = sorted(communication_table, reverse = True)
new_table = table_sorted[:3]
for entry in new_table:
    print(entry[1], ':', entry[0])

WhatsApp Messenger : 1000000000.0
Skype - free IM & video calls : 1000000000.0
Messenger – Text and Video Chat for Free : 1000000000.0


If we look at this more closely we can see the distribution.

I want to explore the distribution of number of installs amongst five popular apps:
 - `COMMUNICATION`
 - `VIDEO_PLAYERS`
 - `SOCIAL`
 - `PHOTOGRAPHY`
 - `BOOKS_AND_REFERENCE`

We are studying the `BOOKS_AND_REFERENCE` genre as we identified this as a viable option when looking at the App Store Data and this has a high average number of installs in the Play Store Data (8,767,811).
 
To explore these I will create a function `genre_frequency_table` which will take a variable - the category - as input and generate a frequency table of rating distribution.

The `Category` genre contains string values and these cannot be used an input in the function.  To counter this I will loop through each row in `Android_Data` and append a value to the end of the row depending on which of the five categories the genre falls under.

If the appended value is equal to the input variable the function will isolate the row, use the replace and float function to convert the installs to fload and append this to the `genre_table`.

The function then takes the `genre_table` and uses the if and elif function to create a frequency table displaying distribution of installs.

The frequency table uses 10 intervals of 10,000,000 to explore frequencies up to over 100,000,000 installs.

In [28]:
# Create function genre_frequency_table
def genre_frequency_table(variable):        
    genre_table = []
    # Create genre_frequency table and define key values
    genre_frequency = {'0 - 10,000,000+':0, '10,000,000+ - 20,000,000+':0, '20,000,000+ - 30,000,000+':0, '30,000,000+ - 40,000,000+':0, '40,000,000+ - 50,000,000+':0, '50,000,000+ - 60,000,000+':0, '60,000,000+ - 70,000,000+':0, '70,000,000+ - 80,000,000+':0, '80,000,000+ - 90,000,000+':0, '90,000,000+ - 100,000,000+':0, 'over 100,000,000+':0}
    # Initiate values to represent categories
    for genre in Android_table:
        for item in Android_Data:
            if item[1] == 'COMMUNICATION':
                item.append(1)
            elif item[1] == 'VIDEO_PLAYERS':
                item.append(2)
            elif item[1] == 'SOCIAL':
                item.append(3)
            elif item[1] == 'PHOTOGRAPHY':
                item.append(4)
            elif item[1] == 'BOOKS_AND_REFERENCE':
                item.append(5)
            if item[-1] == variable:
                str_installs = item[5]
                plus_removed = str_installs.replace('+', '')
                comma_removed = plus_removed.replace(',', '')
                num_installs = float(comma_removed)
                l1 = [name, num_installs]
                genre_table.append(l1)
# Use genre_table to assign values to keys within genre_frequency
    for item in genre_table:
        total_installs = float(item[1])
        if total_installs <= 10000000:
            genre_frequency['0 - 10,000,000+'] += 1
        elif total_installs <= 20000000:
            genre_frequency['10,000,000+ - 20,000,000+'] += 1
        elif total_installs <= 30000000:
            genre_frequency['20,000,000+ - 30,000,000+'] += 1
        elif total_installs <= 40000000:
            genre_frequency['30,000,000+ - 40,000,000+'] += 1
        elif total_installs <= 50000000:
            genre_frequency['40,000,000+ - 50,000,000+'] += 1
        elif total_installs <= 60000000:
            genre_frequency['50,000,000+ - 60,000,000+'] += 1
        elif total_installs <= 70000000:
            genre_frequency['60,000,000+ - 70,000,000+'] += 1
        elif total_installs <= 80000000:
            genre_frequency['70,000,000+ - 80,000,000+'] += 1
        elif total_installs <= 90000000:
            genre_frequency['80,000,000+ - 90,000,000+'] += 1
        elif total_installs <= 100000000:
            genre_frequency['90,000,000+ - 100,000,000+'] += 1
        elif total_installs > 100000000:
            genre_frequency['over 100,000,000+'] += 1
    print(genre_frequency)

# Use Genre Frequency Table function to generate frequency table showing distribution of user installs for Communication Apps
print('\033[1m Communication Apps Distribution \033[0m')
genre_frequency_table(1)
print('\n')
# Use Genre Frequency Table function to generate frequency table showing distribution of user installs for Social Apps
print('\033[1m Social Apps Distribution \033[0m')
genre_frequency_table(2)
print('\n')
# Use Genre Frequency Table function to generate frequency table showing distribution of user installs for Photography Apps
print('\033[1m Photography Apps Distribution \033[0m')
genre_frequency_table(3)
print('\n')
# Use Genre Frequency Table function to generate frequency table showing distribution of user installs for Books and Reference Apps
print('\033[1m Books and Reference Apps Distribution \033[0m')
genre_frequency_table(4)

[1m Communication Apps Distribution [0m
{'0 - 10,000,000+': 8349, '10,000,000+ - 20,000,000+': 0, '20,000,000+ - 30,000,000+': 0, '30,000,000+ - 40,000,000+': 0, '40,000,000+ - 50,000,000+': 231, '50,000,000+ - 60,000,000+': 0, '60,000,000+ - 70,000,000+': 0, '70,000,000+ - 80,000,000+': 0, '80,000,000+ - 90,000,000+': 0, '90,000,000+ - 100,000,000+': 528, 'over 100,000,000+': 363}


[1m Social Apps Distribution [0m
{'0 - 10,000,000+': 4620, '10,000,000+ - 20,000,000+': 0, '20,000,000+ - 30,000,000+': 0, '30,000,000+ - 40,000,000+': 0, '40,000,000+ - 50,000,000+': 330, '50,000,000+ - 60,000,000+': 0, '60,000,000+ - 70,000,000+': 0, '70,000,000+ - 80,000,000+': 0, '80,000,000+ - 90,000,000+': 0, '90,000,000+ - 100,000,000+': 198, 'over 100,000,000+': 99}


[1m Photography Apps Distribution [0m
{'0 - 10,000,000+': 7194, '10,000,000+ - 20,000,000+': 0, '20,000,000+ - 30,000,000+': 0, '30,000,000+ - 40,000,000+': 0, '40,000,000+ - 50,000,000+': 165, '50,000,000+ - 60,000,000+': 0, '6

We can see clearly here that apps with over 100 million installs are skewing the data in all five categories.  To remove this issue we will explore the average number of installs per category without apps with over 100 million installs.

To do this, first we implement an empty list `under_100m_avg`.  We then loop through Android_Data and use the replace and float functions as earlier to float the number of installs.

If the app genre is `COMMUNICATION` and the app has less than 100 million installs we append this into the `under_100m_avg` table.  We then use the sum and length of `under_100m_avg` to determine the average installs for `COMMUNICATION`.

We then repeat this for the other four categories.

In [29]:
under_100m_avg = []
for genre in Android_Data:
    str_installs = genre[5]
    plus_removed = str_installs.replace('+', '')
    comma_removed = plus_removed.replace(',', '')
    num_installs = float(comma_removed)
    if (genre[1] == 'COMMUNICATION') and (num_installs < 100000000):
        under_100m_avg.append(num_installs)
avg_installs = sum(under_100m_avg) / len(under_100m_avg)
print('\033[1m Communication Apps Under 100 Million Average Installs \033[0m')
print(round(avg_installs,2))

print('\n')

under_100m_avg = []
for genre in Android_Data:
    str_installs = genre[5]
    plus_removed = str_installs.replace('+', '')
    comma_removed = plus_removed.replace(',', '')
    num_installs = float(comma_removed)
    if (genre[1] == 'VIDEO_PLAYERS') and (num_installs < 100000000):
        under_100m_avg.append(num_installs)
avg_installs = sum(under_100m_avg) / len(under_100m_avg)
print('\033[1m Video Player Apps Under 100 Million Average Installs \033[0m')
print(round(avg_installs,2))

print('\n')

under_100m_avg = []
for genre in Android_Data:
    str_installs = genre[5]
    plus_removed = str_installs.replace('+', '')
    comma_removed = plus_removed.replace(',', '')
    num_installs = float(comma_removed)
    if (genre[1] == 'SOCIAL') and (num_installs < 100000000):
        under_100m_avg.append(num_installs)
avg_installs = sum(under_100m_avg) / len(under_100m_avg)
print('\033[1m Social Apps Under 100 Million Average Installs \033[0m')
print(round(avg_installs,2))

print('\n')

under_100m_avg = []
for genre in Android_Data:
    str_installs = genre[5]
    plus_removed = str_installs.replace('+', '')
    comma_removed = plus_removed.replace(',', '')
    num_installs = float(comma_removed)
    if (genre[1] == 'PHOTOGRAPHY') and (num_installs < 100000000):
        under_100m_avg.append(num_installs)
avg_installs = sum(under_100m_avg) / len(under_100m_avg)
print('\033[1m Photography Apps Under 100 Million Average Installs \033[0m')
print(round(avg_installs,2))

print('\n')

under_100m_avg = []
for genre in Android_Data:
    str_installs = genre[5]
    plus_removed = str_installs.replace('+', '')
    comma_removed = plus_removed.replace(',', '')
    num_installs = float(comma_removed)
    if (genre[1] == 'BOOKS_AND_REFERENCE') and (num_installs < 10000000):
        under_100m_avg.append(num_installs)
avg_installs = sum(under_100m_avg) / len(under_100m_avg)
print('\033[1m Books and Reference Apps Under 100 Million Average Installs \033[0m')
print(round(avg_installs,2))

[1m Communication Apps Under 100 Million Average Installs [0m
3603485.39


[1m Video Player Apps Under 100 Million Average Installs [0m
5544878.13


[1m Social Apps Under 100 Million Average Installs [0m
3084582.52


[1m Photography Apps Under 100 Million Average Installs [0m
7670532.29


[1m Books and Reference Apps Under 100 Million Average Installs [0m
457134.1


### Comparison

If we compare the average installs for the whole Android_Data dataset for our five genres with the average installs for apps with under 100 Million installs it paints quite a big picture:

**Communication Apps**
 - 38,456,119 original average installs vs 3,603,458 average installs under 100 million apps

**Video Player Apps**
 - 24,727,872 original average installs vs 5,544,878 average installs under 100 million apps
 
 **Social Apps**
 - 23,253,652 original average installs vs 3,084,582 average installs under 100 million apps
 
  **Photography**
 - 17,840,110 original average installs vs 7,670,532 average installs under 100 million apps
 
  **Books and Reference**
 - 8,767,811 original average installs vs 457,134 average installs under 100 million apps
 
 The above indicates that both the `PHOTOGRAPHY` and `BOOK_AND_REFERENCE` genres would be popular, as both genres fell lower in average installs amongst the above popular apps. The `PHOTOGRAPHY` genre has the highest number of average installs, so if we were intending to develop an app purely for the Google Play Store a photography app would be advisable.  The equivalent genre in the app store `Photo & Video` had an average number of user ratings much lower than the `Book` genre.

In the code below we examine the apps in the photo and video genre of the `iOS_Data`.

In [30]:
for app in iOS_Data:
    genre = app[-5]
    if genre == 'Photo & Video':
        print(app[1], ' : ', app[5])

Instagram  :  2161558
Snapchat  :  323905
YouTube - Watch Videos, Music, and Live Streams  :  278166
Pic Collage - Picture Editor & Photo Collage Maker  :  123433
Funimate video editor: add cool effects to videos  :  123268
musical.ly - your video social network  :  105429
Photo Collage Maker & Photo Editor - Live Collage  :  93781
Vine Camera  :  90355
Google Photos - unlimited photo and video storage  :  88742
Flipagram  :  79905
Mixgram - Picture Collage Maker - Pic Photo Editor  :  54282
Shutterfly: Prints, Photo Books, Cards Made Easy  :  51427
Pic Jointer – Photo Collage, Camera Effects Editor  :  51330
Color Pop Effects - Photo Editor & Picture Editing  :  45320
Photo Grid - photo collage maker & photo editor  :  40531
iSwap Faces LITE  :  39722
MOLDIV - Photo Editor, Collage & Beauty Camera  :  39501
Photo Editor by Aviary  :  39501
Photo Lab: Picture Editor, effects & fun face app  :  34585
Rookie Cam - Photo Editor & Filter Camera  :  33921
FotoRus -Camera & Photo Editor & Pi

As we can see in the above output, the `Photo & Video` genre in the App Store is mostly dominated by a large number of popular apps which are skewing the data.  In the [Average Number of Installs](#Average-Number-of-Installs) section we showed that in the iOS_Data `Book` genre, Kindle and Audible are two highly popular genres but the genre is not dominated by already popular apps.

We will now take a look at apps which make up the `BOOKS AND REFERENCE` genre within the Google Play Store.

We want to initially create a list of `Installs`and `App` name.  We then use sorted to show the most installed apps first.

As we want to sort by most installed apps we need `Installs` to be a float.

In [31]:
# Populate list books_and_reference with BOOKS_AND_REFERENCE app names and number of installs
books_and_reference = []
l1 = []
for genre in Android_Data:
    category = genre[1]
    name = genre[0]
    installs = genre[5]
    if category == 'BOOKS_AND_REFERENCE':
        plus_removed = installs.replace('+', '')
        comma_removed = plus_removed.replace(',', '')
        num_installs = float(comma_removed)
        l1 = [round(num_installs), name]
        books_and_reference.append(l1)

# Sort books_and_reference table by number of installs
table_sorted = sorted(books_and_reference, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

# Count the number of Dictionary apps in the BOOKS_AND_REFERENCE category
print('\033[1m Total Number of Dictionary Apps \033[0m')
count = 0
for entry in table_sorted:
    if "Dictionary" in entry[1]:
        count += 1
print(count)


Google Play Books : 1000000000
Wattpad 📖 Free Books : 100000000
Bible : 100000000
Audiobooks from Audible : 100000000
Amazon Kindle : 100000000
Wikipedia : 10000000
Spanish English Translator : 10000000
Quran for Android : 10000000
Oxford Dictionary of English : Free : 10000000
NOOK: Read eBooks & Magazines : 10000000
Moon+ Reader : 10000000
JW Library : 10000000
HTC Help : 10000000
FBReader: Favorite Book Reader : 10000000
English Hindi Dictionary : 10000000
English Dictionary - Offline : 10000000
Dictionary.com: Find Definitions for English Words : 10000000
Dictionary - Merriam-Webster : 10000000
Dictionary : 10000000
Cool Reader : 10000000
Aldiko Book Reader : 10000000
Al-Quran (Free) : 10000000
Al'Quran Bahasa Indonesia : 10000000
Al Quran Indonesia : 10000000
Read books online : 5000000
English to Hindi Dictionary : 5000000
Ebook Reader : 5000000
Dictionary - WordWeb : 5000000
Bible KJV : 5000000
Ancestry : 5000000
AlReader -any text book reader : 5000000
Al Quran : EAlim - Transl

Looking at the above output, we can see that the `BOOK AND REFERENCE` genre has a large number of popular apps dedicated to reading ebooks - such as `Google Play Books` and `Wattpad`.  There are also several different versions of religious texts, including `Bible` and `Quran for Android`.  The reference section of this genre seems to be saturated too, with at least 18 apps which contain the word dictionary.  One possible avenue to explore could be an interactive version of a popular children's book.

Below, we now explore the age categories.  First we create a list of age categories to identify which best suits a children's app and then we create a list of all apps which fit that category.

In [32]:
age_categories = []
already_added = []
for app in Android_Data:
    if (app[1] == 'BOOKS_AND_REFERENCE') and (app[8] not in already_added):
            age_categories.append(app[8])
            already_added.append(app[8])
print('\033[1mAge Category List \033[0m')
print(age_categories)

print('\n')

print('\033[1mApps Suitable for Everyone \033[0m')
all_age_groups = []
for app in Android_Data:
    if (app[1] == 'BOOKS_AND_REFERENCE') and (app[8] == 'Everyone'):
        l1 = [app[0], app[5]]
        all_age_groups.append(l1)
for line in all_age_groups:
    print(line[0], ':', line[1])

print('\n')

print('\033[1mNumber of Apps Suitable for Everyone \033[0m')

print(len(all_age_groups))



[1mAge Category List [0m
['Everyone', 'Everyone 10+', 'Teen', 'Mature 17+']


[1mApps Suitable for Everyone [0m
E-Book Read - Read Book for free : 50,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
English translation from Bengali : 100,000+
Pdf Book Download - Read Pdf Book : 100,000+
Free Book Reader : 100,000+
eBoox n

## Interpretation and Recommendation

Above I identified four age categories within the Play Store:
 - `Everyone`
 - `Everyone 10+`
 - `Teen`
 - `Mature`

Although there is not a specific age group aimed at children, the `Everyone` category would be the most suitable.  Looking through the app list I can see that there are a number of apps based around a book, such as `D.H. Lawrence Poems FREE`. Creating an app version of an interactive children's book would be advisable as I can see that a book will be popular, however the market is not overly saturated.  Making the app interactive is a unique angle which will help the book to stand out within this category.

# Summary

In this project I utilised the methods of data science to:
 - Clarify the project goal
 - Collect relevant data
 - Cleaned the data
 - Analysed the data
 
 Using these methods I am able to recommend developing an interactive children's book, which has the potential to be profitable in both the Play Store and App Store.  The interactive element could be personalisation, in which the child can insert themselves into the story using their name and picture and could take the form of a 'Choose Your Own Adventure' book.

The results of this project demonstrate that the methods of data science can be used to facilitate and advise on decisions in a business setting.  This is only one example; data science is an incredibly versatile field.