# Profitable App Profiles for the Apple App Store and Google Play Markets
## About this Project
**Scenario**: I am working as a data analyst for a company that builds English-language Android and iOS mobile apps and makes them available on Google Play and the Apple App Store. The apps they build are free to download and install, thus their main source of revenue consists of in-app ads. This means their revenue for an app is largely influenced by the number of users who use the app — the more users that see and engage with the ads, the better. 
## Goal
The goal for this project is to analyze data to help the company understand what type of apps are likely to attract more users.
## Validation Strategy
To minimize risks and overhead, the validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app and add it to Google Play.
2. Develop the app further if it has a good response from users.
3. Build an iOS version of the app and add it to the Apple App Store if the Android app is profitable after six months.

Because the end goal is to add the app on both Google Play and the Apple App Store, app profiles that are successful on both markets need to be found. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.
## Overview of Analysis
1. Convert data to an analyzable format
2. Explore the data to see what is in it.
3. Clean up the data, if necessary.
4. Determine how many apps are in each genre/category.
5. Determine the relative number of users in each genre/category.
6. Determine the most desirable genres/categories, i.e. which genres/categories are most like to attract the greatest number of users.
7. Determine the most desireable genres/categories in Google Play that also desireable in the Apple App Store.


## Datasets
1. Data on approximately 10,000 Android apps from Google Play. The data was collected in August 2018. Information on this data, including a link to download the data, is here: https://www.kaggle.com/lava18/google-play-store-apps
2. Data on approximately 7,000 iOS apps from the Apple App Store. The data was collected in July 2017. Information on this data, including a link to download the data, is here: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

## Recommendations
The recommendations do not only consider the average number of users per app in a genre but also the number of apps in a genre. If a genre has relatively few apps it is more likely an app in that genre will stand out. The recommendation attempts to find genres that have relatively few apps but also have a relatively high average number of users per app.The recommendations are discussed at the end of this notebook.

The recommendations all concern music-related genres. For Google Play the recommended genres are **Music**, **Music & Video**, and **Music & Audio**. For the Apple App Store the recommended genre is **Music**. 

The recommendations are discussed at the end of this notebook.

## Open the datasets
The following code opens the Google Play and Apple App Store datasets and coverts it to a form which our code can process.

In [45]:
# Open the Apple App Store and Google Play files, read the csv, and convert the data headers and the data to lists.
open_google_play_csv_file = open("data_sets/googleplaystore/googleplaystore.csv", encoding="utf8")
open_app_store_csv_file = open("data_sets/applestore/AppleStore.csv", encoding="utf8")

from csv import reader
read_google_play_csv_file = reader(open_google_play_csv_file)
read_app_store_csv_file = reader(open_app_store_csv_file)

google_play_all = list(read_google_play_csv_file)
google_play_data_header = list(google_play_all[0])
google_play_data = list(google_play_all[1:])
app_store_all = list(read_app_store_csv_file)
app_store_data_header = list(app_store_all[0])
app_store_data = list(app_store_all[1:])

The following is a utility function for showing certain rows in in a dataset and the total number of both rows and columns.

In [46]:
def explore_data(dataset, start, end, rows_and_columns=False, show_which_rows=True):
    """Print a set of records to stdout. 
    """
    if show_which_rows:
        print("Showing rows " + str(start + 1) + " to " + str(end))
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## A Peek at the Data

### Google Play Store dataset header and first 10 rows of data

In [47]:
# Show a set of Google Play Store data.
print("***HEADER***")
print(google_play_data_header)
print('\n')
print("***DATA***")
explore_data(google_play_data, 0, 10, rows_and_columns=True)

***HEADER***
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


***DATA***
Showing rows 1 to 10
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', 

### Apple App Store dataset header and first 10 rows of data

In [48]:
# Show a set of Apple App Store data.
print("***HEADER***")
print(app_store_data_header)
print('\n')
print("***DATA***")
explore_data(app_store_data, 0, 10, rows_and_columns=True)

***HEADER***
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


***DATA***
Showing rows 1 to 10
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', 

## Data Cleaning ##
### Error in Google Play data ###
In the discussion of this data at https://www.kaggle.com/lava18/google-play-store-apps/discussion it is noted that the row with index 10472 has an error: the entry for the 'Category' column is missing and, because of this, the following columns in that row are shifted one column to the left.

Since it is bad data we remove that row from the dataset.

In [49]:
# Show the row with an error and delete it from the Google Play Store data list.
print("Row 10472: ")
print(google_play_data[10472])
print("Length of header:", len(google_play_data_header))
print("Length of row 10472:", len(google_play_data[10472]))

print("\nDeleted row:\n", google_play_data.pop(10472))

Row 10472: 
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Length of header: 13
Length of row 10472: 12

Deleted row:
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Duplicate rows in Google Play data ###
The discussion of this data at https://www.kaggle.com/lava18/google-play-store-apps/discussion also indicates there are duplicate rows, i.e. rows that have the same app name. 

The following finds the number of apps *without* duplicate names and the number of apps *with* duplicate names. It also shows the first 10 apps that are duplicates and the associated rows for each of the 10 duplicated apps.

In [50]:
# Find the duplicate apps, show how many have duplicate names, and show the first 10 duplicate apps.

def find_unique_and_duplicate_apps(data=google_play_data):
    """Return two dictionaries. The first dictionary has the rows with unique app names and is of the form 
    {unique_app_name : row_index ...}. The second dictionary has the rows with duplicate rows and is of the form
    {duplicate_app_name : [row_index1, row_index2, ...], ...}.
    
    Keyword arguments:
    - data - dataset as a list, without the header
    """
    unique_apps = {}  # dictionary with items of the form app_name : row_index
    duplicate_apps = {}  # dictionary with items of the form app_name : [row_index, ...]

    row_index = 0

    for row in data:
        app_name = row[0]  # the app name is in the first column

        if app_name in unique_apps:  # if there's a match to an app that doesn't already have a match
            first_index = unique_apps[app_name]
            del unique_apps[app_name]

            # move both the row that was previously thought to be unique and the currently considered app to the duplicates
            duplicate_apps[app_name] = [first_index, row_index]      

        elif app_name in duplicate_apps:  # if there's a match to an app that already has matches
            duplicate_apps[app_name].append(row_index)

        else:  # first time this app name was found
            unique_apps[app_name] = row_index

        row_index += 1
        
    return unique_apps, duplicate_apps
    
unique_apps, duplicate_apps = find_unique_and_duplicate_apps()

print("Number of unique app names: ", len(unique_apps))
print("Number of duplicate app names: ", len(duplicate_apps))
print("Total number of apps: ", len(unique_apps) + len(duplicate_apps))
print('\n')

# show the first 10 duplicate apps and their associated rows
for count, (app_name, row_indices) in enumerate(duplicate_apps.items()):
    print("Duplicate app: ", app_name)
    print("Num duplicates: ", len(row_indices))
        
    for row_index in row_indices:
        print(google_play_data[row_index])
        
    print('\n')
    
    if count == 9:
        break;
    

Number of unique app names:  8861
Number of duplicate app names:  798
Total number of apps:  9659


Duplicate app:  Quick PDF Scanner + OCR FREE
Num duplicates:  3
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


Duplicate app:  Box
Num duplicates:  3
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies

Of the apps with duplicated rows keep the row with the maximum number of reviews (number of reviews is in the 4th column). Note that there may be more than one row with the same maximum number of reviews for a duplicated app: if so, all will be removed except one.

In [51]:
# Keep only one app row from each duplicated app. The row kept has the most number of reviews for all duplicate rows for an app.

def remove_duplicates(duplicate_app_indices=duplicate_apps.values(), column=3):
    """Of the apps with duplicated rows keep a single row with the maximum number of reviews and remove the others.
    Note that the column with the number of reviews is column 3.
    
    Keyword arguments:
    - duplicate_apps - a list of the form [row_index1, row_index2, ...]
    - column - the column from the data used to compare the duplicates
    """
    apps_with_duplicates_removed = []
    
    for row_indices in duplicate_app_indices:
        row_data = []
        for row in row_indices:
            row_data.append(google_play_data[row])

        apps_with_duplicates_removed.append(max(row_data, key=lambda x: x[column]))
        
    return apps_with_duplicates_removed
        
apps_with_duplicates_removed = remove_duplicates()

# Confirm the number of apps kept is the same as the number of duplicate apps found previously
assert len(apps_with_duplicates_removed) == len(duplicate_apps)

Put all the cleaned data together.

In [52]:
# Add on the data for apps that already had unique names.

for row_index in unique_apps.values():
    apps_with_duplicates_removed.append(google_play_data[row_index])
    
print("Number of records after removing duplicates: ", len(apps_with_duplicates_removed))

assert len(apps_with_duplicates_removed) == (len(unique_apps) + len(duplicate_apps))

Number of records after removing duplicates:  9659


### Remove apps with non-English names from the data. ###
An app is considered to be non-English if the app name has more than 3 non-English characters.\* 

The following removes rows that have non-English names from the Google Play and Apple App Store datasets.

\*[*Note*: I think this is a terrible way to determine if an app is not in English: there are many non-English languages that use the same character set. But this is what the Dataquest lesson asked to do.]

In [53]:
# If an app name has more than 3 non-English characters - i.e. ASCII value > 127 - then remove it from the dataset.

def is_english(the_string, max_num_chars=3):
    """ English character ascii values are generally 127 or lower. Check a string to see if it has more than the maximum
    number of non-English characters for the string to still be English.
    
    Keyword arguments:
    - the_string - the string to check
    - max_num_chars - the maximum number of non-English characters the string can have and still be considered English
    """
    gt_127_count = 0
    for character in the_string:
        if ord(character) > 127:
            gt_127_count += 1
            
        if gt_127_count > max_num_chars:
            return False
        
    return True

cleaned_android_apps = []

for app in apps_with_duplicates_removed:
    if is_english(app[0]):  # app name is in the 1st column
        cleaned_android_apps.append(app)
        
print("Number of Google Play records after removing non-English apps: ", len(cleaned_android_apps))
print("Number of non-English Google Play apps removed: ", len(apps_with_duplicates_removed) - len(cleaned_android_apps))

cleaned_apple_apps = []

for app in app_store_data:
    if is_english(app[2]):  # app name is in the 3rd column
        cleaned_apple_apps.append(app)
        
print("Number of Apple App Store records after removing non-English apps: ", len(cleaned_apple_apps))
print("Number of non-English Apple App Store apps removed: ", len(app_store_data) - len(cleaned_apple_apps))

Number of Google Play records after removing non-English apps:  9614
Number of non-English Google Play apps removed:  45
Number of Apple App Store records after removing non-English apps:  6183
Number of non-English Apple App Store apps removed:  1014


### Separate out free apps since paid apps are not part of the analysis. ###

In [54]:
def is_free(record, column=7):
    """ Look in the appropriate column in the record for the app price. If the price is '0' then the app is free.
    
    Keyword arguments:
    - record - a record in the form of a list
    - column - column to look for the price (default=7 is correct for Google Play data)
    """
    import re
    
    if float(re.sub("[^0-9^.]", "", record[column])) == 0:
        return True
    
    return False

free_cleaned_android_apps = []

for app in cleaned_android_apps:
    if is_free(app):
        free_cleaned_android_apps.append(app)
        
print("Number of free Google Play apps: ", len(free_cleaned_android_apps))
print("Number of paid Google Play apps: ", len(cleaned_android_apps) - len(free_cleaned_android_apps))

free_cleaned_apple_apps = []

for app in cleaned_apple_apps:
    if is_free(app, column=5):  # price is in column 5
        free_cleaned_apple_apps.append(app)

print("Number of free Apple App Store apps: ", len(free_cleaned_apple_apps))
print("Number of paid Apple App Store apps: ", len(cleaned_apple_apps) - len(free_cleaned_apple_apps))

Number of free Google Play apps:  8862
Number of paid Google Play apps:  752
Number of free Apple App Store apps:  3222
Number of paid Apple App Store apps:  2961


## Which genres/categories are most frequent? ##

To get a sense of what are the most common genres for each market build frequency tables using the following columns:

**Google Play store data**
* `Category`: Category the app belongs to (2nd column)
* `Genres`: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. (10th column)

**Apple App Store data**
* `prime_genre`: Primary Genre (13th column)

In [55]:
%pip install pandas;
%pip install itables;


Note: you may need to restart the kernel to use updated packages.


In [56]:
from pandas import DataFrame
from itables import init_notebook_mode, show, options as opt
opt.classes = ["compact"]
init_notebook_mode(all_interactive=True)

def freq_table(dataset, column):
    """Output a dictionary with column values as keys and frequency of the column value as the value, sorted by frequency in
    descending order.
    Note: sorting a dictionary works with Python 3.7+ only.
    
    Keyword arguments:
    - dataset - dataset in the form of a list of lists
    - column - the column to use from the dataset
    """
    freq_dict = {}
    
    for row in dataset:
        column_values = row[column].split(';')  # can be multiple values delimited by the semicolon
        
        for column_value in column_values:        
            freq_dict[column_value] = freq_dict.get(column_value, 0) + 1

    return {k: v for k, v in sorted(freq_dict.items(), key=lambda item: item[1], reverse=True)}

def convert_to_pandas_dataframe(field_names, dataset):
    """Take a dataset (with no header) with records as a list of values and a field name list and output a pandas dataframe. 
    A pandas dataframe is a dictionary with the names in the header as keys and the values in an list, ordered by record order.
    A dataset that looks like this:
    
    field_names = ['huey', 'louie', 'gooey']
    dataset = [[1, 2, 3], [4, 5, 6]]
    
    is outputted as this:
    
    {'huey' : [1, 4], 'louie' : [2, 5], 'gooey' : [3, 6]}
    
    Keyword arguments:
    - field_names - an ordered list of names for each column in the dataset
    - dataset - a list of records where the records are a list of values, ordered as in field_names
    """
    df = {}
    
    for record in dataset:
        for field_name, value in zip(field_names, record):
#             print("field_name value:", field_name, value)
            if field_name in df:
                df[field_name].append(value)
            else:
                df[field_name] = [value]
                
    return DataFrame(df)

<IPython.core.display.Javascript object>

In [57]:
print("\nGoogle Play store: Category")
sorted_freq_table = freq_table(free_cleaned_android_apps, 1)

# get the total number of values to calculate percentages
total = 0
for value in sorted_freq_table.values():
    total = total + value

table_list = []

for entry in sorted_freq_table.items():
    table_list.append([entry[0], entry[1], "{:.1%}".format(float(entry[1])/total)])

headers=["Category", "Num apps", "% of total"]
df = convert_to_pandas_dataframe(headers, table_list)

show(df, paging=False, order=[[1, "desc"]], columnDefs=[{"className":"dt-left",  "targets": 0}])
    
print("\nGoogle Play store: Genres")
sorted_freq_table = freq_table(free_cleaned_android_apps, 9)

# get the total number of values to calculate percentages
total = 0
for value in sorted_freq_table.values():
    total = total + value

table_list = []

for entry in sorted_freq_table.items():
    table_list.append([entry[0], entry[1], "{:.1%}".format(float(entry[1])/total)])

headers=["Category", "Num apps", "% of total"]
df = convert_to_pandas_dataframe(headers, table_list)

show(df, paging=False, order=[[1, "desc"]], columnDefs=[{"className":"dt-left",  "targets": 0}])


Google Play store: Category


Category,Num apps,% of total



Google Play store: Genres


Category,Num apps,% of total


In [58]:
print("\nApple App Store: prime_genre")
sorted_freq_table = freq_table(free_cleaned_apple_apps, 12)

# get the total number of values to calculate percentages
total = 0
for value in sorted_freq_table.values():
    total = total + value

table_list = []

for entry in sorted_freq_table.items():
    table_list.append([entry[0], entry[1], "{:.1%}".format(float(entry[1])/total)])

headers=["prime_genre", "Num apps", "% of total"]
df = convert_to_pandas_dataframe(headers, table_list)

show(df, paging=False, order=[[1, "desc"]], columnDefs=[{"className":"dt-left",  "targets": 0}])


Apple App Store: prime_genre


prime_genre,Num apps,% of total


### Comments on Frequency Tables ###
#### Google Play store data ####
##### `Category` field #####
- The most common Category is *FAMILY*, almost twice the next most common category, *GAME*.
- *TOOLS* is not too far behind *GAME* in frequency.
- The next two most common Categories are "BUSINESS" and "LIFESTYLE".

##### `Genres` field #####
- The most common Genre is *Tools* but it does not dominate the `Genres` field, like *Games* in the Apple App Store data and *FAMILY* in the `Category` field.
- The next most common in `Genre` are *Education*, *Entertainment*, *Business*, and *Lifestyle*.

Since the top two in `Genre`, *Tools* and *Education*, do not match the top two in `Category`, *FAMILY* and *GAME*, it is not clear how the `Genre` and `Category` fields relate to each other. Regardless, compared to the Apple data, there seems to be a much lower number of entertainment oriented apps relative the rest of the apps.
#### Apple App Store data #####
- The most common genre, by far, is *Games* with more free English-language apps in that genre than all the other genres put together.
- The next most common genre is *Entertainment*.
- The top two genres are 'leisure' activities.
- The next three most common genres are *Photo & Video*, *Education*, & *Social Networking*, all with at least 100 free English-language apps.
- Most of the apps seem to be designed for entertainment purposes (e.g. Games, Entertainment, Photo & Video, Social Networking, etc.) rather than practical purposes (e.g. Education, Shopping, Utilities, Health & Fitness, etc.).

It is also not clear how representative these fields in the Apple and Google data are of frequency of app use because it does not consider the number of users for the values in each field. These values only represent the number of apps in each genre/category.

## Which app genres have the most users? ##

### Google Play data ###
Unlike the Apple App Store data, the Google Play data has the actual number of installations of an app in the `Installs` field. Unfortunately, the number of installs is not given in precise numbers. Instead it is given in values like *100+, 5,000+, 1,000,000+, and 10,000,000+*. The value *1,000,000+* could mean anything from 1,000,000 to 9,999,999. In the calculations below the plus sign is removed from the number and that number is used - so the value *1,000,000+* will be translated to the number 1,000,000 in the calculations, for instance.
#### 'Opportunity' score
Ideally, the app should be in a genre that has a high average install rate but also one in which the app is more likely to stand out from the other apps. Standing out might be difficult in genres that have a lot of apps - the opportunity to stand out is easier in genres that contain few apps, while still have a high average number of installations per app. To capture this 'opportunity' within a genre, a score is calculated that simply divides the average number of installations by the number of apps:

                     average number of installs per app 
    opportunity  =   ---------------------------------- 
                               number of apps           

In [59]:
# Calculate the total number of installs for each Category and Genre.
total_number_installs_per_category = {}
total_number_installs_per_genre = {}
total_number_apps_per_category = {}
total_number_apps_per_genre = {}

for row in free_cleaned_android_apps:    
    installs = int(row[5].replace('+', '').replace(',', ''))
    
    category = row[1]
    total_number_installs_per_category[category] = total_number_installs_per_category.get(category, 0) + installs
    total_number_apps_per_category[category] = total_number_apps_per_category.get(category, 0) + 1
    
    genre_values = row[9].split(';')  # can be multiple values delimited by the semicolon
        
    for genre in genre_values:        
        total_number_installs_per_genre[genre] = total_number_installs_per_genre.get(genre, 0) + installs
        total_number_apps_per_genre[genre] = total_number_apps_per_genre.get(genre, 0) + 1
    
sorted_total_number_installs_per_category = {k: v for k, v in sorted(total_number_installs_per_category.items(), 
                                             key=lambda item: item[1], reverse=True)}
sorted_total_number_installs_per_genre = {k: v for k, v in sorted(total_number_installs_per_genre.items(), 
                                          key=lambda item: item[1], reverse=True)}

print("\nInstallations per Category:" )

# get the total number of values to calculate percentages
total = 0
for value in sorted_total_number_installs_per_category.values():
    total = total + value
    
table_list = []

for entry in sorted_total_number_installs_per_category.items():
    percent_num_installs = float(entry[1])/total
    num_apps = total_number_apps_per_category[entry[0]]
    avg_num_installs = int(entry[1]/num_apps)
    avg_num_installs_per_num_apps = int(avg_num_installs/num_apps)
    
    table_list.append([entry[0], 
                       "{:,}".format(entry[1]), 
                       "{:.1%}".format(percent_num_installs),
                       "{:,}".format(num_apps),
                       "{:,}".format(avg_num_installs),
                       "{:,}".format(avg_num_installs_per_num_apps)
                      ])

headers=["Category", "Total installs count", "% of installs", "Num apps", "Avg # of installs/app", "Opportunity"]
df = convert_to_pandas_dataframe(headers, table_list)

show(df, paging=False, order=[[1, "desc"]], columnDefs=[{"className":"dt-left",  "targets": 0}])
    
print("\nInstallations per Genre:" )

# get the total number of values to calculate percentages
total = 0
for value in sorted_total_number_installs_per_genre.values():
    total = total + value

table_list = []

for entry in sorted_total_number_installs_per_genre.items():
    percent_num_installs = float(entry[1])/total
    num_apps = total_number_apps_per_genre[entry[0]]
    avg_num_installs = int(entry[1]/num_apps)
    avg_num_installs_per_num_apps = int(avg_num_installs/num_apps)
    
    table_list.append([entry[0], 
                       "{:,}".format(entry[1]), 
                       "{:.1%}".format(percent_num_installs),
                       "{:,}".format(num_apps),
                       "{:,}".format(avg_num_installs),
                       "{:,}".format(avg_num_installs_per_num_apps)
                      ])
    
headers=["Genre", "Total installs count", "% of installs", "Num apps", "Avg # of installs/app", "Opportunity"]
df = convert_to_pandas_dataframe(headers, table_list)

show(df, paging=False, order=[[1, "desc"]], columnDefs=[{"className":"dt-left",  "targets": 0}])


Installations per Category:


Category,Total installs count,% of installs,Num apps,Avg # of installs/app,Opportunity



Installations per Genre:


Genre,Total installs count,% of installs,Num apps,Avg # of installs/app,Opportunity


Looking at the Genre names it looks as if each Genre might be a subset of a Category. Also, it looks as if some individual Genres mirror a Category in a one-to-one fashion.

In [60]:
category_genre_dict = {}
genre_category_dict = {}

for row in free_cleaned_android_apps:
    category = row[1]
    
    genre_values = row[9].split(';')  # can be multiple values delimited by the semicolon
    genre_values.sort()
        
    for genre in genre_values:
        if category not in category_genre_dict:
            category_genre_dict[category] = {genre}
        else:
            category_genre_dict[category].add(genre)
            
        if genre not in genre_category_dict:
            genre_category_dict[genre] = {category}
        else:
            genre_category_dict[genre].add(category)
            
# for key,value in category_genre_dict.items():
#     print(key, " : ", sorted(value))
    
# print('\n')

# genre_category_dict = {}

# for row in free_cleaned_android_apps:
#     category = row[1]
    
#     genre_values = row[9].split(';')  # can be multiple values delimited by the semicolon
        
#     for genre in genre_values:
#         if genre not in genre_category_dict:
#             genre_category_dict[genre] = {category}
#         else:
#             genre_category_dict[genre].add(category)
            
# for key,value in genre_category_dict.items():
#     print(key, " : ", sorted(value))

sorted_category_genre_dict = dict(sorted(category_genre_dict.items(), key=lambda x: len(x[1]), reverse=True))
sorted_genre_category_dict = dict(sorted(genre_category_dict.items(), key=lambda x: len(x[1]), reverse=True))
    
import json

def convert_dicts_to_json_for_ipycytoscape(sorted_category_genre_dict, sorted_genre_category_dict):
    """ Convert dict of the form {key : [value1, value2, ...]} to json appropriate for loading into ipycytoscape.
    """
    intermediate_dict = {}
    intermediate_dict['nodes'] = []
    intermediate_dict['edges'] = []
    
    for key in sorted_category_genre_dict:
        node = {}
        node['data'] = {'id': key, 'name': key, 'classes': 'category'}
        intermediate_dict['nodes'].append(node)
        
    for key in sorted_genre_category_dict:
        node = {}
        node['data'] = {'id': key, 'name': key, 'classes': 'genre'}
        intermediate_dict['nodes'].append(node)
        
    for key, values in sorted_category_genre_dict.items():
        for value in values:
            source_target = {'source': key, 'target': value}
            edge = {'data': source_target}
            intermediate_dict['edges'].append(edge)
            
    for key, values in sorted_genre_category_dict.items():
        for value in values:
            source_target = {'source': key, 'target': value}
            edge = {'data': source_target}
            intermediate_dict['edges'].append(edge)
            
    return intermediate_dict


def get_nodes_with_only_one_to_one_edges(all_edges):
    """ Return edges with a source that only have one target and that target only as the original source as its
    target. Ex. only one edge with source 'A' and only one edge with source 'a' where the target of 'A' is 'a' and the target of
    'a' is 'A'.
    
    Keyword inputs:
    - all_edges - a list of edges of the form (ex.) {'data': {'source': 'FAMILY', 'target': 'Trivia'}}
    """
    edges_for_each_node = {}  # this is of the form (ex.) {'FAMILY': ['Trivia', 'Communication'], ...}
    
    for edge in all_edges:
        current_edge = edge['data']
    
        if current_edge['source'] not in edges_for_each_node:
            edges_for_each_node[current_edge['source']] = set([current_edge['target']])
        else:
            edges_for_each_node[current_edge['source']].add(current_edge['target'])
            
    one_to_one_nodes = set()
    
    for source, targets in edges_for_each_node.items():  
        if len(targets) == 1:  # if there is only one target for the current source node
            for current_target in targets:  # there should only be one target if we get this far (it's a set so have to iterate)
                if current_target in edges_for_each_node:  # if the target node is also a source node
                    targets_of_target = edges_for_each_node[current_target]  # use the target node as a source and get its target nodes
                    if len(targets_of_target) == 1:  # if there is only one target of the target node when used as a source
                        for current_target_of_target in targets_of_target:  # there should only be one target if we get this far 
                            if current_target_of_target == source:  # if that target is the same as the original source node
                                one_to_one_nodes.add(current_target)
                                one_to_one_nodes.add(source)
                        
    return one_to_one_nodes

def remove_nodes(nodes_to_remove, original_dataset):
    """ Remove nodes and their corresponding edges from a dataset, returning a deep copy of the dataset without the
    removed nodes.
    
    Keyword inputs:
    - nodes_to_remove - a list of nodes to be removed
    - original_dataset - dictionary of the form
        {'nodes': [{'data': {'id': 'FAMILY', 'name': 'FAMILY'...}}, ...],
         'edges': [{'data': {'source': 'Board', 'target': 'FAMILY'}}, ...]}
    """
    import copy
    new_dataset = copy.deepcopy(original_dataset)
    
    for node in original_dataset['nodes']:
        if node['data']['id'] in nodes_to_remove:
            new_dataset['nodes'].remove(node)
            
    for edge in original_dataset['edges']:
        if edge['data']['source'] in nodes_to_remove or edge['data']['target'] in nodes_to_remove:
            new_dataset['edges'].remove(edge)
            
    return new_dataset        

dict_data = convert_dicts_to_json_for_ipycytoscape(sorted_category_genre_dict, sorted_genre_category_dict)

# print("Num nodes before: ", len(dict_data['nodes']))
# print("Num edges before: ", len(dict_data['edges']))

one_to_one_nodes = get_nodes_with_only_one_to_one_edges(dict_data['edges'])

# print("Num nodes to remove:", len(one_to_one_nodes))

dict_data_without_one_to_ones = remove_nodes(one_to_one_nodes, dict_data)
# print("dict_data\n", dict_data_without_one_to_ones)

# print("Num nodes after: ", len(dict_data_without_one_to_ones['nodes']))
# print("Num edges after: ", len(dict_data_without_one_to_ones['edges']))

import re
pattern = re.compile("[A-Z_]+")
key_function = lambda v: v.upper().replace("&", "AND").replace(" ", "_") if re.fullmatch(pattern, v) else v.upper().replace("&", "AND").replace(" ", "_") + "1"
one_to_one_nodes_list = list(sorted(one_to_one_nodes, key=key_function))

num_pairs = int(len(one_to_one_nodes_list)/2)
print("{:d} pairs of nodes with a one to one connection: \n".format(num_pairs))
print('{:20}   {:>18}'.format('Category', 'Genre'))
print('{:_<20}   {:_>18}'.format('', ''))
i = 0
while i < len(one_to_one_nodes_list) - 1:
    print('{:20} : {:>18}'.format(one_to_one_nodes_list[i], one_to_one_nodes_list[i + 1]))
#     print(one_to_one_nodes_list[i] + " : " + one_to_one_nodes_list[i + 1])
    i += 2

18 pairs of nodes with a one to one connection: 

Category                            Genre
____________________   __________________
AUTO_AND_VEHICLES    :    Auto & Vehicles
BEAUTY               :             Beauty
BUSINESS             :           Business
DATING               :             Dating
EVENTS               :             Events
FINANCE              :            Finance
FOOD_AND_DRINK       :       Food & Drink
HOUSE_AND_HOME       :       House & Home
LIBRARIES_AND_DEMO   :   Libraries & Demo
MAPS_AND_NAVIGATION  :  Maps & Navigation
MEDICAL              :            Medical
NEWS_AND_MAGAZINES   :   News & Magazines
PERSONALIZATION      :    Personalization
PHOTOGRAPHY          :        Photography
PRODUCTIVITY         :       Productivity
SHOPPING             :           Shopping
SOCIAL               :             Social
WEATHER              :            Weather


In [61]:
%pip install ipycytoscape;

Note: you may need to restart the kernel to use updated packages.


In [62]:
import ipycytoscape

cytoscapeobj = ipycytoscape.CytoscapeWidget()
cytoscapeobj.graph.add_graph_from_json(dict_data_without_one_to_ones)
cytoscapeobj.set_layout(name='klay', nodeDimensionsIncludeLabels=True)
cytoscapeobj.set_style([{
                        'selector': 'node',
                        'css': {
                            'content': 'data(name)',
                            'text-valign': 'center',
                            'color': 'white',
                            'text-outline-width': 2,                            
                        }
                        },
                        {
                        'selector': 'node[classes="genre"]',
                        'css': {
                            'text-outline-color': '#c66b3d',
                            'background-color': '#c66b3d'
                        }
                        },
                        {
                        'selector': 'node[classes="category"]',
                        'css': {
                            'background-color': '#26495c',
                            'text-outline-color': '#26495c'
                        }
                        },
#                         {
#                         'selector': 'node:parent',
#                         'css': {
#                             'background-opacity': 0.333
#                             }
#                         },
                        {
                            'selector': 'edge',
                            'style': {
                                'width': 4,
                                'line-color': '#c4b491',
                                #'target-arrow-shape': 'triangle',
                                #'target-arrow-color': '#9dbaea',
                                'curve-style': 'bezier'
                            }
                        }])    

Below is the relationship between the Category data (blue nodes) and the Genre data (orange nodes) that are not related in a one-to-one fashion as are the nodes in the table above. Note that the Genre data is more fine-grained than the Category data. Note the overlap between the Genre nodes related to the Category 'FAMILY' and the Category 'GAME'. Note also the the Genre nodes that are related to 'FAMILY' and 'GAME' are also related to many of the other Category nodes.

Most important in terms of this analysis, there are no Genre nodes that are not related to a Category node so it seems reasonable to ignore Category and use Genre. Given the greater number of nodes in Genre it is a more fine-grained categorization of the data.

In [63]:
cytoscapeobj

CytoscapeWidget(cytoscape_layout={'name': 'klay', 'nodeDimensionsIncludeLabels': True}, cytoscape_style=[{'sel…

#### Discussion of Google Play data ####
Since the Genre data is more fine-grained I will concentrate on it. 

Here are the top 5 apps in average number of installs:

| Genre                   | Avg # installs/app |
|-------------------------|--------------------|
| Communication           | 38.3 million       |
| Video Players & Editors | 24.6 million       |
| Social                  | 23.2 million       |
| Arcade                  | 21.5 million       |
| Photography             | 17.8 million       |

And the top 5 apps (with at least 20 apps in its Genre) in 'Opportunity':

| Genre                   | Opportunity |
|-------------------------|-------------|
| Music                   | 450,301     |
| Word                    | 395,411     |
| Music & Video           | 160,677     |
| Video Players & Editors | 153,778     |
| Racing                  | 141,770     |

Note that only *Video Players & Editors* is in the top 5 in both average number of installs and 'Opportunity'.

### Apple App Store data ###
Unfortunately, given the available data, there is no direct way to figure out the frequency of users for each genre. The total number of ratings for each app is available, however, and that can be used as a close proxy for the actual number of users. A field called `rating_count_tot` is in the 7th column. Calculate the total number of ratings for each genre.

Note that the number of ratings in a genre may be indicative more of the number of apps in a genre rather than how much each app in that category tends to be installed or used. To get an idea of how much each app in a category tends to be installed or used, assuming that rating count for each app is indicative of how much that app is used, the data is 'normalized' to show the average number of ratings for an app in each genre.

In [64]:
# Calculate the total number of ratings and apps for each genre.
total_number_ratings_per_genre = {}
total_number_apps_per_genre = {}

for row in free_cleaned_apple_apps:
    genre = row[12]
    rating_count = int(row[6])
    
    total_number_ratings_per_genre[genre] = total_number_ratings_per_genre.get(genre, 0) + rating_count
    total_number_apps_per_genre[genre] = total_number_apps_per_genre.get(genre, 0) + 1

sorted_total_number_ratings_per_genre = {k: v for k, 
                                         v in sorted(total_number_ratings_per_genre.items(), 
                                         key=lambda item: item[1], reverse=True)}

# sorted_total_number_apps_per_genre = {k: v for k, 
#                                       v in sorted(total_number_apps_per_genre.items(), 
#                                       key=lambda item: item[1], reverse=True)}

print("\nTotal number of ratings and apps per genre and average number of ratings per app:" )

# get the total number of values to calculate percentages
total_num_ratings = 0
for value in sorted_total_number_ratings_per_genre.values():
    total_num_ratings = total_num_ratings + value

table_list = []

for entry in sorted_total_number_ratings_per_genre.items():
    percent_num_ratings = float(entry[1])/total_num_ratings
    num_apps = total_number_apps_per_genre[entry[0]]
    average_num_ratings = int(entry[1]/len(free_cleaned_apple_apps))
    average_num_ratings_per_num_apps = int(average_num_ratings/num_apps)
    
    table_list.append([entry[0], 
                       "{:,}".format(entry[1]), 
                       "{:.1%}".format(percent_num_ratings), 
                       "{:,}".format(num_apps),
                       "{:,}".format(average_num_ratings),
                       "{:,}".format(average_num_ratings_per_num_apps)
                      ])
    
    # print("{} : {:,} ({:.1%}), {:,} ({:.1%})".format(entry[0], entry[1], percent_num_ratings, num_apps, percent_num_apps))
    
headers=["prime_genre", "Total rating count", "% of ratings", "Num apps", "Avg # of ratings/app", "Opportunity"]
df = convert_to_pandas_dataframe(headers, table_list)

show(df, paging=False, order=[[1, "desc"]], columnDefs=[{"className":"dt-left",  "targets": 0}])


Total number of ratings and apps per genre and average number of ratings per app:


prime_genre,Total rating count,% of ratings,Num apps,Avg # of ratings/app,Opportunity


#### Discussion of Apple App Store data ####

The top 5 in average number of ratings:

| Genre             | Avg # ratings/app |
|-------------------|-------------------|
| Games             | 13,254            |
| Social Networking | 2353              |
| Photo & Video     | 1412              |
| Music             | 1174              |
| Entertainment     | 1106              |

Note that the *Games* Genre dominates the average rating count with over 50% of the total rating count. Also note that the top 5 Genre categories are all entertainment-purposed apps with almost 78% of the total rating count.

The top 5 in 'Opportunity' with at least 10 apps in the Genre:

| Genre             | Opportunity |
|-------------------|-------------|
| Reference         | 23          |
| Social Networking | 22          |
| Music             | 17          |
| Weather           | 16          |
| Book              | 12          |

Note that the 'Opportunity' for the *Games* Genre, at 7, is relatively low.

Note that *Social Networking* and *Music* are in the top 5 both in both average number of installs and 'Opportunity'

### Comparison of Google Play and Apple App Store data ###
*Video Players & Editors* is the stand-out genre in the Google Play data - the seemingly closest corresponding genre in the Apple App Store data, *Photo & Video*, has the third highest average number of ratings, but its Opportunity score, at 8, is rather middling.

Since there are fewer genres in the Apple App Store data than in the Google Play store data it may be useful to take a look at the top Apple App Store genres and see how the corresponding Google Play store genres place in the Google Play store data.

#### Apple App Store genres ####
##### *Reference* #####
The *Reference* genre has the highest 'Opportunity' score in the Apple App Store data. Unfortunately, the corresponding Google Play store genre *Books & Reference* has a middling Opportunity score and is not among the top ten genres with average number of installs.

##### *Social Networking* #####
The *Social Networking* genre has the second highest 'Opportunity' score in the Apple App Store data and the second highest number of users. Unfortunately, the corresponding Google Play store genre *Social*, while having the third highest average number of installs, has a middling Opportunity score.

##### *Music* #####
The *Music* genre has the third highest 'Opportunity' score in the Apple App Store data and the second highest average number of users. There are three different potential corresponding genres in the Google Play data - *Music & Audio*, *Music*, and *Music & Video*. These three genres all have very high opportunity scores. *Music & Audio* has the highest Opportunity score at 500,000 but was not included in the discussion about the Google Play data since there is only one app in that genre. *Music* has the second highest Opportunity score (450,301) and *Music & Video* has the forth highest Opportunity score (160,677).

## Discussion of recommendations ##
Music-related genres are recommended mainly because there seems to be a lot of opportunity in these genres since there are relatively few apps in these genres but the average number of users per app in these genres, while not always very high, is reasonably high. While there are genres with a much higher average number of users per app their Opportunity score is low because there are many apps in the genre and there is too much risk that a new app in one of those genre would go unnoticed.