# Profitable App Profiles for the App Store and Google Play Markets
In this project, we will analyze data collected from app usage to help our company's developers understand what type of apps are likely to attract more users. Our company only builds apps that are free to download; consequently, our main source of revenue is through in-app ads that is direclty influenced by the amount of users using the app and viewing and engaging with ads.

The goal of this project is to provide actionable insight that will attract more users to our apps and ultimately boost advertisement revenue.

## Opening and Exploring App Data
There are about 4 million apps between the Apple and Android app stores which would require a significant amount of resources to extract and clean. In this project we will conduct analysis on an existing dataset of a smaller sample size to save time and money.
- For the Apple App Store, we will use a dataset containing data of ~7,000 iOS apps, colleted in July 2017. The dataset can be found [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).
- For the Google Play Store, we will use a dataset containing data of ~10,000 Android apps, collected in August 2018. The dataset can be found [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

We'll start our analyzation process by opening, reading, and converting the datasets into a list of lists. 

In [2]:
from csv import reader

# helper functions 
def open_dataset(filepath):
    of = open(filepath)
    rf = reader(of)
    return list(rf)

def explore_data(ds, start, end, rowCol=False):
    dsSlice = ds[start:end]
    for row in dsSlice:
        print(row,'\n')
    if rowCol:
        print('Number of rows:', len(ds))
        print('Number of columns:', len(ds[0]))

# open datasets with open_dataset helper function
appStore = open_dataset('AppleStore.csv')
playStore = open_dataset('googleplaystore.csv')

# separate headers from datasets
appStore_header = appStore[0]
appStore = appStore[1:]
playStore_header = playStore[0]
playStore = playStore[1:]

Let's explore the Play Store and App Store datasets by previewing the first three entries:

In [3]:
# explore and print datasets with explore_data helper function
print('App Store Data Preview:\n')
explore_data(appStore, 1, 4, True)
print('\nPlay Store Data Preview:\n')
explore_data(playStore, 1, 4, True)

App Store Data Preview:

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] 

Number of rows: 7197
Number of columns: 16

Play Store Data Preview:

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & De

In our App Store dataset, we have data from 7,197 apps reflected in the number of rows; in our Play Store dataset, we have 10,841 apps.

To make sense of the elements in each list, let's print the dataset's columns. The column titles are identified in the first list in the each list of lists.

For a more elaborate description of the column names, the datasets' documentations can be found below:

- [App Store Dataset Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
- [Play Store Dataset Documentation](https://www.kaggle.com/lava18/google-play-store-apps)



In [4]:
# print column titles
print('App Store Column Titles:\n', appStore_header)
print('\n')
print('Play Store Column Titles:\n', playStore_header)

App Store Column Titles:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Play Store Column Titles:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Now that we have the column titles, let's identify which attributes (columns) would be useful in our analysis and work with those. In the context of providing insight into boosting app downloads and advertisement engagement/revenue, the relevant attributes for the App Store dataset are: 

    'track_name', 'price', 'rating_count_tot', 'user_rating', 'cont_rating', 'prime_genre'
    
Relevant attributes for the Play Store dataset are:

    'App', 'Category', 'Rating', 'Reviews', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres'

## Data Cleaning
Before we can begin analyzing our data, we have to validate or "clean" it. The preparatory data cleaning process includes:

- Detecting inaccurate data, then correcting or omitting it
- Detecting duplicate data, then removing the duplicates

At our company, we develop apps that are **free** with an **English-speaking** audience. Therefore we need to remove:

- Non-english apps that can be identified with foreign characters like 爱奇艺PPS 《欢乐颂2》电视剧热播.
- Paid apps

Let's begin the data cleaning process by detecting and deleting inaccurate/unfit data (non-english and paid apps).

The Google Play dataset we are using has a "wrong entry" in row 10472, as identified in the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the source website. Let's print the problem row and compare it with the column titles and a correct row to diagnose the problem.

In [5]:
# diagnose problem row
print('Column Titles:\n', playStore_header, '\n')
print('Valid entry:\n', playStore[0], '\n')
print('Problem entry:\n', playStore[10472], '\n')

Column Titles:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Valid entry:
 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

Problem entry:
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 



Comparing the problem entry and a valid entry, we can see that the problem entry is missing the `'Category'` attribute, causing the list to be a different size and ultimately shifting the shape of the dataset. 

We can make this clearer by printing the item at index 1 (which corresponds to the `'Category'` attribute) and printing the length of the two entries.

In [6]:
# compare the items at index 1 and the lengths of the entries
print('Valid entry element at index 1:', playStore[0][1])
print('Problem entry element at index 1:', playStore[10472][1],'\n')

print('Number of columns:', len(playStore_header))
print('Valid entry length:', len(playStore[0]))
print('Problem entry length:', len(playStore[10472]))

Valid entry element at index 1: ART_AND_DESIGN
Problem entry element at index 1: 1.9 

Number of columns: 13
Valid entry length: 13
Problem entry length: 12


We can see that the valid entry's category at index 1 is "ART_AND_DESIGN" while the problem entry's category is "1.9"- the value meant for the next `'Ratings'` column. Furthermore, the problem entry has only 12 items when there are supposed to be 13 items or attributes as defined by the columns.

Let's remove this entry as we can't perform analysis on it without the necessary category attribute.

In [7]:
# delete the problem entry
# only run this code once as it will modify the original dataset!
del playStore[10472] 

Let's print the item at index 10472 to verify that the problem entry has been removed.

In [8]:
# print the item at index 10472
print(playStore[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Good. Let's repeat this process with the App Store dataset by searching the [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) for reports of problem entries.

There doesn't seem to be any according to the discussion section. Sweet!

## Removing Duplicate Entries
After removing wonky data, let's move onto removing duplicate entries. If you explore the Play Store dataset enough, you'll encounter duplicates. For example, Instagram appears four times in the Play Store dataset:

In [9]:
# print Instagram row items
print('Row items with name "Instagram" in Play Store dataset:')
for row in playStore:
    name = row[0]
    if name == 'Instagram':
        print(row)

Row items with name "Instagram" in Play Store dataset:
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1,181 duplicate apps:

In [10]:
# count duplicate apps
unique_apps = []
duplicate_apps = []

for row in playStore:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps),'\n')
print('Some duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 1181 

Some duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Now that we know how to find the duplicates, our problem is deciding which entry to keep and which entries to delete- we must develop a criteria in the context of our goal to best choose which entry to keep.

If you take another look at the four 'Instagram' entries we printed above, you'll notice that they each have different numbers of reviews:

In [11]:
# print reviews attribute of 'Instagram' row items
print("'Reviews' attribute of row items with name 'Instagram':")
for row in playStore:
    name = row[0]
    if name == 'Instagram':
        print('Reviews: ', row[3])

'Reviews' attribute of row items with name 'Instagram':
Reviews:  66577313
Reviews:  66577446
Reviews:  66577313
Reviews:  66509917


In the context of our analysis, we likely want the most recent data as possible to extrapolate our findings onto our company's apps. The entry with the most amount of reviews indicates that it is the more recent 'Instagram' entry and is the one we should keep. In the output above, the 2nd item has the most reviews at 66,577,446.

In [12]:
# store each unique app entry with the most amount of reviews
reviews_max = {}

for row in playStore:
    name = row[0]
    n_reviews = float(row[3])
    if (name in reviews_max) and (n_reviews > reviews_max[name]):
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

Let's inspect the dictionary and verify the amount of unique items is correct by comparing it to the length of the unique_apps list we used earlier.

In [13]:
# verify the dictionary is the right size
print('Length of reviews_max dictionary:', len(reviews_max))
print('Length of unique_apps list:', len(unique_apps))

Length of reviews_max dictionary: 9659
Length of unique_apps list: 9659


Using the dictionary above, we can now start removing duplicate entries and create a "clean" dataset for the Play Store apps. We'll do this by creating a new list, playStore_clean, for the new data and a list to store app names. Each entry in the playStore dataset will be validated against these two lists to append each unique, highest-number reviews app to the playStore_clean dataset.

In [14]:
playStore_clean = []
already_added = []

for row in playStore:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        playStore_clean.append(row)
        already_added.append(name)

Let's explore the clean dataset we created using the explore_data helper function:

In [15]:
print('Preview of playStore_clean dataset:\n')
explore_data(playStore_clean, 0, 3, True)

Preview of playStore_clean dataset:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows: 9659
Number of columns: 13


We have now successfully created a clean Play Store dataset under playStore_clean that includes only unique, properly formatted row items. Fortunately, the Appstore dataset doesn't have any duplicates and its items are formatted properly.

Next, we'll begin removing apps that do not have an English-speaking audience-- namely apps with foreign characters in the app title-- as our company develops apps in English. Here's an example of a foreign app title in the App Store and Playstore datasets:

In [16]:
print('Play Store foreign app at index 4412:\n', playStore_clean[4412][0],'\n')
print('App Store foreign app at index 813:\n', appStore[813][1])

Play Store foreign app at index 4412:
 中国語 AQリスニング 

App Store foreign app at index 813:
 爱奇艺PPS -《欢乐颂2》电视剧热播


The best way we can identify foreign characters in app names is by checking the Unicode value of each character. The ord() function returns the unicode value of a text character. Let's test the function.

In [17]:
print('"a" unicode:', ord('a'))
print('A" unicode:', ord('A'))
print('"爱" unicode:', ord('爱'))
print('"5" unicode:', ord('5'))
print('"+" unicode:', ord('+'))

"a" unicode: 97
A" unicode: 65
"爱" unicode: 29233
"5" unicode: 53
"+" unicode: 43


Notice the foreign character '爱' has a unicode of 29,233 while alphabet and numeric characters are < 1000. Characters commonly used in English text are all within the range 0-127 in the ASCII system- any character returning a number outside this range with the ord() function can be considered a foreign character.

Let's write a function that will check if a string has any non-English characters then test it on some app names:

In [18]:
def is_english(string):
    retVal = True
    for char in string:
        if ord(char) > 127: 
            retVal = False
    return retVal

sample_app_names = ['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go™ Free Office Suite', 'Instachat 😜']

for app in sample_app_names:
    if is_english(app):
        temp = "'{0}' is English".format(app)
        print(temp)
    else:
        temp = "'{0}' is not English".format(app)
        print(temp)

'Instagram' is English
'爱奇艺PPS -《欢乐颂2》电视剧热播' is not English
'Docs To Go™ Free Office Suite' is not English
'Instachat 😜' is not English


Another problem arises as English app titles with special characters such as '™' and '😜' are outside the English character range- we don't want to omit these as they are still English apps. It is difficult to account for every special ASCII character so we'll change the criteria a little.

Let's rewrite the function to consider an app with **3** or more characters outside the ASCII range of 0-127.

In [19]:
def is_english(string):
    foreignCount = 0
    for char in string:
        if ord(char) > 127: 
            foreignCount += 1
    return False if foreignCount >= 3 else True

for app in sample_app_names:
    if is_english(app):
        temp = "'{0}' is English".format(app)
        print(temp)
    else:
        temp = "'{0}' is not English".format(app)
        print(temp)

'Instagram' is English
'爱奇艺PPS -《欢乐颂2》电视剧热播' is not English
'Docs To Go™ Free Office Suite' is English
'Instachat 😜' is English


Nice. Now all but the app name with mostly foreign characters are considered English apps.

Let's expand this function onto our playStore and appStore datasets.

In [20]:
eng_playStore_apps = []
eng_appStore_apps = []

for row in playStore_clean:
    name = row[0]
    if is_english(name):
        eng_playStore_apps.append(row)

for row in appStore:
    name = row[1]
    if is_english(name):
        eng_appStore_apps.append(row)

print('eng_playStore_apps preview:\n')
explore_data(eng_playStore_apps, 0, 3, True)
print('\neng_appStore_apps preview:\n')
explore_data(eng_appStore_apps, 0, 3, True)

eng_playStore_apps preview:

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows: 9597
Number of columns: 13

eng_appStore_apps preview:

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', 

Row items in each dataset before and after filtering non-english apps:

In [21]:
print('playStore_clean dataset length:', len(playStore_clean))
print('eng_playStore_apps dataset length:', len(eng_playStore_apps))
print('Non-english apps omitted:', len(playStore_clean) - len(eng_playStore_apps),'\n')
print('appStore dataset length:', len(appStore))
print('eng_appStore_apps dataset length:', len(eng_appStore_apps))
print('Non-english apps omitted:', len(appStore) - len(eng_appStore_apps),'\n')

playStore_clean dataset length: 9659
eng_playStore_apps dataset length: 9597
Non-english apps omitted: 62 

appStore dataset length: 7197
eng_appStore_apps dataset length: 6155
Non-english apps omitted: 1042 



So far we've:
- Removed wrong entries (entries missing value(s) for attributes)
- Removed duplicate entries and kept originals based on a relevant criteria
- Filtered non-English apps

Our last step in the data cleaning process will be filtering paid apps and isolating the free ones. We'll filter non-free apps in a similar fashion- note that prices are represented as strings by default! Let's refresh our memory with the column titles for each dataset and sample entries:

In [22]:
print('Play Store Column titles:\n', playStore_header)
print('Sample eng_playStore_apps entry:\n', eng_playStore_apps[3],'\n')
print('Appstore Column titles:\n', appStore_header)
print('Sample eng_appStore_apps entry:\n', eng_appStore_apps[0])

Play Store Column titles:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Sample eng_playStore_apps entry:
 ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] 

Appstore Column titles:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Sample eng_appStore_apps entry:
 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Notice that the Play Store attribute for price is 'Type' at index 6; the corresponding value in the sample entry is 'Free'. In the Appstore, the attribute for price is 'price' at index 4; the corresponding value in the sample entry is '0.0' (Here is the documentation for the [Appstore](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and [Play Store](https://www.kaggle.com/lava18/google-play-store-apps) datasets for a refresh on the column title descriptions).

We can identify free Play Store apps with the string **'Free'** at index **6**. Likewise, we can identify free Appstore apps with the string **'0.0'** at index **4**. Let's loop through both datasets and isolate row items of free apps into a separate list.

In [23]:
free_playStore = []
free_appStore = []

for row in eng_playStore_apps:
    price = row[6]
    if price == 'Free':
        free_playStore.append(row)
        
for row in eng_appStore_apps:
    price = float(row[4])
    if price == 0:
        free_appStore.append(row)

print('free_playStore preview:')
explore_data(free_playStore, 0, 3, True)
print('\n')
print('free_appStore preview:')
explore_data(free_appStore, 0, 3, True)

free_playStore preview:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows: 8847
Number of columns: 13


free_appStore preview:
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2

Now we have finally finished the process of **data cleaning** where we:

1. Removed inaccurate data
2. Removed duplicate entries
3. Removed non-English apps
4. Filtered non-free apps

## Data Analysis

The goal of this analysis is to determine the characterstics of apps that are most likely to attract more users to ultimately generate more revenue through in-app ads. To minimize risks and overhead, the validation strategy for an app idea has three steps:

1. Build and release a minimal Android version of the app onto the Google Play Store
2. If the app is well-received by users, develop it further
3. If the app is profitable after six months, implement an iOS version of the app and release in onto the Appstore.

Our end goal is to develop successful apps that will be available on both the Google Play Store and the Appstore. Therefore, it will be valuable to analyze the most common genres for each market to maximize our userbase potential.

Let's begin the analysis by building frequency tables for columns pertaining to genre for each dataset. First, we'll take a look at the column titles and identify relevant attributes.

In [24]:
print('Play Store column titles:\n', playStore_header,'\n')
print('Appstore column titles:\n', appStore_header)

Play Store column titles:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Appstore column titles:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Columns that will help us in the context of most common genres for the Play Store include:
- 'Category' at index 1
- 'Genres' at index 9

For the Appstore:
- 'prime_genre' at index 11

Let's build two helper functions to analyze the frequency tables:

- The first function will generate percentages for each frequency table
- The second function will display precentages in descending order

It is difficult to properly sort dictionaries as they have no set order. To work around this we'll use the sorted() function on a list of tuples representing each key-value pair in a given frequency table.

In [25]:
# helper function to retun a frequency table of percentages
def freq_table(dataset, index, pct=False):
    table = {}
    # create frequency table
    for row in dataset:
        field = row[index]
        if field not in table:
            table[field] = 1
        else:
            table[field] += 1
    
    # turn values of each field into decimal percentages
    if pct:
        for key, val in table.items():
            pct = (val/len(dataset))*100
            table[key] = pct
    
    return table

# function to sort a dictionary by integer value
def display_table(dataset, index):
    table = freq_table(dataset, index, True)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's display the frequency tables in percentages of the columns prime_genre (index 11), Genres (index 1), and Category (index 9).

We'll start with the Applestore prime_genre frequency table:

In [26]:
print("Descending percentage frequency table for 'prime_genre' column in free_appStore:")
display_table(free_appStore, 11)

Descending percentage frequency table for 'prime_genre' column in free_appStore:
Games : 58.25788323446769
Entertainment : 7.836403371838902
Photo & Video : 4.995316890415236
Education : 3.6840462066812365
Social Networking : 3.3093974399000934
Shopping : 2.5913206369029034
Utilities : 2.466437714642523
Sports : 2.1542304089915705
Music : 2.0605682172962845
Health & Fitness : 2.0293474867311896
Productivity : 1.7483609116453322
Lifestyle : 1.5610365282547611
News : 1.3424914142990947
Travel : 1.248829222603809
Finance : 1.0927255697783327
Weather : 0.8741804558226661
Food & Drink : 0.8117389946924758
Reference : 0.5307524196066188
Business : 0.5307524196066188
Book : 0.3746487667811427
Navigation : 0.18732438339057134
Medical : 0.18732438339057134
Catalogs : 0.1248829222603809


Let's now analyze these frequency tables. Be careful to not expand insights of a particular scope onto a wider scope. For example, if you find gaming apps the most common among free, English apps on Google Play, that does not mean that gaming apps are the most common on Google Play as a whole.

The top 3 occuring genres in the 'prime_genre' column of the Appstore data set are:

1. Games
2. Entertainment
3. Photo and Video

Just eyeing the frequency table, we see that mobile games make up a whopping 58% of the Free, English apps on the Appstore. Most of the top occuring genres of apps in this scope are for Entertainment purposes- Games (#1), Entertainment (#2), Photo and Video (#3), Social Networking (#5), and Shopping (#6). The market is plentiful in entertainment apps rather than practical-use/productivity apps as most people with phone are typically on them for entertainment moreso than a tool, anyways.

Making a recommendation for an app profile on the frequency table for genres alone is likely not enough analysis to work with. This table only tells us *how many* apps there are per genre and the percentage of the types of apps there are on the market, but it does not reflect how popular or well-received they are. 

However, it is vital insight for our validation strategy in developing new apps. This data implies that the market for gaming apps will be saturated, but are likely the most in demand and the easiest to develop and monetize.

Let's take a look at the 'Category' and 'Genre' frequency tables of the Play Store dataset:

In [27]:
print("Descending percentage frequency table for 'Genres' column in free_playStore:")
display_table(free_playStore, 1)
print('\n')
print("Descending percentage frequency table for 'Category' column in free_playStore:")
display_table(free_playStore, 9)

Descending percentage frequency table for 'Genres' column in free_playStore:
FAMILY : 18.932971628800725
GAME : 9.698202780603594
TOOLS : 8.45484344975698
BUSINESS : 4.600429524132474
PRODUCTIVITY : 3.8996269922007465
LIFESTYLE : 3.888323725556686
FINANCE : 3.7074714592517237
MEDICAL : 3.537922459590822
SPORTS : 3.39097999321804
PERSONALIZATION : 3.3231603933536795
COMMUNICATION : 3.2327342602011986
HEALTH_AND_FITNESS : 3.0857917938284163
PHOTOGRAPHY : 2.950152594099695
NEWS_AND_MAGAZINES : 2.803210127726913
SOCIAL : 2.6675709279981916
TRAVEL_AND_LOCAL : 2.3397761953204474
SHOPPING : 2.2493500621679665
BOOKS_AND_REFERENCE : 2.136317395727365
DATING : 1.8650389962699219
VIDEO_PLAYERS : 1.797219396405561
MAPS_AND_NAVIGATION : 1.3903017972193965
FOOD_AND_DRINK : 1.2433593308466147
EDUCATION : 1.1642364643381937
ENTERTAINMENT : 0.9607776647451114
LIBRARIES_AND_DEMO : 0.938171131456991
AUTO_AND_VEHICLES : 0.9268678648129309
HOUSE_AND_HOME : 0.8025319317282694
WEATHER : 0.7912286650842093
EV

Surprisingly, the 'Genres' and 'Category' frequency tables paint a different picture for the Play Store than the 'prime_genre' does for the Appstore. The rankings of genres in the Genre frequency table are:

1. Family (18.9%)
2. Game (9.7%)
3. Tools (8.45%)
4. Business (4.6%)
5. Productivity (3.9%)

Rankings of categories in the Category frequency table:

1. Tools (8.4%)
2. Entertainment (6.1%)
3. Education (5.4%)
4. Business (4.6%)
5. Productivity (3.9%)

Practical-use/productivity apps are more abundant than entertainment apps among Free, English apps on the Play Store. There is a slight exception with 'Family' and 'Game' genres in the Genre frequency table (28.6% combined), however, it does not dominate the market as much as games alone did in the Appstore (58%).

This could be because Androids are typically more open-source than Apple is so app development is more accessible. Additionaly, many electronics projects including CPU's use Androids as cheap CPU's with computer capabilities. Unlike Apple, Android is not a brand but an Operating System that is used with numerous brands of devices whereas the Apple OS is exclusive to a small line of Apple products.

However, this data is not enough to curate a profile for profitable apps just yet. Still similar to the 'prime_genre' frequency table for the Applestore, thees frequency tables lack insight on apps' popularity and feedback from users, which are both essential to the validation process.

The 'Category' frequency table is more granulated as it includes multi-category apps, which would be difficult and confusing to perform analysis. Let's use the 'Genres' column to perform analysis on the Play Store and the 'prime_genre' column for the Appstore.

We have now deduced that the Appstore is dominated by entertainment apps while the Play Store has a good balance of both productivity and entertainment apps. Though we have a picture of the marketplace, we will want to figure out which genres are the most popular. We'll do this by analyzing the 'Installs' column of the Play Store and the 'rating_count_tot' column of the Appstore.

Let's start with the Appstore.

In [28]:
prime_genre = freq_table(free_appStore, 11)

print('Average user ratings per genre in Appstore:')

for genre, val in prime_genre.items():
    total = 0
    len_genre = 0
    for row in free_appStore:
        genre_app = row[11]
        if genre == genre_app:
            total_rat = float(row[5])
            total += total_rat
            len_genre += 1
    avg_rat = total / len_genre
    temp = '{0}, ({1} apps): {2}'.format(genre, len_genre, avg_rat)
    print(temp)
        

Average user ratings per genre in Appstore:
Social Networking, (106 apps): 71548.34905660378
Photo & Video, (160 apps): 28441.54375
Games, (1866 apps): 22886.36709539121
Music, (66 apps): 57326.530303030304
Reference, (17 apps): 79350.4705882353
Health & Fitness, (65 apps): 23298.015384615384
Weather, (28 apps): 52279.892857142855
Utilities, (79 apps): 19156.493670886077
Travel, (40 apps): 28243.8
Shopping, (83 apps): 27230.734939759037
News, (43 apps): 21248.023255813954
Navigation, (6 apps): 86090.33333333333
Lifestyle, (50 apps): 16815.48
Entertainment, (251 apps): 14195.358565737051
Food & Drink, (26 apps): 33333.92307692308
Sports, (69 apps): 23008.898550724636
Book, (12 apps): 46384.916666666664
Finance, (35 apps): 32367.02857142857
Education, (118 apps): 7003.983050847458
Productivity, (56 apps): 21028.410714285714
Business, (17 apps): 7491.117647058823
Catalogs, (4 apps): 4004.0
Medical, (6 apps): 612.0


The top 4 genres by number of user ratings are Navigation, Reference, Social Networking, and Music. By a glance, it seems like our company could maximize its userbase by developing a Navigation or Reference app as they have the highest avg. ratings per app, but the genres are dominated by a mere 6 and 17 apps, respectively. Let's take a look at a few of the top apps per genre by their number of ratings.

In [29]:
def select(ds, ds_header, name_index, sel_index, sel_val, value_index, pct=False, showShape=True, ex=[], ex_index=0):
    ft = {}
    for row in ds:
        exVal = row[ex_index]
        if row[sel_index] == sel_val and exVal not in ex:
            ft[row[name_index]] = float(row[value_index])
    totalVal = 0
    for val in ft.values():
        totalVal += val
    if pct:
        for key, val in ft.items():
            ft[key] = round(val/totalVal * 100, 2)
    if showShape:
        temp = 'Items: {0}, {1} value total: {2:,}'.format(len(ft), ds_header[value_index], totalVal)
        print(temp)
    return ft
            
print("Navigation Apps:\n", select(free_appStore, appStore_header, 1, 11, 'Navigation', 5, True),'\n')
print("Reference Apps:\n", select(free_appStore, appStore_header, 1, 11, 'Reference', 5, True),'\n')
print("Social Networking Apps:\n", select(free_appStore, appStore_header, 1, 11, 'Social Networking', 5, True),'\n')
print("Music Apps:\n", select(free_appStore, appStore_header, 1, 11, 'Music', 5, True))

Items: 6, rating_count_tot value total: 516,542.0
Navigation Apps:
 {'Waze - GPS Navigation, Maps & Real-time Traffic': 66.8, 'Google Maps - Navigation & Transit': 29.99, 'Geocaching®': 2.48, 'CoPilot GPS – Car Navigation & Offline Maps': 0.69, 'ImmobilienScout24: Real Estate Search in Germany': 0.04, 'Railway Route Search': 0.0} 

Items: 17, rating_count_tot value total: 1,348,958.0
Reference Apps:
 {'Bible': 73.09, 'Dictionary.com Dictionary & Thesaurus': 14.83, 'Dictionary.com Dictionary & Thesaurus for iPad': 4.02, 'Google Translate': 1.99, 'Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran': 1.37, 'New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition': 1.3, 'Merriam-Webster Dictionary': 1.25, 'Night Sky': 0.9, 'City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)': 0.63, 'LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools': 0.35, 'GUNS MODS for Minecraft PC Edition - Mods Tools': 0.11, 'Guides for P

Printed above are apps grouped by genre and their % of the total number of reviews for their genre. To come up with an app profile recommendation, we must consider the expected userbase and the nature of the app beign developed. 

In the Navigation market, 96.79% of the 500k+ reviews in the genre belong to just Waze and Google maps, with seldom reviews for the rest. Furthermore, it would be difficult to develop and compete against such popular and reputable apps when users only need one go-to navigation app. Not only do navigation apps require real-time updates, traffic and geographic data but there is little merit in developing another navigation app if reputable ones like Apple Maps and Google Maps suffice. Perhaps a unique functionality of the app would offer something different and would provide a reason to download.

In the Social Networking market, 71% of the 7.6 million reviews are distributed a bit more evenly across 8 popular apps, though 2.2 million reviews are still comprised of smaller, more niche apps. The nature of Social Media apps are that they connect to pre-existing popular platforms with a loaded databse- such as apps that connect to Facebook, Instagram, Playstation, etc- that make-up a large amount of the reviews/userbase. However, some potential exists in niche analysis and utility apps such as "Followers + for Instagram - Follower Analytics" and "Quick Reposter - Repost, Regram and Reshare Photos" with successful apps as each type of app received around 7k-75k reviews. Dating apps also receive as much as 70k reviews per popular app so there is potential in developing a platform targeted for an underrepresented community.

In the Music market, 74.8% of the 3.8 million reviews belong to 5 of the most popular streaming apps leaving ~1 million reviews among smaller music apps. There is some potential in music utility apps such as video to audio, playlist makers, ringtones, apps with target audiences in niche communities such as NBA podcasts, or entertainment music apps like karoake. It would be difficult to develop and compete against popular pre-existing streaming apps like Spotify, Pandora, and Soundcloud as there is little merit in using an app with a smaller userbase. 

In the Reference market, the Bible app alone consists of 73% of the 1.3 million reviews, leaving 350k reviews among other apps. There is potential in wiki/guide apps dedicated to niche topics/communities such as Minecraft and Pokemon GO that receive from thousands to tens of thousands in reviews per app.

Let's find the average amount of reviews in each genre excluding the biggest apps of each genre for a better estimate of our userbase if we were to make Social Networking, Music, or Reference apps (it's impractical to develop in Navigation as the market is heavily dominated and the type of app is expensive and difficult to develop).

In [31]:
appStore_SN_ex = select(free_appStore, appStore_header, 1, 11, 'Social Networking', 5, ex_index=1, ex=['Facebook','Pinterest','Skype for iPhone','Messenger','Tumblr','WhatsApp Messenger','Kik','ooVoo – Free Video Call, Text and Voice'], showShape=False)
appStore_R_ex = select(free_appStore, appStore_header, 1, 11, 'Reference', 5, ex_index=1, ex=['Bible', 'Dictionary.com Dictionary & Thesaurus'], showShape=False)
appStore_M_ex = select(free_appStore, appStore_header, 1, 11, 'Music', 5, ex_index=1, ex=['Pandora - Music & Radio','Spotify Music','Shazam - Discover music, artists, videos & lyrics','iHeartRadio – Free Music & Radio Stations', 'SoundCloud - Music & Audio'], showShape=False)
print('Reviews per Social Networking app (no major apps):\n', appStore_SN_ex,'\n')
print('Reviews per Reference app (no major apps):\n', appStore_R_ex,'\n')
print('Reviews per Music app (no major apps):\n', appStore_M_ex)
print('\n')

total = 0
for key, val in appStore_SN_ex.items():
    total += val
print('Average amount of reviews per non-dominant Social Networking app:', round(total/len(appStore_SN_ex),2))

total = 0
for key, val in appStore_R_ex.items():
    total += val
print('Average amount of reviews per non-dominant Reference app:', round(total/len(appStore_R_ex),2))

total = 0
for key, val in appStore_M_ex.items():
    total += val
print('Average amount of reviews per non-dominant Music app:', round(total/len(appStore_M_ex),2))

Reviews per Social Networking app (no major apps):
 {'TextNow - Unlimited Text + Calls': 164963.0, 'Viber Messenger – Text & Call': 164249.0, 'Followers - Social Analytics For Instagram': 112778.0, 'MeetMe - Chat and Meet New People': 97072.0, 'We Heart It - Fashion, wallpapers, quotes, tattoos': 90414.0, 'InsTrack for Instagram - Analytics Plus More': 85535.0, 'Tango - Free Video Call, Voice and Chat': 75412.0, 'LinkedIn': 71856.0, 'Match™ - #1 Dating App.': 60659.0, 'Skype for iPad': 60163.0, 'POF - Best Dating App for Conversations': 52642.0, 'Timehop': 49510.0, 'Find My Family, Friends & iPhone - Life360 Locator': 43877.0, 'Whisper - Share, Express, Meet': 39819.0, 'Hangouts': 36404.0, 'LINE PLAY - Your Avatar World': 34677.0, 'WeChat': 34584.0, 'Badoo - Meet New People, Chat, Socialize.': 34428.0, 'Followers + for Instagram - Follower Analytics': 28633.0, 'GroupMe': 28260.0, 'Marco Polo Video Walkie Talkie': 27662.0, 'Miitomo': 23965.0, 'SimSimi': 23530.0, 'Grindr - Gay and same s

Taking the above into consideration, a potential App Profile Recommendation for the Appstore would be a Social Networking-type app that could be a dating app that provides a platform for an underrepresented but trending community.

The average amount of reviews per non-dominant, Free, English, Social-Networking app in the Appstore is 17,985 reviews. If our Social-Networking app is successful, we can expect it to reach quite a few users.

Let's group the Play Store's apps by genre and analyze its n_installs column to identify the most installed genre of apps.