# Understanding Application Profitability - Profiles for Users by Users 
---

Let's pretend that we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store. We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. 

Our __goal__ for this project is to __analyze data to help our developers understand what type of apps are likely to attract more users__.

## About The Datasets
---
The datasets contain a __lot__ of useful information about iOS apps from the App Store (collected July 2017) and Android apps from Google Play (collected Aug. 2018). Some of the information we can glean from over 7000 iOS and 10000 Android individual apps are their names, user ratings, prices, and more.

The App Store dataset can be downloaded for free [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download) and the Google Play dataset [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv), while further description of each of their columns/variables can be found in their documentation: [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps), [Google Play](https://www.kaggle.com/lava18/google-play-store-apps). Here's a quick look:


| COLUMN       | DESCRIPTION    |
| :---------:  | :---------:    |
| "id"         | App ID         |
| "track_name" | App Name       |
| "size_bytes" | Size (in Bytes)|



In [1]:
from csv import reader 

app_store = list(reader(open('AppleStore.csv')))

# First few rows
print(app_store[0])
print("\n")
print(app_store[1:5])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']]


Let's try exploring the datasets using a function that prints rows in a readable way. The below function takes as input:
- a __dataset__ in the form of a list of lists
- a __start__ and an __end__, which are integers that correspond to row indices from which the data set is sliced
- a boolean __rows_and_columns__ input indicating whether to print the number of rows and columns (defaults to _False_)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

After we supply the necessary arguments, the function slices the dataset to contain only the rows defined by __start__ and __end__, and iterates over each row. For each iteration, the row is printed along with _"\n"_ which is a special character that adds a new line to make our rows more readable.

Let's see it in action below!

In [3]:
app_store = list(reader(open('AppleStore.csv')))
goog_play = list(reader(open('googleplaystore.csv')))

# Using the explore_data function we created
# We exlude the header row to get the correct number of rows
explore_data(app_store[1:], 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
explore_data(goog_play[1:], 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


As we can see, the App Store dataset contains more columns/variables compared to the Google Play dataset. The exact number of rows for the former is __7,197__, while for the latter it's __10,841__. 

Now let's check and identify which columns could help us reach our goal and aid in our succeeding analyses.

In [5]:
print(app_store[0])
print("\n")
print(goog_play[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Data Cleaning
---
In order to have accurate analyses, we must first perform data cleaning which identifies cells that could affect the results and thus the conclusions of our analyses, and performing the appropriate action step:
- Detecting inaccurate data, removing or correcting them
- Detecting duplicate data, removing them

Recall that the apps we build are only those that are __free__ to download and install, as well as those that are primaryly consumed by an __English-speaking__ audience. This means that we will:
- Remove apps that are not free
- Remove apps with non-English app names

### I. Duplicates, Inaccurate Data 

Let us begin first with the Google Play dataset. Looking at the dedicated discussion section on [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps/discussion), there is a [thread](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) that identified an app that was missing an entry under the _rating_ column. If this is the case, then in order to find the row index of the app, an approach we can consider is comparing each row's length with the length of the header row and consequently finding those with shorter length.

In [6]:
# Length of header row
header_length = len(goog_play[0]) # 13 columns

# Iterating over each row
for row in goog_play[1:]:
    row_length = len(row)
    if row_length != header_length:
        print(goog_play[0])
        print("\n")
        print(row)
        print("\n")
        print("Row with index", goog_play.index(row), "length is", row_length, "instead of", header_length, "!")        

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Row with index 10473 length is 12 instead of 13 !


As we can see, what really occurred is that the app was missing an entry for the _category_ column and all the other entries shifted to the left, column-wise. Let's look at the surrounding rows to check if this is indeed the case.

In [7]:
print(goog_play[10472])
print("\n")
print(goog_play[10473])
print("\n")
print(goog_play[10474])

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


For now, let's remove this row using the __del__ statement, making sure we only run this once.

In [8]:
del goog_play[10473]
print(goog_play[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


We can also check for duplicate entries and get the total number of duplicate cases. For now, we'll use app names as the basis for duplicates.

In [9]:
dups_apps = []
uniq_apps = []

for row in goog_play[1:]:
    name = row[0] # Recall that the first column describes the name of the app
    if name in uniq_apps:
        dups_apps.append(name)
    else:
        uniq_apps.append(name)

print("# of Duplicates:", len(dups_apps))
print("\n")
print("Some Duplicate Apps:", dups_apps[1:5])

# of Duplicates: 1181


Some Duplicate Apps: ['Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Thus, there are __1,181__ cases where an app name is seen more than once. While the names are the same, we can check if the values in the other columns for these apps differ and use those differences to choose which row to keep. Let's take _Instagram_ for example:

In [10]:
for row in goog_play[1:]:
    name = row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


While this may not be the case for all the duplicate cases, the _Instagram_ app duplicates have different values under the _Reviews_ column, which is the count of the reviews for that app. Logic dictates that the app with the most number of reviews corresponds to the most recent version of the app, and is thus the row that should be kept.

To remove the duplicates using this criterion, we do the following:
- Create a dictionary where each key corresponds to a unique app name, while the dictionary value is the highest number of reviews for that app.
- Create a new dataset using this dictionary

In [11]:
reviews_max = {}

for row in goog_play[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print("# of Reviews for Instagram:", reviews_max["Instagram"])
print("Length of dictionary:", len(reviews_max))        

# of Reviews for Instagram: 66577446.0
Length of dictionary: 9659


As we can see, the length of the dictionary is now shorter since we removed duplicates, and that the number of reviews for _Instagram_ used was the maximum among its duplicates (66577313, __66577446__, 66577313, 66509917).

Let's now create a cleaned dataset using this dictionary. For this, we create a list __android_clean__ that we will populate with rows from the original dataset removing duplicates based on the resulting dictionary that used the criterion above. We also create another list __already_added__ which accounts for duplicate account names with the same number of reviews for each (see app _Box_ for example). Without this list and the accompanying logic in the code, we would still be including duplicate apps.

In [12]:
android_clean = []
already_added = []

for row in goog_play[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print("Length of Cleaned Dataset:", len(android_clean))
print("\n")
explore_data(android_clean, 0, 4, True)


Length of Cleaned Dataset: 9659


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


The length of the resulting dataset is now the same as the length of the dictionary. 

Looking at the dedicated discussion section for the App Store dataset on [Kaggle](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion), there seems to be a [thread](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409) describing possible duplicate rows. According to user Garrett Mayock:
> It looks like two apps appear twice in the data. 'Mannequin Challenge' and 'VR Roller Coaster' are the two non-unique values in the 'track_name' column.

> Both apps appear to have a change in 'id', 'size', and the ratings, an increase in 'ver' and 'rating_count_tot', and a change in 'sup_devices.num' and 'ipadSc_urls.num'. 'Mannequin Challenge' also appears to have a higher content rating (9+ instead of 4+).

Further discussion in the thread, however, reveals that the aforementioned apps do not have duplicates. That is, aside from the same names, further research and logical reasoning allow us to conclude that these apps are different from each other (see [this](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409#535304) and [this](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409#564686) posts).

### II. Non-English Apps

To check for non-English apps, we can look at each app name as a string or a series of individual characters, and check if those characters belong to a group of commonly used characters in English such as the English alphabet, numbers 0 to 9, punctuation marks, and others. 

We first write a function that iterates over each character in a string (in our case, the app names) and returns __False__ if there is a character that does not seem to belong in our desired set of English characters, and __True__ otherwise. We use the built-in __ord()__ function to do this.

In [13]:
def english(string):
    trigger = 0
    for character in string:
        if ord(character) > 127:
            trigger += 1
        if trigger > 3:
            return False
    return True

print(english('Instagram')) # True
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播')) # False
print(english('Docs To Go™ Free Office Suite')) # True
print(english('Docs To Go™™™ Free Office Suite')) # True
print(english('Docs To Go™™™™ Free Office Suite')) # False
print(english('Instachat 😜')) # True

True
False
True
True
False
True


We now loop through the datasets and append only rows with app names identified as English in a separate list.

In [14]:
semifinal_googplay = [goog_play[0]]
semifinal_appstore = [app_store[0]]

for row in android_clean:
    if english(row[0]):
        semifinal_googplay.append(row)
        
for row in app_store[1:]:
    if english(row[1]):
        semifinal_appstore.append(row)
        
# Exploring the resulting datasets
explore_data(semifinal_googplay, 0, 4, True)
print("\n")
explore_data(semifinal_appstore, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9615
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215'

### III. Free Apps Only

Finally, we extract only the __free__ apps in both datasets. For the Google Play dataset, we'll refer to the _Price_ column, and for the App Store dataset, we'll look at the _price_ column.

In [15]:
final_googplay = [goog_play[0]]
final_appstore = [app_store[0]]

for row in semifinal_googplay[1:]:
    if row[6] == 'Free':
        final_googplay.append(row)
        
for row in semifinal_appstore[1:]:
    if row[4] == "0.0":
        final_appstore.append(row)
        
print("FINAL CLEANED GOOGLE PLAY DATASET")
explore_data(final_googplay, 0, 4, True)
print("\n")
print("FINAL CLEANED APP STORE DATASET")
explore_data(final_appstore, 0, 4, True)


FINAL CLEANED GOOGLE PLAY DATASET
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


FINAL CLEANED APP STORE DATASET
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devi

After removing inaccurate data, duplicate app cases, non-English apps, and non-free apps, our final datasets include __8,864__ Android apps and __3,223__ iOS apps.

We are now ready to proceed with the analysis.

## Analysis of Google Play and App Store Apps
---
As a recall, our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users wherein revenue is largely influenced by the number of people using the apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We would want to find app profiles that are successful on both the App Store and Google Play since our end goal is to have the app be on both markets. 

### I. Most Frequent Genres

We begin by looking at which genres are most frequently occuring by building frequency tables. Specifically, we build a table for the __prime_genre__ column of the App Store dataset, as well as for the __Category__ and __Genres__ columns of the Google Play dataset.

In [16]:
print(app_store[0])
print("\n")
print(goog_play[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We first create a function for generating frequency tables. Afterwards, we make use of another function that displays the data in a sorted fashion.

In [17]:
def freq_table(dataset, index):
    freqtable = {}
    total_number = 0
    
    for row in dataset:
        total_number += 1
        col = row[index]
        if col in freqtable:
            freqtable[col] += 1
        else:
            freqtable[col] = 1
    
    for key in freqtable:
        freqtable[key] = round((freqtable[key]/total_number)*100, 2)
    
    return freqtable   
        
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

##### App Store: prime_genre

Let's take a look at the App Store dataset first since we're concerned with only one column.

In [42]:
display_table(final_appstore[1:], 11) # prime_genre column

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Among the free English apps in the App Store, more than half are __Games__ at 58.16%. A large difference is seen between this genre and the next in the sorted list which is __Entertainment__ at 7.88%. Looking at the top five genres, most are focused on recreation and pleasure (Games, Entertainment, Photo & Video, and Social Networking) with only __Education__ as the genre meant for productivity making it into the top five.

It is also worth noticing that among the genres, those in the upper half are whose apps offer a lot of customization and focus on individual experiences. There may be however an element of sharing and community that also affects app creation under these genres. Having a large number of apps for a certain genre however does not imply a large number of users.

Let's move on to the Google Play dataset.

##### Google Play: Category

In [19]:
display_table(final_googplay[1:], 1) # Category column

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Among the free English apps in the Google Playstore, the most frequently occuring _category_ is __FAMILY (18.9%)__ followed by __GAME (9.73%)__ and __TOOLS (8.46%)__. The distribution is noticeably different from that of the App Store with a significant lead in the number of apps made for productivity and practical purposes. The leading category of family-oriented apps may be referring to games as well.

##### Google Play: genres

In [20]:
display_table(final_googplay[1:], 9) # genres column

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;B

Among the free English apps in the Google Playstore, the most frequently occuring _genre_ is __Tools (8.45%)__ followed by __Entertainment (6.07%)__ and __Education (5.35%)__. For this column, there are more categories, and its difference with the __Category__ column is not clear. 

Overall, the apps in our App Store dataset seem to most frequently belong in genres meant for entertainment while those in the Google Play dataset offer a more balanced distribution between apps for productivity and apps for recreation.

### II. Most Popular (by Install Count)

For this part of the analysis, we will look at each genre by the amount of users. Specifically, we calculate the average number of installs for each app genre. This can be seen under the __Installs__ column in the Google Play dataset, while for the App Store dataset, we will use __rating_count_tot__ as a proxy variable since there is no direct variable for the number of installs per app.

##### App Store

In [21]:
genre_appstore = freq_table(final_appstore[1:], 11) # rating_count_tot

for genre in genre_appstore:
    total = 0
    len_genre = 0
    for row in final_appstore[1:]:
        genre_app = row[11]
        if genre_app == genre:
            num_ratings = float(row[5])
            total += num_ratings
            len_genre += 1
    
    ave = round(total/len_genre, 2)
    print(genre, ":", ave)

Photo & Video : 28441.54
Navigation : 86090.33
Education : 7003.98
Entertainment : 14029.83
Book : 39758.5
Food & Drink : 33333.92
Utilities : 18684.46
Shopping : 26919.69
Business : 7491.12
Weather : 52279.89
Health & Fitness : 23298.02
Medical : 612.0
Finance : 31467.94
Music : 57326.53
News : 21248.02
Reference : 74942.11
Lifestyle : 16485.76
Travel : 28243.8
Productivity : 21028.41
Social Networking : 71548.35
Catalogs : 4004.0
Games : 22788.67
Sports : 23008.9


Using our proxy variable, we can see that the genre with the most number of users is _Navigation_ with 86,090 ratings on average followed by _Reference_ and _Social Networking_. It is worth noting that the top genres by app frequency we generated earlier now no longer belong in the top genres by user count in terms of ratings count (except for Social Networking). This supports the idea that app frequency does not offer a complete picture of app profitability. However, since we used the average, app popularity by user count also does not accurately show popularity in the sense that the top genres' numbers may be skewed by a few very popular apps. Such is the case for _Navigation_ as seen below where Waze and Google Maps seem to dominate. We also look at apps under _Reference_ and _Social Networking_.

In [33]:
tot = float(0)
for row in final_appstore[1:]:
    if row[11] == "Navigation":
        tot = tot + float(row[5])

for row in final_appstore[1:]:
    if row[11] == "Navigation":
        print(row[1], ":", row[5], ",", round(float(row[5])/tot, 2)*100, "%")

Waze - GPS Navigation, Maps & Real-time Traffic : 345046 , 67.0 %
Google Maps - Navigation & Transit : 154911 , 30.0 %
Geocaching® : 12811 , 2.0 %
CoPilot GPS – Car Navigation & Offline Maps : 3582 , 1.0 %
ImmobilienScout24: Real Estate Search in Germany : 187 , 0.0 %
Railway Route Search : 5 , 0.0 %


Since Waze and Google Maps account for __97%__ of the ratings count, they may seem more popular than they really are. Furthermore, the specific use of these apps do not differ much from each other, which may make it hard to introduce a new app that doesn't already do navigation with integrated traffic information as effectively as those dominating apps.

In [34]:
tot = float(0)
for row in final_appstore[1:]:
    if row[11] == "Reference":
        tot = tot + float(row[5])

for row in final_appstore[1:]:
    if row[11] == "Reference":
        print(row[1], ":", row[5], ",", round(float(row[5])/tot, 2)*100, "%")

Bible : 985920 , 73.0 %
Dictionary.com Dictionary & Thesaurus : 200047 , 15.0 %
Dictionary.com Dictionary & Thesaurus for iPad : 54175 , 4.0 %
Google Translate : 26786 , 2.0 %
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418 , 1.0 %
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588 , 1.0 %
Merriam-Webster Dictionary : 16849 , 1.0 %
Night Sky : 12122 , 1.0 %
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535 , 1.0 %
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693 , 0.0 %
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497 , 0.0 %
Guides for Pokémon GO - Pokemon GO News and Cheats : 826 , 0.0 %
WWDC : 762 , 0.0 %
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718 , 0.0 %
VPN Express : 14 , 0.0 %
Real Bike Traffic Rider Virtual Reality Glasses : 8 , 0.0 %
教えて!goo : 0 , 0.0 %
Jishokun-Japanese English Dictionary &

This set of _Reference_ apps on the other hand, show more variation in terms of what they offer individually, and is overall interesting to work with. For example, the top few apps ranges from the Bible app, to dictionary apps, and to Google Translate. 

A new app could be developed that offered a Bible reading app that also integrates the ability to look up the definition of any word the user may highlight while reading. This might be helpful especially when the etymologies of the words are also provided. Lastly, this new Bible app should also provide constantly improving translations through a feedback functionality where users can rate translations.

In [40]:
tot = float(0)
for row in final_appstore[1:]:
    if row[11] == "Social Networking":
        tot = tot + float(row[5])

for row in final_appstore[1:]:
    if row[11] == "Social Networking":
        print(row[1], ":", row[5], ",", round(float(row[5])/tot, 2)*100, "%")

Facebook : 2974676 , 39.0 %
Pinterest : 1061624 , 14.000000000000002 %
Skype for iPhone : 373519 , 5.0 %
Messenger : 351466 , 5.0 %
Tumblr : 334293 , 4.0 %
WhatsApp Messenger : 287589 , 4.0 %
Kik : 260965 , 3.0 %
ooVoo – Free Video Call, Text and Voice : 177501 , 2.0 %
TextNow - Unlimited Text + Calls : 164963 , 2.0 %
Viber Messenger – Text & Call : 164249 , 2.0 %
Followers - Social Analytics For Instagram : 112778 , 1.0 %
MeetMe - Chat and Meet New People : 97072 , 1.0 %
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414 , 1.0 %
InsTrack for Instagram - Analytics Plus More : 85535 , 1.0 %
Tango - Free Video Call, Voice and Chat : 75412 , 1.0 %
LinkedIn : 71856 , 1.0 %
Match™ - #1 Dating App. : 60659 , 1.0 %
Skype for iPad : 60163 , 1.0 %
POF - Best Dating App for Conversations : 52642 , 1.0 %
Timehop : 49510 , 1.0 %
Find My Family, Friends & iPhone - Life360 Locator : 43877 , 1.0 %
Whisper - Share, Express, Meet : 39819 , 1.0 %
Hangouts : 36404 , 0.0 %
LINE PLAY - Your Avatar 

While there is a plethora of apps available under _Social Networking_, the sheer number of them intuitively makes entering an already very saturated genre a daunting task if we want to differentiate our app.

An additional step we can take is removing these apps with outlying ratings count and recalculating averages. We will leave this step for a future analysis.

##### Google Play

For the Google Play dataset, we can directly look at the __Installs__ column to look at app popularity. However, as we see below, the values are not precise and are open-ended. Thus, for the purpose of our analysis, we will take the values as is. That is, an app that has "1,000,000+" installs will be regarded as having 1,000,000 installs and so on. We would need to perform some cleaning first by converting these string values as floats by removing the commas and the plus symbols.

In [43]:
display_table(final_googplay[1:], 5) # Installs column

1,000,000+ : 15.73
100,000+ : 11.55
10,000,000+ : 10.55
10,000+ : 10.2
1,000+ : 8.39
100+ : 6.92
5,000,000+ : 6.83
500,000+ : 5.56
50,000+ : 4.77
5,000+ : 4.51
10+ : 3.54
500+ : 3.25
50,000,000+ : 2.3
100,000,000+ : 2.13
50+ : 1.92
5+ : 0.79
1+ : 0.51
500,000,000+ : 0.27
1,000,000,000+ : 0.23
0+ : 0.05


We now look again at the average number of installs per category using the following loops.

In [48]:
genre_googplay = freq_table(final_googplay[1:], 1) # Category

for genre in genre_googplay:
    total = 0
    len_category = 0
    for row in final_googplay[1:]:
        category_app = row[1]
        if category_app == genre:
            installs = row[5] # Installs
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = float(installs)
            total += installs
            len_category += 1
            
    ave = round(total/len_category, )
    print(genre, ":", ave)

BUSINESS : 1712290
EVENTS : 253542
WEATHER : 5074486
NEWS_AND_MAGAZINES : 9549178
DATING : 854029
VIDEO_PLAYERS : 24727872
SPORTS : 3638640
SHOPPING : 7036877
BEAUTY : 513152
AUTO_AND_VEHICLES : 647318
TRAVEL_AND_LOCAL : 13984078
BOOKS_AND_REFERENCE : 8767812
PERSONALIZATION : 5201483
MEDICAL : 120551
HOUSE_AND_HOME : 1331541
SOCIAL : 23253652
EDUCATION : 1833495
HEALTH_AND_FITNESS : 4188822
PHOTOGRAPHY : 17840110
LIFESTYLE : 1437816
COMICS : 817657
ART_AND_DESIGN : 1986335
LIBRARIES_AND_DEMO : 638504
FINANCE : 1387692
GAME : 15588016
FAMILY : 3697848
COMMUNICATION : 38456119
TOOLS : 10801391
PARENTING : 542604
ENTERTAINMENT : 11640706
MAPS_AND_NAVIGATION : 4056942
PRODUCTIVITY : 16787331
FOOD_AND_DRINK : 1924898


Using the __Installs__ column, though it isn't readily apparent but the top category is _COMMUNICATION_ with an average of __38,456,119__ installs followed by _VIDEO PLAYERS_ at __2,472,7872__ and _SOCIAL_ at __23,253,652__ installs. Let's look at the top category first.

Below are some of the _COMMUNICATION_ apps with the most number of installs. It's worth considering that there are apps with over a billion installs which may again skew the averages we computed earlier.

In [60]:
for row in final_googplay[1:]:
    if row[1] == "COMMUNICATION" and (row[5] == "1,000,000,000+"):
        print(row[0], ":", row[5])

print("\n")     

for row in final_googplay[1:]:
    if row[1] == "COMMUNICATION" and (row[5] == "500,000,000+"):
        print(row[0], ":", row[5])

print("\n")     

for row in final_googplay[1:]:
    if row[1] == "COMMUNICATION" and (row[5] == "100,000,000+"):
        print(row[0], ":", row[5])        

WhatsApp Messenger : 1,000,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
Skype - free IM & video calls : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+


Google Duo - High Quality Video Calls : 500,000,000+
imo free video calls and chat : 500,000,000+
LINE: Free Calls & Messages : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Viber Messenger : 500,000,000+


imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
Firefox Browser fast & private : 100,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private

In fact, this skewness is also present in the aforementioned categories as we can see below. 

_VIDEO PLAYERS_ is dominated by YouTube and Google Play Movies, who are industry giants that will prove to be difficult to compete against. 

_SOCIAL_ on the other hand is similarly difficult to enter with social media giants such as Facebook and Instagram being prominent and hard-to-oust topnotchers.

In [66]:
print("========================")
print("VIDEO PLAYERS")
print("========================")
for row in final_googplay[1:]:
    if row[1] == "VIDEO_PLAYERS" and (row[5] == "1,000,000,000+"):
        print(row[0], ":", row[5])

print("\n")     

for row in final_googplay[1:]:
    if row[1] == "VIDEO_PLAYERS" and (row[5] == "500,000,000+"):
        print(row[0], ":", row[5])

print("\n")     

for row in final_googplay[1:]:
    if row[1] == "VIDEO_PLAYERS" and (row[5] == "100,000,000+"):
        print(row[0], ":", row[5])  

print("\n")     
print("========================")
print("SOCIAL")
print("========================")
for row in final_googplay[1:]:
    if row[1] == "SOCIAL" and (row[5] == "1,000,000,000+"):
        print(row[0], ":", row[5])

print("\n")     

for row in final_googplay[1:]:
    if row[1] == "SOCIAL" and (row[5] == "500,000,000+"):
        print(row[0], ":", row[5])

print("\n")     

for row in final_googplay[1:]:
    if row[1] == "SOCIAL" and (row[5] == "100,000,000+"):
        print(row[0], ":", row[5])  

VIDEO PLAYERS
YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+


MX Player : 500,000,000+


Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


SOCIAL
Facebook : 1,000,000,000+
Google+ : 1,000,000,000+
Instagram : 1,000,000,000+


Facebook Lite : 500,000,000+
Snapchat : 500,000,000+


Tumblr : 100,000,000+
Pinterest : 100,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
LinkedIn : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


Thus, we recompute the averages while removing apps that have over a hundred million installs to check for up-and-coming categories that may be worth entering, and try to make an app profile recommendation for those.

In [61]:
for genre in genre_googplay:
    total = 0
    len_category = 0
    for row in final_googplay[1:]:
        category_app = row[1]
        if category_app == genre and row[5] not in ("1,000,000,000+", "500,000,000+", "100,000,000+"):
            installs = row[5] # Installs
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = float(installs)
            total += installs
            len_category += 1
            
    ave = round(total/len_category, )
    print(genre, ":", ave)

BUSINESS : 1226919
EVENTS : 253542
WEATHER : 5074486
NEWS_AND_MAGAZINES : 1502842
DATING : 854029
VIDEO_PLAYERS : 5544878
SPORTS : 2994083
SHOPPING : 4640921
BEAUTY : 513152
AUTO_AND_VEHICLES : 647318
TRAVEL_AND_LOCAL : 2944080
BOOKS_AND_REFERENCE : 1437212
PERSONALIZATION : 2549776
MEDICAL : 120551
HOUSE_AND_HOME : 1331541
SOCIAL : 3084583
EDUCATION : 1833495
HEALTH_AND_FITNESS : 2005714
PHOTOGRAPHY : 7670532
LIFESTYLE : 1152129
COMICS : 817657
ART_AND_DESIGN : 1986335
LIBRARIES_AND_DEMO : 638504
FINANCE : 1086126
GAME : 6272565
FAMILY : 2344308
COMMUNICATION : 3603485
TOOLS : 3191461
PARENTING : 542604
ENTERTAINMENT : 6118250
MAPS_AND_NAVIGATION : 2484105
PRODUCTIVITY : 3379657
FOOD_AND_DRINK : 1924898


After doing the previous step, the categories that took the most hit in average install count are those that had a high density of dominating apps. Looking at the recomputed averages, the top category is now _PHOTOGRAPHY_ with an average install count of __7,670,532__ followed by _GAME_ and _ENTERTAINMENT_.

In [68]:
for row in final_googplay[1:]:
    if row[1] == "PHOTOGRAPHY" and row[5] == ("50,000,000+"):
        print(row[0], ":", row[5])  

Motorola Camera : 50,000,000+
InstaBeauty -Makeup Selfie Cam : 50,000,000+
Selfie Camera - Photo Editor & Filter & Sticker : 50,000,000+
ASUS Gallery : 50,000,000+
Square InPic - Photo Editor & Collage Maker : 50,000,000+
VSCO : 50,000,000+
PhotoWonder: Pro Beauty Photo Editor Collage Maker : 50,000,000+
Photo Effects Pro : 50,000,000+
Photo Editor Selfie Camera Filter & Mirror Image : 50,000,000+
Pic Collage - Photo Editor : 50,000,000+
Photo Editor by Aviary : 50,000,000+
Video Editor Music,Cut,No Crop : 50,000,000+
Pixlr – Free Photo Editor : 50,000,000+
Adobe Photoshop Express:Photo Editor Collage Maker : 50,000,000+
InstaSize Photo Filters & Collage Editor : 50,000,000+
Snapseed : 50,000,000+
Keepsafe Photo Vault: Hide Private Photos & Videos : 50,000,000+
MakeupPlus - Your Own Virtual Makeup Artist : 50,000,000+
SNOW - AR Camera : 50,000,000+
Boomerang from Instagram : 50,000,000+
Photo Lab Picture Editor: face effects, art frames : 50,000,000+
MomentCam Cartoons & Stickers : 50,

Above are _PHOTOGRAPHY_ apps that have less than a hundred million installs but have over fifty million installs. These can be considered as budding apps in this genre that, while saturated, benefits from having a lot of variation. Intuitively, each user would have likely tried more than one camera app, more than one filter app, more than one photo/video editing app, etc. and only after would decide their preferred app of choice.

Thus, it might be worth developing a photography app that can encapsulate the varying features while keeping quality control in check. This app, aside from being able to directly take pictures and videos whether selfies, panoramas, or boomerangs, would also allow for filter application. For this, dedicated high-quality presets should be made available for free to the user. Granular control over picture characteristics such as contrast, brightness, hue, saturation, etc. would also be desirable since mobile photography is more popular than ever. Afterwards, making collages from the photos could be another step for an encompassing user experience, but the execution should be creative and innovative so that people will actually use them. 

More intricate ideas include the ability to crop photos according to standard social media ratios, but also for specific social media app cropping dimensions (Twitter and Instagram comes to mind). Finally, the photos should be easily shareable to the user's social circle. This may be done with a dedicated feed that the user can curate, or through partnering with popular social media apps and easing publishing through them.

# Conclusions
---

For this project, we performed data cleaning an analysis on two readily-available datasets on mobile apps in order to create profitable app profiles that our company can develop. We looked at app popularity and frequency to influence our decisions, and came up with two profiles:
* For the App Store, a dedicated Bible reading app with features aimed at making reading interpretability and understanding the main priority by offering definitions and translations
* For the Google Playstore, a photography app that encompasses a variety of features so that users can prioritize usage and not looking for other apps to the job for them. Granular controls over photo parameters as well as intuitive sharing functionalities are prioritized as mobile photography booms for both professionals and enthusiasts

Future recommendations for the analysis is to also look at app revenue and in-app purchases and subscriptions for specific types of apps. Additional steps that account for skewed computations should also be considered in the future.