# Profitable App Profiles for the App Store and Google Play Markets

The context for this project is that we work as data analysts for a fictional company that builds both Android and iOS mobile apps, which is made available on Google Play and in the App Store.

Let us assume that our company only builds apps that are free to download and install, and that our main source of revenue is in-app advertisements. Hence, the number of users of our apps directly influences the revenue we earn for a given app - that is, the more users who view and engage with our ads, the better.

Ultimately, our goal of is to analyse app data to help our developers better understand the type of apps that are likely to attract users.

## Opening and Exploring the Data

In order to achieve the aforementioned objective of this project, we will need to collate and analyse data about mobile apps available on both the App Store and Google Play. 

As of September 2018, there were approximately 2 million iPhone apps available on the App Store, and a similar quantity of 2.1 million Android apps on Google Play. It would obviously not be feasible to collect data for over 4 millions apps for our analysis as that would require an enormous amount of time and computational resources. Instead, we will attempt to analyse only a sample of that data. 

Rather than collecting new data ourselves, it is far more efficient to find relevant existing data. We have in fact found two datasets that are suitable for our objectives:
- [A dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing approximately 10,000 Android apps from Google Play. This data was collected in August 2018, and can be downloaded directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- [A dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing approximately 10,000 Android apps from Google Play. This data was collected in July 2017, and can be downloaded directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

We will begin by opening these two datasets which have been downloaded from Kaggle as `.csv` files.


In [1]:
import csv

# Apple Store apps
opened_file_ios = open("AppleStore.csv")
read_file_ios = csv.reader(opened_file_ios)
ios = list(read_file_ios)
ios_header = ios[0]
ios_data = ios[1:]

# Google Play Store apps
opened_file_android = open("googleplaystore.csv")
read_file_android = csv.reader(opened_file_android)
android = list(read_file_android)
android_header = android[0]
android_data = android[1:]

The next step is to explore both datasets. We have written our own function to perform the following data exploration tasks as shown below:
- Prints the first few rows of the dataset
- Determines the number of rows and columns of the dataset (assumes that the dataset parameter does not contain a header row)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n") # adds an empty line after each row for readability
        
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))

explore_data(ios_data, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [3]:
explore_data(android_data, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
print(ios_header)
print("\n")
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We observe that there are 7,197 apps and 16 attributes in the iOS App Store dataset. The columns in this dataset that may be useful for our analysis include `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, `user_rating`, and `prime_genre`.

There are 10,841 apps and 13 attributes in the Google Play Store dataset. For this dataset, the columns that might potentially be useful are `App`, `Category`, `Rating`, `Reviews`, `Type`, `Price`, and `Genres`.

## Deleting Erroneous Data

Before formally commencing our analysis, it is critical that we ensure that all our data is accurate because otherwise, our analysis will be incorrect. Henceforth, we need to:
- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.

Our fictional company only builds apps that are free to download and install, and ones that are targeted specifically to an English-speaking audience. This means that we will need to:
- Remove non-English apps
- Removes apps that are not free

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion) on Kaggle, and [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a specific row. 

Let's now print the row at that index to verify this claim. We do not know whether the user who reported this error ignored the header row or not, so the actual index of the erroneous row may differ slightly in our case.

In [5]:
print(android_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It appears that the user did in fact exclude the header row and hence, the row with the index 10,472 is also the erroneous row the user was referring to in our case since we also removed the header from our dataset.

Observing this row, it turns out that the value for second column `Category` contains the `Rating` value instead because all the columns from `Rating` through to `Content Rating` have its values shifted one column to the left. 

This is clearly an error and hence, we will remove this observation from the dataset. Let's also count the number of observations in the dataset before and after deleting this row to ensure that it has been removed.

In [6]:
print(len(android_data))
del android_data[10472]

10841


In [7]:
print(len(android_data))

10840


## Removing Duplicate Entries

### Part One

If you explore the Google Play dataset for long enough or have a look at the discussions section, you will realise that some apps have duplicate entires. For instance, Instagram has four entries.

In [8]:
for app in android_data:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Altogether, there are 1,181 instances of an app appearing more than once in the dataset:

In [9]:
duplicate_apps = []
unique_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print("Number of duplicate apps:", len(duplicate_apps))
print("\n")
print("Examples of duplicate apps:", duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Upon examining the rows we just printed for the Instagram app, the difference occurs in the fourth column of each row, which represents the number of reviews for the app. The different numbers implies that the data was collected at different times.

We can leverage this fact to build a criterion for removing duplicates. Specifically, the higher the review count, the more recent we expect the data to be. As opposed to removing duplicates at random, we will only retain the row with the highest number of reviews for any given app.

### Part Two

To remove the duplicates, we will do the following:
- Create a dictionary, where each dictionary key is a unique app name and the corresponding value is the highest number of reviews for a particular app
- Utilise the data stored in the dictionary to create a new dataset, which will only contain one entry per app (the one with the highest number of reviews).

In [10]:
reviews_max = {}

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [11]:
print("Expected length:", len(android_data) - 1181)
print("Actual length:", len(reviews_max))

Expected length: 9659
Actual length: 9659


By inspecting the length of the dictionary, we observe that there are 9,659 entries remaining in the Google Play dataset upon removing the duplicates. This is exactly what we expected given the 1,181 duplicates we found previously.

In [12]:
android_clean = [] 
already_added = []

for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Once again, we can confirm that the new `android_clean` dataset does in fact contain 9,659 rows as we expected.

## Removing Non-English Apps

### Part 1

Recall that we only use for the apps that we develop at the fictional company, and hnece we would like to analyst only the apps that were designed for an English-speaking audience. However, we will find that both our datasets contain apps with names in a foreign language, which indicate that they are not designed for an English-speaking audience.

Because such apps are not relevant to our analysis, we need to devise a strategy for removing them. The approach that we will take is to remove each app with a name containing at least a character that is not commonly used in English text - note that English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

The numbers corresponding to the characters we commonly use in an English text are in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Given this numeric range, we will build a function that detects whether or not a character belongs to the set of common English characters. 

If an app name contains a character with a value greater than 127, then it probably means that the app has a non-English name. Our app names are stored as strings. In Python, strings are both indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

In [14]:
def is_english_app(name):
    for char in name:
        if ord(char) > 127:
            return False
    return True

is_english_app("Instagram")

True

In [15]:
is_english_app("爱奇艺PPS -《欢乐颂2》电视剧热播")

False

In [16]:
is_english_app("Docs To Go™ Free Office Suite")

False

In [17]:
is_english_app("Instachat 😜")

False

Upon testing the function we wrote for detecting non-English app names, we found that the function has its limiations as it could not correctly identify certain apps with English names such as `Docs To Go™ Free Office Suite` and `Instachat 😜`. This is because emojis and characters like `™` fall outside of the ASCII range with corresponding numeric values exceeding 127.

In [18]:
print(ord("™"))
print(ord("😜"))

8482
128540


If we continue to use the above function that we created, we will lose useful data as many English apps will be incorrectly labelled as being non-English ones. To minimise this impact of data loss, we will opt to only remove apps that have names with more than three characters that fall outside of the ASCII range (i.e: 0 - 127). This modified filter function is still not perfect, but should be fairly effective.

In [19]:
def is_english_app(name):
    n_non_ascii = 0
    for char in name:
        if ord(char) > 127:
            n_non_ascii += 1
    if n_non_ascii > 3:
        return False
    return True

is_english_app("Docs To Go™ Free Office Suite")

True

In [20]:
is_english_app("Instachat 😜")

True

In [21]:
is_english_app("爱奇艺PPS -《欢乐颂2》电视剧热播")

False

We will now use our function `is_english_app` to filter out non-English apps from both datasets by looping through them. If an app name is identified as English, we will append the entire row to a separate list.

In [22]:
ios_eng = []
for app in ios_data:
    name = app[1]
    if is_english_app(name):
        ios_eng.append(app)
        
android_eng = []
for app in android_clean:
    name = app[0]
    if is_english_app(name):
        android_eng.append(app)

explore_data(ios_eng, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


In [23]:
explore_data(android_eng, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


Upon removing duplicate entries and non-English apps from both datasets, we observe that there are still 6,183 and 9,614 observations remaining in the iOS App Store and Google Play Store datasets, respectively.

## Isolating the Free Apps

As previously mentioned, we are only interested in apps that are free to download and install. Currently, our datasets contain both free and paid apps and hence, we will need to isolate the data to only consist of free apps for our analysis.

This will be the final step in our data cleaning process before we start analysing the data.

In [24]:
ios_free = []
for app in ios_eng:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
    
android_free = []
for app in android_eng:
    price = app[7]
    if price == "0":
        android_free.append(app)
        
print(len(ios_free))
print(len(android_free))

3222
8864


We observe that there are a total of 3,222 free apps in the iOS App Store, and 8,864 free apps in the Google Play Store.

## Most Common Apps by Genre 

### Part One

To minimise risks and overhead, our validation strategy for an app idea comprises of three steps:
1. Build a minimal Android version of that app, and add it to the Google Play store.
2. If the app is well received by users, it will be developed further.
3. If the app is still profitable after six months, an iOS version of the app will be built and added to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

We will begin our analysis by looking at the most common app genres for each market. For this, we will need to build frequency tables for the relevant columns in our datasets.

In [25]:
print(ios_header)
print(android_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Based on the column names in the header row of each dataset, we have identified the `prime_genre` column of the App Store dataset, and the `Category` and `Genres` columns of the Google Play dataset as being the most relevant for generating our frequency tables.

### Part Two

Let us now build two functions that we can use to analyse the frequency tables:
- A function to generate frequency tables that display percentages
- Another function we can use to display the percentages in descending order

In [26]:
def freq_table(dataset, index):
    table = {}
    total_count = 0
    for app in dataset:
        total_count += 1
        key = app[index]
        if key not in table:
            table[key] = 1
        else:
            table[key] += 1
    percentage_table = {}
    for key in table:
        percentage_table[key] = (table[key] / total_count) * 100
    return percentage_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ":", entry[0])

### Part Three



In [27]:
display_table(ios_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can observe that among the free English apps, over half of them are games (58.2%). Entertainment apps are the second most popular at almost 8%, which is followed by photo and video apps, at nearly 5%. Only 3.66% of apps serve an educational purpose, and next up is social networking apps which account for 3.29% of the apps in the App Store dataset.

From this, we get the impression that App Store is dominated by apps that are designed for fun (i.e: games, entertainment, photo and video, social networking, sports, music), whereas apps that serve practical purposes (i.e: education, shopping, utilities, productivity, lifestyle) are far less common. Despite this, it is important to be mindful that the fact that fun apps are more numerous does not necessarily mean that they also have the largest number of users, as the demand may not match the supply.

Now, we will perform the same analysis on the `Genres` and `Category` columns of the Google Play dataset.

In [28]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The landscape appears to differ considerably on Google Play compared to the iOS App store in that not many apps here are designed for fun, and there is a sizeable amount designed for practical purposes. However, it can be inferred that most of the games in Google Play are for children based on the fact that the Family category contains the most apps (nearly 19%).

However, even if we ignore this fact, the Google Play store has a much better representation of practical apps, based off the `Category` column alone.

In [29]:
display_table(android_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

From generating a frequency table for the `Genres` column of the Google Play dataset, we can draw the same insights as with the `Category` column in that Android apps gear more towards practical use than fun when compared to iOS.

It is not clear why it was necessary to include both the `Genres` and `Category` columns in the Google Play dataset as they represent the same information. However, we can see that the `Genres` column is more granular as exhibited by the larger number of unique genre keys in this column. Because we only care about the bigger picture in our analysis, the `Genres` column is not particularly useful for us and we will focus purely on the `Category` column from this point onwards.

## Most Popular Apps by Genre on the App Store

We would now like to determine the types of apps with the most users. One way to find out which genres are the most popular, in terms of user count, is to compute the average number of installs for apps of a particular genre. 

For the Google Play dataset, we can find this information in the `Installs` column. However, it is missing for the App Store dataset. We will take the total number of user ratings as a proxy to number of users, which can be found in the `rating_count_tot` column.

We will begin by calculating the average number of user ratings per app genre on the App Store. To achieve this, we need to:
- Isolate apps of each genre
- Sum the user ratings for apps of each genre
- Divide the sum of user ratings by the number of apps belonging to each genre

In order to get all the unique app genres in the App Store dataset, let's start by generating a frequency table for the `prime_genre` column.

In [30]:
ios_genre_freq = freq_table(ios_free, 11)
print(ios_genre_freq)

{'Social Networking': 3.2898820608317814, 'Photo & Video': 4.9658597144630665, 'Games': 58.16263190564867, 'Music': 2.0484171322160147, 'Reference': 0.5586592178770949, 'Health & Fitness': 2.0173805090006205, 'Weather': 0.8690254500310366, 'Utilities': 2.5139664804469275, 'Travel': 1.2414649286157666, 'Shopping': 2.60707635009311, 'News': 1.3345747982619491, 'Navigation': 0.186219739292365, 'Lifestyle': 1.5828677839851024, 'Entertainment': 7.883302296710118, 'Food & Drink': 0.8069522036002483, 'Sports': 2.1415270018621975, 'Book': 0.4345127250155183, 'Finance': 1.1173184357541899, 'Education': 3.662321539416512, 'Productivity': 1.7380509000620732, 'Business': 0.5276225946617008, 'Catalogs': 0.12414649286157665, 'Medical': 0.186219739292365}


Now we will loop over the unique genres of the App Store dataset and compute the average number of user ratings for each.

In [31]:
for genre in ios_genre_freq:
    total = 0
    len_genre = 0

    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    
    avg_ratings_count = total / len_genre
    print("{}: {}".format(genre, avg_ratings_count))

Social Networking: 71548.34905660378
Photo & Video: 28441.54375
Games: 22788.6696905016
Music: 57326.530303030304
Reference: 74942.11111111111
Health & Fitness: 23298.015384615384
Weather: 52279.892857142855
Utilities: 18684.456790123455
Travel: 28243.8
Shopping: 26919.690476190477
News: 21248.023255813954
Navigation: 86090.33333333333
Lifestyle: 16485.764705882353
Entertainment: 14029.830708661417
Food & Drink: 33333.92307692308
Sports: 23008.898550724636
Book: 39758.5
Finance: 31467.944444444445
Education: 7003.983050847458
Productivity: 21028.410714285714
Business: 7491.117647058823
Catalogs: 4004.0
Medical: 612.0


We see that on average, that navigation apps have the highest number of user reviews. However, this number is likely to be heavily driven by a select few apps which have significantly more reviews than the others. 

We will loop through apps within the `Navigation` genre and inspect the number of reviews for each to validate whether this is the actual case.

In [32]:
def review_count(dataset, genre, name_ind, genre_ind, n_reviews_ind):
    for app in dataset:
        if app[genre_ind] == genre:
            print("{}: {}".format(app[name_ind], app[n_reviews_ind]))
    
review_count(ios_free, "Navigation", 1, 11, 5)

Waze - GPS Navigation, Maps & Real-time Traffic: 345046
Google Maps - Navigation & Transit: 154911
Geocaching®: 12811
CoPilot GPS – Car Navigation & Offline Maps: 3582
ImmobilienScout24: Real Estate Search in Germany: 187
Railway Route Search: 5


As reasonably expected, the average number of user reviews for Navigation apps is indeed primarily driven by Waze and Google Maps, which comprise a combined total of approximately 500,000 reviews.

Now let's inspect the review counts for apps in the next few most popular genres by number of user reviews (i.e: Reference, Social Networking, and Music):

In [33]:
review_count(ios_free, "Reference", 1, 11, 5)

Bible: 985920
Dictionary.com Dictionary & Thesaurus: 200047
Dictionary.com Dictionary & Thesaurus for iPad: 54175
Google Translate: 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran: 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition: 17588
Merriam-Webster Dictionary: 16849
Night Sky: 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE): 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools: 4693
GUNS MODS for Minecraft PC Edition - Mods Tools: 1497
Guides for Pokémon GO - Pokemon GO News and Cheats: 826
WWDC: 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free: 718
VPN Express: 14
Real Bike Traffic Rider Virtual Reality Glasses: 8
教えて!goo: 0
Jishokun-Japanese English Dictionary & Translator: 0


In [34]:
review_count(ios_free, "Social Networking", 1, 11, 5)

Facebook: 2974676
Pinterest: 1061624
Skype for iPhone: 373519
Messenger: 351466
Tumblr: 334293
WhatsApp Messenger: 287589
Kik: 260965
ooVoo – Free Video Call, Text and Voice: 177501
TextNow - Unlimited Text + Calls: 164963
Viber Messenger – Text & Call: 164249
Followers - Social Analytics For Instagram: 112778
MeetMe - Chat and Meet New People: 97072
We Heart It - Fashion, wallpapers, quotes, tattoos: 90414
InsTrack for Instagram - Analytics Plus More: 85535
Tango - Free Video Call, Voice and Chat: 75412
LinkedIn: 71856
Match™ - #1 Dating App.: 60659
Skype for iPad: 60163
POF - Best Dating App for Conversations: 52642
Timehop: 49510
Find My Family, Friends & iPhone - Life360 Locator: 43877
Whisper - Share, Express, Meet: 39819
Hangouts: 36404
LINE PLAY - Your Avatar World: 34677
WeChat: 34584
Badoo - Meet New People, Chat, Socialize.: 34428
Followers + for Instagram - Follower Analytics: 28633
GroupMe: 28260
Marco Polo Video Walkie Talkie: 27662
Miitomo: 23965
SimSimi: 23530
Grindr - G

In [35]:
review_count(ios_free, "Music", 1, 11, 5)

Pandora - Music & Radio: 1126879
Spotify Music: 878563
Shazam - Discover music, artists, videos & lyrics: 402925
iHeartRadio – Free Music & Radio Stations: 293228
SoundCloud - Music & Audio: 135744
Magic Piano by Smule: 131695
Smule Sing!: 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music: 110420
Amazon Music: 106235
SoundHound Song Search & Music Player: 82602
Sonos Controller: 48905
Bandsintown Concerts: 30845
Karaoke - Sing Karaoke, Unlimited Songs!: 28606
My Mixtapez Music: 26286
Sing Karaoke Songs Unlimited with StarMaker: 26227
Ringtones for iPhone & Ringtone Maker: 25403
Musi - Unlimited Music For YouTube: 25193
AutoRap by Smule: 18202
Spinrilla - Mixtapes For Free: 15053
Napster - Top Music & Radio: 14268
edjing Mix:DJ turntable to remix and scratch music: 13580
Free Music - MP3 Streamer & Playlist Manager Pro: 13443
Free Piano app by Yokee: 13016
Google Play Music: 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes: 9975
TIDAL: 7398
YouTube Music: 7109
Nicki Minaj: The

We see the same trend for the other most popular app genres, whereby one or two apps within a particular genre dominate based on the number of user reviews.The review count for the reference genre, for instance, is heavily influenced by the Bible and Dictionary.com apps. Similarly, social networking apps are dominated by Facebook and Pinterest, and music apps by Pandora and Spotify. 

This makes certain genres of apps seem more popular than they actually are as these dominant apps significantly skew the average review count for each genre. Removing these overwhelmingly popular apps for each genre will make the analysis more accurate and provide us with a better picture of app popularity, but we will not concern ourselves with doing so just yet.

Despite this pattern, the reference genre does show some promise in terms of the innovations that can be made when developing new reference apps. For example, one thing that can be done is to take a popular book and develop an app version of it where new features can be implemented, aside from the raw text of the physical book. This may include an audio version of the book and daily quotes etc. Furthermore, embedding a dictionary of words within the book app can improve the user experience in terms convenience as users would no longer need to exit the app and refer to an external source.

This idea very much fits well with the general theme that the iOS App Store is dominated by apps made for fun. As the market is oversaturated with entertainment-based apps, developing a practical app would benefit our fictional app development business as it will likely stand out amongst the colossal volume of apps avaiable on the App store.

Now, let's perform the same analysis to determine the most popular genres of apps on Google Play and come up with an ideal app profile for the Google Play store.

## Most Popular Apps by Genre on Google Play

For the Google Play market, we have actual data about the number of installs for each app and hence, we should be able to get a clearer and more accurate understanding about genre popularity. 

Let's have a look by creating a frequency table for the `Installs` column of the Google Play dataset.

In [36]:
display_table(android_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Despite having information about the number of app installs, we observe that the values in this column lacks precision and hence we do not know exactly how many installs any of the apps have (except the ones with no installs). However, this is not a massive problem for our analysis as we only want to get an idea of which apps are likely to attract the most users, and we do not need perfect precision in order to achieve our goal.

Instead, what we are going to do is leave the numbers in the `Install` column as they are but disregard the plus symbol (+). This means that we will consider an app with 100,000+ installs to have exactly 100,000 installs, for instance. However, we will need to convert each value in the `Install` column from a string to a float in order to perform calculations on them. Because of the commas and plus symbols that exist in each value, attempting to convert them directly will results in an error which means first need to perform data cleaning to remove these characters.

In [37]:
android_cat_freq = freq_table(android_free, 1)
# print(android_genre_freq)
for category in android_cat_freq.keys():
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = total / len_category
    print(category, ":", avg_installs)    

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

In contrast to apps in the iOS App Store, apps made for practical purposes appear to dominate the Android market based on the number of installs. In particular, communication apps have the most installs on average with a total of over 38 million. Once again however, this number is heavily skewed by a select few number of incredibly popular apps with over 100 million installs.

Let's now investigate this further by determining which apps specifically have been installed over 100 million times and filter them out of the dataset to see how the average changes in that case.

In [40]:
for app in android_free:
    if app[1] == "COMMUNICATION" and (app[5] == "1,000,000,000+"
                                      or app[5] == "500,000,000+"
                                      or app[5] == "100,000,000+"):
        print(app[0], ":", app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [45]:
under_100_m_installs = []

for app in android_free:
    n_installs = app[5]
    n_installs = n_installs.replace("+", "")
    n_installs = n_installs.replace(",", "")
    n_installs = float(n_installs)
    if (app[1] == "COMMUNICATION") and (n_installs < 100000000):
        under_100_m_installs.append(n_installs)
        
sum(under_100_m_installs) / len(under_100_m_installs)

3603485.3884615386

As reasonably expected, removing the communication apps that have over 100 million installs significantly reduces the average for this category by over 10 times.

This same trend holds true again for several other categories that are dominated by some ubiquitously popular apps. For example, the video players category which is the second most popular has an average of almost 25 million installs, but is dominated by apps such as Youtube, Google Play Movies & TV, and MX Player. Furthermore down the line, we have social apps dominated by the likes of Facebook and Instagram, photography apps by Google Photos and several other photo editors, then productivity apps (MS Word, Dropbox, Google Calender etc.).

Once again, the fact that these genres are dominated by these extremely popular apps should raise concerns for us as the average number of installs make these categories seem far more popular than they actually are. Furthermore, these few giants are hard to compete against in their respective categories.

Whilst the games genre seems promising on first glance due to it being popular without being overly dominated by certain apps, we have previously discovered from our analysis of iOS apps that there is an oversaturation of apps made for fun, in general.

Exploring even further, the books and reference category appears to be sufficiently popular also, with an average of 8.7 million installs. We have similarly found that this genre shows potential for the App Store. Because our aim is to recommend an app profile that is profitable for both the App Store and Google Play markets, it is once again worth exploring this genre further. 

Let's view the number of installs for the apps in the Books and References genres:

In [47]:
for app in android_free:
    if app[1] == "BOOKS_AND_REFERENCE":
        print(app[0], ":", app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This particular niche appears to be dominated by eBook processing apps, along with dictionaries and collections of books. Due to the popularity of these apps, building similar ones would not likely be profitable due to competition.

We notice that there are several apps down the list related to books of religion such as the Bible and Quran. This indicates that building an app around a specific book of interest might be profitable for both the App Store and Google Play.

However, we do need to add special features to these eBook apps in order to make them stand out. This may include features like an audio version of the book, daily quotes, and knowledge quizzes on these books. This profile is essentially the same as our recommendation for the App Store.

## Conclusions

Throughout this project, we analysed mobile app data across both the App Store and Google Play markets with the aim of recommending a app profile that would be profitable across both stores.

From our analysis, we reached the conclusion that converting a popular book (especially a recent one) into an electronic format in an app may potentially be profitable for both markets. However, this likely only holds true if developers add novel features to these apps to make them stand out.