# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

- [A data](https://www.kaggle.com/lava18/google-play-store-apps) set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- [A data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

Let's open the dataset:

In [1]:
from csv import reader

# Google Play Store Data #
with open("googleplaystore.csv", encoding="utf8") as file:
    read_file = reader(file)
    android_data = list(read_file)
    android_header = android_data[0]
    android = android_data[1:]

# Apple App Store Data #
with open("AppleStore.csv", encoding="utf8") as file:
    read_file = reader(file)
    ios_data = list(read_file)
    ios_header = ios_data[0]
    ios = ios_data[1:]

To make them easier for you to explore, we created a function named explore_data() that we can repeatedly use to print rows in a readable way. Using this function, here are the first three rows of the Google Play Store Dataset:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

explore_data(android, 0, 0, True)

import pandas as pd
data = android[0:3]
pd.DataFrame(data, columns=android_header, index=[f"Row: {x + 1}" for x in range(len(data))])

Number of rows: 10841
Number of columns: 13


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
Row: 1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Row: 2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
Row: 3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


The Google Play Store Dataset contains 10,841 rows of data in 13 columns. At a glance, the columns which could help us with our analysis are:

- App
- Category
- Rating
- Reviews
- Installs
- Type
- Price
- Genres

Here are the first three rows of the Apple iOS Store Dataset:

In [3]:
explore_data(ios, 0, 0, True)

import pandas as pd
data = ios[0:3]
pd.DataFrame(data, columns=ios_header, index=[f"Row: {x + 1}" for x in range(len(data))])

Number of rows: 7197
Number of columns: 16


Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
Row: 1,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
Row: 2,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
Row: 3,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1


The Apple iOS Store Dataset contains 7,197 rows of data in 16 columns. Not all the columns are self-explanatory, refer to the [documentation](https://www.kaggle.com/lava18/google-play-store-apps) for more details about each column. At a glance, the columns which could help us with our analyse are:

- track_name
- currency
- price
- rating_count_tot
- rating_count_ver
- prime_genre

## Deleting Wrong Data 

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error row 10,472, let's print that row out and compare it with a correct row:

In [4]:
import pandas as pd
data = [android[0], android[10472]]
pd.DataFrame(data, columns=android_header, index=["Correct Data","Incorrect Data"])

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
Correct Data,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Incorrect Data,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


As we can see with the incorrect data, the category is incorrect. This is due to a missing "Category" entry, as a result every column has been shifted left by one. To ensure this data doesn't interfere with our analysis, we will remove this row.

In [5]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


We will write code to check if there are any other similar instances of incorrect data:

In [6]:
def data_checker(header, data):
    for row in data:
        if len(row) != len(header):
            return "Error Detected"
    return "No Error Detected"

print("Android Data:", data_checker(android_header, android))
print("iOS Data:",data_checker(ios_header, ios))

Android Data: No Error Detected
iOS Data: No Error Detected


## Removing Duplicate Entries

If we explore the Google Play data set long enough, we'll notice some apps have duplicate entries, for instance, Instagram has four entries:

In [7]:
data = []

for app in android:
    name = app[0]
    if name == "Instagram":
        data.append(app)
        
import pandas as pd
pd.DataFrame(data, columns=android_header, index=[f"Instance: {x + 1}" for x in range(len(data))])

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
Instance: 1,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
Instance: 2,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
Instance: 3,Instagram,SOCIAL,4.5,66577313,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
Instance: 4,Instagram,SOCIAL,4.5,66509917,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device


We will write code to find the number of duplicate apps as well as example cases where the apps appear more than once:

In [8]:
duplicate_apps = [] # Only contains apps which are duplicated
unique_apps = [] # The "set" of apps

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Number of duplicate apps:", len(duplicate_apps))
print('\n')
print("Example of duplicate apps:", duplicate_apps[:18])

Number of duplicate apps: 1181


Example of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If we examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
- Use the information stored in the dictionary and create a new data set, which will have only one entry per app, and for each app, we'll only select the entry with the highest number of reviews.

In [9]:
max_reviews = {}

for app in android:
    name = app[0]
    reviews = float(app[3]) # Need to convert data to a number since data is originally a string
    if name not in max_reviews:
        max_reviews[name] = reviews
    elif reviews > max_reviews[name]:
        max_reviews[name] = reviews

We will check that the length of the "max_review" dictionary is equal to the expected number of rows of our de-duplicated dataset:

In [10]:
print("Length of Dictionary:", len(max_reviews))
print("Expected number of rows of de-duplicated data:", len(android)-len(duplicate_apps))

Length of Dictionary: 9659
Expected number of rows of de-duplicated data: 9659


Now that we checked that the length of the "max_review" dictionary and the expected number of rows of our de-duplicated dataset are both equal to 9659, we will now use the "max_review" to remove the duplicates. Here is a summary of the methodology:

1. We create two lists: 
    - The "android_clean" list will contain a list of unique apps which has the highest number of reviews
    - The "already_added" list will keep track of the apps which are already added into the android_clean
2. Loop through the data and check whether the number of reviews is equal to the max number of reviews, and it is not already added to the list
    - The second condition ensures duplicate apps which has multiple same number of reviews as max_reviews doesn't get added to the list
3. Check android_clean to make sure that it has 9,659 rows of data as expected

In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == max_reviews[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(f'"android_clean" has {len(android_clean)} rows of data')

"android_clean" has 9659 rows of data


## Removing Non-English Apps

If we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience. Here are some examples:

In [12]:
print("Examples of iOS Non-English Apps:\n")
print(ios[813][1])
print(ios[6731][1])
print('\n')
print("Examples of Android Non-English Apps:\n")
print(android_clean[4412][0])
print(android_clean[7940][0])

Examples of iOS Non-English Apps:

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


Examples of Android Non-English Apps:

中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character we use in a string has a corresponding number associated with it. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system. 

Using the [built-in function ord()](https://docs.python.org/3/library/functions.html#ord), we will write code to check whether every character in the string falls within 0 to 127.

In [13]:
def is_english(string):
    for c in string:
        if ord(c) > 127:
            return False
    return True

print("'Instagram' is English: ", is_english('Instagram'))
print("'爱奇艺PPS -《欢乐颂2》电视剧热播' is English: ", is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print("'Docs To Go™ Free Office Suite' is English: ", is_english('Docs To Go™ Free Office Suite'))
print("'Instachat 😜' is English: ", is_english('Instachat 😜'))

'Instagram' is English:  True
'爱奇艺PPS -《欢乐颂2》电视剧热播' is English:  False
'Docs To Go™ Free Office Suite' is English:  False
'Instachat 😜' is English:  False


To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [14]:
def is_english(string):
    non_english = 0
    for c in string:
        if ord(c) > 127:
            non_english += 1
    
    if non_english > 3:
        return False
    else:
        return True

print("'爱奇艺PPS -《欢乐颂2》电视剧热播' is English: ", is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print("'Docs To Go™ Free Office Suite' is English: ", is_english('Docs To Go™ Free Office Suite'))
print("'Instachat 😜' is English: ", is_english('Instachat 😜'))

'爱奇艺PPS -《欢乐颂2》电视剧热播' is English:  False
'Docs To Go™ Free Office Suite' is English:  True
'Instachat 😜' is English:  True


Although the function is still not perfect, very few non-English apps might get past our filter, but this seems good enough at this point in our analysis. We will now use is_english() to filter out non-english apps from our datasets:

In [15]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

print('Number of rows for "android_english":', len(android_english))
print('Number of rows for "ios_english":', len(ios_english))

Number of rows for "android_english": 9614
Number of rows for "ios_english": 6183


We are now left with 9,614 android apps and 6,183 iOS apps.

## Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis. Below we will isolate the free apps:

In [16]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0': # We know free apps has a price of '0' (string) in the Playstore dataset
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0': # We know free apps has a price of '0.0' (string) in the Playstore dataset
        ios_final.append(app)

print('Number of rows for "android_final":', len(android_final))
print('Number of rows for "ios_final":', len(ios_final))

Number of rows for "android_final": 8864
Number of rows for "ios_final": 3222


We're left with 8,864 Android apps and 3,222 iOS apps, which should be enough for our analysis.

## Most Common Apps by Genre

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

**Because our end goal is to add the app on both Google Play and the App Store** we need to find app profiles that are successful on both markets.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the 'prime_genre' column of the App Store data set, and the 'Genres' and Category columns of the Google Play data set. 

To do this, we will create a function to generate a frequency table (expressed as percentages), then pass that table into another function which sorts the table in descending order:

In [30]:
def freq_table(dataset, index):
    data_length = len(dataset)
    ft = {}
    for row in dataset:
        if row[index] in ft.keys():
            ft[row[index]] += 1
        else:
            ft[row[index]] = 1
      
    for row in ft.keys():
        ft[row] /= data_length # Converts frequency table into proportion
        ft[row] *= 100         # Converts proportions into percentage

    return ft

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)

    data = []
    for entry in table_sorted:
        data.append([entry[1].title(), f"{round(entry[0],2)}%"])
    return data

display_table(ios_final, -5)

[['Games', '58.16%'],
 ['Entertainment', '7.88%'],
 ['Photo & Video', '4.97%'],
 ['Education', '3.66%'],
 ['Social Networking', '3.29%'],
 ['Shopping', '2.61%'],
 ['Utilities', '2.51%'],
 ['Sports', '2.14%'],
 ['Music', '2.05%'],
 ['Health & Fitness', '2.02%'],
 ['Productivity', '1.74%'],
 ['Lifestyle', '1.58%'],
 ['News', '1.33%'],
 ['Travel', '1.24%'],
 ['Finance', '1.12%'],
 ['Weather', '0.87%'],
 ['Food & Drink', '0.81%'],
 ['Reference', '0.56%'],
 ['Business', '0.53%'],
 ['Book', '0.43%'],
 ['Navigation', '0.19%'],
 ['Medical', '0.19%'],
 ['Catalogs', '0.12%']]

We start by examining the 'prime_genre' column of the AppleStore data:

In [18]:
data = display_table(ios_final, -5)

import pandas as pd
pd.DataFrame(data, columns=["Genre", "Proportion of Apps"], index=[f"Rank {x + 1}:" for x in range(len(data))])

Unnamed: 0,Genre,Proportion of Apps
Rank 1:,Games,58.16%
Rank 2:,Entertainment,7.88%
Rank 3:,Photo & Video,4.97%
Rank 4:,Education,3.66%
Rank 5:,Social Networking,3.29%
Rank 6:,Shopping,2.61%
Rank 7:,Utilities,2.51%
Rank 8:,Sports,2.14%
Rank 9:,Music,2.05%
Rank 10:,Health & Fitness,2.02%


We can see that (commentary on the data):

- Games has the largest share compared to the other genres, making up 58.16% of the data, followed by Entertainment at 7.88%
- Games makes up more than 7 times the second most common genre
- A small minority of genres makes up a large majority of the dataset

The general impression of the free iOS apps is that most of the apps are designed for entertainment purposes. A large number of apps for a particular genre does not imply that there is a large number of users because not every app has the same amount of downloads (i.e. the *demands* may not be the same as the *offer*). So we cannot recommmend an app profile for the AppleStore based on the frequency table alone.

Let's examine the 'Category' and 'Genre' columns of the Google Play dataset:

In [19]:
data = display_table(android_final, 1) # Categories

import pandas as pd
pd.DataFrame(data, columns=["Category", "Proportion of Apps"], index=[f"Rank {x + 1}:" for x in range(len(data))])

Unnamed: 0,Category,Proportion of Apps
Rank 1:,Family,18.91%
Rank 2:,Game,9.72%
Rank 3:,Tools,8.46%
Rank 4:,Business,4.59%
Rank 5:,Lifestyle,3.9%
Rank 6:,Productivity,3.89%
Rank 7:,Finance,3.7%
Rank 8:,Medical,3.53%
Rank 9:,Sports,3.4%
Rank 10:,Personalization,3.32%


We can see that (commentary on the data):
    
- Family has the largest share at 18.91% followed by Game at 9.72%
- Unlike the AppleStore data, there is a much more even distribution across the categories
- Most noticibly the gap between the most common category and second most common is much smaller compared to the AppleStore data

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). Overall, practical apps seem to have a better representation here. This can be seen in the frequency table for the Genres column:

In [20]:
data = display_table(android_final, -4) # Genres

import pandas as pd
pd.DataFrame(data, columns=["Genres", "Proportion of Apps"], index=[f"Rank {x + 1}:" for x in range(len(data))])

Unnamed: 0,Genres,Proportion of Apps
Rank 1:,Tools,8.45%
Rank 2:,Entertainment,6.07%
Rank 3:,Education,5.35%
Rank 4:,Business,4.59%
Rank 5:,Productivity,3.89%
...,...,...
Rank 110:,Books & Reference;Education,0.01%
Rank 111:,Art & Design;Pretend Play,0.01%
Rank 112:,Art & Design;Action & Adventure,0.01%
Rank 113:,Arcade;Pretend Play,0.01%


The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more *granular* (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_column app.

Below, we calculate the average number of user ratings per app genre on the App Store (iOS):

In [21]:
from scipy.stats import skew

genre_ios = freq_table(ios_final, -5)
genre_avg_n_ratings = []

for genre in genre_ios:
    ratings = []
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            ratings.append(user_ratings)
            len_genre += 1
    avg_n_rating = total / len_genre
    skewness = skew(ratings)
    genre_avg_n_ratings.append((round(avg_n_rating,2), genre, len_genre, skewness))

import pandas as pd
data = sorted(genre_avg_n_ratings, reverse=True)
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Average Number of Ratings", "Genre", "Number of Apps", "Skewness"], index = i )
table.reindex(columns=["Genre", "Average Number of Ratings", "Number of Apps", "Skewness"])

Unnamed: 0,Genre,Average Number of Ratings,Number of Apps,Skewness
Rank 1,Navigation,86090.33,6,1.220429
Rank 2,Reference,74942.11,18,3.645717
Rank 3,Social Networking,71548.35,106,8.153762
Rank 4,Music,57326.53,66,4.664408
Rank 5,Weather,52279.89,28,2.840356
Rank 6,Book,39758.5,14,2.06805
Rank 7,Food & Drink,33333.92,26,2.598927
Rank 8,Finance,31467.94,36,2.437536
Rank 9,Photo & Video,28441.54,160,11.628795
Rank 10,Travel,28243.8,40,4.126132


The average number of ratings gives us an idea of which genre has the highest number of user reviews, it doesn't give us a complete picture because it is influenced a few apps which has a very high number of reviews. As an example, for the Navigation genre, which has the highest average number of reviews, we will inspect user reviewed apps:

In [22]:
ios_reviews = []

for app in ios_final:
    name = app[1]
    reviews = int(app[5])
    genre = app[-5]
    if genre == "Navigation":
        ios_reviews.append((reviews, name))

import pandas as pd
data = sorted(ios_reviews, reverse=True)
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Total number of ratings", "Genre"], index = i )
table.reindex(columns=["Genre", "Total number of ratings"])

Unnamed: 0,Genre,Total number of ratings
Rank 1,"Waze - GPS Navigation, Maps & Real-time Traffic",345046
Rank 2,Google Maps - Navigation & Transit,154911
Rank 3,Geocaching®,12811
Rank 4,CoPilot GPS – Car Navigation & Offline Maps,3582
Rank 5,ImmobilienScout24: Real Estate Search in Germany,187
Rank 6,Railway Route Search,5


As we can see, Waze and Google Maps makes up a large majority of the number of reviews for the Navigation category. The way we can measure this is by using the "skewness" of the data. A highly positive skewness means that a few apps influences the average number of rating figure. We notice that most of the other genres follow this pattern.

The term "Reference" seems a bit ambiguous, since it is ranked second it is important to understand what types of apps are those, let's inspect these apps:

In [23]:
ios_reviews = []

for app in ios_final:
    name = app[1]
    reviews = int(app[5])
    genre = app[-5]
    if genre == "Reference":
        ios_reviews.append((reviews, name))

import pandas as pd
data = sorted(ios_reviews, reverse=True)
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Total number of ratings", "Genre"], index = i )
table.reindex(columns=["Genre", "Total number of ratings"])

Unnamed: 0,Genre,Total number of ratings
Rank 1,Bible,985920
Rank 2,Dictionary.com Dictionary & Thesaurus,200047
Rank 3,Dictionary.com Dictionary & Thesaurus for iPad,54175
Rank 4,Google Translate,26786
Rank 5,"Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...",18418
Rank 6,New Furniture Mods - Pocket Wiki & Game Tools ...,17588
Rank 7,Merriam-Webster Dictionary,16849
Rank 8,Night Sky,12122
Rank 9,City Maps for Minecraft PE - The Best Maps for...,8535
Rank 10,LUCKY BLOCK MOD ™ for Minecraft PC Edition - T...,4693


We notice that these apps offer some sort of guidance for the app user.

Before we make recommendation, we will first analyze the Google Play data in a similar fashion.

## Most Popular Apps by Category on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [24]:
data = display_table(android_final, 5)

category = []
proportion = []

for row in data:
    category.append(row[0])
    proportion.append(row[1])

import pandas as pd
pd.DataFrame(proportion, columns=["Proportion of Apps"], index=category)

Unnamed: 0,Proportion of Apps
"1,000,000+",15.73%
"100,000+",11.55%
"10,000,000+",10.55%
"10,000+",10.2%
"1,000+",8.39%
100+,6.92%
"5,000,000+",6.83%
"500,000+",5.56%
"50,000+",4.77%
"5,000+",4.51%


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [25]:
from scipy.stats import skew

category_android = freq_table(android_final, 1)
category_avg_n_installs = []

for category in category_android:
    installs = []
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            user_installs = app[5]
            user_installs = user_installs.replace(",", "")
            user_installs = user_installs.replace("+", "")
            total += float(user_installs)
            installs.append(float(user_installs))
            len_category += 1
    avg_n_installs = total / len_category
    skewness = skew(installs)
    category_avg_n_installs.append((round(avg_n_installs,2), category.title(), len_category, skewness))

import pandas as pd
data = sorted(category_avg_n_installs, reverse=True)
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Average Number of Installs", "Category", "Number of Apps", "Skewness"], index = i )
table.reindex(columns=["Category", "Average Number of Installs", "Number of Apps", "Skewness"])

Unnamed: 0,Category,Average Number of Installs,Number of Apps,Skewness
Rank 1,Communication,38456119.17,287,5.312031
Rank 2,Video_Players,24727872.45,159,7.382816
Rank 3,Social,23253652.13,236,7.167487
Rank 4,Photography,17840110.4,261,12.406054
Rank 5,Productivity,16787331.34,345,8.582844
Rank 6,Game,15588015.6,862,10.830049
Rank 7,Travel_And_Local,13984077.71,207,9.742867
Rank 8,Entertainment,11640705.88,85,2.910589
Rank 9,Tools,10801391.3,750,11.150166
Rank 10,News_And_Magazines,9549178.47,248,10.53696


The category which has the highest number of installs is "Communication" at 38,456,119 average installs. However, we notice that the data is much more positively skewed compared to the AppleStore data. This is due to a few apps which has over 50 Million installs. As a relatively new player in the app market, we don't want to be competing against those giants since we will have to spend a lot on marketing. So for our purposes, we will exclude apps with over 50 Million + installs from our analysis:

In [26]:
from scipy.stats import skew

category_android = freq_table(android_final, 1)
category_avg_n_installs = []

for category in category_android:
    installs = []
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            user_installs = app[5]
            user_installs = user_installs.replace(",", "")
            user_installs = user_installs.replace("+", "")
            if float(user_installs) < 50000000:
                total += float(user_installs)
                installs.append(float(user_installs))
                len_category += 1
    avg_n_installs = total / len_category
    skewness = skew(installs)
    category_avg_n_installs.append((round(avg_n_installs,2), category.title(), len_category, skewness))

import pandas as pd
data = sorted(category_avg_n_installs, reverse=True)
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Average Number of Installs", "Category", "Number of Apps", "Skewness"], index = i )
table.reindex(columns=["Category", "Average Number of Installs", "Number of Apps", "Skewness"])

Unnamed: 0,Category,Average Number of Installs,Number of Apps,Skewness
Rank 1,Entertainment,3808684.21,76,0.714231
Rank 2,Photography,3437585.52,220,0.769468
Rank 3,Game,3244832.82,751,0.866434
Rank 4,Shopping,2942987.09,187,1.019174
Rank 5,Weather,2392365.97,67,1.457366
Rank 6,Video_Players,2369512.29,140,1.386723
Rank 7,Communication,2319787.36,253,1.369098
Rank 8,Social,2008540.83,218,1.616115
Rank 9,Travel_And_Local,1993454.98,198,1.638109
Rank 10,Food_And_Drink,1924897.74,110,1.763887


Excluding apps with a high number of downloads has significantly decreased the skewness within each category. 

Let's inspect a few apps in the most popular categories:

In [27]:
android_installs = []

for app in android_final:
    name = app[0]
    user_installs = app[5]
    user_installs = user_installs.replace(",", "")
    user_installs = user_installs.replace("+", "")
    user_installs = int(user_installs)
    category = app[1].lower()
    if category == "entertainment" and user_installs < 50000000:
        android_installs.append((user_installs, name))

import pandas as pd
data = sorted(android_installs, reverse=True)[:18]
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Total number of Installations", "Genre"], index = i )
table.reindex(columns=["Genre", "Total number of Installations"])

Unnamed: 0,Genre,Total number of Installations
Rank 1,ivi - movies and TV shows in HD,10000000
Rank 2,WWE,10000000
Rank 3,Vudu Movies & TV,10000000
Rank 4,Viki: Asian TV Dramas & Movies,10000000
Rank 5,Tubi TV - Free Movies & TV,10000000
Rank 6,SketchBook - draw and paint,10000000
Rank 7,STARZ,10000000
Rank 8,Redbox,10000000
Rank 9,"Movies by Flixster, with Rotten Tomatoes",10000000
Rank 10,Motorola Spotlight Player™,10000000


In [28]:
android_installs = []

for app in android_final:
    name = app[0]
    user_installs = app[5]
    user_installs = user_installs.replace(",", "")
    user_installs = user_installs.replace("+", "")
    user_installs = int(user_installs)
    category = app[1].lower()
    if category == "photography" and user_installs < 50000000:
        android_installs.append((user_installs, name))

import pandas as pd
data = sorted(android_installs, reverse=True)[:18]
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Total number of Installations", "Genre"], index = i )
table.reindex(columns=["Genre", "Total number of Installations"])

Unnamed: 0,Genre,Total number of Installations
Rank 1,Wondershare PowerCam,10000000
Rank 2,"Sweet Snap - live filter, Selfie photo edit",10000000
Rank 3,"Sweet Camera - Selfie Filters, Beauty Camera",10000000
Rank 4,Scoompa Video - Slideshow Maker and Video Editor,10000000
Rank 5,RetroSelfie - Selfie Editor,10000000
Rank 6,Retro Camera,10000000
Rank 7,QuickPic - Photo Gallery with Google Drive Sup...,10000000
Rank 8,Pixlr-o-matic,10000000
Rank 9,PixelLab - Text on pictures,10000000
Rank 10,PhotoScan by Google Photos,10000000


The Google Play store categories describes the types of apps in a more straightforward manner. This allows us to better infer the types apps that will be in a certain category.

Now that we know the current demands and offerings of the Google Play and Apple Store market, we will now move on to recommend the types of apps to build.

## App Recommendations

It's important we keep a few a things in mind when deciding on the type of apps to build:

- The type of app should be high in demand
- The type of app that is high in supply
- The type of app we are building needs to overlap in both the Google Play and Apple Store markets
- The app should occupy multiple categories / genres
- We want the user to remain engaged within the app for an extended period of time so we can offer more ads
- We don't want to create an app which is similar to a pre-existing app which has a large user base

We notice that in both markets, games seems to be in high supply, especially in the Apple Store market. However, those supply doesn't seem to be met in terms of demand. This means that the games are quite saturated in both app markets. As a result, we recommend that we should not focus on making a game app.

If we move down the Google Play category rankings, we find that Shopping overlaps with the AppleStore rankings. One type of app we can make would be an app which suggests places for the user to shop and then guide them there. Let's inspect the types of Shopping apps that are currently on the Google Play store:

In [29]:
android_installs = []

for app in android_final:
    name = app[0]
    user_installs = app[5]
    user_installs = user_installs.replace(",", "")
    user_installs = user_installs.replace("+", "")
    user_installs = int(user_installs)
    category = app[1].lower()
    if category == "shopping":
        android_installs.append((user_installs, name))

import pandas as pd
data = sorted(android_installs, reverse=True)[:18]
i = [f"Rank {x}" for x in range(1,len(data)+1)]
table = pd.DataFrame(data, columns=["Total number of Installations", "Genre"], index = i )
table.reindex(columns=["Genre", "Total number of Installations"])

Unnamed: 0,Genre,Total number of Installations
Rank 1,eBay: Buy & Sell this Summer - Discover Deals ...,100000000
Rank 2,Wish - Shopping Made Fun,100000000
Rank 3,Flipkart Online Shopping App,100000000
Rank 4,Amazon Shopping,100000000
Rank 5,"AliExpress - Smarter Shopping, Better Living",100000000
Rank 6,"letgo: Buy & Sell Used Stuff, Cars & Real Estate",50000000
Rank 7,The birth,50000000
Rank 8,OLX - Buy and Sell,50000000
Rank 9,Myntra Online Shopping App,50000000
Rank 10,Mercado Libre: Find your favorite brands,50000000


The types of apps in this category include apps which help us find shops and brands, online shopping and much more. 

On top of the base functionality of suggestions and guiding:

- We can have sponsored suggestions which is another advertisement channel for our app.
- We should also include a functionality which tracks current sales and prioritise those recommendations for users which would entice users to use our app.
- We can include the functionality for users to give suggestions inside a forum for others in return for points which can be exchanged for coupons at our sponsors.
- We can even include the functionality for users to give suggestions for others in return for points which can be exchanged for coupons at our sponsors. We can have sponsored suggestions which is another advertisement channel for our app.
- We can have the function to order from the shop in advance so that the user can pick up their order as they arrive to the store.


## Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We have concluded that a shopping app would be viable for our purposes as it has a high demand with a low supply on both markets. Since there are already many apps available, we need to include extra features so that we can differentiate our app from the competition. The suggested features are just a starting point for the app development. In order to create an enticing app we need to be distinguishes from the competition by having useful features that users would want. 