# Profitable App Profiles for the App Store and Google Play Markets


### Context
We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

### Aim
Our aim for this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. 

At our company, we only build apps that are **free to download and install**, and our main source of revenue consists of **in-app ads**. This means that our revenue for any given app is mostly influenced by the **number of users that use our app**. 

Over the course of this project we will analyze data to help our developers understand what kinds of apps are likely to attract more users.

### Skills / Libraries / Tools

Note, we will only employ the standard Python library in this project, using the below to performing some **practical data analysis**:

- Some Python basics (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions
- Jupyter Notebook

## Opening and exploring the datasets

As of [August 2019](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/) there were 1.96M iOS apps available on the Apple App Store and 2.46M Android apps on Google play.

<img src="statista.png" alt="Drawing" style="width: 600px;"/>

We are reluctant to invest the significant amounts of time and money it would take to collect new data on such a large volume of apps, so we will try instead to analyse a sample of the data. 

Luckily, we have found two existing data sets on [Kaggle](https://www.kaggle.com/) which seem useful to our purposes. 

- A [data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately 10,000 Android apps from Google Play, collected in August 2018.
- A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately 7,000 iOS apps from the App Store, collected in July 2017.

Let's start with opening both data sets.

In [88]:
from csv import reader

# Create a list of lists containing the iOS app data
opened_file_apple = open('AppleStore.csv')
read_file_apple = reader(opened_file_apple)
apple_data_all = list(read_file_apple)
apple_data_header = apple_data_all[0]
apple_data = apple_data_all[1:]

# Create a list of lists containing the Android app data
opened_file_google = open('googleplaystore.csv')
read_file_google = reader(opened_file_google)
google_data_all = list(read_file_google)
google_data_header = google_data_all[0]
google_data = google_data_all[1:]

To help in our exploratory data analysis ([EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis)) we create a function `explore_data` which allows us to repeatedly explore rows in a readable way.

#### Apple App Store data
Let's explore the Apple App Store data set first.

In [13]:
# Create function to print data set slices
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Pass in Apple app data
print(apple_data_header)
print('\n')
explore_data(apple_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


We can see the `apple_data` dataset has 7197 rows (without header) and 16 columns.

At a glance, some useful columns could be: 

| Column Name | Description |
| --- | --- |
| `track name` | App name |
| `price` | Price amount |
| `rating_count_tot` | User Rating counts (for all versions) |
| `rating_count_ver` | User Rating counts (for current version) |
| `user_rating` | Average User Rating value (for all version) |
| `user_rating_ver` | Average User Rating value (for current version) |
| `prime_genre` | Primary genre |

Further details about the dataset columns can be found in the documentation [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

#### Google Play data
Now let's do the same for the Google Play data set.

In [14]:
# Pass in Google app data
print(google_data_header)
print('\n')
explore_data(google_data, 0, 1, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The `google_data` dataset has 10841 rows (without header) and 13 columns.

Useful columns could be: 

| Column Name | Description |
| --- | --- |
| `App` | App name |
| `Category` | Categry the app belongs to |
| `Rating` | Overall user rating of the app (as when scraped) |
| `Reviews` | Number of user reviews for the app (as when scraped) |
| `Installs` | Number of user downloads/installs for the app (as when scraped) |
| `Type` | Paid or free |
| `Price` | Price |
| `Genres` | An app can belong to multiple genres |

Details about the dataset columns can be found [here](https://www.kaggle.com/lava18/google-play-store-apps).


## Deleting Incorrect Data

#### Google Play data
The `google_data` data set contains a [known](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) error at index `10472`. The `Category` value is missing and there is a column shift for the columns that follow.

To confirm this we print out the header, a correct row, and the row in question.

In [15]:
# compare rows
print(google_data_header)    #header
print('\n')
print(google_data[0])    #correct row
print('\n')
print('Correct row length: ', len(google_data[0])) #correct row length
print('\n')
print(google_data[10472])    #incorrect row
print('\n')
print('Incorrect row length: ', len(google_data[10472])) #correct row length


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Correct row length:  13


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Incorrect row length:  12


This can also be checked by looking for all rows which do not have length of 13.

We can see that our list below returns the same app name as row 10472.

In [16]:
# Instantiate emply list
app_name = []

# Iterate to find rows with missing value
for each_row in google_data:
    length = len(each_row)
    if length != 13:
        name = each_row[0]
        app_name.append(name)
        
print(app_name)

['Life Made WI-Fi Touchscreen Photo Frame']


Let's delete the row with the error and check.

In [17]:
print(google_data[10472])
print('Length before: ', len(google_data))
print('\n')

# Delete the row with the bad data
del google_data[10472]    #only run once

# Check
print(google_data[10472])
print('Length after: ', len(google_data))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Length before:  10841


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Length after:  10840


## Removing Duplicate Entries

#### Google Play data
We notice that out `google_data` data set has rows with duplicate app names, so we check the entire data set for duplicate app names and count how many there are.

In [90]:
# Instantiate lists
google_unique_apps = []
google_duplicate_apps = []

# Iterate over dataset and append rows to appropriate list
for each_row in google_data:
    name = each_row[0]
    if name in google_unique_apps:
        google_duplicate_apps.append(name)
    else:
        google_unique_apps.append(name)

# Print to confirm
print('Duplicates Android apps:', len(google_duplicate_apps))
print('Unique Android apps:', len(google_unique_apps))
print('\n')
print(google_duplicate_apps[0:5])
print('\n')
print(google_data_header)
for each_row in google_data:
    name = each_row[0]
    if name == 'Instagram':
        print(each_row)


Duplicates Android apps: 1181
Unique Android apps: 9660


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with devic

Looking at the above rows for the Instagram app we can see that the main difference happens in the fourth field of each row, the `Reviews` field, which gives the number of reviews. 

We can assume the higher the number of reviews, the more recent the data should be. Therefore, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

Let's create a dictionary storing app names as the key, and the  highest number of reviews for that app as the corresponding value. 

In [43]:
# Check how many rows should be left in our dataset after we remove the duplicates
print('Expected length:', len(google_data) - 1181)

# Instantiate our dictionary
reviews_max = {}

# Iterate over data set and append rows to appropriate list
for each_row in google_data:
    name = each_row[0]
    n_reviews = float(each_row[3])
    
    if name in reviews_max and (n_reviews > reviews_max[name]):
        reviews_max[name] = n_reviews
    
    else:
        reviews_max[name] = n_reviews

# Check        
len(reviews_max)

Expected length: 9659


9659

Now let's use the dictionary to create a new data set without the duplicates, i.e. one only containing entries with the highest rating.

To do this we initialise 2 lists:
- `google_clean` which will become our new list of lists
- `already_added` to check for cases where we have genuinely duplicate rows


In [20]:
# Instantiate lists
google_clean = []
already_added = []

# Iterate, checking for genuine duplicates and discarding if so
for each_row in google_data:
    name = each_row[0]
    n_reviews = float(each_row[3])
    
    if (reviews_max[name] == n_reviews) and name not in already_added:
        google_clean.append(each_row)
        already_added.append(name)
    
len(google_clean)

9659

## Remove non-English Characters

During our EDA we have come across apps in both data sets which contain non-English characters.

In [21]:
print(apple_data[813][1])
print(apple_data[6731][1])
print('\n')
print(google_clean[4412][0])
print(google_clean[7940][0])
print(ord('欢'))

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ
27426


Non-English apps are outside the scope of this project, so we decide to use the `ord()` built-in function to identify and and remove them.

We create a function `in_english()` which checks the ord value of each character in a string. As all English characters are in the range 0 to 127, If the returned value is greater than 127, we know it must be a non-English character in the string.


#### App Store Data
Let's build our function and use it on our `apple_data` data set first.

In [22]:
# Function to check ord value and determine if string is in English 
def in_english(string):
    ord_count = 0
    for each_character in string:
        ord_n = ord(each_character)
        
        if ord_n > 127:
            ord_count += 1
    
    # allows for English strings containing, e.g. emojis, '™' etc.
    if ord_count >= 3:            
        return False
    else:
        return True

# Instantiate lists to capture English/non-English app data
apple_data_english = []
apple_data_nonenglish = []


# Iterate over data set and append rows to appropriate list
for each_row in apple_data:
    app_name = each_row[1]
    
    if in_english(app_name):
        apple_data_english.append(each_row)
    else:
        apple_data_nonenglish.append(each_row)
                
# Check        
print(len(apple_data_english) + len(apple_data_nonenglish))
print(len(apple_data) == (len(apple_data_english) + len(apple_data_nonenglish)))


7197
True


#### Google Play data

Now let's do the same for our `google_clean` data set.

In [23]:
# Instantiate lists to capture English/non-English app data
google_clean_english = []
google_clean_nonenglish = []

# Iterate over data set and append rows to appropriate list
for each_row in google_clean:
    app_name = each_row[0]
    
    if in_english(app_name):
        google_clean_english.append(each_row)    
    else:
        google_clean_nonenglish.append(each_row)

print(len(google_clean_english) + len(google_clean_nonenglish))
print(len(google_clean) == (len(google_clean_english) + len(google_clean_nonenglish)))

9659
True


## Isolate the free apps

As our company only builds apps that are free to download and install, we zero in on these for our analysis.

#### App Store data
First let's check the price field ing our `apple_data_english` data set and use the values found there to further subdivide our data set into free and not-free.

In [45]:
# Instantiate lists to capture free / not-free data
apple_data_english_free = []
apple_data_english_notfree = []

# Iterate over data set and append rows to appropriate lists
for each_row in apple_data_english:
    price = float(each_row[4])
    if price == 0.0:
        apple_data_english_free.append(each_row)
    else:
        apple_data_english_notfree.append(each_row)

ios_final = apple_data_english_free

# Check
print(len(ios_final))
print(len(apple_data_english_notfree))
print(len(apple_data_english) == (len(ios_final) + len(apple_data_english_notfree)))


3203
2952
True


#### Google Play data

Repeat the process for our `google_clean_english` data set.

In [47]:
# Instantiate lists
google_clean_english_free = []
google_clean_english_nonfree = []

# Iterate over data set and append row to appropriate list
for each_row in google_clean_english:
    price = each_row[6]
    if price == 'Free':
        google_clean_english_free.append(each_row)
    else:
        google_clean_english_nonfree.append(each_row)

android_final = google_clean_english_free

# Check
print(len(android_final))
print(len(google_clean_english_nonfree))
print(len(google_clean_english) == (len(android_final) + len(google_clean_english_nonfree)))


8847
750
True


## Most Common Apps by Genre

As we mentioned in the introduction, because our revenue is highly influenced by the number of people using our apps, our aim is to determine the kinds of apps that are likely to attract the most users.

As we want to minimize our risk and overhead in launching a new app, we take the following validation strategy approach:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, it is in our interest to find app profiles that are successful in both markets.

We decide the fields that best describe the app profiles are:
- `prime_genre` column of the App Store dataset
- `Genres` and `Category` columns of the Google Play dataset


In [75]:
print('iOS genre data:')
print(apple_data_header[11])

for row in ios_final[:5]:
    print(row[11])
    
print('\n')
print('Android genre data:')
print(google_data_header[1], ',', google_data_header[9])

for row in android_final[:5]:
    print(row[1], ',', row[9])

iOS genre data:
prime_genre
Social Networking
Photo & Video
Games
Games
Music


Android genre data:
Category , Genres
ART_AND_DESIGN , Art & Design
ART_AND_DESIGN , Art & Design
ART_AND_DESIGN , Art & Design
ART_AND_DESIGN , Art & Design;Creativity
ART_AND_DESIGN , Art & Design


We need to get a sense of the most popular app genres for each market. 

First we create a function we can use on both our `ios_final` and `android_final` data sets, to build frequency tables capturing genre frequencies.

In [80]:
# Function to create frequency tables showing percentages
def freq_table(dataset, index):
    dictionary = {}
    for each_row in dataset:
        key = each_row[index]
        
        if key in dictionary:
            dictionary[key] += 1
        else:
            dictionary[key] = 1
    
    dictionary_percent = {}
    for key in dictionary:
        percent = round(dictionary[key] / len(dataset) * 100, 2)
        dictionary_percent[key] = percent
    
    return dictionary_percent

# Check
print(freq_table(ios_final, 11))    #prime_genre index
print('\n')
print(freq_table(android_final, 1))    #Category index
# print(freq_table(android_final, 9))    #Genres index

{'Social Networking': 3.31, 'Photo & Video': 5.0, 'Games': 58.26, 'Music': 2.06, 'Reference': 0.53, 'Health & Fitness': 2.03, 'Weather': 0.87, 'Utilities': 2.47, 'Travel': 1.25, 'Shopping': 2.59, 'News': 1.34, 'Navigation': 0.19, 'Lifestyle': 1.56, 'Entertainment': 7.84, 'Food & Drink': 0.81, 'Sports': 2.15, 'Book': 0.37, 'Finance': 1.09, 'Education': 3.68, 'Productivity': 1.75, 'Business': 0.53, 'Catalogs': 0.12, 'Medical': 0.19}


{'ART_AND_DESIGN': 0.64, 'AUTO_AND_VEHICLES': 0.93, 'BEAUTY': 0.6, 'BOOKS_AND_REFERENCE': 2.14, 'BUSINESS': 4.6, 'COMICS': 0.61, 'COMMUNICATION': 3.23, 'DATING': 1.87, 'EDUCATION': 1.16, 'ENTERTAINMENT': 0.96, 'EVENTS': 0.71, 'FINANCE': 3.71, 'FOOD_AND_DRINK': 1.24, 'HEALTH_AND_FITNESS': 3.09, 'HOUSE_AND_HOME': 0.8, 'LIBRARIES_AND_DEMO': 0.94, 'LIFESTYLE': 3.89, 'GAME': 9.7, 'FAMILY': 18.93, 'MEDICAL': 3.54, 'SOCIAL': 2.67, 'SHOPPING': 2.25, 'PHOTOGRAPHY': 2.95, 'SPORTS': 3.39, 'TRAVEL_AND_LOCAL': 2.34, 'TOOLS': 8.45, 'PERSONALIZATION': 3.32, 'PRODUCTIVITY'

Now that we have a way of making our frequency tables, we want to be able to easily understand the contents. 

To this end we create a function `display_table` which will call our `freq_table` function to create a frequency table, then sort the table based on the frequencies, finally displaying the results in descending order.

In [78]:
# Function to display the percentages in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    # Create a tuple with the index/value swapped around
    for key in table:
        key_val_as_tuple = (table[key], key)    
        table_display.append(key_val_as_tuple) 
    
    # Sort on the values
    table_sorted = sorted(table_display, reverse = True)    
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])    #print so that it is "index : value"
    

#### App Store data

Now that we have our function let's start by analysing the genre frequencies of our `ios_final` data set.

In [85]:
display_table(ios_final, -5)    # prime_genre index

Games : 58.26
Entertainment : 7.84
Photo & Video : 5.0
Education : 3.68
Social Networking : 3.31
Shopping : 2.59
Utilities : 2.47
Sports : 2.15
Music : 2.06
Health & Fitness : 2.03
Productivity : 1.75
Lifestyle : 1.56
News : 1.34
Travel : 1.25
Finance : 1.09
Weather : 0.87
Food & Drink : 0.81
Reference : 0.53
Business : 0.53
Book : 0.37
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


We can see over half (58.26%) of the free English apps available from the App Store are `Games` apps, making then the most common genre available.

`Entertainment` apps were the next most common (7.84%), followed by `Photo & Video` (5%), `Education` (3.68%) and `Social Networking` (3.31).

The general impression is that the majority of the free English apps available from the App Store are those that cater for **fun** (games, entertainment, photo and video, social networking etc.), rather than those that are used for **practical purposes** (weather, finance, news, business etc.). 

**However**, the availability of a large number of apps for a particular genre does not necessarily correspend to large number of users for those apps. Perhaps supply exceeds demand.

#### Google Play data
Now let's examine the `Category` and `Genres` columns from our `android_final` data set.

In [29]:
display_table(android_final, 1)    # Category index

FAMILY : 18.93
GAME : 9.7
TOOLS : 8.45
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.89
FINANCE : 3.71
MEDICAL : 3.54
SPORTS : 3.39
PERSONALIZATION : 3.32
COMMUNICATION : 3.23
HEALTH_AND_FITNESS : 3.09
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.67
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.87
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.39
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.8
WEATHER : 0.79
EVENTS : 0.71
PARENTING : 0.66
ART_AND_DESIGN : 0.64
COMICS : 0.61
BEAUTY : 0.6


We can see that among free English apps available from Google Play, the genres most available are `Family` (18.93%), `Games` (9.7%), `Tools` (8.45%), `Business` (4.6%), `Productivity` (3.9%) and `Lifestyle` (3.89%).

Apps that serve a more practical purpose have a much higher representation here than at the App Store. 

While initially is seems that there are proportionally many more practical categories than fun categories, and more apps available in those practical categories than in fun categories, upon closer inspection the category with the most numerous apps, `FAMILY` (18.93%), actually appears to consist of game apps for children. However, this is still a more balanced availability of fun and practical free English apps here than at the App Store.

This trend of a representation of both practical and fun apps seems to hold up when we examine the `Genres` table.

In [30]:
display_table(android_final, 9)    # Genres index

Tools : 8.44
Entertainment : 6.08
Education : 5.36
Business : 4.6
Productivity : 3.9
Lifestyle : 3.88
Finance : 3.71
Medical : 3.54
Sports : 3.46
Personalization : 3.32
Communication : 3.23
Action : 3.1
Health & Fitness : 3.09
Photography : 2.95
News & Magazines : 2.8
Social : 2.67
Travel & Local : 2.33
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.05
Dating : 1.87
Arcade : 1.84
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.39
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.8
Weather : 0.79
Events : 0.71
Adventure : 0.67
Comics : 0.6
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Trivia : 0.42
Casino : 0.42
Educational;Education : 0.4
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;Bra

While we are not sure what the distinction is between `Category` and `Genres` in the Google Play dataset, `Genres` appears to be much more granular, with more categories than `Category`.

As we are currently more interested in the big picture, we will work with the `Category` data only going forward.


## Most Popular Apps by Genre on the App Store

Let's take a look at the 'most popular' iOS apps.

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. 

As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

In [86]:
# Get iOS genres frequency table
ios_unique_genres = freq_table(ios_final, -5)

# Get user ratings
for genre in ios_unique_genres:
    # Stores the sum of user ratings
    total = 0    
    
    # Stores the number of unique apps in a genre
    len_genre = 0    
    
    genre_dict = {}
    
    for row in ios_final:
        genre_app = row[-5]
        if genre == genre_app:
            user_ratings = float(row[5])
            total += user_ratings
            len_genre += 1
            
    avg_user_rating = round(total / len_genre)
    print(genre, ':', avg_user_rating)


Social Networking : 71548
Photo & Video : 28442
Games : 22886
Music : 57327
Reference : 79350
Health & Fitness : 23298
Weather : 52280
Utilities : 19156
Travel : 28244
Shopping : 27231
News : 21248
Navigation : 86090
Lifestyle : 16815
Entertainment : 14195
Food & Drink : 33334
Sports : 23009
Book : 46385
Finance : 32367
Education : 7004
Productivity : 21028
Business : 7491
Catalogs : 4004
Medical : 612


On average, `Navigation` apps have the highest number of user reviews. Let's look further.

We can see that a small number of apps (Waze and Google Maps) have the overwhelming proportion of user reviews for this category.

In [32]:
for row in ios_final:
    if row[-5] == 'Navigation':
        print(row[1], ':', row[5])
        

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The same pattern applies to `Social Networking`, with Facebook and Pintrest accounting for the overwhelming majority of the category's user reviews.

In [33]:
for row in ios_final:
    if row[-5] == 'Social Networking':
        print(row[1], ':', row[5])


Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Let's check out what the `References` category tells us. 

It too is skewed by a couple of big hitters, the Bible and Dictionary.com.

In [34]:
for row in ios_final:
    if row[-5] == 'Reference':
        print(row[1], ':', row[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0


We appear to have the following options:

- Look to create an app that will compete with the titans that are `Facebook`, `Google Maps`, `Bible` etc. However, this would require significant investment to reseach and implement, and for our pruposes does not really work as a strategy

- Go back to the data, strip these apps that are skewing the results to get a more reasonable and accurate representation of what is popular outside of these top tier apps. 

Looking at the data in its current form however, the following thoughts occur:

- We may want to consider the **saturation** of apps in a particular category. Take `Social Networking` for example, which contains over 100 individual apps. We can see from user reviews for Facebook, Pinterest, WhatsApp Messenger etc, that if this a Social Networking app takes off, it has the potential to be incredibly popular. Perhaps then a new messaging app, for example would seem like a good idea. However, how to make it stand out from the noise of 100+ existing apps, all clamouring for market share?

- **`Food & Drink`** is a genre with upper-mid range popularity, and has relatively few unique apps. Perhaps an app would have greater chance of success here. Apart from existing food delivery service apps, there is one high ranking app which allows you to make reservations. We could go one step further by building an app that returns all current specials and deals at restaurants in a particular location, and then allows you to book your table. This has the added benefit of not requiring any specialist knowledge. Furthmore, as the App Store has such a high concentration of apps that are 'fun', this more practical (yet still entertainment-related) app might have a chance at standing out.

- **`Book`** is a genre with some potential also. There are relatively few apps to compete against, but a good level of popularity. We could take popular book and turn it into an app with special features in addition to the raw version of the book, e.g. daily quotes from the book, an audio version, quizzes about the book, an online forum for readers to discuss etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app. As with the `Food & Drink` idea, this more practical app might stand out amongst the fun apps which dominate the App Store.


In [35]:
for row in ios_final:
    if row[-5] == 'Book':
        print(row[1], ':', row[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0


## Most Popular Apps by Genre on Google Play

Let's now look at the 'most popular' Android apps.

We use the column `Installs` to calculate the popularity of genres. 

`Installs` values are open-ended (100+, 100,000+) and use ',' hence we will need to replace these characters to make the values usable for our calculations.


In [87]:
# Create a genre frequency table for Google Play apps
android_unique_genres = freq_table(android_final, 1)

cat_installs_list = []
cat_installs_dict = {}

for category in android_unique_genres:
    # Stores the sum of installs specific to each genre
    total = 0    
    
    # Stores the number of apps specific to each genre  
    len_category = 0      
    
    # get average no of installs per genre
    for row in android_final:
        category_app = row[1]
        
        # remove all instances of '+' and ',' from a value
        if category == category_app:
            installs = row[5].replace('+', '')   
            installs_1 = float(installs.replace(',', ''))   
            total += installs_1
            len_category += 1
            
    avg_installs = round(total / len_category)
    cat_installs_dict[category] = avg_installs    

# loop over the dictionary. For every dictionary entry create a tuple (installs, category) and append it to a list
for index in cat_installs_dict:
    cat_installs_tuple = ((cat_installs_dict[index]), index)
    cat_installs_list.append(cat_installs_tuple)

#sort the list in reverse order
cat_installs_sorted = sorted(cat_installs_list, reverse = True) 

for element in cat_installs_sorted:
    print(element[1], ':', element[0])


COMMUNICATION : 38590581
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15544015
TRAVEL_AND_LOCAL : 13984078
ENTERTAINMENT : 11640706
TOOLS : 10830252
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8814200
SHOPPING : 7036877
PERSONALIZATION : 5201483
WEATHER : 5145550
HEALTH_AND_FITNESS : 4188822
MAPS_AND_NAVIGATION : 4049275
FAMILY : 3697848
SPORTS : 3650602
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924898
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1446158
FINANCE : 1387692
HOUSE_AND_HOME : 1360598
DATING : 854029
COMICS : 832614
AUTO_AND_VEHICLES : 647318
LIBRARIES_AND_DEMO : 638504
PARENTING : 542604
BEAUTY : 513152
EVENTS : 253542
MEDICAL : 120551


From the above we can see that the most popular genres with free English apps on Google play are `COMMUNICATION`, `VIDEO_PLAYERS` and `SOCIAL`.

However, if we delve deeper into the popularity of the apps, we can see again the same pattern of a small number of apps dominating in popularity.

For `COMMUNICATION`, if we do not consider the apps with > 100M installs, the popularity of `COMMUNICATION` apps **decreases more than 10 fold** from 38,590,581 to 3,617,398.

In [37]:
under_100_m = []

# remove outliers
for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
round(sum(under_100_m) / len(under_100_m))

3617398

This same pattern for the `VIDEO_PLAYERS` category (Youtube, Google Play Movies & TV, MX Player), for `SOCIAL` apps (Facebook, Instagram, Google+, etc.), `PHOTOGRAPHY` apps (Google Photos and other popular photo editors) and `PRODUCTIVITY` apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

We cannot take the popularity data at face value with these few big players skewing the results. Furthermore, were we to build an app based on the above most 'popular' categories, these giants would be hard to compete with.

Let's explore the genres that we found had potential in the App Store:

**`FOOD_AND_DRINK`**  
With only 1.9M installs, this genre does not share the same popularity on Google Play as its App Store counterpart. As we are looking for an app that shows potential for being profitable in both markets, we decide to rule out this genre.


**`BOOKS_AND_REFERENCE`**  
The books and reference genre looks fairly popular here as well as at the App Store, with an average number of installs of 8.8M. Let's take a look at some of the apps from this genre and their number of installs:

In [38]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The `BOOKS_AND_REFERENCE` genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. 

It seems there's still a small number of extremely popular apps that skew the average:

In [39]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Outside of these, there are still a lot of reasonably popular apps. Let's look at those which sit in the range of 1M - 100M installs. 

There is a dominance of e-reader apps, along with reference books.

In [40]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                           or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

We also notice there are a number of apps built arounnd the Quran, which suggests our idea of building an app around a particular book might be a good one. 

As mentioned above, as there are many existing library-style apps, a we would need to differentiate by having special features additional to the text, such as daily quotes, quizes, audio, built-in dictionary, an online forum for readers to discuss et.


# Conclusion

In this project, we analysed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that would be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. 

The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.