# Understanding the Profitability of Free iOS and Android Apps

## Introduction

A fictional company produces iOS and Android apps for English-speaking audiences. These apps are free to customers to download and install, so revenue is generated through in-app advertisements; apps with more users therefore generate more revenue. The purpose of this study is to understand the types of apps that are likely to attract more users, to guide future development efforts.

### Data Source

[Apple Store data](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

[Google Play Store data](https://www.kaggle.com/lava18/google-play-store-apps)

### Functions for Importing and Exploring Data

In [1]:
# reader from csv module required to read CSV files
from csv import reader
# This function produces a list of data read in from a CSV file
# If it contains header data, then the header data is output separately
def openCSV(file, header=True):
    with open(file) as f:
        readf = reader(f)
        readf = list(readf)
    if header:
        return [readf[0],readf[1:]]
    else:
        return readf

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n') # adds a new (empty) line after each row

### Read in source data and initial exploration

Datasets produced:

`iosHeader`: column header information for the iOS data

`iosData`: main body of raw import of the iOS data

`androidHeader`: column header information for the Android data

`androidData`: main body of raw import of the Android data

In [2]:
# read in iOS and Android data
[iosHeader, iosData] = openCSV('./AppleStore.csv')
[androidHeader, androidData] = openCSV('./googleplaystore.csv')

print(iosHeader, '\n')
explore_data(iosData,0,3,True)
print(androidHeader, '\n')
explore_data(androidData,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 

Based on a brief exploration of each dataset, perhaps the following columns would be of use for this assessment:

| iOS Column Name   | Android Column Name   | Description           |
| :---              | :---                  | :---                  |
| 'user_rating_ver' | 'Rating'              | Overall app rating    |
| *N/A*             | 'Installs'            | Number of downloads   |
| 'cont_rating'     | 'Content Rating'      | Age restriction       |
| 'prime_genre'     | 'Genres'              | Genre                 |

## Data Cleaning

### Erroneous entry in the Android Data

The discussion board for the Android dataset highlighted an error with row 10472 ([see specific discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)). This row is highlighted against the column headers and the neighbouring rows; a column entry is missing, so to prevent this causing errors in subsequent data processing, it is deleted.

In [3]:
print(androidHeader, '\n') # print column names from header
explore_data(androidData,10471,10474) # print rows 10471-10473 inclusive

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [4]:
del androidData[10472] # delete the erroneous row
# print small selection of rows again to confirm deletion
print(androidHeader, '\n')
explore_data(androidData,10471,10473)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




### Duplicate Entries

This section scans through the two datasets to identify and remove duplicate entries. For example, Netflix has 5 duplicate entries in the Android dataset. There are a total of 1181 duplicate entries across all Android apps in the dataset. The iOS dataset only has 2 duplicate entries. Duplicates won't be removed randomly, but rather the entry with the largest number of reviews will be kept, on the basis that these entries have the most recent data.

In [5]:
## just some exploratory code to find apps with quite a few repetitions
# appNamesAndroid = {}
# for row in androidData:
#     name = row[0] # app name is in first column in android data
#     if name in appNamesAndroid:
#         appNamesAndroid[name] += 1
#     else:
#         appNamesAndroid[name] = 1

# for item in appNamesAndroid.items():
#     if item[1] > 3:
#         print(item[0])

# define const string for Netflix (used a few times)
netflix = 'Netflix'
# print all the duplicate entries of Netflix
for row in androidData:
    if row[0] == netflix:
        print(row)

dupNamesIOS = []
uniqueNamesIOS = []
for row in iosData:
    name = row[1] # app name is in first column of ios data
    if name in uniqueNamesIOS:
        dupNamesIOS.append(name)
    else:
        uniqueNamesIOS.append(name)

print('\nNumber of duplicate IOS entries: ', len(dupNamesIOS))
print('Number of unique IOS entries: ', len(uniqueNamesIOS))

dupNamesAndroid = []
uniqueNamesAndroid = []
for row in androidData:
    name = row[0] # app name is in first column of android data
    if name in uniqueNamesAndroid:
        dupNamesAndroid.append(name)
    else:
        uniqueNamesAndroid.append(name)

print('\nNumber of duplicate Android entries: ', len(dupNamesAndroid))
print('Number of unique Android entries: ', len(uniqueNamesAndroid))

['Netflix', 'ENTERTAINMENT', '4.4', '5456208', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Netflix', 'ENTERTAINMENT', '4.4', '5456208', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Netflix', 'ENTERTAINMENT', '4.4', '5456599', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Netflix', 'ENTERTAINMENT', '4.4', '5456708', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Netflix', 'FAMILY', '4.4', '5453997', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 31, 2018', 'Varies with device', 'Varies with device']

Number of duplicate IOS entries:  2
Number of unique IOS entries:  7195

Number of dup

#### Filtering Out Duplicates

The duplicates are filtered out by keeping the single entry per app that has the highest number of reviews.

Step 1: loop through the android dataset to create a dictionary for each app to map each unique name with the highest number of reviews that app has in the dataset. The number of unique apps is known from the previous section, so the length of this dictionary is used as a check; it should match 9659 unique app names. Having inspected the duplicate entries for Netflix, this item is also specifically called out to check that it has the expected max number of reviews (5456708).

Step 2: loop through the android dataset to produce a clean set of data (i.e. no duplicates) using the dictionary from Step 1. The code here adds to the clean list only if the app has not already been entered and the number of reviews matches the number from the dictionary. It is possible that there are duplicates entries with the same max number of reviews, so the code ensures in addition that these duplicates do not sneak in. Again the length of the clean set of data (`android_clean`) is checked to see if it matches the known number of unique app names (9659). Also, any rows for Netflix can be pulled out for a spot check to ensure there is only one instance of Netflix and that the number of reviews matches the expected max (5456708).

Datasets produced:

`android_clean`: androidData with duplicate entries removed

`ios_clean`: iosData with duplicate entries removed

In [6]:
## Step 1
# create dictionary mapping unique app names with respective max number of reviews, where duplicates exist
reviews_max = {}
for row in androidData:
    name = row[0] # app name is in first column of android data
    n_reviews = float(row[3]) # number of reviews is in the fourth column of android data
    if name not in reviews_max:
        reviews_max[name] = n_reviews # if the app name does not exist, add it
    elif name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews # if a duplicate is found with more reviews, update the entry
# a couple of checks
print('Check length of reviews_max is as expected (9659): ', len(reviews_max))
print('Check if Netflix entry is as expected (5456708): ', reviews_max[netflix], '\n')

## Step 2
# create a clean list with only one row per unique app name
# ensures the entry added is the one with the max number of reviews from reviews_max
android_clean = []
already_added = []
for row in androidData:
    name = row[0]
    n_reviews = float(row[3])
    if name not in already_added and reviews_max[name] == n_reviews:
        android_clean.append(row)
        already_added.append(name) # prevents duplicates with the same name AND number of reviews
# a few checks follow
print('Check length of android_clean is as expected (i.e. 9659): ', len(android_clean), '\n')

print(androidHeader, '\n')
explore_data(android_clean, 0, 3)

for row in android_clean:
    name = row[0]
    if name == netflix:
        print(row)

## Step 2b - remove two duplicates from iosData
# Rows 4831 (VR Roller Coaster) and 4463 (Mannequin Challenge) to be removed
ios_clean = iosData
del ios_clean[4831]
del ios_clean[4463]
# quick check
for row in ios_clean:
    name = row[1]
    if name == 'Mannequin Challenge' or name == 'VR Roller Coaster':
        print(name, ': ', row[5])

Check length of reviews_max is as expected (9659):  9659
Check if Netflix entry is as expected (5456708):  5456708.0 

Check length of android_clean is as expected (i.e. 9659):  9659 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Netflix', 'ENTERTAINMENT', '4.4', '5456708', 'Varies with device', '100,000,000+', 'Free', '0', 'Tee

### Filtering out Non-English App Entries

This company produces apps for an English-speaking audience, so data relating to apps that are not marketed to English-speaking audiences are not relevant for the scope of this investigation. The following code is designed to identify app entries that are most likely marketed at an English-speaking audience, by interrogating the app names. Common English characters are in the ASCII number range 0-127; if there are any characters in an app name outside this range, then is is possible that the app is non-English. However, it is not uncommon for an English app to contain the odd character outside the common ASCII range (e.g. TM or an emoji). To keep this step simple, the code arbitrarily determines that an app is non-English if the name contains more than 3 characters outside the 0-127 ASCII range.

Datasets produced:

`android_clean_english`: android_clean with non-English app entries filtered out

`ios_clean_english`: ios_clean with non-English app entries filtered out

In [7]:
# function to check if a string contains characters not found in common English
# common English characters here means ASCII values between 0 - 127
# some English apps do contain non common characters
# assume greater than 3 non common characters means it can be neglected
def check_english(phrase):
    count = 0
    for char in phrase:
        if ord(char) > 127:
            count += 1
            if count > 3:
                return False
    return True
# test function
test_names = ['Instagram', '爱奇艺PPS -《欢乐颂2》电视剧热播', 'Docs To Go™ Free Office Suite', 'Instachat 😜']
for name in test_names:
    print(check_english(name))

# create ios and android lists without non-english apps, as determined by the above function
android_clean_english = []
for row in android_clean:
    name = row[0]
    if check_english(name):
        android_clean_english.append(row)

ios_clean_english = []
for row in ios_clean:
    name = row[1]
    if check_english(name):
        ios_clean_english.append(row)

True
False
True
True


### Filtering Out Paid-for Apps

The last step of data cleaning is the removal of paid-for apps in the datasets, bearing in mind that the scope of this investigation is to understand the profitability of free apps.

In [8]:
# remove paid-for apps
# price is index 4 in ios data and index 7 in android data
# expected length of free app ios data is roughly 50% of 6181
# expected length of free app android data is roughly 90% of 9614

android_clean_eng_free = []
for row in android_clean_english:
    free_or_paid = row[6]
    if free_or_paid == 'Free':
        android_clean_eng_free.append(row)
print(len(android_clean_eng_free))

ios_clean_eng_free = []
for row in ios_clean_english:
    price = float(row[4])
    if price == 0:
        ios_clean_eng_free.append(row)
print(len(ios_clean_eng_free))

8863
3220


## Data Analysis

To minimise risks and overheads, the high level strategy for validating a new app idea is as follows:
1. Develop a minimal Android version and add it to the Play Store
2. Develop the app further if it has a good response from users
3. If after 6 months the app is profitable an iOS version is built and added to the App Store

Since the end goal is to produce an app in both markets, it is pertinent to get some understanding of apps that are successful in both markets. Frequency tables are generated to see what the most common app genres are.

In [9]:
# genres are found in column indices 1 or 9 (android) and 11 (ios)
# produces percentage frequency table
def freq_table(dataset, index):
    table = {}
    for row in dataset:
        col = row[index]
        if col in table:
            table[col] += 1
        else:
            table[col] = 1
    for key in table:
        table[key] = table[key] / len(dataset) * 100 # convert to percentage
    return table

# displays a percentage frequency table, with percentages rounded to 1dp
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (round(table[key], 1), key) # note table value is rounded to 1dp
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('Android Category frequency table:-')
display_table(android_clean_eng_free, 1)

print('\nAndroid Genres frequency table:-')
display_table(android_clean_eng_free, 9)

print('\niOS prime_genre frequency table:-')
display_table(ios_clean_eng_free, 11)

Android Category frequency table:-
FAMILY : 18.9
GAME : 9.7
TOOLS : 8.5
BUSINESS : 4.6
PRODUCTIVITY : 3.9
LIFESTYLE : 3.9
FINANCE : 3.7
MEDICAL : 3.5
SPORTS : 3.4
PERSONALIZATION : 3.3
COMMUNICATION : 3.2
HEALTH_AND_FITNESS : 3.1
PHOTOGRAPHY : 2.9
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.7
TRAVEL_AND_LOCAL : 2.3
SHOPPING : 2.2
BOOKS_AND_REFERENCE : 2.1
DATING : 1.9
VIDEO_PLAYERS : 1.8
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.2
EDUCATION : 1.2
ENTERTAINMENT : 1.0
LIBRARIES_AND_DEMO : 0.9
AUTO_AND_VEHICLES : 0.9
WEATHER : 0.8
HOUSE_AND_HOME : 0.8
PARENTING : 0.7
EVENTS : 0.7
COMICS : 0.6
BEAUTY : 0.6
ART_AND_DESIGN : 0.6

Android Genres frequency table:-
Tools : 8.5
Entertainment : 6.1
Education : 5.3
Business : 4.6
Productivity : 3.9
Lifestyle : 3.9
Finance : 3.7
Sports : 3.5
Medical : 3.5
Personalization : 3.3
Communication : 3.2
Health & Fitness : 3.1
Action : 3.1
Photography : 2.9
News & Magazines : 2.8
Social : 2.7
Travel & Local : 2.3
Shopping : 2.2
Books & Reference : 2.1
Simulatio

### iOS Genres
Considering the frequency table for iOS 'prime_genre', games are the majority in terms of number of free English apps in the sample dataset at 58.1%; entertainment apps follow at just 7.9%. Without data on number of downloads though, it can only be inferred from this that gaming apps command the largest number of users.

Education (3.7%), shopping (2.6%), utilities (2.5%), lifestyle (1.6%), and productivity (1.7%) make up around 12% of free English apps. Entertainment (7.9%), photo & video (5.0%), social networking (3.3%), sports (2.1%), and music (2.0%) make up around 20%. This suggests that a free English app marketed for entertainment purposes would likely be popular, and would certainly have a lot of competition.

### Android Genres
The frequency table for 'Genres' contains a lot of subcategorisation that shows some detail within top level genres. In order to compare with the iOS data, the 'Category' frequency table will be focussed on here. The top category is family (18.9%), followed by games (9.7%). Gaming apps don't appear to have the same level of dominance in the Android market that they do in the iOS market. It is worth noting that family apps may include games for children. Nevertheless, the landscape here is more balanced between entertainment genres and practical genres.

## Determining the Types of Apps with the Most Users

### iOS Apps
The code below generate the average total number of reviews for apps in each genre (as a proxy for number of app downloads). The aim will be to compare this with the genre frequency table to suggest an app profile for the Apple Store.

In [10]:
# produce a second dictionary with average total ratings (proxy for downloads) per genre
ios_genre_ft = freq_table(ios_clean_eng_free, 11) # prime_genre is column index 11
ios_genre_avg_users = {}
for genre in ios_genre_ft:
    total = 0
    len_genre = 0
    for row in ios_clean_eng_free:
        app_genre = row[11]
        ratings_tot = float(row[5])
        if app_genre == genre:
            total += ratings_tot
            len_genre += 1
    ios_genre_avg_users[genre] = total / len_genre

# print the averages
def display_sorted_dict(dictionary, rev=True):
    table = []
    for key in dictionary:
        dict_item = (dictionary[key], key)
        table.append(dict_item)
    table = sorted(table, reverse=rev)
    for item in table:
        print(item[1], ' : ', item[0])

display_sorted_dict(ios_genre_avg_users)

Navigation  :  86090.33333333333
Reference  :  74942.11111111111
Social Networking  :  71548.34905660378
Music  :  57326.530303030304
Weather  :  52279.892857142855
Book  :  39758.5
Food & Drink  :  33333.92307692308
Finance  :  31467.944444444445
Photo & Video  :  28441.54375
Travel  :  28243.8
Shopping  :  26919.690476190477
Health & Fitness  :  23298.015384615384
Sports  :  23008.898550724636
Games  :  22812.92467948718
News  :  21248.023255813954
Productivity  :  21028.410714285714
Utilities  :  18684.456790123455
Lifestyle  :  16485.764705882353
Entertainment  :  14029.830708661417
Business  :  7491.117647058823
Education  :  7003.983050847458
Catalogs  :  4004.0
Medical  :  612.0


Despite being the genre most populated with apps, the average number of ratings in Games is only a quarter of those in the Navigation, the top genre by number of ratings. The frequency of apps within a genre can be considered as an indication of competition a new app would face more than the popularity of the genre. The average number of ratings (as a proxy for number of downloads) is perhaps a more appropriate indication of the popularity of a genre for the free English app market. In this space Navigation and Reference have the greatest average total ratings, but make up only 0.2% and 0.6% respectively of the sample dataset. A new app in one of these genres could have much more potential than one in the crowded gaming space. However, before making a firm recommendation, it would be worth checking that the average total ratings presented here aren't skewed from a relatively small sample size of free English apps in these genres.

### Android Apps

In [13]:
android_category_ft = freq_table(android_clean_eng_free, 1)
android_category_avg_dls = {}
for cat in android_category_ft:
    total = 0
    len_cat = 0
    for row in android_clean_eng_free:
        app_genre = row[1]
        dls = row[5]
        dls = dls.replace('+', '')
        dls = dls.replace(',', '')
        dls = float(dls)
        if app_genre == cat:
            total += dls
            len_cat += 1
    android_category_avg_dls[cat] = total / len_cat

#print(android_category_avg_dls)

display_sorted_dict(android_category_avg_dls)

print('\n')
display_table(android_clean_eng_free, 1)

COMMUNICATION  :  38456119.167247385
VIDEO_PLAYERS  :  24727872.452830188
SOCIAL  :  23253652.127118643
PHOTOGRAPHY  :  17840110.40229885
PRODUCTIVITY  :  16787331.344927534
GAME  :  15588015.603248259
TRAVEL_AND_LOCAL  :  13984077.710144928
ENTERTAINMENT  :  11640705.88235294
TOOLS  :  10801391.298666667
NEWS_AND_MAGAZINES  :  9549178.467741935
BOOKS_AND_REFERENCE  :  8767811.894736841
SHOPPING  :  7036877.311557789
PERSONALIZATION  :  5201482.6122448975
WEATHER  :  5074486.197183099
HEALTH_AND_FITNESS  :  4188821.9853479853
MAPS_AND_NAVIGATION  :  4056941.7741935486
FAMILY  :  3697848.1731343283
SPORTS  :  3638640.1428571427
ART_AND_DESIGN  :  1986335.0877192982
FOOD_AND_DRINK  :  1924897.7363636363
EDUCATION  :  1833495.145631068
BUSINESS  :  1712290.1474201474
LIFESTYLE  :  1437816.2687861272
FINANCE  :  1387692.475609756
HOUSE_AND_HOME  :  1331540.5616438356
DATING  :  854028.8303030303
COMICS  :  817657.2727272727
AUTO_AND_VEHICLES  :  647317.8170731707
LIBRARIES_AND_DEMO  :  638

For the Android dataset, Books & Reference and Maps & Navigation seem to have reasonably high download numbers, but not particularly high compared to the top categories. The high download rates in these categories could be skewed by a handful of apps that dominate a given category, so some deeper investigation within categories to remove such apps from the datasets may prove to be a useful perspective when the aim is to enter the market with a new free English app.

Books & Reference and Maps & Navigation account for a small proportion of free English Android apps (2.1% and 1.4%) respectively just as they account for a small proportion of the iOS dataset, so a new app in one of these categories could see moderate to good success in the Android market, and then see good success in the iOS market.