# Profitable App Profiles for the App Store and Google Play Markets

My aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. I am analysing data for company that builds Android and iOS mobile apps, and my job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

This company, only build apps that are free to download and install, and the main source of revenue consists of in-app ads. This means that the revenue for any given app is mostly influenced by the number of users that use the app. My goal for this project is to analyze data to help developers develop an understanding of the type of apps that are likely to attract more users.

# Opening and exploring data

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
open_file_apple = open('C:/Users/amaar/Desktop/DataQuest/Projects/Project1/AppleStore.csv', encoding = 'utf8')
open_file_google = open('C:/Users/amaar/Desktop/DataQuest/Projects/Project1/googleplaystore.csv', encoding = 'utf8')
from csv import reader
read_file_apple = reader(open_file_apple)
read_file_google = reader(open_file_google)
data_apple = list(read_file_apple)
data_google = list(read_file_google)

ios_header = data_apple[0]
ios = data_apple[1:]

android_header = data_google[0]
android = data_google[1:]

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


There are 7197 ios apps in the data set. The columns that are of interest are: 'track name', 'price', rating_count_tot', 'rating_count_ver' 'user_rating', user_rating_ver', 'prime_genre'. The column names are not all self explanatory. Details can be found in data [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


There are 10841 android apps in the data set. The columns that are of interest are: 'App', 'Category', 'Rating', 'Reviews', 'Installs', 'type', 'Price', 'Genres'. 

# Deleting incorrect data

Within the Google play data discussion section, one of the discussions highlights an error in row 10472. I will print this row and compare it to the heading and another row

In [5]:
print(android_header)
print('\n')
print(android[0])
print('\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The rating in row 10472 ('Life Made WI-FI Touchscreen Photo Frame' app) is 19 this is clearly incorrect as the maximum rating is 5. This is due to a missing category value see [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101)

This row will be removed

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


There is no more incorrect data

# Removing Duplicate Entries

Looking through the google play data duplicate entries were found. For example 4 entries were found for instagram.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


there 1,181 instances of duplicate apps in total

In [8]:
duplicate_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('No. of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])
print('\n')
print('No. of unique apps: ', len(unique_apps))

No. of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


No. of unique apps:  9659


As we do not want to count duplicate entries when we analyse the data, we must remove them as to only keep one entry per app. 

The only difference seen in the Instagram data was the change in number of reviews. This is significant as it implies it is the most recent data set (even though the last updated column contains the same date). Therefore, instead of removing duplicate rows randomly we will remove all but the one with the highest number of reviews.

To carry out the removal of duplicates:
- create a dictionary where each apps name is a key, and the value is the highest number of reviews of that app
- I will then create a new data set using the dictionary, ensuring only one entry per app and select the app with the highest number of reviews

In [9]:
reviews_max = {}
for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [10]:
print('expected length of data set: ', len(android) - 1181)
print('Actual length of data set', len(reviews_max))

expected length of data set:  9659
Actual length of data set 9659


In [11]:
android_clean = []
already_added = []

for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(apps)
        already_added.append(name)

To confirm there are only 9659 rows we will explore the new cleaned dataset.

In [12]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


just as calculated above we have 9659 rows.

No duplicate entries were found in the App store data

# Removing non-english apps

However when lookinng at the data for both the app store and the google play store non english app names were discovered. As the audience of the app in development is english speakers these non english titled apps must be removed.

In [13]:
def english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

To test the above function we will be insterting 4 test cases

In [14]:
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
False
False


The above code did not recognize the emoji or <sup>TM </sup> as they fall out of the ASCII range of 0 - 127. To correct for this error in determining if an application is marketed to the english speaking market or not we only remove apps that have 4 or more characters corresponding to those not in the english language range.

In [15]:
def english(string):
    non_english = 0
    
    for character in string:
        if ord(character) > 127:
            non_english +=1
            
    if non_english > 3:
        return False
    else:
        return True

In [16]:
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
True
True


Although the function is not perfect and may still result in non-english apps getting past the filter, it is sufficient for this stage of our analysis.

Below we will use the english function to filter out non-english apps for both data sets. We shall then explore the data to determine the number of english langauge apps for both app stores.

In [17]:
ios_english = []
android_english = []


for app in ios:
    name = app[1]
    if english(name):
        ios_english.append(app)
        
for app in android_clean:
    name = app[0]
    if english(name):
        android_english.append(app)
        
explore_data(ios_english, 0, 3, True)
print('\n')
explore_data(android_english, 0, 3, True)


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Vari

From above we can see that there are 6183 ios apps and 9614 android apps left.

# Removing Non-Free Apps

In [18]:
ios_free = []
android_free = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
        
for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
        
explore_data(ios_free, 0, 3, True)
print('\n')
explore_data(android_free, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Vari

There are 3222 ios apps and 8864 android apps left, this should be a significant enough sample size for our analysis.

In [19]:
ios_final = ios_free
android_final = android_free

The final ios files are renamed to signify the cleaned files.

# Most Common App Genre

As stated in the introducation as the app has an ads based revenue, the revenue is highly influenced by the number of people using the apps.

so far we have cleaned the data to:
 - Remove inaccurate data
 - Remove duplicated entries
 - Remove non-english apps
 - Remove all non-free apps

To minimise risk in the development of the app, we are implementing 3 validation stratigies.

 1) Build a minimal Android version of the app, and add it to Google Play.
   
 2) If the app has a good response from users, we develop it further.
   
 3) If the app is profitable after six months, we build an iOS version of the       app and add it to the App Store.
 
The end goal is to add the app on both the App Store and Google Play. Therfore, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

We will begin the analysis by determining the most common genres for the two markets. We will build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns for the Google Play data set. 


Below i will create 2 functions:
 - One function will generate the frequency table that shows percentages
 - The second function displays the percentages in descending order

In [20]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percent = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percent[key] = percentage
    
    return table_percent

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


The prime_genre frequency table for the app store is examined below

In [22]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


More than a half (58.16%)  of the free english apps on the App store are games. Approximately 8 percent of the apps are for entertainment purposes, whilst close to 5 percent of the apps are for photos & videos. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, having a larger number of apps of a specific genre does not imply that they have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).

In [24]:
display_table(android_final, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The Google play store seems to have a significantly different showing of the types of apps that are available. Unlike the App Store