# Profitable App Profiles In The Apple And Google Markets Stores

## Introduction

Our Aim in this project is to help our developers understand what types of Apps attract the most users and determine lucrative markets.

The data sources are
Google Play 'Android' dataset is in https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

Apples 'Ios' dataset is in https://dq-content.s3.amazonaws.com/350/AppleStore.csv

First lets import the data we need.

In [1]:
from csv import reader
# The Google Play Store Data Set#
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# The App Store Data Set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Lets preview the columns seen in the both stores

In [140]:
print(android_header, '\n \n ', ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 
 
  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Here we will create a function that allows us to slice the dataset how we wish

In [12]:
def explore_data(dataset, start, end, rows_and_columns = False):
    datasetslice = dataset[start:end]
    for row in datasetslice:
        print(row)
        print('\n')
    if rows_and_columns == True:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        

Lets preview some rows in both the data sets.

## Data Cleaning

A reviewer of this dataset describes a row with an error. It is said to be have the values shifted. Lets search to find this specific instance.

In [14]:
for row in android:
    headerlength = len(android_header)
    rowlength = len(row)
    if rowlength != headerlength:
        print(row)
        print(android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


Lets compare this to the column names and another row

In [19]:
print(android_header[:5])
print(android[10472][:5])
print(android[10473][:5])

['App', 'Category', 'Rating', 'Reviews', 'Size']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M']


So entry 10472 has the entries incorrectly placed. Where there should be a 'Category' there is inface the rating. As we have 10000+ rows we can delete this anomaly

In [20]:
del android[10472]

Check to see its worked

In [21]:
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Perfect. Now lets investigate if there are any duplications in the android dataset

In [24]:
unique_apps = []
duplicate_apps = []
for row in android:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print(duplicate_apps)

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite', 'TurboScan: scan documents and receipts in PDF', 'Tiny Scanner Pro: PDF Doc Scan', 'Docs To Go™ Free Office Suite', 'OfficeSuite : Free Office + PDF Editor',

This was more than expected. Just how many duplicates are there?

In [26]:
print(len(duplicate_apps))

1181


Lets now see an example of this

In [32]:
print(android_header[:8])
for row in android:
    name = row[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(row[:8])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0']


Clearly we want to remove duplicates and leave one version of each. But how best to achieve this. The best way might be popularity. To do this we will keep the entry with the highest number of reviews.

In [33]:
reviews_max = {}
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

    

9659


Having worked out the the number of unique apps lets now create a dataset of the unique apps.

In [35]:
android_clean = []
already_added = []
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
#Add on a check to show everything worked.        
print(len(android_clean))

9659


## Removing Non-English Apps From The Dataset

Our developers are wanting to develop an app in english. But there are many entries in the dataset that are not in English. For instance:

In [37]:
print(android_clean[4412][0])

中国語 AQリスニング


Fortunately these can be selected based on the characters used. In python these can be listed using the 'ord' built in function. Anything above 127 is likely to not be in the English alphabet. We need a function to determine if something is in English Characters. 

However there may be some entries in English with an extra character (like trademark or a smiley) added on. So in our function lets say that anything with 3 or more English characters will be included

In [38]:
def is_english(name):
    non_english_char = 0
    for char in name:
        if ord(char) > 127:
            non_english_char += 1
    if (non_english_char >= 3):
        return False    
    else:
        return True

Lets run a check on some entries to see if the function works

In [59]:
a_list = ['Instagram','爱奇艺PPS -《欢乐颂2》电视剧热播','Docs To Go™ Free Office Suite',
          'Instachat 😜']
for item in a_list:
    print(is_english(item))

True
False
True
True


Excellent. Now lets create new datasets for android and ios to include only these english apps

In [60]:
android_english = []
for row in android_clean:
    name = row[0]
    if is_english(name):
        android_english.append(row)

print(android_english[:5])        

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


In [61]:
ios_english = []
for row in ios:
    name = row[0]
    if is_english(name):
        ios_english.append(row)
        
print(ios_english[:5])

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']]


In [62]:
print(len(android_english))
print(len(ios_english))

9597
7197


## Selecting The Free Apps

We have sorted for English only Apps. Our developers want to produce apps using a 'free' model. They will be free to download but host advertisements. Now lets look at free apps on the markets 

In [65]:
free_apps_android = []
for row in android_english:
    price = row[6]
    if price == 'Free':
        free_apps_android.append(row)
print(free_apps_android[:3])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


In [67]:
free_apps_ios = []
for row in ios_english:
    price = row[4]
    if price == '0.0':
        free_apps_ios.append(row)
print(free_apps_ios[:3])

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]


In [69]:
print(len(free_apps_android))
print(len(free_apps_ios))

8847
4056


There are a very high proportion of free apps on the android market, and far more than on the Ios market.

Which columns might be useful in determining the best genre to target

In [78]:
print(free_apps_ios[3][11])
print(free_apps_android[3][1])
print(free_apps_android[0][9])

Games
ART_AND_DESIGN
Art & Design


Now lets build a frequency table to visualise this

In [79]:
def freq_table(dataset, index):
    frequency = {}
    total = 0
    for row in dataset:
        total += 1
        a_type = row[index]
        if a_type in frequency:
            frequency[a_type] += 1
        else:
            frequency[a_type] = 1
    table_percent = {}
    for key in frequency:
        percent = ((frequency[key] / total) * 100)
        table_percent[key] = percent
    return table_percent

Lets preview what this function does

In [80]:
freq_table(free_apps_android, 1)

{'ART_AND_DESIGN': 0.6442861987114276,
 'AUTO_AND_VEHICLES': 0.9268678648129309,
 'BEAUTY': 0.5990731321351871,
 'BOOKS_AND_REFERENCE': 2.136317395727365,
 'BUSINESS': 4.600429524132474,
 'COMICS': 0.6103763987792472,
 'COMMUNICATION': 3.2327342602011986,
 'DATING': 1.8650389962699219,
 'EDUCATION': 1.1642364643381937,
 'ENTERTAINMENT': 0.9607776647451114,
 'EVENTS': 0.7121057985757884,
 'FINANCE': 3.7074714592517237,
 'FOOD_AND_DRINK': 1.2433593308466147,
 'HEALTH_AND_FITNESS': 3.0857917938284163,
 'HOUSE_AND_HOME': 0.8025319317282694,
 'LIBRARIES_AND_DEMO': 0.938171131456991,
 'LIFESTYLE': 3.888323725556686,
 'GAME': 9.698202780603594,
 'FAMILY': 18.932971628800725,
 'MEDICAL': 3.537922459590822,
 'SOCIAL': 2.6675709279981916,
 'SHOPPING': 2.2493500621679665,
 'PHOTOGRAPHY': 2.950152594099695,
 'SPORTS': 3.39097999321804,
 'TRAVEL_AND_LOCAL': 2.3397761953204474,
 'TOOLS': 8.45484344975698,
 'PERSONALIZATION': 3.3231603933536795,
 'PRODUCTIVITY': 3.8996269922007465,
 'PARENTING': 0.65

Next lets create a frequency table that sorts this from high to low

In [95]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0],4))

In [96]:
display_table(free_apps_android, 1)

FAMILY : 18.933
GAME : 9.6982
TOOLS : 8.4548
BUSINESS : 4.6004
PRODUCTIVITY : 3.8996
LIFESTYLE : 3.8883
FINANCE : 3.7075
MEDICAL : 3.5379
SPORTS : 3.391
PERSONALIZATION : 3.3232
COMMUNICATION : 3.2327
HEALTH_AND_FITNESS : 3.0858
PHOTOGRAPHY : 2.9502
NEWS_AND_MAGAZINES : 2.8032
SOCIAL : 2.6676
TRAVEL_AND_LOCAL : 2.3398
SHOPPING : 2.2494
BOOKS_AND_REFERENCE : 2.1363
DATING : 1.865
VIDEO_PLAYERS : 1.7972
MAPS_AND_NAVIGATION : 1.3903
FOOD_AND_DRINK : 1.2434
EDUCATION : 1.1642
ENTERTAINMENT : 0.9608
LIBRARIES_AND_DEMO : 0.9382
AUTO_AND_VEHICLES : 0.9269
HOUSE_AND_HOME : 0.8025
WEATHER : 0.7912
EVENTS : 0.7121
PARENTING : 0.6556
ART_AND_DESIGN : 0.6443
COMICS : 0.6104
BEAUTY : 0.5991


In [97]:
display_table(free_apps_android, 9)

Tools : 8.4435
Entertainment : 6.0812
Education : 5.3577
Business : 4.6004
Productivity : 3.8996
Lifestyle : 3.877
Finance : 3.7075
Medical : 3.5379
Sports : 3.4588
Personalization : 3.3232
Communication : 3.2327
Action : 3.0971
Health & Fitness : 3.0858
Photography : 2.9502
News & Magazines : 2.8032
Social : 2.6676
Travel & Local : 2.3285
Shopping : 2.2494
Books & Reference : 2.1363
Simulation : 2.0459
Dating : 1.865
Arcade : 1.8424
Video Players & Editors : 1.7746
Casual : 1.7633
Maps & Navigation : 1.3903
Food & Drink : 1.2434
Puzzle : 1.1303
Racing : 0.9947
Role Playing : 0.9382
Libraries & Demo : 0.9382
Auto & Vehicles : 0.9269
Strategy : 0.9043
House & Home : 0.8025
Weather : 0.7912
Events : 0.7121
Adventure : 0.6669
Comics : 0.5991
Beauty : 0.5991
Art & Design : 0.5991
Parenting : 0.4973
Card : 0.4521
Trivia : 0.4182
Casino : 0.4182
Educational;Education : 0.3956
Board : 0.3843
Educational : 0.373
Education;Education : 0.3391
Word : 0.26
Casual;Pretend Play : 0.2374
Music : 0.20

In [98]:
display_table(free_apps_ios, -5)

Games : 55.646
Entertainment : 8.2347
Photo & Video : 4.1174
Social Networking : 3.5256
Education : 3.2544
Shopping : 2.9832
Utilities : 2.6874
Lifestyle : 2.3176
Finance : 2.071
Sports : 1.9477
Health & Fitness : 1.8738
Music : 1.6519
Book : 1.6272
Productivity : 1.5286
News : 1.43
Travel : 1.3807
Food & Drink : 1.0602
Weather : 0.7643
Reference : 0.4931
Navigation : 0.4931
Business : 0.4931
Catalogs : 0.2219
Medical : 0.1972


Games are highly ranked in both datasets. 

Lets explore the frequency table we created for the ios market and look at the average number of ratings per genre

In [115]:
prime_genre = freq_table(free_apps_ios, -5)
prime_dictionary = {}
for genre in prime_genre:
    total = 0
    len_genre = 0
    for app in free_apps_ios:
        genre_app = app[-5]
        if genre_app == genre:
            user_rating = float(app[5])
            total += user_rating
            len_genre += 1
    average = round(total / len_genre)
    prime_dictionary[genre] = average

sorted(prime_dictionary.items(), key=lambda x: x[1] ,reverse= True)

[('Reference', 67448),
 ('Music', 56482),
 ('Social Networking', 53078),
 ('Weather', 47221),
 ('Photo & Video', 27250),
 ('Navigation', 25972),
 ('Travel', 20216),
 ('Food & Drink', 20179),
 ('Sports', 20129),
 ('Health & Fitness', 19952),
 ('Productivity', 19054),
 ('Games', 18925),
 ('Shopping', 18747),
 ('News', 15893),
 ('Utilities', 14010),
 ('Finance', 13522),
 ('Entertainment', 10823),
 ('Lifestyle', 8978),
 ('Book', 8498),
 ('Business', 6368),
 ('Education', 6266),
 ('Catalogs', 1780),
 ('Medical', 460)]

In [121]:

categories_android = freq_table(free_apps_android, 1)
android_dictionary = {}
for category in categories_android:
    total = 0
    len_category = 0
    for app in free_apps_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = round(total / len_category)
    android_dictionary[category] = avg_n_installs

sorted(android_dictionary.items(), key=lambda x: x[1] ,reverse= True)

[('COMMUNICATION', 38590581),
 ('VIDEO_PLAYERS', 24727872),
 ('SOCIAL', 23253652),
 ('PHOTOGRAPHY', 17840110),
 ('PRODUCTIVITY', 16787331),
 ('GAME', 15544015),
 ('TRAVEL_AND_LOCAL', 13984078),
 ('ENTERTAINMENT', 11640706),
 ('TOOLS', 10830252),
 ('NEWS_AND_MAGAZINES', 9549178),
 ('BOOKS_AND_REFERENCE', 8814200),
 ('SHOPPING', 7036877),
 ('PERSONALIZATION', 5201483),
 ('WEATHER', 5145550),
 ('HEALTH_AND_FITNESS', 4188822),
 ('MAPS_AND_NAVIGATION', 4049275),
 ('FAMILY', 3697848),
 ('SPORTS', 3650602),
 ('ART_AND_DESIGN', 1986335),
 ('FOOD_AND_DRINK', 1924898),
 ('EDUCATION', 1833495),
 ('BUSINESS', 1712290),
 ('LIFESTYLE', 1446158),
 ('FINANCE', 1387692),
 ('HOUSE_AND_HOME', 1360598),
 ('DATING', 854029),
 ('COMICS', 832614),
 ('AUTO_AND_VEHICLES', 647318),
 ('LIBRARIES_AND_DEMO', 638504),
 ('PARENTING', 542604),
 ('BEAUTY', 513152),
 ('EVENTS', 253542),
 ('MEDICAL', 120551)]

## Conclusion

Android market is far large in terms of free applications. The biggest markets are Communication, Video_players and Social. These categories have high user reviews on average also.

However, the % of apps installed in these categories is low (2-3%) and suggest that a few applications dominate the market. This suggests a very difficult market to create our application for.

The Family category has many applications downloaded as a percentage and high number of downloads. The games category also has high average number of installs and % of the total applications.

A game targeting the family market would be a suitable suggestion for a new app.