# Project 1: Profitable App Profiles for the App Store and Google Play Markets

The aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and t job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At the company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [3]:
from csv import reader

#open apple store dataset
open_file = open('AppleStore.csv')
read_file = reader(open_file)
applst = list(read_file)

#open google play dataset
open_file = open('googleplaystore.csv')
read_file = reader(open_file)
ggplay = list(read_file)


In [5]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#print the first few rows of each data set
print(applst[0]) #apple store header
print('\n')
print(ggplay[0]) #google play header

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [6]:
# Find the number of rows and columns of each data set
explore_data(applst[1:], 0, 1, True) #explore the first non-header row in apple store dataset
print('\n')
explore_data(ggplay[1:], 0, 1, True) #explore the first non-header row in google play dataset

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


AppleStore.csv has 7191 apps and 16 characteristics for each app. The characteristics that we can analyze further are 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'cont_rating', and 'prime_genre'.

googleplay.csv has 10841 apps and 13 characteristics for each app. The characteristics that we can analyze further are 'Category', 'Rating', 'Reviews', 'Type', 'Price', 'Content Rating', and 'Genres'.

For explanation of the above names go to [document](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

In [11]:
# print the wrong row in google play dataset
print(ggplay[10473]) # the 'category' is wrong and 'Rating' is empty
print(ggplay[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [12]:
# remove the row
del ggplay[10473]

In [13]:
print(ggplay[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [15]:
# Check duplicate entries in google play dataset
unique_names = []
duplicate_names = []

for row in ggplay[1:]:
    app_name = row[0]
    if app_name in unique_names:
        duplicate_names.append(app_name)
    else:
        unique_names.append(app_name)
        
print('Examples of duplicate apps:', duplicate_names[:10])
print('\n')
print('Number of duplicate apps:', len(duplicate_names))

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Number of duplicate apps: 1181


In [17]:
# select a dulicated app to check whether it has multiple entries
for row in ggplay[1:]:
    app_name = row[0]
    if app_name == 'Slack':
        print(row)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Note that the 3rd entry has a different review number. The 1st and 2nd entries are the same. I would first delete the entries that are exactly the same and try to understand why the entries for the same app have different characteristics.

In [21]:
# remove the duplicates in google play dataset
reviews_max = {}
for row in ggplay[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Length of the dictionary:', len(reviews_max))

Length of the dictionary: 9659


In [39]:
android_clean = []
already_added = []
for row in ggplay[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print('Length of the clean dataset:', len(android_clean))
print(android_clean[:2])
        

Length of the clean dataset: 9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


In [37]:
# Dealing with non-english apps
def english_or_not(a_string):
    
    for character in a_string:
        if ord(character) > 127:
            return False
    return True

print(english_or_not('Instagram'))
print(english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_or_not('Docs To Go™ Free Office Suite'))
print(english_or_not('Instachat 😜'))

True
False
False
False


In [31]:
print(ord('™'))
print(ord('😜'))

8482
128540


In [38]:
# define a function that can finds more than 3 characters in name are non-english words.
def is_english(a_string):
    number = 0
    for character in a_string:
        if ord(character) > 127:
            number += 1
            if number > 3:
                return False
    return True
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
        

True
True
False


In [45]:
# remove only non-english apps (keeps english apps that contain non-english characters)
english_apps_applst = []
for row in applst[1:]:
    name = row[1]
    if is_english(name) == True:
        english_apps_applst.append(row)
print("Number of English Apps in Apple Store:", len(english_apps_applst))
print('\n')

english_apps_ggplay = []
for row in android_clean:
    name = row[0]
    if is_english(name) == True:
        english_apps_ggplay.append(row) 
print("Number of English Apps in Google Play:", len(english_apps_ggplay))        


Number of English Apps in Apple Store: 6183


Number of English Apps in Google Play: 9614


## Isolate only the free apps for our analysis

In [48]:
free_apps_applst = []
for row in english_apps_applst:
    price = row[4]
    if price == '0.0':
        free_apps_applst.append(row)
print("Number of Free English Apps in Apple Store:", len(free_apps_applst))
        
free_apps_ggplay = []
for row in english_apps_ggplay:
    price = row[7]
    if price == '0':
        free_apps_ggplay.append(row)      
print("Number of Free English Apps in Google Play:", len(free_apps_ggplay))

Number of Free English Apps in Apple Store: 3222
Number of Free English Apps in Google Play: 8864


## The most popular app type

In [54]:
def freq_table(dataset, index):
    frequency_table = {}
    total = 0

    for row in dataset:
        total += 1
        interested_data = row[index]
        if interested_data in frequency_table:
            frequency_table[interested_data] += 1
        else:
            frequency_table[interested_data] = 1
            
    percentage_table = {}
    for element in frequency_table:
        percentage = (frequency_table[element] / total) *100
        frequency_table[element] = percentage
    return frequency_table


In [55]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [60]:
print(display_table(free_apps_applst, 11))
print('\n')
print(display_table(free_apps_ggplay, 9))
print('\n')
print(display_table(free_apps_ggplay, 1))

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.316

## What genres are the most popular
### calculate the average number of installs for each app genre

In [81]:
# generate a frequency table for prime_genre column
genre_freq = freq_table(free_apps_applst, 11)
genre_avg_rating = {}
for genre in genre_freq:
    total = 0
    len_genre = 0
    for app in free_apps_applst:
        genre_app = app[11]
        if genre_app == genre:
            ratings_n = float(app[5])
            total = total + ratings_n
            len_genre += 1
    avg_rating = total / len_genre
    genre_avg_rating[genre] = avg_rating
    #print(genre, ':', avg_rating)
print(genre_avg_rating)  
    

{'News': 21248.023255813954, 'Reference': 74942.11111111111, 'Business': 7491.117647058823, 'Health & Fitness': 23298.015384615384, 'Weather': 52279.892857142855, 'Photo & Video': 28441.54375, 'Lifestyle': 16485.764705882353, 'Social Networking': 71548.34905660378, 'Shopping': 26919.690476190477, 'Productivity': 21028.410714285714, 'Medical': 612.0, 'Education': 7003.983050847458, 'Finance': 31467.944444444445, 'Navigation': 86090.33333333333, 'Food & Drink': 33333.92307692308, 'Games': 22788.6696905016, 'Sports': 23008.898550724636, 'Book': 39758.5, 'Entertainment': 14029.830708661417, 'Catalogs': 4004.0, 'Music': 57326.530303030304, 'Travel': 28243.8, 'Utilities': 18684.456790123455}


In [85]:
 #sort data according to the average rating of each genre
    table_display = []
    for key in genre_avg_rating:
        key_val_as_tuple = (genre_avg_rating[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        


Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


In [89]:
for app in free_apps_applst:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [90]:
for app in free_apps_applst:
    if app[11] == 'News':
        print(app[1], ':', app[5]) # print name and number of ratings

Twitter : 354058
Fox News : 132703
CNN: Breaking US & World News, Live Video : 112886
Reddit Official App: All That's Trending and Viral : 67560
USA TODAY : 61724
ABC News - US & World News + Live Video : 48407
NBC News : 32881
HuffPost - News, Politics & Entertainment : 29107
The Washington Post Classic : 18572
WIRED Magazine : 12074
CBS News - Watch Free Live Breaking News : 11691
The Guardian : 8176
AOL: News, Email, Weather & Video : 5233
SmartNews - Trending News & Stories : 4645
MSNBC : 3692
LotteryHUB : 2417
theSkimm : 1765
Quartz • News in a whole new way : 1267
Lotto Results - Mega Millions Powerball Lottery : 794
TopBuzz: Best Viral Videos, GIFs, TV & News : 692
Ticket Scanner for Powerball & MegaMillions Pool : 581
FOCUS Online - Aktuelle Nachrichten : 373
SPIEGEL ONLINE - Nachrichten : 299
n-tv Nachrichten : 273
CNN Politics : 254
Tagesschau : 233
Fresco — Be a part of the news : 219
News Break - Local & World Breaking News & Radio : 173
OPM Alert : 172
franceinfo - l'actua

In [88]:
#do the same for the installment of google play apps
frequency_category = freq_table(free_apps_ggplay, 1)
print(frequency_category)

{'ENTERTAINMENT': 0.9589350180505415, 'COMMUNICATION': 3.2378158844765346, 'EVENTS': 0.7107400722021661, 'SOCIAL': 2.6624548736462095, 'FOOD_AND_DRINK': 1.2409747292418771, 'PERSONALIZATION': 3.3167870036101084, 'GAME': 9.724729241877256, 'SHOPPING': 2.2450361010830324, 'BOOKS_AND_REFERENCE': 2.1435018050541514, 'SPORTS': 3.395758122743682, 'DATING': 1.861462093862816, 'WEATHER': 0.8009927797833934, 'PRODUCTIVITY': 3.892148014440433, 'COMICS': 0.6204873646209386, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'NEWS_AND_MAGAZINES': 2.7978339350180503, 'MAPS_AND_NAVIGATION': 1.3989169675090252, 'PARENTING': 0.6543321299638989, 'VIDEO_PLAYERS': 1.7937725631768955, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'FAMILY': 18.907942238267147, 'ART_AND_DESIGN': 0.6430505415162455, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'BUSINESS': 4.591606498194946, 'FINANCE': 3.7003610108303246, 'BEAUTY': 0.5979241877256317, 'PHOTOGRAPHY': 2.944494584837545, 'LIFESTYLE': 3.9034296028880866, 'MEDICAL': 3.531137184115

In [97]:
avg_install_category = {}
for category in frequency_category:
    total = 0
    len_category = 0
    for app in free_apps_ggplay:
        category_app = app[1]
        if category_app == category:
            n_install = app[5]
            n_install = n_install.replace('+', '')
            n_install = n_install.replace(',', '')
            n_install = float(n_install)
            total = total + n_install
            len_category += 1
    avg_catg_install = total / len_category
    avg_install_category[category] = avg_catg_install
    #print(category, ':', avg_catg_install)

In [98]:
table_display = []
for key in avg_install_category:
    key_val_as_tuple = (avg_install_category[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

In [99]:
for app in free_apps_ggplay:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess