Analyzing Mobile Data

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

In [35]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [36]:
from csv import reader
opened_file = open("AppleStore.csv", encoding="utf8")
opened_file_2 = open("googleplaystore.csv", encoding="utf8")
read_file_2 = reader(opened_file_2)
read_file = reader(opened_file)
app_data = list(read_file)
app_data_2 = list(read_file_2)

In [37]:
explore_data(app_data, 1, 4)
explore_data(app_data,1,4)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




In [38]:
explore_data(app_data,1,2,True)
explore_data(app_data_2,1,2,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [39]:
print(app_data[0])
print(app_data_2[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [40]:
print(app_data[0][5], app_data[0][11])
print(app_data_2[0][5], app_data_2[0][9])

rating_count_tot prime_genre
Installs Genres


In [41]:
print(app_data_2[10473][1])

1.9


In [42]:
del app_data_2[10473]

Once we start to examine the Google play data we realize that there are duplicate rows 

In [43]:
for app in app_data_2:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [44]:
duplicate_apps = []
unique_apps = []

for app in app_data_2:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print("Number of duplicate apps:", len(duplicate_apps))
print("\n")
print("Examples of duplicate apps:", duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We won't be removing the duplicates values randomly. When we encounter a duplicate we will keep the row with the higest amount of reviews. The reason we are doing this is because the higher the amount of reviews, the more recent the reviews should be.

In [45]:
reviews_max = {}

for app in app_data_2[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max)

9659

In [46]:
android_clean = []
already_added = []

for app in app_data_2[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
len(android_clean)

9659

We have just completed removing duplicate entries from the android data. To complete this it required two steps. The first step was to determine the highest reviews and store them in a list. The second step was to use the highest reviews to determine which rows to add to a newly created list that would store the android data that had the highest reviews. For the second step we also ensured that duplicates would not be added into the new list for android data as it is possible there would be multiple rows that had the same name and the highest review count.

In [47]:
def eng_char_check (word):
    count = 0
    for character in word:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True
    
print(eng_char_check('Instagram'))
print(eng_char_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_char_check('Docs To Go™ Free Office Suite'))
print(eng_char_check('Instachat 😜'))

True
False
True
True


In [48]:
android_english = []
IOS_english = []

for row in android_clean:
    name = row[0]
    if eng_char_check(name):
        android_english.append(row)

for row in app_data[1:]:
    name = row[1]
    if eng_char_check(name):
        IOS_english.append(row)

In [49]:
print(len(android_english))
print(len(IOS_english))

9614
6183


In [50]:
android_free = []
IOS_free = []

for row in android_english:
    price = row[7]
    if price == "0":
        android_free.append(row)

for row in IOS_english:
    price = row[4]
    if price == "0.0":
        IOS_free.append(row)

print(len(android_free))
print(len(IOS_free))

8864
3222


Our business plan is to launch an android and IOS app that attract the most amount of users as our revenue is largely dependent on the amount of users using our apps. We have a three step rollout plan:

1. Build a minimal Android version of the app, and add it to the Google Play Store.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an IOS version of the app and add it to the IOS Store.


In [51]:
print(android_free[0][5], android_free[0][12])
print(IOS_free[0][5], IOS_free[0][9])

10,000+ 4.0.3 and up
2974676 95.0


In [52]:
def freq_table(dataset, index):
    freq = {}
    total = 0
    for row in dataset:
        total += 1
        key = row[index]
        if key in freq:
            freq[key] += 1
        else:
            freq[key] = 1
    freq_percentages = {}
    for key in freq:
        percentage = (freq[key]/total) * 100
        freq_percentages[key] = percentage
    return freq_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


In [53]:
#display_table(IOS_free,-5)
#display_table(android_free,1)
display_table(android_free,-4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [54]:
genres = {}

for row in IOS_free:
    genre = row[11]
    if genre in genres:
        genres[genre] += 1
    else:
        genres[genre] = 1
    
for value in genres:
    total = 0
    len_genre = 0
    for row in IOS_free:
        genre_app = row[11]
        if genre_app == value:
            total += float(row[5])
            len_genre += 1
    average_num_rating = total/len_genre
    print(value,":",average_num_rating)




Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [67]:
Categories = {}

for row in android_free:
    category = row[1]
    if category in Categories:
        Categories[category] += 1
    else:
        Categories[category] = 1

for value in Categories:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == value:
            n_installs = row[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    average_n_rating = total/len_category
    print(value,":",average_n_rating)


ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_