
# Profitable App Profiles for the App Store and Google Play Markets


## From the information collected from 7200 apps, we try to help developers understand what type of apps attract more users.


Links for datasets, if you are interested and/or going to use it:

- [Apps Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps?select=appleStore_description.csv)
    
- [Google Play](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

This is my first data analysis project, thanks to DataQuest. 
I used simple methods for cleaning data and showing the results. Discovered some techniques, such as _frequency table_.  

Opening and Reading our Datasets

In [3]:
open_appst = open('AppleStore.csv')
open_gplayst = open('googleplaystore.csv')
from csv import reader
read_appst = reader(open_appst)
read_gplayst = reader(open_gplayst)
apps_data = list(read_appst)
gplay_data = list(read_gplayst)

Function `explore_data()` helps us to read, slice and print necessary data

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(apps_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


In [6]:
explore_data(gplay_data, 10470,10474, True)

['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10842
Number of columns: 13


In [7]:
del gplay_data[10473]

It often happens that while collecting data, some data can be duplicated. 

In [8]:
duplicate_apps = []
unique_apps = []
for app in gplay_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicated apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicated apps:', duplicate_apps[:20])

Number of duplicated apps: 1181


Examples of duplicated apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


There are many ways to clean duplicate data, but in this case, I will leave data with the most reviews (or installs), because it shows the recency of data.

In [9]:
print('Expected num of apps after cleaning:', len(gplay_data[1:]) - 1181)
reviews_max = {}
for row in gplay_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print('Our dictionary with maximum num of reviews contains', len(reviews_max), 'apps')

Expected num of apps after cleaning: 9659
Our dictionary with maximum num of reviews contains 9659 apps


So we are creating two lists: one for inserting whole information about an app and other for checking whether we added that app or not. And after the whole process, we are printing the length of both apps and few rows of our cleaned data to check.

In [10]:
android_clean = []
already_added = []
for row in gplay_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
print('Length of clean data:', len(android_clean))
print('Length of list with each apps\' name:', len(already_added))
print('\n')
print(android_clean[:3])

Length of clean data: 9659
Length of list with each apps' name: 9659


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


In [11]:
def isEnglish(app_name):
    nonengcount = 0
    for ch in app_name:
        ascii_code = ord(ch)
        if ascii_code > 127:
            nonengcount += 1
        if nonengcount > 3:
            return False
    return True

In [12]:
print(isEnglish('Instagram'))
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Instachat 😜'))

True
False
True
True


In [13]:
apple_clean = []
android_cleaner = []
for row in apps_data[1:]:
    name = row[1]
    if isEnglish(name):
        apple_clean.append(row)
for row in android_clean:
    name = row[0]
    if isEnglish(name):
        android_cleaner.append(row)
print('Let\'s compare the lengths of before and after cleaning non-English apps')
print('For Apple, before cleaning was',len(apps_data[1:]), 'apps and after cleaning remain', len(apple_clean),'apps')    
print('For Android, before cleaning was',len(android_clean), 'apps and after cleaning remain', len(android_cleaner),'apps')

Let's compare the lengths of before and after cleaning non-English apps
For Apple, before cleaning was 7197 apps and after cleaning remain 6183 apps
For Android, before cleaning was 9659 apps and after cleaning remain 9614 apps


In [14]:
gplay_data[:2]

[['App',
  'Category',
  'Rating',
  'Reviews',
  'Size',
  'Installs',
  'Type',
  'Price',
  'Content Rating',
  'Genres',
  'Last Updated',
  'Current Ver',
  'Android Ver'],
 ['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up']]

In [15]:
apps_data[:3]

[['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic'],
 ['284882215',
  'Facebook',
  '389879808',
  'USD',
  '0.0',
  '2974676',
  '212',
  '3.5',
  '3.5',
  '95.0',
  '4+',
  'Social Networking',
  '37',
  '1',
  '29',
  '1'],
 ['389801252',
  'Instagram',
  '113954816',
  'USD',
  '0.0',
  '2161558',
  '1289',
  '4.5',
  '4.0',
  '10.23',
  '12+',
  'Photo & Video',
  '37',
  '0',
  '29',
  '1']]

Removing paid apps and leaving free apps only.
During the cleaning, I found that app named _Command & Conquer: Rivals_ has price of 0 but type of NaN instead of _free_.

In [16]:
android_free = []
for row in android_cleaner:
    #if row[6] == 'Free':
    if row[7] == '0':    
        android_free.append(row)
        '''
        if row[6] != 'Free':
            print(row[0])
        '''
print(len(android_cleaner))
print(len(android_free))

9614
8864


In [17]:
apple_free = []
for row in apple_clean:
    if float(row[4]) == 0.0:
        apple_free.append(row)
print(len(apple_clean))
print(len(apple_free))

6183
3222


To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Why do we have this validation strategy? From observations above, we can see that 92% of the Google Play apps are free, comparing to only half of the apps on App Store. We guess that it is much harder to upload free apps on App Store, so we need to have quality app on Google Play to upload it to Apple products.

In [18]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

def freq_table(dataset, index):
    freq_dict = {}
    leng = len(dataset)
    for row in dataset:
        val = row[index]
        if val in freq_dict:
            freq_dict[val] += 1
        else:
            freq_dict[val] = 1
    for key in freq_dict:
        freq_dict[key] /= leng
        freq_dict[key] *= 100
    return freq_dict
display_table(android_free, 1)
display_table(android_free, 9)
display_table(apple_free, 11)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

1. From the `prime_genre` column of App Store, we can actually observe that the most common genre is Games, containing 58% of all apps and more than all apps combined. Next common genre is entertainment. Actually, Apple users install more entertainment apps rather than practicial.

2. On the other hand, Android users install more of the practical apps. From category column, we can see that Family, Tools, Business, Lifestyle and Productivity genres are more used (although in between we have Games). Almost same with genres column, only difference is entertainment instead of games.

In [19]:
apple_genres = freq_table(apple_free, 11)
genre_rat = []
for genre in apple_genres:
    total = 0
    len_genre = 0
    for row in apple_free:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    genre_rat.append([total / len_genre, genre])
genre_rat = sorted(genre_rat, reverse=True)
print(genre_rat)

[[86090.33333333333, 'Navigation'], [74942.11111111111, 'Reference'], [71548.34905660378, 'Social Networking'], [57326.530303030304, 'Music'], [52279.892857142855, 'Weather'], [39758.5, 'Book'], [33333.92307692308, 'Food & Drink'], [31467.944444444445, 'Finance'], [28441.54375, 'Photo & Video'], [28243.8, 'Travel'], [26919.690476190477, 'Shopping'], [23298.015384615384, 'Health & Fitness'], [23008.898550724636, 'Sports'], [22788.6696905016, 'Games'], [21248.023255813954, 'News'], [21028.410714285714, 'Productivity'], [18684.456790123455, 'Utilities'], [16485.764705882353, 'Lifestyle'], [14029.830708661417, 'Entertainment'], [7491.117647058823, 'Business'], [7003.983050847458, 'Education'], [4004.0, 'Catalogs'], [612.0, 'Medical']]


Most of the average installation belongs to the 'Navigation' genre. We think this is because it is vital for users not to get lost in cities and other locations, thus they will install these apps.

In [23]:
android_genres = freq_table(android_free, 1)
category_ins = []
for category in android_genres:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == category:
            inst = row[5]
            inst = inst.replace("+", "")
            inst = inst.replace(",", "")
            total += float(inst)
            len_category += 1
    #print(category, category_app, len_category)
    category_ins.append([total / len_category, category])
category_ins = sorted(category_ins, reverse=True)
print(category_ins)

[[38456119.167247385, 'COMMUNICATION'], [24727872.452830188, 'VIDEO_PLAYERS'], [23253652.127118643, 'SOCIAL'], [17840110.40229885, 'PHOTOGRAPHY'], [16787331.344927534, 'PRODUCTIVITY'], [15588015.603248259, 'GAME'], [13984077.710144928, 'TRAVEL_AND_LOCAL'], [11640705.88235294, 'ENTERTAINMENT'], [10801391.298666667, 'TOOLS'], [9549178.467741935, 'NEWS_AND_MAGAZINES'], [8767811.894736841, 'BOOKS_AND_REFERENCE'], [7036877.311557789, 'SHOPPING'], [5201482.6122448975, 'PERSONALIZATION'], [5074486.197183099, 'WEATHER'], [4188821.9853479853, 'HEALTH_AND_FITNESS'], [4056941.7741935486, 'MAPS_AND_NAVIGATION'], [3695641.8198090694, 'FAMILY'], [3638640.1428571427, 'SPORTS'], [1986335.0877192982, 'ART_AND_DESIGN'], [1924897.7363636363, 'FOOD_AND_DRINK'], [1833495.145631068, 'EDUCATION'], [1712290.1474201474, 'BUSINESS'], [1437816.2687861272, 'LIFESTYLE'], [1387692.475609756, 'FINANCE'], [1331540.5616438356, 'HOUSE_AND_HOME'], [854028.8303030303, 'DATING'], [817657.2727272727, 'COMICS'], [647317.817

In [25]:
for row in android_free:
    if row[1] == 'COMMUNICATION':
        print(row[0], row[5])
print(android_genres['COMMUNICATION'])

WhatsApp Messenger 1,000,000,000+
Messenger for SMS 10,000,000+
My Tele2 5,000,000+
imo beta free calls and text 100,000,000+
Contacts 50,000,000+
Call Free – Free Call 5,000,000+
Web Browser & Explorer 5,000,000+
Browser 4G 10,000,000+
MegaFon Dashboard 10,000,000+
ZenUI Dialer & Contacts 10,000,000+
Cricket Visual Voicemail 10,000,000+
TracFone My Account 1,000,000+
Xperia Link™ 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard 10,000,000+
Skype Lite - Free Video Call & Chat 5,000,000+
My magenta 1,000,000+
Android Messages 100,000,000+
Google Duo - High Quality Video Calls 500,000,000+
Seznam.cz 1,000,000+
Antillean Gold Telegram (original version) 100,000+
AT&T Visual Voicemail 10,000,000+
GMX Mail 10,000,000+
Omlet Chat 10,000,000+
My Vodacom SA 5,000,000+
Microsoft Edge 5,000,000+
Messenger – Text and Video Chat for Free 1,000,000,000+
imo free video calls and chat 500,000,000+
Calls & Text by Mo+ 5,000,000+
free video calls and chat 50,000,000+
Skype - free IM & video