# User App Analysis

Our company only builds apps that are free to download and install, and our main source of revenue is in-app ads.  Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Two existing data sets are available for Android and iOS apps.  
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately 10,000 Android apps from Google Play.  The data was collected in August 2018.
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately 7,000 iOS apps from the App Store.  The data was collected in July 2017.

Useful columns from App Store data:
- track_name
- price (only interested in free apps)
- prime_genre
- rating_count_tot (how many ratings?)

Useful columns from Play Store data:
- App
- Category
- Installs
- Price (only the free apps)
- Genres (is this the same as Category?)

In [1]:
opened_file_apple = open('AppleStore.csv')
opened_file_google = open('googleplaystore.csv')
from csv import reader
read_file_apple = reader(opened_file_apple)
read_file_google = reader(opened_file_google)
apple_apps_data = list(read_file_apple)
google_apps_data = list(read_file_google)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apple_apps_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
explore_data(google_apps_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
print(google_apps_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del google_apps_data[10473]

Checking for duplicate entries, we loop through the Play Store data set and make a list of unique and duplicate apps and count how many duplicate apps are in the set.

In [7]:
duplicate_apps = []
unique_apps = []
for row in google_apps_data[1:]:
    appname = row[0]
    if appname in unique_apps:
        duplicate_apps.append(appname)
    else:
        unique_apps.append(appname)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Exploring an example of a duplicate app (`'Slack'`)

In [8]:
for app in google_apps_data[1:]:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Duplicates will not be deleted randomly.  The only value that has changed is under "Reviews."  A higher number of reviews indicates the most recent data.

We will loop through the data and create a dictionary that will hold the app names and the highest number of reviews for each app.  Since there are 1,181 duplicates, we expect that there will be 9,659 entries.

In [9]:
reviews_max = {}
for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [10]:
print(len(reviews_max))

9659


To separate the duplicates and clean the data, we create a new list and loop through all the rows in the Play Store data set, checking if the app is in the set and if it has the highest number of reviews.  If the app is unique and has the highest number of reviews (from the dictionary we created), then it is added to the `android_clean` data set.  Once a name has been added, we also add that name to the `already_added` list so that entries that have the same number of reviews are not added again.

Check that the length of the cleaned data set is as expected at 9,659 rows.

In [11]:
android_clean = []
already_added = []
for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

In [12]:
print(len(android_clean))

9659


Some of the apps in the list contain names that are not written with English characters.  We want to be able to filter out these apps as our company's goal is to make English-language apps. We create a function that can take a string and determine if the name has more than 3 non-ASCII characters.

In [13]:
def is_english(string):
    notallowed = 0
    for character in string:
        if ord(character) > 127:
            notallowed += 1
            if notallowed > 3:
                return False
    return True

In [14]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


The filter is not perfect, but will allow us to remove obvious app names automatically without sacrificing too much of the data.  We will use the `is_english` function to make new lists of data for each set (`android_english` and `ios_english`.)

In [15]:
android_english = []
ios_english = []

for row in android_clean:
    name = row[0]
    if is_english(name):
        android_english.append(row)

for row in apple_apps_data[1:]:
    name = row[1]
    if is_english(name):
        ios_english.append(row)
        
print('English Android Apps:', len(android_english))
print('\n')
print('English iOS Apps:', len(ios_english))

English Android Apps: 9614


English iOS Apps: 6183


Next, we need to isolate the free apps for each platform so that we can narrow down the field for our analysis.  For each data set, we loop through the data and create and new list containing only the free apps.

In [16]:
android_free = []
ios_free = []

for row in android_english:
    price = row[7]
    if price == '0':
        android_free.append(row)

for row in ios_english:
    price = float(row[4])
    if price == 0:
        ios_free.append(row)

In [17]:
print(android_free[0:4])
print('\n')
print(ios_free[0:4])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & 

In [18]:
print('Free, English-only Android Apps:', len(android_free))
print('\n')
print('Free, English-only iOS Apps:', len(ios_free))

Free, English-only Android Apps: 8864


Free, English-only iOS Apps: 3222


Our aim in our analysis is the determine what kinds of apps are likely to attract more users since our ad revenue is influenced by the number of people using our apps.

To minimize risks and overhead, the strategy for an app idea is as follows:
1. Build a minimal Android app and add to Google Play
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after 6 months, we build an iOS version of the app and add it to the App Store.

The ultimate goal is to add apps on both Google Play *and* the App Store, so we need to find the types of apps that will be successful in *both* markets.  To do this, we will create frequency tables for the `prime_genre` column in the App Store data and the `Genres` and `Category` columns of the Play Store data.

In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [20]:
display_table(ios_free, 11) 

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [21]:
display_table(android_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [22]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

For the iOS apps, Games dominate with more than half the app share (58%), with Entertainment being the 2nd-highest (~8%).  There are significantly more apps geared toward fun than productivity.  While there are a lot more games apps, it doesn't tell us that there are more users for the games.  It could be that there are a lot of users using a small number of apps in other categories.

For the Play apps, the number one category is Family (19%) followed by Games (10%).  When looking at the genres, there is a lot more subdivision amongst the categories, with Tools and Entertainment taking the top spots (8% and 6%, respectively.  As the Category column is more general, we'll work with that from now on.

In order to get a better idea of the more popular types of apps, it would be helpful to examine the number of users or installs.

For the Google Play data set, we can find information about the number of installs in the Installs column.  Unfortunately, the Apps Store data set does not have this information.  However, we can approximate by using the number of user ratings (`rating_count_tot`).

Below we calculate the average number of users in each category for the Apps Store:

In [23]:
genres_ios = freq_table(ios_free, 11)
for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            num_ratings = float(app[5])
            total += num_ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, ':', avg_ratings)

Photo & Video : 28441.54375
Productivity : 21028.410714285714
Book : 39758.5
News : 21248.023255813954
Medical : 612.0
Music : 57326.530303030304
Games : 22788.6696905016
Sports : 23008.898550724636
Finance : 31467.944444444445
Utilities : 18684.456790123455
Food & Drink : 33333.92307692308
Health & Fitness : 23298.015384615384
Reference : 74942.11111111111
Navigation : 86090.33333333333
Social Networking : 71548.34905660378
Catalogs : 4004.0
Lifestyle : 16485.764705882353
Shopping : 26919.690476190477
Entertainment : 14029.830708661417
Education : 7003.983050847458
Business : 7491.117647058823
Weather : 52279.892857142855
Travel : 28243.8


The categories with the highest average number of users are Navigation (86090), Reference (74942), Social Networking (71548), Music (57327), and Weather (52280).  Based on these results, a practical app in these categories best fits our revenue model.

The Installs column in the Play Store data set contains open-ended values (100+, 500+, etc) instead of exact numbers.  We don't need precise data for our purposes of finding which app genres attract the most users, so we will leave the numbers as they are.  However, in order to find the averages, we needed to remove the non-numerical characters to convert each string to a float.

In [24]:
android_freq = freq_table(android_free, 1)

In [25]:
for category in android_freq:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

FOOD_AND_DRINK : 1924897.7363636363
BOOKS_AND_REFERENCE : 8767811.894736841
EDUCATION : 1833495.145631068
MEDICAL : 120550.61980830671
BUSINESS : 1712290.1474201474
EVENTS : 253542.22222222222
ART_AND_DESIGN : 1986335.0877192982
PRODUCTIVITY : 16787331.344927534
MAPS_AND_NAVIGATION : 4056941.7741935486
TOOLS : 10801391.298666667
VIDEO_PLAYERS : 24727872.452830188
LIFESTYLE : 1437816.2687861272
SOCIAL : 23253652.127118643
LIBRARIES_AND_DEMO : 638503.734939759
TRAVEL_AND_LOCAL : 13984077.710144928
FAMILY : 3695641.8198090694
GAME : 15588015.603248259
SPORTS : 3638640.1428571427
NEWS_AND_MAGAZINES : 9549178.467741935
COMMUNICATION : 38456119.167247385
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
DATING : 854028.8303030303
HOUSE_AND_HOME : 1331540.5616438356
PHOTOGRAPHY : 17840110.40229885
FINANCE : 1387692.475609756
COMICS : 817657.2727272727
BEAUTY : 513151.88679245283
AUTO_AND_VEHICLES : 647317.8170731707
PERSONALIZATION : 5201482.6122448975
SHOPPING : 7036877.311557789
HEA

For the Play Store data set, the categories with the highest average number of installs is as follows:
- Communication (~38,000,000)
- Video players (~25,000,000)
- Social (~23,000,000)
- Photography (~18,000,000)
- Productivity (~17,000,000)

Based on this information, along with that from the App Store, our recommendation is to develop an app in the social media category.

Looking at the solution notebook, they did a lot of additional analysis breaking down the popular apps within the categories to eliminate those as probable reasons to look at other categories.  I would never have thought to do that.