# Profitable App Profiles for the App Store and Google Play Markets

This project is requested by a company that builds Android and iOS mobile apps that are free to download and install. The business strategy of the company is solely to make money by in-app ads which means that revenue is mostly influenced by the number of users who use our app.

The goal of this project is to analyze data to help our developers understand what type of apps are likely to attract more users. 

## Opening and Exploring the Data

First, we will use a function to output and observe the opened data.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
from csv import reader

with open('AppleStore.csv') as f:
    csv_reader = reader(f)
    a_lines = list(csv_reader)
    
explore_data(a_lines, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


For Apple, the columns of interest would be 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. For context, the documentation regarding the columns are found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
with open('googleplaystore.csv') as f:
    csv_reader = reader(f)
    g_lines = list(csv_reader)
    
explore_data(g_lines, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


For Google, the columns of interest would be 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'. For context, the documentation regarding the columns are found [here](https://www.kaggle.com/lava18/google-play-store-apps).

## Cleaning Data
From here, we look to clean any anomalies in the datasets. 

### Removing Innacurate data:

Regarding innacurate data, one issue of which is noted in the Google Play Store's [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

In [5]:
header_length = len(g_lines[0])
for row in g_lines:
    if len(row) != header_length:
        print(row)
        print(g_lines.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [6]:
del g_lines[10473] # Result of above length check

### Removing Duplicates:

Another thing to check for is duplicate data. This can be checked for primarily using the name. An example would be 'Instagram' which you can see an example of below.

In [7]:
for row in g_lines:
    name = row[0]
    if row[0] == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Since this is confirmed above, we will see how extensive the problem is. Below the count of duplicates will be shown:

In [8]:
apps = {}
duplicate_count = 0

for app in g_lines:
    name = app[0]
    if name in apps:
        if apps[name] < app[3]:
            apps[name] = app[3]
        duplicate_count += 1
    else:
        apps[name] = app[3]
        
print('The number of duplicates in the Google Play Store dataset is ' + str(duplicate_count) + '.')

The number of duplicates in the Google Play Store dataset is 1181.


The above duplicate analysis was performed on the Apple Store and it did not result in any duplicates there. To deal with the duplicates in the Google Play Store, we will use the number of reviews as a proxy to when the row was entered. The highest number of reviews will indicate the most recent entry.

In [9]:
cleaned_apps = []

for app in g_lines: # To select non-duplicates
    name = app[0]
    if apps[name] == -1: # App is used
        continue
    elif apps[name] == app[3]: # Match
        cleaned_apps.append(app)
        apps[name] = -1

explore_data(cleaned_apps, 1, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9660
Number of columns: 13


This is the expected number of rows (including the header row).

### Removing Non-English Apps:

Aside from duplicates, we should look at removing non-english apps since the apps developed at the company are for an English-speaking audience. First, we will develop a function for it.

In [10]:
def is_english(string):
    non_ascii = 0
    for letter in string:
        if ord(letter) > 127:
            non_ascii += 1
    if non_ascii > 3: # In case of emoji or symbol
        return False
    else:
        return True

#To Test
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Now we will clean our data sets.

In [11]:
google_apps = []
apple_apps = []

for app in cleaned_apps: #Cleaned Google Apps
    if is_english(app[0]):
        google_apps.append(app)

for app in a_lines:
    if is_english(app[1]):
        apple_apps.append(app)
        
explore_data(google_apps,1,2,True)
explore_data(apple_apps,1,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9615
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6184
Number of columns: 16


### Isolating Free Apps:

Another aspect to look at is the price of the app. Since the company only builds apps that are free to download and install, the data set will have to be cleaned to remove non-free apps.

In [12]:
free_google_apps = []
free_apple_apps = []

for app in google_apps: #Cleaned Google Apps
    if app[7] == '0':
        free_google_apps.append(app)

for app in apple_apps:
    if app[4] == '0.0':
        free_apple_apps.append(app)
        
explore_data(free_google_apps,0,1,True)
explore_data(free_apple_apps,0,1,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8862
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 3222
Number of columns: 16


## Analyzing the Data

Since the end goal is to add an app on both the Google Play and App Store, we should find an app profile that is successful on both markets.

The validation strategy will be the following:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Android will be first due to the larger audience and potential for profit since the revenue is highly influenced by the number of people using the app.

### Generating a Frequency Table:

First, we will look at the genre of the application to develop with use of a frequency table. This is the *prime_genre* for the App Store and *Genres* and *Category* for the Google Play store.

In [17]:
def freq_table(dataset, index):
    frequencies = {}
    total = len(dataset)
    
    for row in dataset:
        value = row[index]
        if value in frequencies:
            frequencies[value] += 1
        else:
            frequencies[value] = 1
    
    table_percentages = {}
    for key in frequencies:
        percentage = (frequencies[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages 

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [19]:
display_table(free_apple_apps, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most common Apple Store application are Games at 58% and Entertainment at 8% as runner-up.

The general impression is that entertainment of some sort are more numerous in the Apple Store.

Just because there are many more games than the other genres does not mean that there are a higher number of users per app.

In [22]:
display_table(free_google_apps, 1)

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.

The most common Google Play category is FAMILY at 19% with GAME at 10% as a runner up.

The category is not as dominated as with Apple Store's genre.

In [21]:
display_table(free_google_apps, 9)

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

When looking at the genre table for Google Play, you can note that practical purposes are more dominant (however not as much as entertainment in the Apple Store).

So far, the genres for both stores are not equivalent and an app profile cannot be recommended using the number of applications for a genre.

### Finding the Genre with the Most Users per App:

Since the number of apps per genre is not helpful for capturing the number of users for that genre, we will look at the average number of users pe app for each genre.

In [42]:
def avg_users(dataset, genre_index, users_index, google=False):
    user_count = {}
    app_frequency = {}
    
    for row in dataset:
        genre = row[genre_index]
        if google:
            users = int(row[users_index].replace(',','').replace('+',''))
        else:
            users = int(row[users_index])
        if genre in user_count:
            user_count[genre] += users
            app_frequency[genre] += 1
        else:
            user_count[genre] = users
            app_frequency[genre] = 1
    
    average_users = {}
    for key in user_count:
        percentage = (user_count[key] / app_frequency[key]) * 100
        average_users[key] = percentage 
    
    return average_users 

def display_table_users(dataset, genre_index, users_index, google=False):
    table = avg_users(dataset, genre_index, users_index, google)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [43]:
display_table_users(free_apple_apps, 11, 5)

Navigation : 8609033.333333332
Reference : 7494211.111111111
Social Networking : 7154834.905660378
Music : 5732653.03030303
Weather : 5227989.285714285
Book : 3975850.0
Food & Drink : 3333392.307692308
Finance : 3146794.4444444445
Photo & Video : 2844154.375
Travel : 2824380.0
Shopping : 2691969.0476190476
Health & Fitness : 2329801.5384615385
Sports : 2300889.8550724634
Games : 2278866.96905016
News : 2124802.3255813955
Productivity : 2102841.0714285714
Utilities : 1868445.6790123456
Lifestyle : 1648576.4705882354
Entertainment : 1402983.0708661417
Business : 749111.7647058823
Education : 700398.3050847457
Catalogs : 400400.0
Medical : 61200.0


In [44]:
display_table_users(free_google_apps, 1, 5, google=True)

COMMUNICATION : 3845611916.7247386
VIDEO_PLAYERS : 2472787245.2830186
SOCIAL : 2325365212.7118645
PHOTOGRAPHY : 1780562764.3678162
PRODUCTIVITY : 1678733134.4927535
GAME : 1556096559.9534342
TRAVEL_AND_LOCAL : 1398407771.0144928
ENTERTAINMENT : 1164070588.235294
TOOLS : 1068230103.3377837
NEWS_AND_MAGAZINES : 954917846.7741934
BOOKS_AND_REFERENCE : 876781189.4736841
SHOPPING : 703687731.1557789
PERSONALIZATION : 520148261.22448975
WEATHER : 507448619.7183099
HEALTH_AND_FITNESS : 418882198.5347985
MAPS_AND_NAVIGATION : 405694177.41935486
FAMILY : 369427633.4922527
SPORTS : 363864014.28571427
ART_AND_DESIGN : 198633508.77192983
FOOD_AND_DRINK : 192489773.63636363
EDUCATION : 182067307.6923077
BUSINESS : 171229014.74201474
LIFESTYLE : 143781626.87861273
FINANCE : 138769247.5609756
HOUSE_AND_HOME : 133154056.16438356
DATING : 85402883.03030303
COMICS : 81765727.27272727
AUTO_AND_VEHICLES : 64731781.70731707
LIBRARIES_AND_DEMO : 63850373.4939759
PARENTING : 54260362.06896552
BEAUTY : 513151

In [46]:
display_table_users(free_google_apps, 1, 5, google=True)

COMMUNICATION : 3845611916.7247386
VIDEO_PLAYERS : 2472787245.2830186
SOCIAL : 2325365212.7118645
PHOTOGRAPHY : 1780562764.3678162
PRODUCTIVITY : 1678733134.4927535
GAME : 1556096559.9534342
TRAVEL_AND_LOCAL : 1398407771.0144928
ENTERTAINMENT : 1164070588.235294
TOOLS : 1068230103.3377837
NEWS_AND_MAGAZINES : 954917846.7741934
BOOKS_AND_REFERENCE : 876781189.4736841
SHOPPING : 703687731.1557789
PERSONALIZATION : 520148261.22448975
WEATHER : 507448619.7183099
HEALTH_AND_FITNESS : 418882198.5347985
MAPS_AND_NAVIGATION : 405694177.41935486
FAMILY : 369427633.4922527
SPORTS : 363864014.28571427
ART_AND_DESIGN : 198633508.77192983
FOOD_AND_DRINK : 192489773.63636363
EDUCATION : 182067307.6923077
BUSINESS : 171229014.74201474
LIFESTYLE : 143781626.87861273
FINANCE : 138769247.5609756
HOUSE_AND_HOME : 133154056.16438356
DATING : 85402883.03030303
COMICS : 81765727.27272727
AUTO_AND_VEHICLES : 64731781.70731707
LIBRARIES_AND_DEMO : 63850373.4939759
PARENTING : 54260362.06896552
BEAUTY : 513151

It seems that social applications have the highest number of users on each store. However, in context there may be skewed data due to massive downloads for the most popular social apps.

I would recomm