# Profitable App Profiles
---
Analyze the profitability aspects of app profiles from Google Play Store and Apple App Store found [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) and [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

Define a function `explore_data` to display dataset in a readable manner

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Load the datasets:

In [2]:
from csv import reader

opened_file = open('datasets/googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
playstore_datas = list(read_file)

opened_file = open('datasets/AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
appstore_datas = list(read_file)

print('Apple App Store Data Summary:')
explore_data(appstore_datas, 0, 3, True)
print('\n')
print('Google Play Store Data Summary:')
explore_data(playstore_datas, 0, 3, True)

Apple App Store Data Summary:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


Google Play Store Data Summary:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_

### Remove Erroneous Data
From [this forum discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) we know that the Google Play Store Datas have an error row at index `10472`. So we try to find out:

In [3]:
print(playstore_datas[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We can see that the number of reviews is empty, so we have to delete the row to avoid error in the future (also comment the code to avoid double deletion if we re-run the kernel output):

In [4]:
del playstore_datas[10473]
print(playstore_datas[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [5]:
empty_genres_rows = []
for element in playstore_datas[1:]:
    genre = element[8]
    if(genre == ''):
        empty_genres_rows.append(element)
print(empty_genres_rows)


[]


Turns out it is the only row that has empty genre. Leave it be

### Remove Duplicates
Again, from the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) we have an information that some of the rows contain duplicates. for example, instagram:

In [6]:
for app in playstore_datas[1:]:
    if(app[0] == 'Instagram'):
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In fact there are many other duplicate rows:

In [7]:
duplicate_apps = []
unique_apps = []
for app in playstore_datas:
    app_name = app[0]
    if(app_name in unique_apps):
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print(len(duplicate_apps))
print(len(unique_apps))

1181
9660


However, the fourth column which is number of reviews, have different values. It means, each row represent data taken at different time, so we need to only preserve the latest observation and delete the others. To do this we have to do the following:
- Create a dictionary called `reviews_max` containing a key-value pair of app-name key and number of reviews value
- Create an empty list called `playstore_datas_clean`
- Fill the list with unique rows from the `playstore_datas`, we do this by:
    - loop throuh the original list
    - compare the current `n_reviews` with the corresponding value by the key app name in the `reviews_max` dictionary
    - append the row if it matches, and register it in a `already_added` list to keep track

In [8]:
reviews_max = {}
for app in playstore_datas[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if(name in reviews_max and reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    if(name not in reviews_max):
        reviews_max[name] = n_reviews       

In [9]:
len(reviews_max)

9659

In [10]:
playstore_datas_clean = []
already_added = []

for app in playstore_datas[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        playstore_datas_clean.append(app)
        already_added.append(name) # make sure this is inside the if block

In [11]:
len(playstore_datas_clean)

9659

Above, we get the cleaned datasets called `playstore_datas_clean`

### Remove Non-English Apps
Because we only want to analyze the apps that are targeted to English audience only, we have to remove datas such as these:

In [12]:
print(appstore_datas[814][1])
print(appstore_datas[6732][1])
print('\n')
print(playstore_datas_clean[4412][0])
print(playstore_datas_clean[7940][0])


爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


Below we define a simple function called `is_english` to filter a string containing special characters (> 127 `ASCII` codes). To make it simple we specify a threshold of 3, to determine an app name as english if it contains less than 3 special characters

In [13]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

#check the function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [14]:
playstore_datas_clean_en = []
appstore_datas_clean_en = []

for app in playstore_datas_clean:
    name = app[0]
    if is_english(name):
        playstore_datas_clean_en.append(app)

for app in appstore_datas[1:]:
    name = app[1]
    if is_english(name):
        appstore_datas_clean_en.append(app)

In [15]:
print(len(playstore_datas_clean_en))
print(len(appstore_datas_clean_en))

9614
6183


### Isolate Free Apps
The last step of our data cleaning is separating free apps to a separate list


In [16]:
playstore_datas_free = []
appstore_datas_free = []

for app in playstore_datas_clean_en:
    price = app[7]
    if price == '0':
        playstore_datas_free.append(app)

for app in appstore_datas_clean_en:
    price = float(app[4])
    if price == 0:
        appstore_datas_free.append(app)

In [17]:
explore_data(playstore_datas_free, 0, 5, True)
print('\n')
explore_data(appstore_datas_free, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', 

## The Analysis
---
Suppose that a company wanted to develop an App that will be publish to both the Google Play Store and Apple App Store, we need to find app profiles that are successful on both markets. To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [18]:
def freq_table(dataset, index):
    freq_table = {}
    for element in dataset:
        genre = element[index]
        if genre in freq_table:
            freq_table[genre] += 1
        else:
            freq_table[genre] = 1
    for key in freq_table:
        freq_table[key] /= len(dataset)
        freq_table[key] *= 100
    return freq_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#### App Store Freq Table for Prime Genre
---

In [23]:
display_table(appstore_datas_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the frequency tables above we can see that English apps in the App Store were dominated by Games genre (58.16%), followed by Entertainment (7.88%) and Photo & Video (4.96%). The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

#### Play Store Freq Table for Category and Genres
---


In [21]:
display_table(playstore_datas_free, 1)


FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [22]:
display_table(playstore_datas_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids. Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column.

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

### Find Out Most Popular Apps by Genre 
---