### Analysis of profitable app profiles in Google Play and App Store

In this project we'll try to find app profiles that are profitable in Google Play and App Store markets. As if we were Data Analysts for a company devoted to smartphone apps development, we want to provide the development team with insights regarding our products.


#### Preliminary data exploration

First of all, we want to open the data and explore it, in order to ...

Thus, what we'll do is open the CSV files and store the headers and the dataset themselves. Then, we can print the headers to discard those that are not relevant for our task, as well as print part of the dataset to know it better.

In [60]:
import csv

In [84]:
opened_googleplay_file = open('googleplaystore.csv')
googleplay_reader = csv.reader(opened_googleplay_file)
googleplay_csv = list(googleplay_reader)

opened_appstore_file = open('AppleStore.csv')
appstore_reader = csv.reader(opened_appstore_file)
appstore_csv = list(appstore_reader)    

googleplay_header, googleplay_dataset = googleplay_csv[0], googleplay_csv[1:]
appstore_header, appstore_dataset = appstore_csv[0], appstore_csv[1:]

explore_dataset(googleplay_dataset, 0, 5)
explore_dataset(appstore_dataset, 0, 5)

print(googleplay_header)
print(appstore_header)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


['284882215', 'Facebook', '389879808', 'U

In [61]:
def explore_dataset(dataset, start, end):
    dataset_slice = dataset[start:end]    
    
    for row in dataset_slice:
        print(row)
    
    print('\n')
    print('Number of rows:', len(dataset))
    print('Number of columns:', len(dataset[0]))
    print('\n')

#### Data Cleaning

Before actually analysing data, we need to get rid of wrong data and rows that are not useful for us.

As stated [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), the element in row 10472 is not correct. Therefore we'll remove it.

Moreover, there is duplicate data in our dataset (some apps appear more than once). In order to select the most recent data, we will take the "reviews" column into account (i.e.: we'll keep the row with the largest value for "reviews")

In [85]:
del googleplay_dataset[10472]

In [86]:
print(len(googleplay_dataset))

unique_apps_dict = dict()

for row in googleplay_dataset:
    if row[0] in unique_apps_dict:
        current_reviews = unique_apps_dict[row[0]][3]
        new_version_reviews = row[3]
        if new_version_reviews < current_reviews:
            continue
    unique_apps_dict[row[0]] = row
    
googleplay_dataset = list(unique_apps_dict.values())
        
print(len(googleplay_dataset))

10840
9659


Another step in the data cleaning process is to filter those apps that are not in English. In order to detect apps in English, we define the following function. It just checks character by character and returns false if the amount of non-English characters is greater than 3.

In [87]:
def is_english_title(title):
    non_english_characters = 0
    for char in title:
        if ord(char) > 127:
            non_english_characters += 1
    return True if non_english_characters <= 3 else False

In [88]:
googleplay_english_dataset = []
appstore_english_dataset = []

for row in googleplay_dataset:
    if is_english_title(row[0]):
        googleplay_english_dataset.append(row)

for row in appstore_dataset:
    if is_english_title(row[1]):
        appstore_english_dataset.append(row)
        
print('Google play dataset original size: {}'.format(len(googleplay_dataset)))
print('Google play dataset new size: {}'.format(len(googleplay_english_dataset)))
print('App store dataset original size: {}'.format(len(appstore_dataset)))
print('App store dataset new size: {}'.format(len(appstore_english_dataset)))

Google play dataset original size: 9659
Google play dataset new size: 9614
App store dataset original size: 7197
App store dataset new size: 6183


Next step is filter free apps. We are interested only in those apps, therefore we have to get rid of apps with a price different than 0.

In [89]:
googleplay_free_apps = []
appstore_free_apps = []

for row in googleplay_english_dataset:
    if row[7] == '0':
        googleplay_free_apps.append(row)
        
for row in appstore_english_dataset:
    if row[4] == '0.0':
        appstore_free_apps.append(row)

print('Google play only free apps dataset size: {}'.format(len(googleplay_free_apps)))
print('App store only free apps dataset size: {}'.format(len(appstore_free_apps)))

Google play only free apps dataset size: 8862
App store only free apps dataset size: 3222


#### Data Analysis
Once our data is cleaned, we want to get some insight from it. Our company wants to know what are the characteristics of the most popular apps, in order to develop similar ones. Thus, the first step towards this can be to have a function to obtain frequency tables from the dataset columns.


In [68]:
def frequency_table(dataset, column_index):
    frequency_dict = dict()
    for row in dataset:
        if row[column_index] not in frequency_dict:
            frequency_dict[row[column_index]] = 0
        frequency_dict[row[column_index]] += 1
    
    percentages_dict = dict()
    for key in frequency_dict:
        percentages_dict[key] = 100 * frequency_dict[key] / len(dataset)
    
    return percentages_dict

def display_table(dataset, index):
    table = frequency_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [69]:
display_table(appstore_free_apps, 11)
print()
display_table(googleplay_free_apps, 1)
print()
display_table(googleplay_free_apps, 9)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665

FAMILY : 18.968630106070865
GAME : 9.68178740690589
TOOLS : 8.451816745655607
BUSINESS : 4.592642744301512
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.893026404874746
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475963
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COM

##### Most popular app genres (App Store data)

Once we are able to obtain frequency tables for the different columns in our dataset, we want to know what app genres are the most popular ones. We thus define the following function, in order to get the average ratings per app (in the App Store dataset we don't have information regarding the number of installs, so we use the number of ratings as a proxy for the number of installs)

In [97]:
def get_genre_stats_dict(dataset, column_index, ratings_installs_column_index):
    genre_stats_dict = dict()
    for row in dataset:
        if row[column_index] not in genre_stats_dict:
            genre_stats_dict[row[column_index]] = dict()
            genre_stats_dict[row[column_index]]['ratings/installs'] = 0
            genre_stats_dict[row[column_index]]['app_count'] = 0
        genre_stats_dict[row[column_index]]['ratings/installs'] += int(row[ratings_installs_column_index])
        genre_stats_dict[row[column_index]]['app_count'] += 1
    return genre_stats_dict


In [98]:
genre_ratings_and_count = get_genre_stats_dict(appstore_free_apps, 11, 5)
genre_stats = dict()
for item in genre_ratings_and_count:
    genre_stats[item] = genre_ratings_and_count[item]['ratings/installs'] / genre_ratings_and_count[item]['app_count']
genre_stats = {k: v for k, v in sorted(genre_stats.items(), key=lambda item: item[1], reverse = True)}

Finally we display the data obtained. As we can see, the most popular apps are those from genres Navigation, Reference and Social Networking. However, it is noteworthy that those genres use to have a few apps that are very popular, and therefore are not a good option for our "company" to develop an app. As an alternative, Food & Drink, Photo & Video or Travel apps could be a good market niche.

In [99]:
for item in genre_stats:
    print(item + ' -> ' + str(genre_stats[item]))

Navigation -> 86090.33333333333
Reference -> 74942.11111111111
Social Networking -> 71548.34905660378
Music -> 57326.530303030304
Weather -> 52279.892857142855
Book -> 39758.5
Food & Drink -> 33333.92307692308
Finance -> 31467.944444444445
Photo & Video -> 28441.54375
Travel -> 28243.8
Shopping -> 26919.690476190477
Health & Fitness -> 23298.015384615384
Sports -> 23008.898550724636
Games -> 22788.6696905016
News -> 21248.023255813954
Productivity -> 21028.410714285714
Utilities -> 18684.456790123455
Lifestyle -> 16485.764705882353
Entertainment -> 14029.830708661417
Business -> 7491.117647058823
Education -> 7003.983050847458
Catalogs -> 4004.0
Medical -> 612.0


##### Most popular app genres (Google Play data)

Now we do the same process with the Google Play dataset. We find a similar situation, in the sense that the most important genre is "Communication", but it is rather skewed, as there are a few very important apps (Whatsapp, Messenger, Hangouts, ...) and lots of other less successful apps.

In [90]:
import re

for row in googleplay_free_apps:
    row[5] = int(re.sub('[^0-9]', '', row[5]))

In [100]:
genre_ratings_and_count = get_genre_stats_dict(googleplay_free_apps, 9, 5)
genre_stats = dict()
for item in genre_ratings_and_count:
    genre_stats[item] = genre_ratings_and_count[item]['ratings/installs'] / genre_ratings_and_count[item]['app_count']
genre_stats = {k: v for k, v in sorted(genre_stats.items(), key=lambda item: item[1], reverse = True)}

In [104]:
for item in genre_stats:
    print(item + ' -> ' + str(genre_stats[item]))

Communication -> 38456119.167247385
Adventure;Action & Adventure -> 35333333.333333336
Video Players & Editors -> 24947335.796178345
Social -> 23253652.127118643
Arcade -> 22888365.48780488
Casual -> 19630958.51612903
Puzzle;Action & Adventure -> 18366666.666666668
Photography -> 17805627.643678162
Educational;Action & Adventure -> 17016666.666666668
Productivity -> 16787331.344927534
Racing -> 15910645.681818182
Travel & Local -> 14051476.145631067
Casual;Action & Adventure -> 12916666.666666666
Action -> 12603588.872727273
Strategy -> 11199902.530864198
Tools -> 10683213.20053476
Lifestyle;Pretend Play -> 10000000.0
Adventure;Education -> 10000000.0
Casual;Music & Video -> 10000000.0
Tools;Education -> 10000000.0
Card;Action & Adventure -> 10000000.0
Role Playing;Brain Games -> 10000000.0
News & Magazines -> 9549178.467741935
Music -> 9445583.333333334
Educational;Pretend Play -> 9375000.0
Word -> 9094458.695652174
Puzzle;Brain Games -> 9013125.0
Racing;Action & Adventure -> 8816666.

In [108]:
for row in googleplay_free_apps:
    if row[9] == 'Communication' and row[5] > 100000000:
        print(row[0])
        

Messenger – Text and Video Chat for Free
WhatsApp Messenger
Google Chrome: Fast & Secure
Gmail
Hangouts
Viber Messenger
imo free video calls and chat
Google Duo - High Quality Video Calls
UC Browser - Fast Download Private & Secure
Skype - free IM & video calls
LINE: Free Calls & Messages
