# Profitable App Profiles for the App Store and Google Play Markets

Today, there are countless phone apps in the Google Play Store and Apple Store. They can vary in genres, rating, prices and so forth.

In this project we will try to understand what type of apps are likely to attract more users. The apps that interest this project are free and in English. 

## Opening and Exploring the Data

In this section we will open and explore data from data sets regarding google play store apps and apple store apps. The data sets were taken from Kaggle, and you can find the Play Store data set [here](https://www.kaggle.com/lava18/google-play-store-apps) and the App Store data set [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). 

In [17]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The function above (`explore_data`) recieves a data set, a starting point and an ending point (rows), and if you'd like to see the total of rows and columns (which by default is false). 

In [22]:
## Apple Store Data Set ##
open_file = open('/Catharine/DataSets/AppleStore.csv', encoding='utf8')
from csv import reader
read_file = reader(open_file)
data_set_apple = list(read_file)
apple_header = data_set_apple[0]
apple_data = data_set_apple[1:]

## Google Play Store Data Set ##
open_file = open('/Catharine/DataSets/googleplaystore.csv', encoding='utf8')
from csv import reader
read_file = reader(open_file)
data_set_google = list(read_file)
android_header = data_set_google[0]
android_data = data_set_google[1:]

The code above opens and extracts the data from both files Apple and Google files. We separate the header file and the rest of the data set.

In [28]:
print('First few rows of the apple store file:')
print(apple_header)
print('\n')
explore_data(apple_data,0 ,3, True)
print('\n')

First few rows of the apple store file:
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17




The apple store file has information on 7197 apps, and some useful information we can use are: `track_name`, `currency`, `price`, `user_rating`, and `prime_genre`. 
More information on the columns listed can be found in the (documentation)[https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home]

In [27]:
print('First few rows of the google play store file:')
print(android_header)
print('\n')
explore_data(data_set_google, 0, 3, True)

First few rows of the google play store file:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


On the other hand, the google play store file has more than 10,000 apps and some useful columns are: `app`, `category`, `rating`, `price`, and `genre`.

## Data Cleaning

Before actually analyzing the data, it has to be cleaned by removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis. 

### Deleting incorrect data
In this section we will start by **deleting incorrect data**. As mentioned in the (discussion section)[https://www.kaggle.com/lava18/google-play-store-apps/discussion], there is an error in row 10472 of the google data set. 

In [33]:
print(android_header)
print('\n')
print(android_data[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


As seen above, the app "*Life Made WI-Fi Touchscreen Photo Frame*" has a rating of 19 which is out of the 0 to 5 range estabilished. Therefore, it is necessary to delete this row. The code below shows how to do that. 

In [35]:
print(len(android_data))
del android_data[10472]  # don't run this more than once
print(len(android_data))

10841
10840


### Removing Duplicate Entries

The next step in cleaning data is **removing duplicate entries**. To do that, first we need to identify if there are duplicate entries and which are they. The code below runs through the data set and identifies which apps have the same name. The duplicate apps are inserted into the list named `duplicates_android`

In [37]:
duplicates_android = []
unique_android = []

for row in android_data:
    name = row[0]
    if name in unique_android:
        duplicates_android.append(name)
    else:
        unique_android.append(name)

print('Number of duplicate apps: ', len(duplicates_android))

Number of duplicate apps:  1181


As shown from the code above, there are 1181 duplicate apps. Now we have to remove the duplicate entries, however which ones should de deleted?

If we take for instance the Instagram app, it has four entries, however the number of reviews for each entry is different. Considering the fact that the higher the number of reviews, the more recent the data should be, we will keep only the entry with the highest number of reviews. 

In [40]:
## Finding the duplicated entries with the highest number of reviews ##

reviews_max = {}

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))
        

9659


The dictionary `reviews_max` contains the duplicated apps with the highest number of entries. We will use this dictionary to compare the apps in the data_set `android_data` so we can insert into a then into a new data set (`android_clean`).

In the code below, we run through the google data set, compare if the app is one of the duplicates with the highest number of reviews *and* if has not already been added. 

In [42]:
android_clean = []
already_added = []

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
        
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Modifying the data

The final step in cleaning the data is modifying the data to fit the purpose of our analysis. As mentioned in the introduction, we are interested in the free and English apps only. We will have to modify our data set to contains only the apps with those caracteristics. 

In [58]:
def is_English(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True

print(is_English('Instachat 😜'))
print(is_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_English('Docs To Go™ Free Office Suite'))


True
False
True


To identify if the name of the app is in English or not, we created the function `is_English` which receives a string, loops through each of the characters and checks if its ASCII code is between 0 and 127. 

However, if that was the only verification, app names with special caracters like ™ or emojis would de descarted. To fix that, we added a limit of 3 to the number of special caracteres permited in the app name. It isn't a perfect solution, but it will minimize the number of English-language app losses. 

Now that we created a way to filter the English-speaking apps, we have to filter those from the data set. The code below does just that:


In [61]:
android_english = []
apple_english = []

for row in android_clean:
    name = row[0]
    if is_English(name):
        android_english.append(row)

for row in apple_data:
    name = row[2] 
    if is_English(name):
        apple_english.append(row)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

As shown in the code above, we have come down to 9614 apps from the google play store and 6183 from the apple store. However, these entries still include non-free apps. The next step is isolating only the free apps.

In [70]:
final_android = []
final_apple = []

for row in android_english:
    price = row[7]
    if price == '0':
        final_android.append(row)
        
for row in apple_english:
    price = float(row[5])
    if price == 0.0:
        final_apple.append(row)
        
explore_data(final_android, 0, 3, True)
print('\n')
explore_data(final_apple, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Sh

## Data Analysis 

Now that we have a clean data set, we can begin the analysis. The goal of this project is to understand what type of apps are likely to attract more users, which can be different in the Apple store versus the Google Play Store. 

This analysis has to be done so we can know which types of apps are more successful, and therefore profitable, on both markets.

For this, we'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.

### Most Common Apps by Genre

For this analysis we must create two functions: one that will create frequency tables that show percentages, and another that will display the percentages in a descending order. 


In [76]:
## function for creating a Frequency Table ##
def freq_table(dataset, index):
    freq_dictionary = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in freq_dictionary:
            freq_dictionary[value] += 1
        else:
            freq_dictionary[value] = 1
            
    freq_percentages = {}
    for element in freq_dictionary:
        percentage = (freq_dictionary[element]/total) *100
        freq_percentages[element] = percentage
    
    return freq_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


Now that the functions are created, we can analyze the data shown.

### Most Common Genres in the Apple Store

The list below shows the most common genres in the Apple Store, considering English-language and free apps only. 


The most common genre is *Games* with 58.2% and the runner-up is *Entertainment* with 7.9%. The genres after that are *Photo & Video* with 4.9%, *Education* with 3.7%, *Social Networking* with 3.2%, and *Shopping* with 2.6%.

The general impression is that the most common apps are related to fun (Games and Entertainment).

Even though the most common apps belong in those genres, that may not reflect the user frequency. 

In [77]:
display_table(final_apple, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


### Most common genres in the Google Play Store

The list below shows the most common genres and categories in the Google Play Store.

The most common genre is *Tools* with 8.4% followed by *Entertainment* with 6.1%. If we analyze by category, the most common is *Family* with 18.9% and *Game* with 9.7%. 

In comparisson with the Apple Store, the Google Play Store has a lower percentage of fun-dedicated apps and a higher percentage of tool/utilities apps. 

These frequency tables revealed the most frequent app genres however it isn't known what genres have the most users.

In [78]:
display_table(final_android, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The list below shows the most common categories in the Google Play Store.

In [80]:
display_table(final_android, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

### Most Popular Apps by Genres on the App Store

The previous section displays the most common genres in the Apple Store and Google Play store, but don't take into consideration actual usage, or number of users on the apps. 

In this section we will analyze the users per genre in both app stores. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

In [93]:
genres_apple = freq_table(final_apple, -5)

for genre in genres_apple:
    total = 0 ## quantity of ratings
    len_genre = 0 ## quantity of apps in the genre
    for app in final_apple:
        app_genre = app[-5]
        if app_genre == genre:
            total += float(app[6])
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)


Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


### Most Popular Apps by Genres on the Google Play Store

Unlike the Apple data set, the information on number of users is displayed in the `installs` column with the format *100,000+*, *500,000+*. From that format we can't know exactly how many users, so we will use the value. To do that, we must transform the string into a float by taking out the commas and plus signs. 



In [100]:
category_google = freq_table(final_android, 1)

for category in category_google:
    total = 0 ## quantity of ratings
    len_category = 0 ## quantity of apps in the genre
    for app in final_android:
        app_category = app[1]
        if app_category == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_n_ratings = total / len_category
    print(category, ':', avg_n_ratings)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_