# Most Profitable App Categories on the Apple App Store and Google Play

**Project Goal: Increase in-app ad revenue on free to download apps.**

The goal of this project is to find the most profitable types of free apps on the Apple App Store and Google Play. In this project we are looking to make data-driven decisions in regards to what kind of apps we should build to be most profitable. The main source of revenue will be coming from in-app ads in free to download apps, so the revenue for any given app is most influenced by its number of users. At the end of this analysis we will know what kinds of apps are most likely to attract the highest number of users. 


## Opening and Exploring the Data

We use the following data sets for our analysis. The first contains approximately seven thousand iOS apps from the Apple App Store. The second contains approximately ten thousand Android apps from Google Play. There are more than 2 million apps available on each marketplace, but analysing so much data would require significant time and resources, so we use the following data sets which are smaller, but nevertheless contain sufficient data for our analysis. 

* [Apple App Store Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
* [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps/home)

First we start by opening each data set, and then we continue by exploring the data. 

In [1]:
from csv import reader

# Importing Apple App Store Data
open_file = open('AppleStore.csv')
read_file = reader(open_file)
apple_apps = list(read_file)
apple_header = apple_apps[0]
apple_apps = apple_apps[1:]

# Importing Google Play Store Data
open_file = open('googleplaystore.csv')
read_file = reader(open_file)
google_apps = list(read_file)
google_header = google_apps[0]
google_apps = google_apps[1:]

In [2]:
# Function for Exploring the Data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# Apple App Store Data Sample
print(apple_header)
print('\n')
explore_data(apple_apps, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


In [4]:
# Google Play Store Data Sample
print(google_header)
print('\n')
explore_data(google_apps, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Cleaning the Data
### Deleting Incorrect Data

In the discussion section of the Google Play data set, we learn that row 10472 has an error.

We print this row and see that in column 3 the rating is 19, which is an error because ratings only go up to 5. We delete that row and see that after deleting, the number of apps in the data set has been decreased by one. 

In [5]:
# On the discussion forums we learn there is one row with a missing value.
print(google_apps[10472])
print(len(google_apps))
del(google_apps[10472]) #don't run more than once
print(len(google_apps))


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10841
10840


### Removing Duplicate Entries
The Google Play Dataset has some duplicate entries listed. We will need to remove these duplicate listings, but first we need to decide which of the duplicate listings to remove. We would like to keep the most recent data, so one criterion we can check to tell us which entry is the most recent is the 'Rating" column, and then we simply keep the entry with the highest number of ratings. 

In [6]:
# finding the duplicate app entries
duplicate_apps = []
unique_apps = []

for app in google_apps:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate Apps: ', len(duplicate_apps))
print('\n')
print('Examples of Duplicate Apps: ', duplicate_apps[:10])

Number of Duplicate Apps:  1181


Examples of Duplicate Apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [7]:
# deleting the duplicate Google app entries
print('Expected number of entries after deletion: ', len(google_apps) - 1181)

# create a dictionary keeping the apps with the highest number of review
reviews_max = {}

for app in google_apps:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews 
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Actual number of entries after deletion: ', len(reviews_max))

Expected number of entries after deletion:  9659
Actual number of entries after deletion:  9659


Now we will use the reviews_max dictionary to remove the duplicates.

In [8]:
google_clean = []
already_added = []

for app in google_apps:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(app)
        already_added.append(name)

Now that we have removed the duplicate entries, we are going to use 'explore_data' to make sure everything looks as it should. 

In [9]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have 9,659 rows as expected, so it looks like the duplicates have successfully been removed. 

### Removing Non-English Apps

There are non-english apps in our dataset, so we will want to remove them as they do not represent our target audience. 

We use the following function to determine if an app name contains only characters used in English or not. 

In [11]:
def in_english(string):
    
    for char in string:
        if ord(char) > 127:
            return False
    
    return True

print(in_english('Instachat'))
print(in_english('电视剧热播'))
print(in_english('Business™'))
print(in_english('Emoji🤯'))
# the following could be used in english app names but come back as being non-english with the function we wrote:
print(ord('™'))
print(ord('🤯'))

True
False
False
False
8482
129327


However, we see that some higher-numbered characters could still be used in English app names. In order to minimize losing more of our dataset entries than necessary, we will accept up to 3 characters that fall out of the '127' range.

In [12]:
def in_english(string):
    not_ascii = 0
    
    for char in string:
        if ord(char) > 127:
            not_ascii += 1
            
    if not_ascii > 3:
        return False
    else:
        return True
    
print(in_english('Business™'))
print(in_english('Emoji🤯'))
print(in_english('Insta 电视剧热'))

True
True
False


Below, we use our in_english() function to filter out the non-English apps from both of our data sets.

In [13]:
# filtering non-english apps out of our Apple dataset
apple_apps_eng = []
for app in apple_apps:
    if in_english(app[2]):
        apple_apps_eng.append(app)
explore_data(apple_apps_eng, 0, 3, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6183
Number of columns: 17


In [14]:
# filtering non-english apps out of our Google dataset
google_clean_eng = []
for app in google_clean:
    if in_english(app[0]):
        google_clean_eng.append(app)
explore_data(google_clean_eng, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


### Isolating the Free Apps
Here we filer out all of the apps that are not free. For our purposes we only need to analyse the free apps.

In [15]:
# filtering out the apps that are not free out of the Apple dataset
apple_free = []

for app in apple_apps_eng:
    price = app[5]
    if price == '0':
        apple_free.append(app)
        
print(len(apple_free))

3222


In [16]:
# filtering out the apps that are not free out of the Google dataset
google_free = []

for app in google_clean_eng:
    price = app[7]
    if price == '0':
        google_free.append(app)
        
print(len(google_free))

8864


## Most Common Apps by Category
Our aim is to determine the types of apps that are likely to attract more users. A good strategy for developing a new app might be to first build a minimal Android version of the app. If the app gets a good response rate, then we can develope it further. Finally, if the app is profitable after six months, we can build an iOS version. Because we want to add our apps to both marketplaces, we need to find app profiles that are likely to be successful on both markets.

We begin by building frequency tables for the Genre and Category columns of our data sets to get a sense of the most common app types for each market. 

In [17]:
# this function will help us sort our frequency table percentages
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [18]:
# this is our frequency table function that will show percentages
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentage = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentage[key] = percentage
        
    return table_percentage

Here we can examine the *prime_genre* of the Apple App Store data set.

In [19]:
display_table(apple_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Here we can examine the *Genre* of the Google Play data set.

In [21]:
display_table(google_free, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Here we can examine the *Category* of the Google Play data set.

In [22]:
display_table(google_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

## Most Popular Apps by Category on the Apple App Store
We can calculate how popular certain app categories are on the Apple App Store by looking at how many ratings each app has.

In [23]:
genre_apple = freq_table(apple_free, -5)

for genre in genre_apple:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, " : ", avg_n_ratings)

Productivity  :  21028.410714285714
Weather  :  52279.892857142855
Shopping  :  26919.690476190477
Reference  :  74942.11111111111
Finance  :  31467.944444444445
Music  :  57326.530303030304
Utilities  :  18684.456790123455
Travel  :  28243.8
Social Networking  :  71548.34905660378
Sports  :  23008.898550724636
Health & Fitness  :  23298.015384615384
Games  :  22788.6696905016
Food & Drink  :  33333.92307692308
News  :  21248.023255813954
Book  :  39758.5
Photo & Video  :  28441.54375
Entertainment  :  14029.830708661417
Business  :  7491.117647058823
Lifestyle  :  16485.764705882353
Education  :  7003.983050847458
Navigation  :  86090.33333333333
Medical  :  612.0
Catalogs  :  4004.0


We have to keep in mind that some outliers here like Google Maps, Facebook, Instagram, Spotify, or YouTube can skew the data in their respective categories because they have so many ratings. If we choose to look beyond Music, Social Networking, and Navigation, we see that Weather, Reference, and Book categories have a relatively high number of ratings. Food & Drink looks promising, but many of these apps come from established businesses. With Weather apps, people don't spend a lot of time in them, so they likely would not be the best category for in-app ads. Reference looks promising, and games could be a good choice with the quantity of time people spend in-app.
Let's look at the Google Play market. 
## Most Popular Apps by Category on the Apple App Store
We will first examine the number of instals for each category.

In [24]:
display_table(google_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


This doesn't give us perfect precision, but it can still tell us which app genres attract the most users.

In [26]:
category_google = freq_table(google_free, 1)

for category in category_google:
    total = 0
    len_category = 0
    for app in google_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, " : ", avg_n_installs)

ART_AND_DESIGN  :  1986335.0877192982
AUTO_AND_VEHICLES  :  647317.8170731707
BEAUTY  :  513151.88679245283
BOOKS_AND_REFERENCE  :  8767811.894736841
BUSINESS  :  1712290.1474201474
COMICS  :  817657.2727272727
COMMUNICATION  :  38456119.167247385
DATING  :  854028.8303030303
EDUCATION  :  1833495.145631068
ENTERTAINMENT  :  11640705.88235294
EVENTS  :  253542.22222222222
FINANCE  :  1387692.475609756
FOOD_AND_DRINK  :  1924897.7363636363
HEALTH_AND_FITNESS  :  4188821.9853479853
HOUSE_AND_HOME  :  1331540.5616438356
LIBRARIES_AND_DEMO  :  638503.734939759
LIFESTYLE  :  1437816.2687861272
GAME  :  15588015.603248259
FAMILY  :  3695641.8198090694
MEDICAL  :  120550.61980830671
SOCIAL  :  23253652.127118643
SHOPPING  :  7036877.311557789
PHOTOGRAPHY  :  17840110.40229885
SPORTS  :  3638640.1428571427
TRAVEL_AND_LOCAL  :  13984077.710144928
TOOLS  :  10801391.298666667
PERSONALIZATION  :  5201482.6122448975
PRODUCTIVITY  :  16787331.344927534
PARENTING  :  542603.6206896552
WEATHER  :  50

## Conclusion
What kind of app should we build next?