# App revenue in the App Store and Google Play Store

---------
*Author: Anna Pot <br/>*

In this project we will analyse data on app engagement to understand what type of apps are more likely to attract more users on Google play and the App Store.

To do this, we will analyse data about mobile apps available on Google play and the App Store.

In [1]:
from csv import reader

## The Apple dataset ##
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

## The Google Play dataset ##
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

**Function**
The function explore_data() below slices the dataset and loops through the slice, whereby it prints a row and adds a new line after that row for each iteration. Also, it prints the number of rows and columns if True.

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') #new, empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print('\n')
explore_data(google, 0, 3, True)
    



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We've explored the Google android dataset, which consists of 13 columns and 10841 rows. The columns that would be useful for our analysis are App, Category, Rating, Reviews, Content rating and Genres. \\
We will do the same for the IOS dataset below.

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


The IOS dataset consists of 17 columns and less rows: 7197. The columns that might be useful for our analysis are id, track name, rating, count total, rating count, user rating, prime genre.<br/>

## Delete wrong data
According to the discussion section of the Google dataset, there is an error in one of the entries, in row 10472. We will remove this row from our dataset.

In [4]:
print(len(google)) #row has missing value for rating
del google[10472] # run this only once!
print(len(google))


10841
10840


**Explore the Google dataset**:<br/>
The Google dataset has duplicate entries:   

In [5]:
unique_apps = []
duplicate_apps = []

for app in google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Number of unique apps: ", len(unique_apps))
print('\n')
print("Number of duplicate apps: ", len(duplicate_apps))
print("Examples of duplicate apps: ", duplicate_apps[:10])

Number of unique apps:  9659


Number of duplicate apps:  1181
Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


## Remove duplicates
As you can see, the dataset lists a couple of apps multiple times. We want to remove these duplicates, but not randomly. If we examine the duplicates in more depth, we see that some have more user reviews than others. Let's show this with an example:

In [6]:
for app in google:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


The Instagram app has multiple entries. However, when looking at the number of reviews (item 4 in the list), we see that the number varies.

In [7]:
reviews_max = {}

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print("Expected length: ", len(google)-1181) # minus duplicate apps
print("Actual length: ", len(reviews_max))

Expected length:  9659
Actual length:  9659


In [8]:
google_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(app)
        already_added.append(name)

print("expected length: ", len(reviews_max))
print("actual length: ", len(google_clean))

expected length:  9659
actual length:  9659


We've removed the duplicate entries from the google play dataset with the code above. As becomes clear, the expected length (subtracting the duplicates from the full Google dataset) matches the actual length. We're removed duplicates on the basis of the number of reviews each duplicate entry had. We retained the entry with the highest number of reviews.<br/>
<br/>
Let's quickly explore the dataset:

In [9]:
explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The code below checks whether a string consists of English characters only (corresponding to the ASCII values up to 127). If so, the function returns True. If there are 'foreign' characters, the function returns False. However, onlywhen a name has more than three characters with number falling outside the ASCII range, the app will be removed.

In [10]:
def is_it_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_it_english('Instachat 😜'))
print(is_it_english('Instachat'))
print(is_it_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Let's apply the function to both of our datasets and remove the non-English apps.

In [11]:
google_english = []
ios_english = []
    
for app in google_clean:
    name = app[0]
    if is_it_english(name):
        google_english.append(app)

for app in ios:
    name = app[2]
    if is_it_english(name):
        ios_english.append(app)
            
explore_data(google_english, 0, 3, True)
print("\n")
explore_data(ios_english, 0, 3, True)
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

As a last step in the data cleaning process we will remove the non-free apps. As we focus on free to install apps we'll remove the paid apps from the dataset for the analysis.

In [12]:
google_free = []
ios_free = []

for app in google_english:
    price = app[7]
    if price == '0':
        google_free.append(app)
        
for app in ios_english:
    price = app[5]
    if price == '0':
        ios_free.append(app)

explore_data(google_free, 0, 3, True)
print("\n")
explore_data(ios_free, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Sh

# Exploring the dataset
## most common genres for Google Play
We're going to build frequency tables to see what the most common app genres are in the google play store.
<br/>
We're going to build a frequency table for the Genres and Category columns of the GP dataset. To do this, we create two functions to analyse the frequency tables.
* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [13]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_perc = {}
    for key in table:
        percentage = (table[key]/ total) * 100
        table_perc[key] = percentage
    
    return table_perc

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(google_free, 1) 
print("\n")
display_table(google_free, 9) 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Above we have written a function to create and display a frequency table for the genres in the Google Play dataset. (1 corresponds to the column Category and 9 to the column Genres).
<br/>
Let's do the same for the IOS dataset:

In [14]:
display_table(ios_free, 12) 

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


## Let's analyse the frequency tables for both app stores!
### The App Store data
The App store (IOS) features games as the most common genre with 58% of the apps being labeled as gaming apps. Entertainment is the runner-up with 7%. Photo and video is also relatively well-represented. Then there are a number of categories that represent around 1 or 2%, relating more to 'usability' apps such as travel apps, productivity, fitness, finance, etc. The bulk of the apps seems to be designed for entertainment purposes. 
Recommendations for an app profile for the App Store marked would probably be based on the visibility of the entertainment category. The question is, however, whether the number of apps in a particular genre also reflects a large user base..

### The Google Play Store data
The most common genres for the GP store are a bit more diverse and include 'family' apps, games, 'tools', entertainment. The percentages however are much more spread out, with, as opposed to the App Store, games representing only 9.7% of the total number of app categories. Also, the list of genres shows much more detail, with more precise definitions of the app types. Educational games, for example. 
Investigating the app categories in a bit more detail, we see that the family category (a large category) mostly lists games for kids (so entertainment is also a sizeable share of the apps available in the GP store). Nonetheless, the number of practical apps (cateogry 'tools') is much bigger in the GP store than in the App Store.

## Most popular apps by genre
To find out which genres are the most popular (i.e. have the most users) we'll calculate the average number of installs for each app genre.

In [18]:
ios_genres = freq_table(ios_free, 12)

for genre in ios_genres:
    total = 0
    len_genre = 0
    
    for app in ios_free:
        genre_app = app[12]
        
        if genre_app == genre:
            user_ratings = float(app[6])
            total += user_ratings
            len_genre += 1

    average_ratings = total / len_genre
    print(genre, ":", average_ratings)

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


The highest number of user reviews can be found in apps relating to navigation and social networking. Entertainment and Lifestyle also have a fairly large number of user ratings. Based on the number of ratings, and looking back to the number of apps present in the app store (the frequency table), I would recommend an app profile related to entertainment for the App Store. Let's see if this is a logical recommendation when considering the number of different apps in relation to user ratings. (e.g. some apps may be dominating the number of user reviews, making it a relatively small category but used by a large number of people).

In [20]:
for app in ios_free:
    if app[12] == "Navigation":
        print(app[2], ":", app[6]) # this prints the name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


Indeed, as we can see here the navigation category lists a large number of user reviews but is heavily dominated by one app: Waze. This app is responsible for a bulk of the user ratings. Let's see what the category Lifestyle:

In [21]:
for app in ios_free:
    if app[12] == "Lifestyle":
        print(app[2], ":", app[6])

Zillow Real Estate - Homes for Sale & for Rent : 342969
Alipay - Makes Life Easy : 1926
Mega Millions & Powerball - lottery games in the US with winning number results, lotto jackpots and prize payouts : 1255
Zoopla Property Search -UK Homes for Sale and Rent : 210
Celebtwin: Celebrity Looks Like Lite : 1111
Autohome-Find new＆Used Cars For Sale : 194
IKEA Catalog : 8939
myChevrolet : 1083
Text Free: Free Texting + Calling + MMS : 100477
Countdown‼ (Event Reminders and Timer) : 60490
PINK Nation : 49816
Perfect365 - Custom makeup designs and beauty tips : 19540
happn — Dating app — Find and meet your crush : 20546
cute icon&wallpaper dressup - CocoPPa : 12508
Tapage by My Little App : 41
Tinder : 143040
OnCamera : 111
Monogram - Wallpaper & Backgrounds Maker HD DIY with Glitter Themes : 7427
Tile - Find & track your lost phone, wallet, keys : 5684
SafeTrek - Personal Safety : 2227
Yoshirt - Design Your Own Custom Tshirt, Tote Bag, Socks and More : 1849
Bumble – Find a Date, Meet Friends

This category is much more diverse, but the number of ratings is still somewhat skewed towards some apps, although the number is more divided among a number of apps.