# Profitable App Profiles for the App Store and Google Play Markets

Our organisation only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
opened_file = open('/Users/dannyfowler/Data Science/My Datasets/google-play-store-apps/googleplaystore.csv')

from csv import reader
read_file = reader(opened_file)
google_data = list(read_file)

In [3]:
opened_file = open('/Users/dannyfowler/Data Science/My Datasets/app-store-apple-data-set-10k-apps/AppleStore.csv')

from csv import reader
read_file = reader(opened_file)
apple_data = list(read_file)

# Data Exploration

## Google Play Data
### Columns in Google Play dataset

App - Application name

Category - Category the app belongs to

Rating - Overall user rating of the app (as when scraped)

Reviews - Number of user reviews for the app (as when scraped)

Size - Size of the app (as when scraped)

Installs - Number of user downloads/installs for the app (as when scraped)

Type - Paid or Free

Price - Price of the app (as when scraped)

Content Rating - Age group the app is targeted at - Children / Mature 21+ / Adult

Genres - An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.

Last Updated - Date when the app was last updated on Play Store (as when scraped)

Current Ver - Current version of the app available on Play Store (as when scraped)

Android Ver - Min required Android version (as when scraped)

In [4]:
explore_data(google_data,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
print(len(google_data))

10842


In [7]:
del google_data[10473]

Row deleted as invalid number of columns.

# Apple Data
### Columns in apple dataset:

"id" : App ID

"track_name": App Name

"size_bytes": Size (in Bytes)

"currency": Currency Type

"price": Price amount

"rating_count_tot": User Rating counts (for all version)

"rating_count_ver": User Rating counts (for current version)

"user_rating" : Average User Rating value (for all version)

"user_rating_ver": Average User Rating value (for current version)

"ver" : Latest version code

"cont_rating": Content Rating

"prime_genre": Primary Genre

"sup_devices.num": Number of supporting devices

"ipadSc_urls.num": Number of screenshots showed for display

"lang.num": Number of supported languages

"vpp_lic": Vpp Device Based Licensing Enabled

In [8]:
explore_data(apple_data,0,3,True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 7198
Number of columns: 17


# Data Cleansing

## Duplicates

### Google

In [9]:
google_app_names = []
google_app_names_dupes = []

for row in google_data[1:]:
    name = row[0]
    if name in google_app_names:
        google_app_names_dupes.append(name)
    else:
        google_app_names.append(name)
        
print('There are ' + str(len(google_app_names_dupes)) + ' duplicate apps in this dataset')
print('Some dupes are:')
print(google_app_names_dupes[0:5])

There are 1181 duplicate apps in this dataset
Some dupes are:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


As there are duplicates in the data, the apps with the largest count of user ratings will be left in the data, with the others removed.

In [10]:
reviews_max = {}

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max:
        if reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
    else:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


The above code finds the maximum number of reviews for each app and stores them in a dictionary. This now provides a reference point so that we can loop through the google play store data, and include it in our clean dataset only when the number of reviews for that record is the maximum number.

In [11]:
android_clean = []
already_added = []

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

In [12]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Apple

In [13]:
apple_app_names = []
apple_app_names_dupes = []

for row in apple_data[1:]:
    name = row[2]
    if name in apple_app_names:
        apple_app_names_dupes.append(name)
    else:
        apple_app_names.append(name)
        
print('There are ' + str(len(apple_app_names_dupes)) + ' duplicate apps in this dataset')
print('Some dupes are:')
print(apple_app_names_dupes[0:5])

There are 2 duplicate apps in this dataset
Some dupes are:
['VR Roller Coaster', 'Mannequin Challenge']


Whilst there are duplicates in this dataset, due to the conclusion of the discussion found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409), these will be left in.

## Non-English Apps

The below function checks that a string contains only english characters (ASCII 0-127):

In [14]:
def englishCharactersOnly(string):
    n_non_english = 0
    for char in string:
        if ord(char) > 127:
            n_non_english += 1
        if n_non_english >= 3:
            return False
    return True

### Google

In [15]:
google_eng = []

for row in android_clean:
    name = row[0]
    if (englishCharactersOnly(name)):
        google_eng.append(row)

In [16]:
explore_data(google_eng,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9597
Number of columns: 13


### Apple

In [17]:
apple_eng = []

for row in apple_data[1:]:
    name = row[2]
    if (englishCharactersOnly(name)):
        apple_eng.append(row)

In [18]:
explore_data(apple_eng,0,3,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6155
Number of columns: 17


## Free Apps Only

In [19]:
google_eng_free = []

for row in google_eng:
    typeApp = row[6]
    if typeApp == 'Free':
        google_eng_free.append(row)

In [20]:
apple_eng_free = []

for row in apple_eng:
    price = float(row[5])
    if price == 0:
        apple_eng_free.append(row)

We are interested in looking for an app profile that appeals across both the Google Play Store and the App store as the ad revenue will primarily be linked to the number of users of the app. Therefore by covering both markets this provides more opportunity to increase revenue.

### Useful columns for analysis:

*Google (google_eng_free):*
* Name [0]
* Genre [9]
* Installs [5]
* Category [1]

*Apple (apple_eng_free):*
* Name [2]
* Primary genre [12]
* Rating Count for current version [6]
* Rating Count for all versions [7]

### Frequency Table

In [21]:
def freq_table(dataset, index):
    freq_table = {}
    for row in dataset:
        ref = row[index]
        if ref in freq_table:
            freq_table[ref] += 1
        else:
            freq_table[ref] = 1
    return freq_table

In [22]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Google

**Category**

In [23]:
display_table(google_eng_free,1)

FAMILY : 1675
GAME : 858
TOOLS : 748
BUSINESS : 407
PRODUCTIVITY : 345
LIFESTYLE : 344
FINANCE : 328
MEDICAL : 313
SPORTS : 300
PERSONALIZATION : 294
COMMUNICATION : 286
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 189
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 123
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 71
WEATHER : 70
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 54
BEAUTY : 53


The most common category we see here is the Family category, followed by Game apps. 

**Genres**

In [24]:
display_table(google_eng_free,9)

Tools : 747
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 343
Finance : 328
Medical : 313
Sports : 306
Personalization : 294
Communication : 286
Action : 274
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 189
Simulation : 181
Dating : 165
Arcade : 163
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 123
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 71
Weather : 70
Events : 63
Adventure : 59
Comics : 53
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Trivia : 37
Casino : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

### Apple

**Primary genre**

In [25]:
display_table(apple_eng_free,12)

Games : 1866
Entertainment : 251
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 83
Utilities : 79
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 50
News : 43
Travel : 40
Finance : 35
Weather : 28
Food & Drink : 26
Reference : 17
Business : 17
Book : 12
Navigation : 6
Medical : 6
Catalogs : 4


We see that apps of the 'Games' primary genre are the most frequent of the english & free apps on the App Store. This is nearly 9 times more common than the runner-up, which is the 'Entertainment' genre.

## Finding the most popular genres

In [26]:
ft_apple_genres = freq_table(apple_eng_free,12)

In [27]:
apple_genre_avg_users = {}

for genre in ft_apple_genres:
    total = 0
    length = 0
    
    for row in apple_eng_free:
        if row[12] == genre:
            total += int(row[6])
            length += 1
    print(genre)
    print(total / length)
    print('\n')
    apple_genre_avg_users[genre] = total / length


Productivity
21028.410714285714


Weather
52279.892857142855


Shopping
27230.734939759037


Reference
79350.4705882353


Finance
32367.02857142857


Music
57326.530303030304


Utilities
19156.493670886077


Travel
28243.8


Social Networking
71548.34905660378


Sports
23008.898550724636


Health & Fitness
23298.015384615384


Games
22886.36709539121


Food & Drink
33333.92307692308


News
21248.023255813954


Book
46384.916666666664


Photo & Video
28441.54375


Entertainment
14195.358565737051


Business
7491.117647058823


Lifestyle
16815.48


Education
7003.983050847458


Navigation
86090.33333333333


Medical
612.0


Catalogs
4004.0




The two app types that appear to have the largest userbase would be Navigation and Social Networking applications. The recommendation would be to look into building an app (leaning in favour of navigation as this had a larger average user base).

In [28]:
ft_google_categories = freq_table(google_eng_free,1)

In [44]:
google_genre_avg_users = {}

for genre in ft_google_categories:
    total = 0
    length = 0
    
    for row in google_eng_free:
        installs = (row[5].replace('+','').replace(',',''))
         
        if row[1] == genre:
            total += int(installs)
            length += 1
    print(genre)
    print(total / length)
    print('\n')
    google_genre_avg_users[genre] = total / length

ART_AND_DESIGN
1986335.0877192982


AUTO_AND_VEHICLES
647317.8170731707


BEAUTY
513151.88679245283


BOOKS_AND_REFERENCE
8814199.78835979


BUSINESS
1712290.1474201474


COMICS
832613.8888888889


COMMUNICATION
38590581.08741259


DATING
854028.8303030303


EDUCATION
1833495.145631068


ENTERTAINMENT
11640705.88235294


EVENTS
253542.22222222222


FINANCE
1387692.475609756


FOOD_AND_DRINK
1924897.7363636363


HEALTH_AND_FITNESS
4188821.9853479853


HOUSE_AND_HOME
1360598.042253521


LIBRARIES_AND_DEMO
638503.734939759


LIFESTYLE
1446158.2238372094


GAME
15544014.51048951


FAMILY
3697848.1731343283


MEDICAL
120550.61980830671


SOCIAL
23253652.127118643


SHOPPING
7036877.311557789


PHOTOGRAPHY
17840110.40229885


SPORTS
3650602.276666667


TRAVEL_AND_LOCAL
13984077.710144928


TOOLS
10830251.970588235


PERSONALIZATION
5201482.6122448975


PRODUCTIVITY
16787331.344927534


PARENTING
542603.6206896552


WEATHER
5145550.285714285


VIDEO_PLAYERS
24727872.452830188


NEWS_AND_MAGAZ

For a revenue generating free app, it would appear that a game would be able to command a large install base. However there is a lot of competition in this area which would make this challenging. Other areas that could prove useful would be a travel app, as this area has a large install base, and is a slightly less congested area of both app stores.

# Next Steps

Analyze the frequency table for the Genre column of the Google Play data set, and see whether you can find useful patterns.
Assume we could also make revenue via in-app purchases and subscriptions, and try to find out which genres seem to be liked the most by users — you could examine app ratings here.
Refine your project using our data science project style guide.

[link to dataquest page](https://app.dataquest.io/m/350/guided-project%3A-profitable-app-profiles-for-the-app-store-and-google-play-markets/14/next-steps)