By Charles Fleury, guided by a project from app.dataquest.io

#App Store and Google Play Store applications profiling

##Goal and fictional context: 
We are part of a mobile developing company and are charged with the task of finding what kind of application would be profitable in both the iOS and the Android market based on collected data. We have twp guidelines that help us direct our research and analysis : The app we'll build will be free and generate revenue from in-app adds, and it will be an English app.

##Opening and exploring data

We already have two datasets that suit our need:

A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play.

A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. 

In [114]:
from csv import reader

opened_file_A = open('AppleStore.csv', encoding='utf8')
read_file_A = reader(opened_file_A)
applestore_data = list(read_file_A)

opened_file_B = open('googleplaystore.csv', encoding='utf8')
read_file_B = reader(opened_file_B)
googleplay_data = list(read_file_B)

We create a function that shows us a slice of our dataset

In [115]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's see what our datasets look like from looking at the first couple of rows

In [116]:
explore_data(applestore_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


In [117]:
explore_data(googleplay_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


We see that both datasets have a header, so real data starts on row 1.
Interesting columns from our datasets are price, number of reviews, installs, name and ratings.

##Data cleaning

###Part 1: Error

From a [discussion on the Google Play dataset](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), we learn that row 10472 (actually 10473 with the header) has an error: it has 12 columns instead of 13, because the 'Category' value is missing. We'll delete it.

In [118]:
print(googleplay_data[0])
print(googleplay_data[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [119]:
del googleplay_data[10473]

In [120]:
print(googleplay_data[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


###Part 2: Removing duplicate apps

The [discussion on Google Play Apps](https://www.kaggle.com/lava18/google-play-store-apps/discussion) also reveals there are duplicate apps in the dataset. We'll create functions that identify them and eventually keep only the most pertinent instance of each app.

Here's an example showing a duplicate application:

In [121]:
for app in googleplay_data:
    name = app[0]
    if name == 'Facebook':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


In [122]:
unique_apps = []
duplicate_apps = []

for app in googleplay_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print(duplicate_apps[:5])
print('Number of duplicate apps:', len(duplicate_apps))

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
Number of duplicate apps: 1181


We can see we actually have 1181 duplicate apps whose names are stored in our duplicate_apps list.

To select which instance of the app we will work on, we will base our decision on the number of reviews. We saw above with the duplicate apps of Facebook that only the number of reviews varied from one app to its duplicate. We will work only with the instance of the app that has the higher number of reviews.

We start be creating a dictionary storing each app and its highest number of reviews.

In [123]:
reviews_max = {}

for app in googleplay_data[1:]: #excluding the header
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print('Expected length:', len(googleplay_data) - 1181 - 1) #removing duplicates and header
print('Real length of dictionary:', len(reviews_max))
        

Expected length: 9659
Real length of dictionary: 9659


We will now use reviews_max to remove the duplicate apps from our data. We will append an app from the original dataset to our clean data only of it contains the highest number of reviews and has not been added yet.

In [124]:
android_clean = []
already_added = []

for app in googleplay_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
    
print('Expected length: 9659 rows')
print('Real length of cleaned data:', len(android_clean), 'rows')
    

Expected length: 9659 rows
Real length of cleaned data: 9659 rows


We have successfully removed the duplicate android apps.

###Part 3: Keeping only English apps

Since we only want to work with English apps, we also want to remove any app that have non-English in its name, which are characterized by all its characters having an ASCII number of 127 and below.

We will now create a function that checks if a string contains only English characters.

In [125]:
def is_english(s):
    for char in s:
        if ord(char) > 127:
            return False
    return True
    

Let's check our function on a couple of strings:

In [126]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™'))
print(is_english('Instaboom 😜'))


True
False
False
False


The function works, but we might want to keep English apps with characters like emojis or Trademarks, since they can be useful to our analysis. 

We will then modify the function to filter out app names that have more than 3 special characters that fall outside our ASCII range.

In [127]:
def is_english(s):
    non_english_count = 0
    for char in s:
        if ord(char) > 127:
            non_english_count += 1
    if non_english_count > 3:
        return False
    else:
        return True

Let's check our new function on a couple of strings:

In [128]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™'))
print(is_english('Instaboom 😜'))

True
False
True
True


We will now be able to keep more useful data from our set.

Let's filter out the non-english apps from our apple and android datasets and create new lists with only english apps.

Remember that our latest cleaned android data is named android_clean. As for the apple data, we'll use the initial applestore_data list.

In [129]:
def filter_en(rough_data, clean_data, name_column):
    for app in rough_data:
        name = app[name_column]
        if is_english(name):
            clean_data.append(app)
        
    

android_clean_en = []
ios_clean_en = []

filter_en(android_clean, android_clean_en, 0)
filter_en(applestore_data, ios_clean_en, 1)

print('Number of english apps from Google Play Store:', len(android_clean_en))
print('Number of english apps from Apple Store:', len(ios_clean_en))      

Number of english apps from Google Play Store: 9614
Number of english apps from Apple Store: 6184


###Part 4: Keeping only free apps

The last step from the cleaning process is to keep only free apps, since our fictional startup's source of revenue is in-app ads. In the rows of our Android data, the price is located at [7], in the Apple data it's located at [4].

In [130]:
print(googleplay_data[0][7])
print(applestore_data[0][4])


Price
price


In [131]:
android_free = []
ios_free = []

'''we did not create a function because of the
string format of the price being different from one dataset to another'''
for app in android_clean_en:
    price = app[7]
    if price == '0':
        android_free.append(app)

for app in ios_clean_en:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
print('Number of free english apps from Google Play Store:', len(android_free))
print('Number of free english apps from Apple Store:', len(ios_free))   

Number of free english apps from Google Play Store: 8864
Number of free english apps from Apple Store: 3222


8864 android apps and 3222 iOS apps are left for our analysis.

Let's rename more conveniently our final and clean datasets.

In [132]:
android_data = android_free
ios_data = ios_free

##Analysis: Genre profiling

Our fictional startup is looking to determine the kinds of apps that are profitable by its amount of users exposed to in-app ads, as previously mentioned.

To minimize the workload and risk on the startup, the building process of an app would be this one:

1.Build a minimal app and add it to Google Play
2.If the app is successful, we develop it further.
3.If it is still profitable after 6 months, we develop the iOS version and add it to the App Store.

The end goal is to have our app run on both Android and iOS, so we will analyze each market.

### Part 1: Analysing popular genres with number of apps
Let's start by taking a look at the main genres in each app market.

These are the names of the columns we'll use to generate our analysis of app profiles, taken from the original header of the dataset:

In [133]:
print('Interesting columns from Google Play data:')
print(googleplay_data[0][9])
print(googleplay_data[0][1], '\n')

print('Interesting column from Apple Store data:')
print(applestore_data[0][11])

Interesting columns from Google Play data:
Genres
Category 

Interesting column from Apple Store data:
prime_genre


The 'display_table()' function was given to us to help us in sorting a frequency table. It converts lists of lists in a list of tuples and then sorts it in descending order before printing it:

In [134]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We'll now create the 'freq_table()' function, that takes a dataset and the index of a row and returns the frequency table for any column (indicated by the index).

In [135]:
def freq_table(dataset, index):
    f_table = {}
    total_items = len(dataset)
    for row in dataset:
        val = row[index]
        if val in f_table:
            f_table[val] += 1
        else:
            f_table[val] = 1
    #transforming the table into percentages
    for attribute in f_table:
        f_table[attribute] = float(f_table[attribute]/total_items)
    return f_table
            

Let's run our 'display_table()' function with the interesting columns from our datasets.

In [136]:
print('Genres frequency table from Android:')
display_table(android_data, 9)
print('\nCategory frequency table from Android:')
display_table(android_data, 1)
print('\nprime_genre frequency table from iOS:')
display_table(ios_data, 11)

Genres frequency table from Android:
Tools : 0.08449909747292418
Entertainment : 0.06069494584837545
Education : 0.05347472924187725
Business : 0.04591606498194946
Productivity : 0.03892148014440433
Lifestyle : 0.03892148014440433
Finance : 0.03700361010830325
Medical : 0.03531137184115524
Sports : 0.03463447653429603
Personalization : 0.03316787003610108
Communication : 0.032378158844765345
Action : 0.03102436823104693
Health & Fitness : 0.030798736462093863
Photography : 0.02944494584837545
News & Magazines : 0.027978339350180504
Social : 0.026624548736462094
Travel & Local : 0.023240072202166066
Shopping : 0.022450361010830325
Books & Reference : 0.021435018050541516
Simulation : 0.020419675090252706
Dating : 0.01861462093862816
Arcade : 0.018501805054151624
Video Players & Editors : 0.017712093862815883
Casual : 0.01759927797833935
Maps & Navigation : 0.013989169675090252
Food & Drink : 0.012409747292418772
Puzzle : 0.01128158844765343
Racing : 0.009927797833935019
Role Playing : 0

Analysis of Genres and Category in Android free english apps market:

The Android app market most popular Genres are Tools, and then Entertainment, but both represent less than 10% of the market share. As for the Category of the apps, Family comes first with a little less than 19% of the market, as Game and Tools take second and third place as both have between 8 and 10% of the market share. These results show us that the android market is very balanced between apps serving practical and entertainment purposes.

Analysis of prime genre in iOS free english apps market:

The results of the prime genre of apps in the iOS market shows a domination of Games applications, as they represent more than 58% of total english free apps. Entertainment comes second. The iOS market is clearly populated by apps with an amusement purpose.

###Part 2: Analysing popular of genres by number of users and ratings

To have a better idea of what popular apps genres are, we'll look at the number of users in the Android market (Installs), and for the iOS market, we'll look at the number of ratings (rating_count_tot) because the number of users isn't present in this dataset. Let's look at the latter first.

####iOS

In [137]:
#Calculating average number of user ratings by genre in iOS market
genre_table = freq_table(ios_data, 11)

print('Average number of user ratings by genre in iOS market:')
for genre in genre_table:
    total_ratings = 0
    len_genre = 0
    
    for app in ios_data:
        genre_app = app[11]
        if genre_app == genre:
            ratings = float(app[5])
            total_ratings += ratings
            len_genre += 1
    
    avg_ratings = total_ratings / len_genre
    print(genre + ':', avg_ratings)
            

Average number of user ratings by genre in iOS market:
Education: 7003.983050847458
Travel: 28243.8
News: 21248.023255813954
Utilities: 18684.456790123455
Food & Drink: 33333.92307692308
Entertainment: 14029.830708661417
Productivity: 21028.410714285714
Shopping: 26919.690476190477
Reference: 74942.11111111111
Lifestyle: 16485.764705882353
Sports: 23008.898550724636
Medical: 612.0
Games: 22788.6696905016
Navigation: 86090.33333333333
Weather: 52279.892857142855
Catalogs: 4004.0
Social Networking: 71548.34905660378
Book: 39758.5
Photo & Video: 28441.54375
Business: 7491.117647058823
Health & Fitness: 23298.015384615384
Finance: 31467.944444444445
Music: 57326.530303030304


Analysis of iOS average number of ratings by genre: 

Navigation, Reference, Social Networking, Music all generate a lot of reviews for their average app. Education, Catalogs and Medical stand out as having less than a 10000 reviews on average. All other genres have a decent amount of reviews and a good app of one of those could be viable if we refer to this metric.

####iOS App Profile Recommendation: 
Based on our study of the iOS market for free apps in english, we would recommend developing a Food and Drink app, which could take the form of digital recipe book, and includes ratings, reviews and reactions by the community. This genre of app represents less than 1% of the market, but they generate on average more than 30000 ratings, which demonstrates a good response from the users. If we can add the very popular interracting side of Social Networking apps to it, we believe one such app could be a success on the Apple market. 

####Android

The install numbers from the android dataset aren't precise, but we'll consider that an app in the 100,000+ range has 100000 installs, as this will give us a rough idea of the number of installs of an app of a given genre. We will use the Category column, as the categories are more clearly defined than genres.

In [138]:
#imprecise numbers of installs
display_table(android_data, 5)

1,000,000+ : 0.1572653429602888
100,000+ : 0.11552346570397112
10,000,000+ : 0.10548285198555957
10,000+ : 0.10198555956678701
1,000+ : 0.08393501805054152
100+ : 0.06915613718411552
5,000,000+ : 0.06825361010830325
500,000+ : 0.05561823104693141
50,000+ : 0.047721119133574005
5,000+ : 0.04512635379061372
10+ : 0.035424187725631766
500+ : 0.032490974729241874
50,000,000+ : 0.023014440433212997
100,000,000+ : 0.021322202166064983
50+ : 0.01917870036101083
5+ : 0.0078971119133574
1+ : 0.0050767148014440435
500,000,000+ : 0.002707581227436823
1,000,000,000+ : 0.002256317689530686
0+ : 0.0004512635379061372
0 : 0.0001128158844765343


In [140]:
#Calculating average number of installs by category in Android market
category_table = freq_table(android_data, 1)

print('Average number of installs by category in Android market:')
for category in category_table:
    total_installs = 0
    len_category = 0
    
    for app in android_data:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total_installs += installs
            len_category += 1
    
    avg_installs = total_installs / len_category
    print(category + ':', avg_installs)

Average number of installs by category in Android market:
BEAUTY: 513151.88679245283
HEALTH_AND_FITNESS: 4188821.9853479853
COMMUNICATION: 38456119.167247385
MAPS_AND_NAVIGATION: 4056941.7741935486
NEWS_AND_MAGAZINES: 9549178.467741935
SPORTS: 3638640.1428571427
SOCIAL: 23253652.127118643
PHOTOGRAPHY: 17840110.40229885
FINANCE: 1387692.475609756
BOOKS_AND_REFERENCE: 8767811.894736841
LIFESTYLE: 1437816.2687861272
FAMILY: 3695641.8198090694
GAME: 15588015.603248259
ART_AND_DESIGN: 1986335.0877192982
COMICS: 817657.2727272727
HOUSE_AND_HOME: 1331540.5616438356
VIDEO_PLAYERS: 24727872.452830188
SHOPPING: 7036877.311557789
TOOLS: 10801391.298666667
MEDICAL: 120550.61980830671
EDUCATION: 1833495.145631068
AUTO_AND_VEHICLES: 647317.8170731707
PERSONALIZATION: 5201482.6122448975
FOOD_AND_DRINK: 1924897.7363636363
TRAVEL_AND_LOCAL: 13984077.710144928
EVENTS: 253542.22222222222
LIBRARIES_AND_DEMO: 638503.734939759
PARENTING: 542603.6206896552
ENTERTAINMENT: 11640705.88235294
DATING: 854028.8303

Let's see the categories that generate an average of over 8M installs per app:

In [141]:
#Calculating average number of installs by category in Android market
category_table = freq_table(android_data, 1)

print('Average number of installs by category in Android market (10M+):')
for category in category_table:
    total_installs = 0
    len_category = 0
    
    for app in android_data:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total_installs += installs
            len_category += 1
    
    avg_installs = total_installs / len_category
    if avg_installs > 10000000:
        print(category + ':', avg_installs)

Average number of installs by category in Android market:
COMMUNICATION: 38456119.167247385
SOCIAL: 23253652.127118643
PHOTOGRAPHY: 17840110.40229885
GAME: 15588015.603248259
VIDEO_PLAYERS: 24727872.452830188
TOOLS: 10801391.298666667
TRAVEL_AND_LOCAL: 13984077.710144928
ENTERTAINMENT: 11640705.88235294
PRODUCTIVITY: 16787331.344927534


Analysis of Android average number of installs by category: 

The popular categories above offer great potential for a future app. Since no category is overly exploited in the Google Play Store, we could go of many directions here. We know that communication and social apps are dominated by giants of the industry. The travel and local category is interesting because of the worldwide reach it can have and possible partnerships with foreign advertisers.

###Final app recommendation

After studying the free english apps of the two major markets, we spotted a good oppurtunity for a Food and Drink app on iOS and for a Travel app on Android. We also would like to take advantage of the interest for social interraction within applications. Because we want to build a profitable app for both markets here's what we recommend to have a great chance at success:

A worldwide foodie app that connects the user with the nearby food markets and restaurants and provides him or her with reviews and insights from the community. 'Cook your meal' and 'Go out for a meal' could be the two sections of this app, and advertising revenue would come from the restaurants, cafes, markets, pubs and bars that want to be starred in our app.