# Profitable App Profiles for the App Store and Google Play Markets

+ Data analysis for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.
+ Aims of the project:  To analyze data to help our developers understand what type of apps are likely to attract more users 


    


### Opening and exploring the Data

In [1]:
from csv import reader

#reading app store data & convert to list
open_file_app_store = open('AppleStore.csv')
read_app_store = reader(open_file_app_store)
ios  = list(read_app_store)
ios_header = ios[0]
ios = ios[1:]

#reading play store data & convert to list
open_file_play_store = open('googleplaystore.csv')
read_play_store = reader(open_file_play_store)
android = list(read_play_store)
android_header = android[0]
android = android[1:]



In [2]:
#writing the function explore_data()

def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start : end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds an empty line after each line
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        

#exploring data from both android and app store


In [3]:
# exploring ios apps data using explore_data fxn
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
# exploring andoid apps data using explore_data function
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


#### Column names for ios apps#
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

Full description of [column names](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/downloads/app-store-apple-data-set-10k-apps.zip/7) for ios apps

#### Column names of android apps#

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

Full description of [column names](https://www.kaggle.com/lava18/google-play-store-apps/downloads/google-play-store-apps.zip/6) for andoid apps

### Data Cleaning
To ensure accurate data, it is important to: 
1. Detect inaccurate data, and correct or remove it.
2. Detect duplicate data, and remove the duplicates.


In [5]:
print(android[10472])



['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del android[10472]

* The data from the Google play store has numerous duplicate data points.
* In the cell below, we would determine the number of duplicates 

* The number of duplicate apps: 1181

* The number of unique apps:  9659

Examples of duplicate apps

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']

In [7]:
unique_apps = []
duplicate_apps = []

for apps in android:
    name = apps[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('The number of duplicate apps:', len(duplicate_apps))
print('\n')
print('The number of unique apps: ' , len (unique_apps))
print('\n')
print(duplicate_apps[:15])

The number of duplicate apps: 1181


The number of unique apps:  9659


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


#### Criteria for removing duplicate data

The duplicate fills have to be removed and our **Criteria** will be to choose the one with the **most number of reviews** as its the latest & best data

In [8]:
#sorting the highest review number with the apps 

reviews_max = {}

for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))


9659


In [9]:
# removing duplicate apps from the data by comaparing the data of the
# android apps with that of those in teh reviews_max

android_clean = []
already_added = []

for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(apps)
        already_added.append(name)

print(len(android_clean))
        


9659


* Let us explore the cleaned data using the `explore_data` function to see if its correct

In [10]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


* After we clean the data, we now have **9659** app data for android 

#### We need to remove all non-english apps from the dataset

* First write the function, 
* And in the next cell we'll remove the rows corresponding to the non-English apps.

In [11]:
# the funtion to seperate the non-english from the english apps

def in_english(string):
    non_english_character = 0
    for character in string:
        if ord(character) > 127:
            non_english_character += 1
        
    if non_english_character > 3:
        return False
    else:
        return True

print(in_english('Instagram'))
print('\n')
print(in_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))


True


False


#### Cleaning the data set for android apps by:

* Using the `in_english` function to filter out non-English apps from the apple store data set. 
* Looping through the **android** data set. If an app name is identified as English, append the whole row to a separate list (**android_eng_apps**).


In [12]:
# appending the english apps to a new list called english_apps
android_eng_apps = []
for apps in android_clean:
    name = apps[0]
    if in_english(name):
        android_eng_apps.append(apps)
        
explore_data(android_eng_apps,0,3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


#### Cleaning the data set for iOS apps by:

* Using the `in_english` function to filter out non-English apps from the apple store data set. 
* Looping through the **ios** apps data set. If an app name is identified as English, append the whole row to a separate list (**ios_eng_apps**).


In [13]:
# appending the english apps to a new list called english_apps for iOS
ios_eng_apps = []
for apps in ios:
    name = apps[1]
    if in_english(name):
        ios_eng_apps.append(apps)
        
explore_data(ios_eng_apps,0,3, True)


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


#### Isolating the Free apps from the Non-free in both data sets

* So far in the data cleaning process, we:

    1. Removed inaccurate data
    2. Removed duplicate app entries
    3. Removed non-English apps
    
    
* We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. 
* Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [14]:
#looping through android data to isolate free apps
free_android_apps = []
for apps in android_eng_apps:
    price = apps[7]
    if price == '0':
        free_android_apps.append(apps)
    
print('Number of free android apps: ' + str(len(free_android_apps)))

# looping through ios data to isolate free apps
free_ios_apps = []
for apps in ios_eng_apps:
    price = apps[4]
    if price == '0.0':
        free_ios_apps.append(apps)
        
        
print('Number of free ios apps: ' + str(len(free_ios_apps)))
        

Number of free android apps: 8864
Number of free ios apps: 3222


#### Validation strategy for an app idea #

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

    1. Build a minimal Android version of the app, and add it to Google Play.
    2. If the app has a good response from users, we then develop it further.
    3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.
    
    
Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [19]:
# Bulding the following functions for the specific tasks.

# create a function for generating frequency tables, 
# and use it in combination with the display_table() function.

def freq_table(dataset, index):
    table = {}
    total = 0
    for app in dataset:
        col = app[index]
        total += 1
        if col in table:
            table[col] += 1
        else:
            table[col] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])   


### Part 3#

* We start by examining the frequency table for the genre column of the App Store data set.


We can see that among the free English apps, more than a half (58.16%) are games. **Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.**

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's continue by examining the Genres and Category columns of the Google Play data set (two columns which seem to be related).



In [20]:
print('Table for ios prime genre')
display_table(free_ios_apps, 11)  
print('\n')

Table for ios prime genre
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665




The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:



In [21]:
print('Table for android category')
display_table(free_android_apps, 1) 
print('\n')

Table for android category
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321


The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

In [22]:
print('Table for android genre')
display_table(free_android_apps, -4)

Table for android genre
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Ve

### Most Popular Apps by Genre on the App Store#

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [23]:
ios_genre = freq_table(free_ios_apps, 11)


for genre in ios_genre:
    total = 0
    len_genre = 0
    for apps in free_ios_apps:
        genre_app = apps[-5]
        if genre_app == genre:
            n_ratings = float(apps[5])
            total += n_ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, ':', avg_ratings)
   

Reference : 74942.11111111111
Business : 7491.117647058823
News : 21248.023255813954
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Games : 22788.6696905016
Utilities : 18684.456790123455
Medical : 612.0
Navigation : 86090.33333333333
Music : 57326.530303030304
Travel : 28243.8
Weather : 52279.892857142855
Shopping : 26919.690476190477
Book : 39758.5
Education : 7003.983050847458
Social Networking : 71548.34905660378
Lifestyle : 16485.764705882353
Finance : 31467.944444444445
Entertainment : 14029.830708661417
Catalogs : 4004.0
Productivity : 21028.410714285714
Photo & Video : 28441.54375
Health & Fitness : 23298.015384615384


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [24]:
for apps in free_ios_apps:
    if apps[-5] == 'Navigation':
        print(apps[1], ':', apps[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


#### Most Popular Apps by Genre on Google Play

In [37]:
android_category = freq_table(free_android_apps,1)

for category in android_category:
    total = 0
    len_category = 0
    for app in free_android_apps:
        category_app = app[1]
        if category == category_app:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '').replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
            avg_installs = total / len_category
    print(category, ':' ,avg_installs )
        

HEALTH_AND_FITNESS : 4188821.9853479853
PRODUCTIVITY : 16787331.344927534
PERSONALIZATION : 5201482.6122448975
EVENTS : 253542.22222222222
FOOD_AND_DRINK : 1924897.7363636363
ENTERTAINMENT : 11640705.88235294
MAPS_AND_NAVIGATION : 4056941.7741935486
HOUSE_AND_HOME : 1331540.5616438356
FAMILY : 3695641.8198090694
EDUCATION : 1833495.145631068
TOOLS : 10801391.298666667
DATING : 854028.8303030303
BEAUTY : 513151.88679245283
SPORTS : 3638640.1428571427
WEATHER : 5074486.197183099
MEDICAL : 120550.61980830671
FINANCE : 1387692.475609756
SOCIAL : 23253652.127118643
LIFESTYLE : 1437816.2687861272
SHOPPING : 7036877.311557789
GAME : 15588015.603248259
COMICS : 817657.2727272727
TRAVEL_AND_LOCAL : 13984077.710144928
PARENTING : 542603.6206896552
COMMUNICATION : 38456119.167247385
PHOTOGRAPHY : 17840110.40229885
NEWS_AND_MAGAZINES : 9549178.467741935
BUSINESS : 1712290.1474201474
LIBRARIES_AND_DEMO : 638503.734939759
VIDEO_PLAYERS : 24727872.452830188
ART_AND_DESIGN : 1986335.0877192982
AUTO_AN

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [39]:
for app in free_android_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

## Conclusion

From our data analysis, the following conclusions can be made: 

It can be concluded that by taking a very popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store market. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book. 