# Improving App Profitability via Data Analysis

-----------------------------------------------------------------------------
This project is done as a part of [Python for Data Science: Fundamentals][1] class on DataQuest. Though I have explored beyond what is suggested throughout the project.

* **Concepts learned:** Phyton syntax and fundamental data structures, exposed to Jupiter Notebooks, Github and Kaggle

* **Main challenges** Cleaning up two large datasets with different structures. Analyzing data for skewed or small datasets.

-----------------------------------------------------------------------------

The aim of this project is to provide more insight for developers on what kinds of apps return better profits based on user data. The data to be looked at is for free and English-only apps in IOS and Android markets.

[1]:https://www.dataquest.io/course/python-for-data-science-fundamentals/

### 1.Data Review

The very first thing we are going to do is to open the datasets we obtained from the links below. We will look at the content of the data and how it is structured. 
- [Android](https://www.kaggle.com/lava18/google-play-store-apps/home)

- [IOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)


In [1]:
def open_csv (csv_file):
    data_file = open(csv_file)
    from csv import reader
    data_list = list(reader(data_file))
    return data_list

In [2]:
ios_data = open_csv('AppleStore.csv')
android_data = open_csv('googleplaystore.csv')

In [3]:
def print_header(string):
    print('#################################################')
    print('#',string)
    print('#################################################\n')

In [4]:
def explore_data(dataset, source, start, end, rows_and_columns=False):
    string = source+' data samples and dimensions'
    print_header(string)
    
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row,'\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]),'\n')

In [5]:
explore_data(android_data,'Android', 0, 2, True)
explore_data(ios_data,'IOS', 0, 2, True)

#################################################
# Android data samples and dimensions
#################################################

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

Number of rows: 10842
Number of columns: 13 

#################################################
# IOS data samples and dimensions
#################################################

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social N

### 2. Data Wrangling

In this section we will modify the datasets such that they are accurate and have the right content for our project's scope.

#### 2.1. Missing Columns

The [discussions][1] for the Android  data indicates the Android data has some issues. 

First of all, we know that not all entries have the same number of columns. Let us remove the entries with insufficient data. Let us check noth datasets for potentially missing data and remove those entries.

[1]: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion


In [6]:
def check_n_cols(data_list,source):
  
    index = 0
    expected_n_col = len(data_list[0])
    print('Initial number of entries in',source,':', len(data_list))
    for each_row in data_list:
       if len(each_row) != expected_n_col:
        print('At line: ',(index+1), ' number of columns:',len(each_row), ', expected number of columns:', expected_n_col)
        del data_list[index]
       else:
        index += 1
    print('Final number of entries in',source,':', len(data_list), '\n')
    return data_list

In [7]:
android_data = check_n_cols(android_data,'Android')
ios_data = check_n_cols(ios_data,'IOS')

Initial number of entries in Android : 10842
At line:  10474  number of columns: 12 , expected number of columns: 13
Final number of entries in Android : 10841 

Initial number of entries in IOS : 7198
Final number of entries in IOS : 7198 



#### 2.2. Duplicate Entries

Secondly, from the discussions mentioned above we see that there are duplicate entries in the Android dataset. From the checks done below, you will see some of the entries are identical to each other whereas some of them have different number of reviews for the same app.

In [8]:
def check_identicals(data_list, source):
  
    tuple_list = []                                # Changing the nested lists into nested tuples      
    for each_row in data_list[1:]:                 # so that we can use the set function on them
        tuple_list.append(tuple(each_row))         # lists are not hashable
                            
    data_set = set(tuple_list)                     # set function takes only the unique entries               
    
    if (len(data_set) != (len(data_list)-1)):
        print(source,'dataset has',(len(data_list)-1-len(data_set)),'identical entries \n')
    else:
        print (source,'dataset has NO identical entries')

In [9]:
check_identicals(android_data,'Android')
check_identicals(ios_data,'IOS')

Android dataset has 483 identical entries 

IOS dataset has NO identical entries


In [10]:
def check_duplicates(data_list, source,name_index):
  
    unique_list =[]
    duplicates_list = []
    
    for each_row in data_list[1:]:  
        app_name = each_row[name_index]
        if(app_name not in unique_list):
            unique_list.append(app_name)
        else:
            duplicates_list.append(app_name)
 
    print(source,'dataset has',len(duplicates_list),'duplicate entries \n')
    return duplicates_list

In [11]:
def print_samples(data_list, duplicates_list,source, name_index):
    header = 'Sample '+source+' Data:'
    print_header(header)
    print (data_list[0],'\n')
    for each_row in data_list:
        if (each_row[name_index] == duplicates_list[0]):
            print(each_row)
    print('\n')     

In [12]:
android_duplicates = check_duplicates(android_data,'Android',0)
print_samples(android_data,android_duplicates,'Android',0)
        
ios_duplicates = check_duplicates(ios_data,'IOS',1)
print_samples(ios_data,ios_duplicates,'IOS',1)


Android dataset has 1181 duplicate entries 

#################################################
# Sample Android Data:
#################################################

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


IOS dataset has 2 duplicate entries 

#################################################
# Sample

**To cleanup the duplicate entries,** we'll need to select the representative entry for each app. Here, we will assume the entries with the highest number of reviews for a specific app are the latest. Hence, they will be used as the representative content for their corresponding apps.

In [13]:
def clean_duplicates(data_list,source,name_index,rating_cnt_index):

    unique_dic  = {}
    
    for each_row in data_list[1:]:
        app_name = each_row[name_index]
              
        if (app_name not in unique_dic):
            unique_dic[app_name] = each_row          
        else:
            n_reviews = int(each_row[rating_cnt_index])
            prev_n_reviews = int(unique_dic[app_name][rating_cnt_index])
            if(n_reviews > prev_n_reviews):
                unique_dic[app_name] = each_row          
    
    initial_cnt = len(data_list)
    data_list = [data_list[0]]+list(unique_dic.values())

    n_duplicate = initial_cnt - len(data_list)
    print('Out of',initial_cnt,source,'entries ',n_duplicate, 'have content for the same app name')
    print(source,'data has now',len(data_list),'entries\n')
    return data_list

In [14]:
android_data = clean_duplicates(android_data,'Android',0,3)
ios_data = clean_duplicates(ios_data,'IOS',1,5)

Out of 10841 Android entries  1181 have content for the same app name
Android data has now 9660 entries

Out of 7198 IOS entries  2 have content for the same app name
IOS data has now 7196 entries



#### 2.3. Remove Non-English Content

On the third step, we will remove the content for apps that are not in English since our scope is limited to the American market.

To detect wheter an app name is in English we will look at the ASCII codes of the characters within the app's name. English characters have ASCII codes less than 128. So any character below this value should be at least written in Latin alphabet and would probably (not surely) in English. 

There are couple of corner cases this algorithm doesn't work though. Superscripts and emojis that we have encountered in our datasets have ASCII codes larger than 127 and they don't follow an easy-to-detect pattern. Hence, we will allow a certain number of characters with ASCII codes higher than 127 in our definition of what goes into an English word. As you can expect this algorithm is not bullet-proof.

In [15]:
def is_english(string, tolerance,verbose = False):
    english = True
    counter = 0
    for each_char in (string):
        if (ord(each_char) > 127) :     
            counter += 1
            if counter > tolerance:
                english = False
    if verbose : print(string,' is English?',english) 
    return english

Let us test the above function on following test cases:

In [16]:
a = 'Docs To Go™ Free Office Suite'
b = 'Instachat 😜'
c = '爱奇艺PPS -《欢乐颂2》电视剧热播'

for test_case in [a,b,c]:
    is_english (test_case,2,True)

Docs To Go™ Free Office Suite  is English? True
Instachat 😜  is English? True
爱奇艺PPS -《欢乐颂2》电视剧热播  is English? False


In [17]:
def pick_eng_apps (data_list, source, name_index, tolerance, verbose = False):
    index = 1
    ne_app_names = []
    initial_cnt = len(data_list)
    
    for each_row in data_list[1:]:
        app_name = each_row[name_index]
        
        if not(is_english(app_name,tolerance)):
            ne_app_names.append(app_name)
            del data_list[index]
        else:
            index += 1

    n_neng = initial_cnt-len(data_list)
    print('Out of',initial_cnt,source,'entries ',n_neng, 'have non-English content for a tolerance of',tolerance,'characters')
    print(source,'data has now',len(data_list),'entries\n')
    
    if verbose: print(ne_app_names,'\n')
    return data_list

**You can play with the following code piece** to see what non-English app names the we catch for a given tolerance level (default is 3 characters).

In [18]:
tolerance = 3 
verbose = False    #change to 'True' to see the list of non-English app names
android_data = pick_eng_apps(android_data,'Android',0,tolerance,verbose)
ios_data = pick_eng_apps(ios_data,'IOS',1,tolerance,verbose)

Out of 9660 Android entries  45 have non-English content for a tolerance of 3 characters
Android data has now 9615 entries

Out of 7196 IOS entries  1014 have non-English content for a tolerance of 3 characters
IOS data has now 6182 entries



#### 2.4. Pick Free Apps

As the final step, we will remove the data for paid apps from our dataset. To do that, we will look at the price of the apps and remove the ones that are not zero.

In [19]:
def pick_free_apps (data_list,source,price_index):
    index = 1 
    initial_cnt = len(data_list)
    for each_row in data_list[1:]:
        price = each_row[price_index]
        if ((price != '0.0') and (price != '0')) :
            del data_list[index]
        else:
            index += 1
    
    print('Out of',initial_cnt,source,'entries ',initial_cnt-len(data_list), 'are for paid apps')
    print(source,'data has now',len(data_list),'entries\n')
    return data_list

In [20]:
android_data = pick_free_apps(android_data,'Android',7)
ios_data = pick_free_apps(ios_data,'IOS',4)

Out of 9615 Android entries  750 are for paid apps
Android data has now 8865 entries

Out of 6182 IOS entries  2961 are for paid apps
IOS data has now 3221 entries



### 3. Data Analysis

In this section we will look at the finally clean datasets and try to deduce patterns for popularity and hence the profitability of an app genre.

#### 3.1. Ranking with respect to Number of Apps in Store 

Let us first look at which type of genres has the most number of apps provided at the app store. To do that we will need to create some frequency tables and sort them in a descending order. 

In [21]:
def freq_table (data_list, index):
    
    total = len(data_list)-1
    f_table = {}
    
    for each_row in data_list[1:]:
        key = each_row [index]
        
        if key not in f_table:
            f_table[key] = 100/total
        else:
            f_table[key] += 100/total
    
    return f_table

In [22]:
def display_table(data_list, index):
    table = freq_table(data_list, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [23]:
display_table(ios_data,11)      # For IOS prime_genres

Games : 58.13664596273457
Entertainment : 7.888198757764019
Photo & Video : 4.968944099378889
Education : 3.66459627329192
Social Networking : 3.2919254658385046
Shopping : 2.6086956521739095
Utilities : 2.5155279503105556
Sports : 2.14285714285714
Music : 2.0496894409937862
Health & Fitness : 2.0186335403726683
Productivity : 1.7391304347826066
Lifestyle : 1.5838509316770168
News : 1.3354037267080732
Travel : 1.2422360248447193
Finance : 1.1180124223602474
Weather : 0.8695652173913038
Food & Drink : 0.8074534161490678
Reference : 0.5590062111801242
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


**Android data** has two columns related to the genres of the apps. Let us look at the data in detail.  

In [24]:
display_table(android_data,1)    #For Android 'Categories' column

FAMILY : 18.907942238266926
GAME : 9.724729241877363
TOOLS : 8.46119133574016
BUSINESS : 4.591606498194979
LIFESTYLE : 3.90342960288811
PRODUCTIVITY : 3.8921480144404565
FINANCE : 3.7003610108303455
MEDICAL : 3.5311371841155417
SPORTS : 3.3957581227436986
PERSONALIZATION : 3.3167870036101235
COMMUNICATION : 3.2378158844765483
HEALTH_AND_FITNESS : 3.079873646209398
PHOTOGRAPHY : 2.944494584837555
NEWS_AND_MAGAZINES : 2.7978339350180583
SOCIAL : 2.6624548736462152
TRAVEL_AND_LOCAL : 2.335288808664261
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.14350180505415
DATING : 1.861462093862813
VIDEO_PLAYERS : 1.7937725631768928
MAPS_AND_NAVIGATION : 1.398916967509025
FOOD_AND_DRINK : 1.2409747292418778
EDUCATION : 1.1620036101083042
ENTERTAINMENT : 0.9589350180505433
LIBRARIES_AND_DEMO : 0.9363718411552363
AUTO_AND_VEHICLES : 0.9250902527075828
HOUSE_AND_HOME : 0.8235559566787015
WEATHER : 0.8009927797833946
EVENTS : 0.7107400722021667
PARENTING : 0.6543321299638993
ART_AND_DESIGN : 0.6

In [25]:
display_table(android_data,9)    #For Android 'Genres' column

Tools : 8.449909747292507
Entertainment : 6.069494584837599
Education : 5.34747292418777
Business : 4.591606498194979
Productivity : 3.8921480144404565
Lifestyle : 3.8921480144404565
Finance : 3.7003610108303455
Medical : 3.5311371841155417
Sports : 3.46344765342962
Personalization : 3.3167870036101235
Communication : 3.2378158844765483
Action : 3.1024368231047053
Health & Fitness : 3.079873646209398
Photography : 2.944494584837555
News & Magazines : 2.7978339350180583
Social : 2.6624548736462152
Travel & Local : 2.3240072202166075
Shopping : 2.2450361010830324
Books & Reference : 2.14350180505415
Simulation : 2.041967509025268
Dating : 1.861462093862813
Arcade : 1.8501805054151597
Video Players & Editors : 1.771209386281586
Casual : 1.7599277978339327
Maps & Navigation : 1.398916967509025
Food & Drink : 1.2409747292418778
Puzzle : 1.1281588447653441
Racing : 0.9927797833935037
Role Playing : 0.9363718411552363
Libraries & Demo : 0.9363718411552363
Auto & Vehicles : 0.9250902527075828


The 'Genres' column seems a bit more granular than the 'Categories' column, though it is not very useful for us as it classifies sub-categories of Games and Education genres separately.Hence we'll use the 'Categories' column for further analysis.

**Observation:** The above data indicates the types of apps that are available in the IOS and Android markets. From the looks of it, The IOS app market is dominated by games genre with entertainment apps as the runner ups wheres the android apps are developed both for entertainment and practical purposes.

#### 3.2. Ranking with respect to Average Number of Users
Although we have a feel for the types of apps out there, the data above doesn't necesssarily correspond to the popularity of these genres. So we will look at the average number of ratings for IOS and average installs for Android markets to get a feel for number of users.

As the number of installs in the Android data has a format like: '1,000,000,000+', we will need to remove any '+' and ',' characters before turning them into numbers. Since we don't know the exact number of installs from this format, the numbers we are going to use are going to be approximate.

In [26]:
def avg_user_cnt(data_list,genre_index,n_user_index):

    genres = freq_table(data_list,genre_index)
    avg_n_users = []

    for each_genre in genres:
        total_n_users = 0
        len_genre = 0
        for each_row in data_list[1:]:
            genre = each_row[genre_index]
            if genre == each_genre :
                n_users = each_row[n_user_index]
                if('+'in n_users):
                    n_users = n_users.replace('+','')
                if(','in n_users):
                    n_users = n_users.replace(',','')
               
                total_n_users += float(n_users)
                len_genre += 1
        avg_n_users.append([(total_n_users/len_genre),each_genre])        

    sorted_avg_n_users = sorted(avg_n_users, reverse = True)
    for each_entry in sorted_avg_n_users:
        print (format(each_entry[0],'.3e'),each_entry[1])
    return sorted_avg_n_users

In [27]:
ios_avg_n_users = avg_user_cnt(ios_data,11,5)


8.609e+04 Navigation
7.494e+04 Reference
7.155e+04 Social Networking
5.733e+04 Music
5.228e+04 Weather
3.976e+04 Book
3.333e+04 Food & Drink
3.147e+04 Finance
2.844e+04 Photo & Video
2.824e+04 Travel
2.692e+04 Shopping
2.330e+04 Health & Fitness
2.301e+04 Sports
2.281e+04 Games
2.125e+04 News
2.103e+04 Productivity
1.868e+04 Utilities
1.649e+04 Lifestyle
1.403e+04 Entertainment
7.491e+03 Business
7.004e+03 Education
4.004e+03 Catalogs
6.120e+02 Medical


In [28]:
android_avg_n_users = avg_user_cnt(android_data,1,5)        # based on Android 'Category' column

3.846e+07 COMMUNICATION
2.473e+07 VIDEO_PLAYERS
2.325e+07 SOCIAL
1.784e+07 PHOTOGRAPHY
1.679e+07 PRODUCTIVITY
1.559e+07 GAME
1.398e+07 TRAVEL_AND_LOCAL
1.164e+07 ENTERTAINMENT
1.080e+07 TOOLS
9.549e+06 NEWS_AND_MAGAZINES
8.768e+06 BOOKS_AND_REFERENCE
7.037e+06 SHOPPING
5.201e+06 PERSONALIZATION
5.074e+06 WEATHER
4.189e+06 HEALTH_AND_FITNESS
4.057e+06 MAPS_AND_NAVIGATION
3.696e+06 FAMILY
3.639e+06 SPORTS
1.986e+06 ART_AND_DESIGN
1.925e+06 FOOD_AND_DRINK
1.833e+06 EDUCATION
1.712e+06 BUSINESS
1.438e+06 LIFESTYLE
1.388e+06 FINANCE
1.332e+06 HOUSE_AND_HOME
8.540e+05 DATING
8.177e+05 COMICS
6.473e+05 AUTO_AND_VEHICLES
6.385e+05 LIBRARIES_AND_DEMO
5.426e+05 PARENTING
5.132e+05 BEAUTY
2.535e+05 EVENTS
1.206e+05 MEDICAL


**Observation** When we rank the apps with respect to their average number of users, we see the picture changes. The games are no longer at the top of the list but we see more of genres like social networking and communication.

#### 3.3. Top Contributors per Genre

Let us look at the topmost contributors per Genre to see what kind of apps are popular within a genre. Let us also look at their histograms so that we have a feel for the distribution and see whether the data is skewed.

In [29]:
def top_apps(data_list,ranking_list, top_n, name_index, genre_index, n_users_index):

    for each_genre in ranking_list:
        genre = each_genre[1]
        genre_apps = []
        bin_list = [10**x for x in range(0,10)]
        genre_histo = [[x,0] for x in bin_list]
        max_users = 1e20
        
        for each_row in data_list[1:]:
            if (each_row[genre_index] == genre):
                n_users = each_row[n_users_index]
                n_users = n_users.replace('+','')
                n_users = float(n_users.replace(',',''))
                app_name = each_row[name_index]
                genre_apps.append([n_users,app_name])
        
                for bin_index in range(0,len(bin_list)):
                    current_bin = bin_list[bin_index]
                    if(bin_index < (len(bin_list)-1)):
                        next_bin = bin_list[bin_index+1] 
                    else:
                        next_bin = max_users

                    if(current_bin <= n_users) and (n_users < next_bin):
                            genre_histo[bin_index][1] += 1
                        
                    
        top_apps = sorted(genre_apps, reverse = True)
        if (len(top_apps) < top_n):
            top = len(top_apps)
        else:
            top = top_n
            
        print('\n',genre,'-Top',top)
        for rank in range(0,top):
            print (format(top_apps[rank][0],'.2e'),top_apps[rank][1])
               
        print('\n',genre,'-Histo')
        for each_bin in genre_histo:
            print(format(each_bin[0],'.2e'),each_bin[1])


In [30]:
top_apps(ios_data,ios_avg_n_users,10,1,11,5)


 Navigation -Top 6
3.45e+05 Waze - GPS Navigation, Maps & Real-time Traffic
1.55e+05 Google Maps - Navigation & Transit
1.28e+04 Geocaching®
3.58e+03 CoPilot GPS – Car Navigation & Offline Maps
1.87e+02 ImmobilienScout24: Real Estate Search in Germany
5.00e+00 Railway Route Search

 Navigation -Histo
1.00e+00 1
1.00e+01 0
1.00e+02 1
1.00e+03 1
1.00e+04 1
1.00e+05 2
1.00e+06 0
1.00e+07 0
1.00e+08 0
1.00e+09 0

 Reference -Top 10
9.86e+05 Bible
2.00e+05 Dictionary.com Dictionary & Thesaurus
5.42e+04 Dictionary.com Dictionary & Thesaurus for iPad
2.68e+04 Google Translate
1.84e+04 Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran
1.76e+04 New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition
1.68e+04 Merriam-Webster Dictionary
1.21e+04 Night Sky
8.54e+03 City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)
4.69e+03 LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools

 Reference -Histo
1.00e+00 1
1.00e+01 1
1

In [31]:
top_apps(android_data,android_avg_n_users,10,0,1,5)


 COMMUNICATION -Top 10
1.00e+09 WhatsApp Messenger
1.00e+09 Skype - free IM & video calls
1.00e+09 Messenger – Text and Video Chat for Free
1.00e+09 Hangouts
1.00e+09 Google Chrome: Fast & Secure
1.00e+09 Gmail
5.00e+08 imo free video calls and chat
5.00e+08 Viber Messenger
5.00e+08 UC Browser - Fast Download Private & Secure
5.00e+08 LINE: Free Calls & Messages

 COMMUNICATION -Histo
1.00e+00 3
1.00e+01 19
1.00e+02 36
1.00e+03 35
1.00e+04 30
1.00e+05 25
1.00e+06 62
1.00e+07 50
1.00e+08 21
1.00e+09 6

 VIDEO_PLAYERS -Top 10
1.00e+09 YouTube
1.00e+09 Google Play Movies & TV
5.00e+08 MX Player
1.00e+08 VivaVideo - Video Editor & Photo Movie
1.00e+08 VideoShow-Video Editor, Video Maker, Beauty Camera
1.00e+08 VLC for Android
1.00e+08 Motorola Gallery
1.00e+08 Motorola FM Radio
1.00e+08 Dubsmash
5.00e+07 Vote for

 VIDEO_PLAYERS -Histo
1.00e+00 0
1.00e+01 2
1.00e+02 13
1.00e+03 22
1.00e+04 21
1.00e+05 16
1.00e+06 40
1.00e+07 36
1.00e+08 7
1.00e+09 2

 SOCIAL -Top 10
1.00e+09 Instagram
1.0

### 4. Recommendations

#### 4.1. IOS Market
Unfortunately, the IOS sample sizes are very small for the genres with highest average number of users to make a recommendation. It seems like the IOS dataset was already small compared to the Android one. On top of that, it also had a significant number of non-English and paid application data. Since 58% of the final dataset belonged to games genre, the data for the remaining genres are not very reliable. 

Within the remaining genres in top 10, social networking and music genres have some relatively significant number of sample size. If I had to pick one, I would recommend the music genre as  it is easier to challenge established apps in this genre. For your app to succeed in social networking genre, you need to attract a significant portion of the user's personal network which is harder than just appeling to the user individually.

#### 4.2. Android Market
For the Android set we see a reasonable amount of data to look through. From the highest ranking genres, video players look the most reasonable and attractive one to get into as the average number of users is significantly high whereas the number of big players in the genre is rather low. Given the brand names in this dataset seem to have ~1e9,1e8 number of users, I would call any app with more than 1e8 users a big player.