## Profitable App Profiles for the App Store and Google Play Markets

In-app advertisements has always been the main source of revenue for every company developing mobile apps. Revenue generated from any given app is very much influenced by the size of the user-base - the more users see and engage with the ads, the better. 

As such, the goal of this project is to analyze data to help identify the types of apps that would likely attract a huge user-base. We will carry out our analysis based on the [Google Play data set](https://www.kaggle.com/lava18/google-play-store-apps), as well as the [iOS App Store data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) made available at the respective Kaggle webpages.

In [1]:
from csv import reader
import statistics

def open_dataset(path, has_header=True):
    openedFile = open(path)
    csvHandle = reader(openedFile)
    data = list(csvHandle)
    if has_header:
        return data[1:], data[0]
    else:
        return data
    
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    if rows_and_columns:
        print('Row-count:', len(dataset))
        print('Column-count:', len(dataset[0]))

def clean_malformed_rows(dataset_without_header, header, data_set_name = ''):
    print('Cleaning '+ data_set_name +', original row-count:' + str(len(dataset_without_header)))
    headerlength = len(header)
    result = []
    for record in dataset_without_header:
        rowlength = len(record)
        if rowlength != headerlength:
            print(record)
            errorMsg = 'Index ' + str(dataset_without_header.index(record)) + ': Expected ' + str(headerlength) + ' elements, found ' + str(rowlength) + '.'
            print(errorMsg)
        else:
            result.append(record)
    print('Cleaning completed, final row-count: '+str(len(result))+'\n')
    return result

In [2]:
apple_apps_data, apple_data_headers = open_dataset('AppleStore.csv')
google_apps_data, google_data_headers = open_dataset('googleplaystore.csv')

#print(apple_data_headers)
#explore_data(apple_apps_data, 0, 3, True)
print('Cleaning malformed rows in Apple data set:')
apple_cleaned_rows_apps_data = clean_malformed_rows(apple_apps_data, apple_data_headers)

#print(google_data_headers)
#explore_data(google_apps_data, 0, 3, True)
print('Cleaning malformed rows in Google data set:')
google_cleaned_rows_apps_data = clean_malformed_rows(google_apps_data, google_data_headers)

Cleaning malformed rows in Apple data set:
Cleaning , original row-count:7197
Cleaning completed, final row-count: 7197

Cleaning malformed rows in Google data set:
Cleaning , original row-count:10841
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Index 10472: Expected 13 elements, found 12.
Cleaning completed, final row-count: 10840



The clean_malformed_rows method calls have cleaned up both data sets by removing malformed rows.

Exploration revealed that Google Play data set has incidences of repeated app names. Before we proceed further, it would be good to take note of the various columns made available in both data sets.

Column descriptions for the Google Play data set as follows:

| Syntax      | Description |
| ----------- | ----------- |
|App | Application name |
|Category | Category the app belongs to |
|Rating | Overall user rating of the app (as when scraped) |
|Reviews | Number of user reviews for the app (as when scraped) |
|Size | Size of the app (as when scraped) |
|Installs | Number of user downloads/installs for the app (as when scraped) |
|Type | Paid or Free |
|Price | Price of the app (as when scraped) |
|Content Rating | Age group the app is targeted at - Children / Mature 21+ / Adult |
|Genres | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. |
|Last Updated | Date when the app was last updated on Play Store (as when scraped) |
|Current Ver | Current version of the app available on Play Store (as when scraped) |
|Android Ver | Min required Android version (as when scraped) |

Column descriptions for the App Store data set as follows:

| Syntax      | Description |
| ----------- | ----------- |
| id | App ID |
| track_name | App Name |
| size_bytes | Size in bytes |
| currency | Currency Type |
| price | Price amount |
| rating_count_tot | User Rating counts (for all versions) |
| rating_count_ver | User rating counts (for current version) |
| user_rating | Average User Rating value (for all versions) |
| user_rating_ver | Average User Rating value (for current version) |
| ver | Latest version code |
| cont_rating | Content Rating |
| prime_genre | Primary Genre |
| sup_devices.num | Number of supporting devices |
| ipadSc_urls.num | Number of screenshots showed for display |
| lang.num | Number of supported languages |
| vpp_lic | Vpp Device Licensing Enabled |

A preliminary inspection function is then built to look closer at the app with greatest duplicate count in both data sets.

In [8]:
def build_app_dict(dataset_without_header, app_name_index):
    app_dict = {}
    for app in dataset_without_header:
        if app[app_name_index] not in app_dict.keys():
            app_dict[app[app_name_index]] = [app]
        else:
            app_dict[app[app_name_index]].append(app)
    return app_dict

apple_app_dict = build_app_dict(apple_cleaned_rows_apps_data, 1)

google_app_dict = build_app_dict(google_cleaned_rows_apps_data, 0)

def inspect_most_duplicated_app(app_dict):
    target = 0
    app_count_list = []
    for k,v in app_dict.items():
        app_count_list.append([k, len(v)])
        if len(v) > target:
            target = len(v)
    print('Highest duplicate count: '+str(target))
    app_target = [i[0] for i in app_count_list if i[1] == target][0]
    for item in app_dict[app_target]:
        print(item)

print('Inspecting Apple app with highest duplicate count:')
inspect_most_duplicated_app(apple_app_dict)
print('\n')
print('Inspecting Google app with highest duplicate count:')
inspect_most_duplicated_app(google_app_dict)
    

Inspecting Apple app with highest duplicate count:
Highest duplicate count: 2
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


Inspecting Google app with highest duplicate count:
Highest duplicate count: 9
['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '

With the help of the column descriptions, it can be observed that rating/review-related metrics differ across the duplicated records for a given app ('rating_count_tot' in App Store data set, and 'Reviews' in Google Play data set). Assuming that ratings and reviews cannot be deleted once published, it can be said that the record holding the highest count of the associated metric would be the most recent one, which ought to be considered for the next stage of data analysis.

We then create another function to inspect all apps with duplicate records, and only take the record with the highest count of the respective metric, while disposing off the other duplicates.

In [19]:
def clean_duplicates(app_dict, target_index):
    result = []
    for k,record_list in app_dict.items():
        if len(record_list) == 1:
            result.append(record_list[0]) # immediately take the only element in the list
        else:
            print('Processing ' + str(len(record_list)) + ' records found for '+k)
            target_count = 0
            for item in record_list:
                if int(item[target_index]) > target_count:
                    target_count = int(item[target_index])
            print('Targeted metric value: '+str(target_count))
            target_record_list = [record for record in record_list if int(record[target_index]) == target_count] # filter first
            target_record = target_record_list[0] #take 1st record after filtering
            result.append(target_record)
    print('\n')
    return result

apple_unique_cleaned_rows_apps_data = clean_duplicates(apple_app_dict, 5) # rating_count_tot at index 5

google_unique_cleaned_rows_apps_data = clean_duplicates(google_app_dict, 3) # Reviews at index 3

print('\nApp store original record count:' + str(len(apple_apps_data)) + ', after duplicate removal: '+str(len(apple_unique_cleaned_rows_apps_data))+'\n')
print('Google Play original record count:' + str(len(google_apps_data)) + ', after duplicate removal: '+str(len(google_unique_cleaned_rows_apps_data))+'\n')

Processing 2 records found for VR Roller Coaster
Targeted metric value: 107
Processing 2 records found for Mannequin Challenge
Targeted metric value: 668


Processing 4 records found for Angry Birds Rio
Targeted metric value: 2610680
Processing 4 records found for Cymera Camera- Photo Editor, Filter,Collage,Layout
Targeted metric value: 2418165
Processing 3 records found for Booking.com Travel Deals
Targeted metric value: 1830388
Processing 2 records found for Anthem Anywhere
Targeted metric value: 2657
Processing 2 records found for PBS KIDS Video
Targeted metric value: 36214
Processing 5 records found for Viber Messenger
Targeted metric value: 11335481
Processing 2 records found for Phogy, 3D Camera
Targeted metric value: 35725
Processing 3 records found for mySugr: the blood sugar tracker made just for you
Targeted metric value: 21189
Processing 7 records found for 8 Ball Pool
Targeted metric value: 14201891
Processing 2 records found for Endomondo - Running & Walking
Targeted metri

At this point, both data sets have unique records with well-formed rows.  We now attempt to detect non-English apps by their app names. The first step is to inspect characters that belong to outside the ASCII range, i.e. ordinal value beyond 127:

In [147]:
def get_non_ascii_characters(dataset_without_headers, app_name_index):
    non_ascii_list = []
    for app in dataset_without_headers:
        for character in app[app_name_index]:
            if ord(character) >= 128 and character not in non_ascii_list:
                non_ascii_list.append(character)
    return sorted(non_ascii_list)

apple_non_ascii_list = get_non_ascii_characters(apple_unique_cleaned_rows_apps_data, 1) # track_name at index 1
#print(apple_non_ascii_list)
google_non_ascii_list = get_non_ascii_characters(google_unique_cleaned_rows_apps_data, 0) # App at index 0
#print(google_non_ascii_list)

combined_non_ascii_list = [character for character in google_non_ascii_list]
for character in apple_non_ascii_list :
    if character not in combined_non_ascii_list:
        combined_non_ascii_list.append(character)

# The following print statement would result in a huge output that is mostly used for inspecting the non-ASCII characters
#print(sorted(combined_non_ascii_list))


At this point, we would make a discretionary whitelist of characters that should not be perceived as non-English, e.g. the Registered, Service Mark, Trademark symbols, as well as the emoticons that seem to be present in some of the app names.

In [148]:
def print_and_build_character_whitelist(character_list):
    result = []
    for character in character_list:
        if len(character) > 1:
            print('Bad character: ' +character)
        else:
            #print(character + ": "+str(ord(character)))
            result.append(ord(character))
    return result

# The following list is built after inspecting the print-out of the combined_non_ascii_list in the previous cell
character_whitelist = print_and_build_character_whitelist([
    '】','【','♥','、','»','〜','✨','⏰','Σ','✔','★','↔','Σ','★','―','►','🗓','～','·','▻','▫','∘','Ⅸ','●','Ⓞ',
    'С','！','－','：','∞','é','＆','•','—','’','–','®', '°', '²','⁴','℠', '™','🌏', '🌸', '🍀', '🎈', '🎨', '🏆', 
    '🏠', '🐈', '🐕', '🐬', '🐶', '👍', '💎', '💘', '💞', '💣', '📏', '📖', '🔔', '🔥', '🔫', '😂', '😄', '😍', 
    '😘', '😜', '🚀'])

# method for manual inspection of app names
def test_print(input, whitelist):
    index = 0
    for character in input:
        if ord(character) >= 128 and ord(character) not in whitelist:
            print(character + '('+str(index)+'): '+ str(ord(character)))
        index += 1

print('Size of whitelist: '+str(len(character_whitelist)))

Size of whitelist: 69


Finally, we filter the data sets to exclude apps with app names that contain characters with ordinal values outside of the ASCII and whitelist ranges.

In [143]:
def clean_non_english_apps(dataset_without_headers, app_name_index, whitelist):
    result = []
    non_english_list = []
    for app in dataset_without_headers: #iterate through apps in data-set
        is_containing_non_english_character = False
        for character in app[app_name_index]: #iterate through characters in app name
            if ord(character) >= 128 and ord(character) not in whitelist:
                is_containing_non_english_character = True
                break
        if is_containing_non_english_character == False:
            result.append(app)
        else:
            #print('Removing '+app[app_name_index])
            non_english_list.append(app)
    return result, non_english_list

apple_unique_cleaned_rows_english_apps_data, apple_non_english = clean_non_english_apps(apple_unique_cleaned_rows_apps_data, 1, character_whitelist) # track_name at index 1

google_unique_cleaned_rows_english_apps_data, google_non_english = clean_non_english_apps(google_unique_cleaned_rows_apps_data, 0, character_whitelist) # App at index 0

print('\nApp Store before non-English removal:' + str(len(apple_unique_cleaned_rows_apps_data)) + 
      ', after non-English removal: '+str(len(apple_unique_cleaned_rows_english_apps_data)) + 
     ', non-English list size: '+str(len(apple_non_english)))
print('Google Play before non-English removal:' + str(len(google_unique_cleaned_rows_apps_data)) + 
      ', after non-English removal: '+str(len(google_unique_cleaned_rows_english_apps_data)) + 
     ', non-English list size: '+str(len(google_non_english)))


App Store before non-English removal:7195, after non-English removal: 6109, non-English list size: 1086
Google Play before non-English removal:9659, after non-English removal: 9480, non-English list size: 179


We have settled for 6109 records in the App Store data set, and 9480 records in the Google Play data set. These records have complete information, are unique and determined to be English-language apps. 

The last filter would be to pick only free apps, since our main goal is to analyse mobile apps that are free for users to download and use, while relying on the in-app advertisements as a means to generate revenue.

In [146]:
def get_free_apps(dataset_without_headers, price_index):
    result = []
    for app in dataset_without_headers:
        if float(app[price_index].replace('$','')) == 0.0:
            result.append(app)
    return result

apple_final_apps_data = get_free_apps(apple_unique_cleaned_rows_english_apps_data, 4) # price at index 4
google_final_apps_data = get_free_apps(google_unique_cleaned_rows_english_apps_data, 7) # price at index 7

print('\nApp Store before paid app removal: ' + str(len(apple_unique_cleaned_rows_english_apps_data)) +
     ', after paid app removal: '+str(len(apple_final_apps_data)))
print('\nGoogle Play before paid app removal: ' + str(len(google_unique_cleaned_rows_english_apps_data)) +
     ', after paid app removal: '+str(len(google_final_apps_data)))


App Store before paid app removal: 6109, after paid app removal: 3168

Google Play before paid app removal: 9480, after paid app removal: 8742


At this point, we have completed to cleaning and filtering of data, to arrive at 3168 apps in the App Store and 8742 apps in Google Play that would be utilised for analysis.



## Most Common Genre In App Stores

Firstly, we explore the most common genre of apps published in the respective stores. To do that, we would create helper functions that iterate through the data sets to count the number of occurences for each genre classification.

In [159]:
def freq_table(dataset_without_headers, target_index):
    result = {}
    for app in dataset_without_headers:
        if app[target_index] not in result.keys():
            result[app[target_index]] = 1
        else:
            result[app[target_index]] += 1
    return result

def display_table(data_dict):
    table_display = []
    for key in data_dict:
        key_val_as_tuple = (data_dict[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [161]:
print('Top genres in App Store:\n')
apple_genre_dict = freq_table(apple_final_apps_data, apple_data_headers.index('prime_genre'))
display_table(apple_genre_dict)

Top genres in App Store:

Games : 1851
Entertainment : 252
Photo & Video : 159
Education : 117
Social Networking : 104
Shopping : 81
Utilities : 76
Sports : 68
Music : 65
Health & Fitness : 63
Productivity : 54
Lifestyle : 49
News : 43
Travel : 37
Finance : 35
Weather : 28
Food & Drink : 26
Reference : 17
Business : 15
Book : 12
Navigation : 6
Medical : 6
Catalogs : 4


For the App Store, games are the most common apps, followed by apps that serve as tools (utilities, shopping, Photo & Video), community-building (Social Networking), learning (Education) and also for lifestyle activities (Entertainment, Sports, Music). Overall, apps aimed at leisurely persuits are the most common.

Next, we look at the breakdown for the Google Play data set:

In [162]:
print('Top genres in Google Play:\n')
google_genre_dict = freq_table(google_final_apps_data, google_data_headers.index('Genres'))
display_table(google_genre_dict)

print('\nTop categories in Google Play:\n')
google_category_dict = freq_table(google_final_apps_data, google_data_headers.index('Category'))
display_table(google_category_dict)

Top genres in Google Play:

Tools : 741
Entertainment : 530
Education : 467
Business : 406
Productivity : 344
Lifestyle : 338
Finance : 321
Medical : 312
Personalization : 293
Sports : 285
Communication : 283
Action : 274
Health & Fitness : 270
Photography : 261
News & Magazines : 241
Social : 233
Travel & Local : 204
Shopping : 198
Books & Reference : 187
Simulation : 181
Dating : 164
Arcade : 161
Video Players & Editors : 157
Casual : 154
Maps & Navigation : 118
Food & Drink : 107
Puzzle : 99
Racing : 88
Role Playing : 83
Strategy : 81
Libraries & Demo : 81
Auto & Vehicles : 81
Weather : 71
House & Home : 71
Events : 63
Adventure : 59
Beauty : 53
Art & Design : 53
Comics : 50
Parenting : 44
Card : 40
Casino : 37
Trivia : 35
Educational;Education : 35
Board : 34
Educational : 32
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 17
Puzzle;Brain Games : 15
Racing;Action & Adventure : 14
Entertainment;Music & Video : 14
Casual;Brain Games : 12
Casual;Action & Adventure 

Genres in Google Play seems to be very granular, so for a higher level perspective, we inspect the Categories data. Family seems to top the list, and upon a visit to the Google Play website, it is revealed that Family actually caters to games meant for kids. 

Hence, it can be concluded that games is the most common genre of apps in Google Play, followed by tools (tools, business, productivity, finance, medical, personalization, etc), lifestyle and communication. 

This analysis highlights similarities in both app stores, in which games are the most common apps being published, followed by tools and lifestyle.

## Most Downloads In App Stores

Another way of analysing the data sets is to look at the number of downloads clocked by the published apps. This could possibly give another perspective of app popularity among the user-base of both platforms. In other words, a type of app that draws in a larger user-base would have a greater chance of securing more ad-impressions within itself.

For the iOS App store data set, we could use the rating_count_tot column to indirectly compute the number of app downloads. This is based on the assumption that a larger user-base would always lead to more ratings submitted.

For the Google Play store data set, we could use the Installs column. However, that column is expressed in an approximate format, e.g. '1,000,000+', which cannot be cast as an integer immediately. Hence, the helper function would be augmented with a few more lines code to clean off the non-numeric characters to ensure the value can be parsed as a numeral.

In [169]:
def build_download_count(dataset_without_header, genre_index, count_index):
    genre_dict = {}
    result = {}
    non_number_set = (',','+') # to cope with the approximate exression in google data set
    for app in dataset_without_header:
        count_value = int(''.join([c for c in app[count_index] if c not in non_number_set]))
        if app[genre_index] not in genre_dict.keys():
            genre_dict[app[genre_index]] = [count_value]
        else:
            genre_dict[app[genre_index]].append(count_value)
    # iterate through the first dictionary to find average number of downloads per genre
    for genre,count_list in genre_dict.items():
        result[genre] = float(sum(count_list)) / len(count_list)
    return result

In [170]:
apple_download_dict = build_download_count(
    apple_final_apps_data, 
    apple_data_headers.index('prime_genre'),
    apple_data_headers.index('rating_count_tot'))

display_table(apple_download_dict)

Navigation : 86090.33333333333
Reference : 79350.4705882353
Social Networking : 72916.54807692308
Music : 58205.03076923077
Weather : 52279.892857142855
Book : 46384.916666666664
Food & Drink : 33333.92307692308
Finance : 32367.02857142857
Travel : 30524.297297297297
Photo & Video : 28619.415094339623
Shopping : 27898.802469135804
Health & Fitness : 24037.634920634922
Sports : 23101.926470588234
Games : 23007.38141545111
Productivity : 21799.14814814815
News : 21248.023255813954
Utilities : 19423.0
Lifestyle : 15924.163265306122
Entertainment : 14139.42857142857
Business : 6839.6
Education : 6010.940170940171
Catalogs : 4004.0
Medical : 612.0


Preliminary analysis seems to reveal that a navigation app would ensure the best chance of capturing a huge user-base on the iOS App Store, as it tops the table at ~86K average installations. Reference apps is a close 2nd at 79K, while Social Networking comes in 3rd at around 73K.

In [177]:
google_download_dict = build_download_count(
    google_final_apps_data, 
    google_data_headers.index('Category'),
    google_data_headers.index('Installs'))

display_table(google_download_dict)

COMMUNICATION : 38996132.830388695
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23552836.060085837
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16833224.75
GAME : 15749886.682297772
TRAVEL_AND_LOCAL : 14120454.07804878
ENTERTAINMENT : 11640705.88235294
TOOLS : 10917152.700808626
NEWS_AND_MAGAZINES : 9823830.746887967
BOOKS_AND_REFERENCE : 8908439.893048128
SHOPPING : 7072366.590909091
PERSONALIZATION : 5219231.699658703
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4234992.192592593
MAPS_AND_NAVIGATION : 4211061.694915255
SPORTS : 3869044.505376344
FAMILY : 3742179.6003627568
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1978678.981308411
EDUCATION : 1841666.6666666667
BUSINESS : 1716495.2955665025
LIFESTYLE : 1464520.1150442478
FINANCE : 1383030.0062305296
HOUSE_AND_HOME : 1367638.8873239437
DATING : 859205.8353658536
COMICS : 850218.6274509804
LIBRARIES_AND_DEMO : 654244.5679012346
AUTO_AND_VEHICLES : 654074.8271604938
PARENTING : 542603.6206896552
BEAUTY : 513151.88679245

Based on above results, it seems that Communication would have been a go-to genre with an average of >38M installations per app. However, upon closer inspection, it is revealed that there are several tech giants in this genre, such as WhatsApp (over 1B downloads) and LINE (over 500M downloads).  These exceedingly popular apps may be skewing the mean values, hence it would be folly to include them in our analysis.

One possible solution would be to exclude these outliers. If we assume that installation-count for apps within a given genre fit a normal distribution, then according to this [Wiki article on normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), about 68% of values drawn from a normal distribution are within 1 standard deviation away from the mean; about 95% of the values fall within 2 standard deviations; and about 99.7% are within 3 standard deviations.

We then take the approach to remove the outlier counts by eliminating any apps that were above (Mean + 2*SD) and apps points below (Mean - 2*SD) before re-examining the statistics behind the trimmed list of apps.

To test this alternative, we build another helper function to focus on the apps in COMMUNICATION genre for the Google Play data set:

In [203]:
def analyse_download_count(dataset_without_header, app_name_index, genre_index, count_index, target_genre):
    genre_dict = {}
    result = {}
    non_number_set = (',','+') # to cope with the approximate exression in google data set
    for app in dataset_without_header:
        count_value = int(''.join([c for c in app[count_index] if c not in non_number_set]))
        if app[genre_index] not in genre_dict.keys() and app[genre_index] == target_genre:
            genre_dict[app[genre_index]] = [[app[app_name_index], count_value]]
        else:
            if app[genre_index] == target_genre:
                genre_dict[app[genre_index]].append([app[app_name_index], count_value])
    # iterate through the first dictionary to find average number of downloads per genre
    for genre,count_list in genre_dict.items():
        print('Processing ['+genre+'] genre:')
        avg = float(sum([item[1] for item in count_list])) / len(count_list)
        pop_standard_dev = statistics.pstdev([item[1] for item in count_list])
        print('Mean: ' + str(avg) + ', Population standard deviation:' + str(pop_standard_dev))
        final_list = []
        exclusion_list = []
        for item in count_list:
            if abs(item[1] - avg) > (2 * pop_standard_dev):
                #print(item[0] +' ('+str(item[1])+') excluded as >2 standard deviations away from mean')
                exclusion_list.append(item)
            else:
                #print(item[0] +' ('+str(item[1])+') included as within 2 standard deviations away from mean')
                final_list.append(item)
        print('Size of original app list:'+str(len(count_list))+', Size of trimmed list:'+str(len(final_list))+'\n')
        result[genre] = sum([item[1] for item in final_list]) / len(final_list)
        
    return result, final_list, exclusion_list

In [213]:
google_download_dict_COMMUNICATION, good_list, outliers = analyse_download_count(google_final_apps_data,
                                                         google_data_headers.index('App'),
                                                         google_data_headers.index('Category'),
                                                         google_data_headers.index('Installs'),
                                                         'COMMUNICATION')

display_table(google_download_dict_COMMUNICATION)
print('\nOutliers (>2 std devs) in COMMUNICATION genre:')
for app in outliers:
    print(app[0]+', '+str(app[1]))

Processing [COMMUNICATION] genre:
Mean: 38996132.830388695, Population standard deviation:157296436.36876735
Size of original app list:283, Size of trimmed list:272

COMMUNICATION : 9323182.31985294

Outliers (>2 std devs) in COMMUNICATION genre:
Viber Messenger, 500000000
Messenger – Text and Video Chat for Free, 1000000000
imo free video calls and chat, 500000000
Hangouts, 1000000000
Google Chrome: Fast & Secure, 1000000000
Gmail, 1000000000
LINE: Free Calls & Messages, 500000000
Google Duo - High Quality Video Calls, 500000000
WhatsApp Messenger, 1000000000
Skype - free IM & video calls, 1000000000
UC Browser - Fast Download Private & Secure, 500000000


As shown above, just by removing 11 outliers from the original list of 283 apps in the COMMUNICATION genre, the mean value of downloads have been revised from 38.9M to just 9.3M. It can be said that the presence of a few outliers can distort the outlook of certain genres.

This time, we process all genres using this trimmed-mean mechanism:

In [209]:
def build_2sd_download_count(dataset_without_header, app_name_index, genre_index, count_index):
    genre_dict = {}
    result = {}
    non_number_set = (',','+') # to cope with the approximate exression in google data set
    for app in dataset_without_header:
        count_value = int(''.join([c for c in app[count_index] if c not in non_number_set]))
        if app[genre_index] not in genre_dict.keys():
            genre_dict[app[genre_index]] = [[app[app_name_index], count_value]]
        else:
            genre_dict[app[genre_index]].append([app[app_name_index], count_value])
    # iterate through the first dictionary to find average number of downloads per genre
    inclusion_dict = {}
    exclusion_dict = {}
    for genre,count_list in genre_dict.items():
        print('Processing ['+genre+'] genre:')
        avg = float(sum([item[1] for item in count_list])) / len(count_list)
        pop_standard_dev = statistics.pstdev([item[1] for item in count_list])
        
        final_list = []
        exclusion_list = []
        for item in count_list:
            if abs(item[1] - avg) > (2 * pop_standard_dev):
                #print(item[0] +' ('+str(item[1])+') excluded as >2 standard deviations away from mean')
                exclusion_list.append(item)
            else:
                #print(item[0] +' ('+str(item[1])+') included as within 2 standard deviations away from mean')
                final_list.append(item)
        print('Size of original app list:'+str(len(count_list))+', Size of trimmed list:'+str(len(final_list)))
        result[genre] = sum([item[1] for item in final_list]) / len(final_list)
        print('Original Avg: ' + str(avg) + ', Pstdev:' + str(pop_standard_dev)+', Trimmed Avg:'+str(result[genre]))
        inclusion_dict[genre] = final_list
        exclusion_dict[genre] = exclusion_list
        
    return result, inclusion_dict, exclusion_dict

In [211]:
google_download_2sd_dict, inclusion_dict, exclusion_dic = build_2sd_download_count(google_final_apps_data,
                                                         google_data_headers.index('App'),
                                                         google_data_headers.index('Category'),
                                                         google_data_headers.index('Installs'))
print('\n\n')
display_table(google_download_2sd_dict)

Processing [VIDEO_PLAYERS] genre:
Size of original app list:159, Size of trimmed list:156
Original Avg: 24727872.452830188, Pstdev:118708611.49289133, Trimmed Avg:9177767.435897436
Processing [LIFESTYLE] genre:
Size of original app list:339, Size of trimmed list:337
Original Avg: 1464520.1150442478, Pstdev:6499340.809699266, Trimmed Avg:1028107.7715133531
Processing [ENTERTAINMENT] genre:
Size of original app list:85, Size of trimmed list:80
Original Avg: 11640705.88235294, Pstdev:24444494.088985275, Trimmed Avg:6118250.0
Processing [COMMUNICATION] genre:
Size of original app list:283, Size of trimmed list:272
Original Avg: 38996132.830388695, Pstdev:157296436.36876735, Trimmed Avg:9323182.31985294
Processing [PHOTOGRAPHY] genre:
Size of original app list:261, Size of trimmed list:260
Original Avg: 17840110.40229885, Pstdev:66618742.73719847, Trimmed Avg:14062572.365384616
Processing [BOOKS_AND_REFERENCE] genre:
Size of original app list:187, Size of trimmed list:186
Original Avg: 8908

Based on the revised set of results, PHOTOGRAPHY tops the list with 14M average downloads, followed by GAME at over 12M, then COMMUNICATION at over 9M. It should be prudent to suggest building a decent photography app which would appeal to the camera-eager generation! 

While games come in as a close 2nd, a game app would face nearly 3 times more competition (852 game apps in original data set), compared to photography apps (261 photo apps in original data set). In other words, it might be 3 time harder to differentiate one's game app in the current Google Play market, compared to a photography app.

In [216]:
apple_download_2sd_dict, apple_inclusion_dict, apple_exclusion_dic = build_2sd_download_count(apple_final_apps_data,
                                                         apple_data_headers.index('track_name'),
                                                         apple_data_headers.index('prime_genre'),
                                                         apple_data_headers.index('rating_count_tot'))
print('\n\n')
display_table(apple_download_2sd_dict)

Processing [Games] genre:
Size of original app list:1851, Size of trimmed list:1807
Original Avg: 23007.38141545111, Pstdev:95366.8249476728, Trimmed Avg:11838.421693414499
Processing [Travel] genre:
Size of original app list:37, Size of trimmed list:35
Original Avg: 30524.297297297297, Pstdev:81778.48709309765, Trimmed Avg:13123.685714285713
Processing [Photo & Video] genre:
Size of original app list:159, Size of trimmed list:158
Original Avg: 28619.415094339623, Pstdev:174213.9422495501, Trimmed Avg:15119.803797468354
Processing [Reference] genre:
Size of original app list:17, Size of trimmed list:16
Original Avg: 79350.4705882353, Pstdev:231344.8278406113, Trimmed Avg:22689.875
Processing [Medical] genre:
Size of original app list:6, Size of trimmed list:6
Original Avg: 612.0, Pstdev:606.3384643359955, Trimmed Avg:612.0
Processing [Productivity] genre:
Size of original app list:54, Size of trimmed list:50
Original Avg: 21799.14814814815, Pstdev:35502.81431171515, Trimmed Avg:12864.0

Considering the iOS App Store alone, a weather app would seem like a good choice as it tops the list of average installations at close to 36K.  Again, we would want to avoid the Social Network space that is already dominated by earlier mentioned tech giants. Navigation could be a good choice too, with users constantly seeking a useful map app that could help them navigate around urban areas.


After applying the same trimmed-mean analysis to the iOS App Store, we now have 2 top-10 tables to consider:

| iOS App | Count |
| ------- | ----- |
| Weather | 35859.666666666664 |
| Social Networking | 34774.71568627451 |
| Navigation | 34299.2 |
| Music | 28220.396825396827 |
| Book | 27685.727272727272 |
| Reference | 22689.875 |
| Finance | 20201.090909090908 |
| Shopping | 19997.25316455696 |
| Photo & Video | 15119.803797468354 |
| News | 13323.97619047619 |

| Google Play App | Count |
| --------------- | ----- |
| PHOTOGRAPHY | 14062572.365384616 |
| GAME | 12305015.731132075 |
| COMMUNICATION | 9323182.31985294 |
| VIDEO_PLAYERS | 9177767.435897436 |
| PRODUCTIVITY | 8231944.879056048 |
| SOCIAL | 6525485.97368421 |
| TOOLS | 6250716.445652174 |
| ENTERTAINMENT | 6118250.0 |
| TRAVEL_AND_LOCAL | 4407355.1034482755 |
| BOOKS_AND_REFERENCE | 3579990.64516129 |



## Final Analysis & Conclusion
If we were to consider only app types that appear in both tables, it would be:
- Social Network
- Photography
- Book/Reference

As discussed earlier, it would be prudent to avoid avoid genres that have several tech giants reigning the market space with their immense popularity. As such, a Social Network app would hardly gain traction in the face of household names like Facebook and Google. Photography seems like a good choice, though you may be competing with the likes of Instagram and Snapchat. 

Book/Reference seems a surprising entry for both platforms. With a sizable user-base, few exceedingly successful outliers to compete with, and likely unsaturated market space (earlier mentioned that games, tools and lifestyle genre are the most populated), it might be good to consider some sort of a e-reader app that would allow users to read books or look up references on just about any subject while on the go.