# Finding Room for a New App #

## Introduction

* Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
* Cost of app is free, revenue to be generated by advertisement views.
* Two app sources are considered, Google Play for Android users and the Apple App Store for iOS users.

## Examining the data sets
* The code below opens both data sets, displays the header and first two rows, and a chart explaining the column headings, particularly useful for the iOS columns.
* For iOS
    * There are about 7000 apps and 16 data columns.
    * The original data set is available here: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps
* For Android
    * There are about 10,000 rows and 13 data columns.
    * The original data set is available here: https://www.kaggle.com/lava18/google-play-store-apps
* Lastly, the header is removed from both data sets, making them a little easier to use.

In [None]:
def opener(file):
    import csv
    from csv import reader
    opened_file = open(file)
    read_file = reader(opened_file)
    data_file = list(read_file)
    return data_file

ios_data = opener('AppleStore.csv')
android_data = opener('googleplaystore.csv')

In [None]:
# function to isolate given number of rows in data set
def slicer(data_file, start, stop):
    slice = data_file[start:stop]
    for row in slice:
        if row == slice[-1]:
            print(row)
            print('That was the last row in the slice')
        else:
            print(row)
            print('\n')
            
# function to count the number of total rows (including any header if present) and number of columns.
def row_column_counter(data_file):
    print('There are ' + str(len(data_file)) + ' rows.')
    print('There are ' + str(len(data_file[0])) + ' columns.')

In [None]:
slicer(ios_data, 0, 3)
row_column_counter(ios_data)


|Heading         |Definition                         | Heading       | Defintion                             |
|:--             |:--                                |:--            |:--                                    |
|id              |App ID                             |user_rating_ver|Avg User Rating (current version)      |
|track_name      |App Name                           |ver            |Latest Version Code                    |
|size_bites      |Size(in Bytes)                     |cont_rating    |Content Rating                         |
|currency        |Currency Type                      |prime_genre    |Primary Genre                          |
|price           |Price                              |sup_devices.num|Number of Supporting Devices           |
|rating_count_tot|User Rating Count (all versions)   |ipadSc_urls.num|Number of Screenshots Shown for Display|
|rating_count_ver|User Rating Count (current version)|lang.num       |Number of SUpported Languages          |
|user_rating     |Avg User Rating (all versions)     |vpp_lic        |Vpp Device Based LIcensing Available   |

In [None]:
slicer(android_data, 0, 3)
row_column_counter(android_data)

|Heading |Definition         |Heading       |Definition                         |
|:--     |:--                |:--           |:--                                |
|App     |Applicantion Name  |Price         |Price                              |
|Category|Category           |Content Rating|Target Age Group                   |
|Rating  |User Rating        |Genres        |Genres                             |  
|Reviews |User Rating Count  |Last Updated  |Last Update(when scraped)          |
|Size    |Size(in Megabytes) |Current Ver   |Current Version                    |
|Installs|Number of downloads|Android Ver   |Minimum Required Version of Android|
|Type    |Paid or Free       |


In [None]:
print(len(android_data))
print(android_data[0])

android_data = android_data[1:]
ios_data = ios_data[1:]

print(len(android_data))
print(android_data[0])

## Removing a Corrupted Row ##

Below is an example of a corrupted row, the entry for the category column is missing. In this case it can be deleted, but that could cause a problem with the data set. Here is a link to the discussion of the error: https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101

A quick look shows that the only row with this problem is the one already identified. Filling the variable or using ```del``` are good options, but if the block is run more than once ```del``` will erase whatever new data is in the index. Sometimes the best practice is to leave the original data set unchanged. In that case it is possible to create a copy and fill it with the 'good' data.

The code bloacks below will:
* Print the length of a data set.
* Print the length of a copy of the data set.
* Use a function to add all rows that are the same length as the header in the original data set to the copy.
* Print the lengths of the two data sets again.
* The original Android data can be modified.

In [None]:
# Row 10472 google data (header exclusive)

print(len(android_data[10472]))
print(android_data[10472])

print(len(android_data[10473]))
print(android_data[10473])

# Curious to see if there are any other rows missing entries.
short_rows = []
for row in android_data:
    if len(row) != 13:
        short_rows.append(row)

print(len(short_rows))

In [None]:
data_copy = []
print('The android data set length is ' + str(len(android_data)))
print('The android data set copy length is ' + str(len(data_copy)))

def refiner(data_set):
    for row in data_set:
        if len(row) == len(data_set[0]):
            data_copy.append(row)

refiner(android_data)
print('The android data set length is ' + str(len(android_data)))
print('The android data set copy length is ' + str(len(data_copy)))

In [None]:
android_data = data_copy
print(len(android_data))

## Removing duplicate entries ##
Many data sets will contain duplicate data which needs to be consolidated or removed. One criteria for which entry to retain is to keep the one with the highest number of reviews. In the Android data set there are three entries for 'Slack'. Only one of them has a unique value in the fourth poistion, number of reviews (51510 vs 51507).

The code below loops through all the data and builds a dictionary with the app name as the key and the number of reviews as the corresponding value. If the code finds an entry where the name already exists in the dictionary, it will keep whichever entry has the most reviews.

For the example, we want to keep the third entry for 'Slack', where the User Rating Count = 51510. The code also shows that the ```reviews_max``` dictionary is the same length as the ```each_app_at_least_once``` list above.

Then a list is built containing one complete entry for each application, and the highest number of reviews associated with that app. The code also checks to ensure only one listing in cases of duplicate entries with the same number of same number of reviews where that is also the highest number of reviews. After that it will compile the data in two lists. One will have unique entries, and the other duplicated entries. Then it will print the first ten entries of the duplicate list as an example, and the number of times the app 'Slack' appears in the both lists.

The iOS data set uses an ID number for each application. It is straightforward to use this and see if there are any duplicate entries.

In [None]:
for row in android_data:
    app_name = row[0]
    if app_name == 'Slack':
        print(row)

In [None]:
duplicate_apps = []
each_app_at_least_once = []

for row in android_data:
    name = row[0]
    if name in each_app_at_least_once:
        duplicate_apps.append(name)
    else:
        each_app_at_least_once.append(name)
        
print('Number of individual apps:', len(each_app_at_least_once))
print('Total number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])
print('\n')

slack_count = 0

for app_name in each_app_at_least_once:
    if app_name == 'Slack':
        slack_count += 1
for app_name in duplicate_apps:
    if app_name == 'Slack':
        slack_count += 1

print('The number of times "Slack" appears in either data set:' , slack_count)
    

In [None]:
reviews_max = {}

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    
print('The number of reviews for "Slack":', reviews_max['Slack'])
print('The number of unique entries:', len(reviews_max))

In [None]:
android_clean = []
already_added = []
print('The length of the android data set:', len(android_data))

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

android_data = android_clean
print('The length of the android_clean data set:', len(android_clean))
print('The length of the android data set:', len(android_data))

print('\n')
for row in android_clean:
    if row[0] == 'Slack':
        print(row)

In [None]:
print('The total number of entries in the iOS data set:', len(ios_data))

ios_id_nums = []
for row in ios_data:
    if row[0] not in ios_id_nums:
        ios_id_nums.append(row[0])
    else:
        print('This ID number already exists:', row[0])
        
print('The total number of unique ID numbers in the iOS data set:', len(ios_id_nums))

## Focusing on apps intended for English speaking audience. ##
In the code blocks below a ```common_english_character``` function is written and called on a sample dataset. Then it used on the Android and iOS data sets and checked against the results.
Any title with more than three characters outside the common English character set will be removed from the data set. This allows for some special characters in titles, but limits the likleyhood the application will be intended for a non-English speaking audience.

In [None]:
string1 = 'Instagram'
string2 = '爱奇艺PPS -《欢乐颂2》电视剧热播'
string3 = 'Docs To Go™ Free Office Suite'
string4 = 'Instachat 😜'

def common_english_character(string):
    c_count = 0
    for character in string:
        if ord(character) > 127:
            c_count += 1
        if c_count == 3:
            return False
        

print(common_english_character(string1))
print(common_english_character(string2))
print(common_english_character(string3))
print(common_english_character(string4))

In [None]:
dataset = [ ['Instagram'], ['爱奇艺PPS -《欢乐颂2》电视剧热播'], ['Docs To Go™ Free Office Suite'], ['Instachat 😜']]
cec_dataset = []
cec_dataset_2 = []
non_english_dataset = []

def common_english_character(dataset1, dataset2, dataset3, title_index):
    for row in dataset1:
        c_count = 0
        app_name = row[title_index]
        for character in app_name:
            if ord(character) > 127:
                c_count += 1
        if c_count < 3:
            dataset2.append(row)
        else:
            dataset3.append(row)
            
common_english_character(dataset, cec_dataset, non_english_dataset, 0)
print(dataset)
print(cec_dataset)
print(non_english_dataset)

common_english_character(cec_dataset, cec_dataset_2, non_english_dataset, 0)
print(cec_dataset_2)

In [None]:
cec_android = []
non_english_android = []


common_english_character(android_data, cec_android, non_english_android, 0)


print(len(android_data))
print(len(cec_android))
print(len(non_english_android))
print(non_english_android[:5])

# tester = non_english_android[1]
# print(tester)
# cec_android.append(tester) ### running the kernel without this commented will show the changes in the totals in the second block

In [None]:
cec_android_the_second = []

common_english_character(cec_android, cec_android_the_second, non_english_android, 0)


print(len(android_data))
print(len(cec_android))
print(len(cec_android_the_second))
print(len(non_english_android))
print(non_english_android[:5])

In [None]:
cec_ios = []
non_english_ios = []

common_english_character(ios_data, cec_ios, non_english_ios, 1)

print(len(ios_data))
print(len(cec_ios))
print(len(non_english_ios))

In [None]:
android_data = cec_android
ios_data = cec_ios

print(len(android_data))
print(len(ios_data))

## Focusing on free applications ##

Both data sets can easily be cleaned of any applications with a price other than 0.

In [None]:
free_android = []
charge_android = []

print(len(android_data))
print(len(free_android))
print(len(charge_android))

for row in android_data:
    if row[7] == '0':
        free_android.append(row)
    else:
        charge_android.append(row)
    
print(len(android_data))
print(len(free_android))
print(len(charge_android))

android_data = free_android

In [None]:
free_ios = []
charge_ios = []

print(len(ios_data))
print(len(free_ios))
print(len(charge_ios))

for row in ios_data:
    if row[4] == '0.0':
        free_ios.append(row)
    else:
        charge_ios.append(row)

print(len(ios_data))
print(len(free_ios))
print(len(charge_ios))

ios_data = free_ios

for row in free_ios:
    if row[4] != '0.0':
        print(row)
        
print(len(ios_data))

## Determining Popularity of Existing Applications  ##
## Part 1 ##
A client wants to build the application for the Android environment intially, and then build it for iOS once it has been shown to be successful. A minimal version for Android will be created, followed by a refined version published based on user response. After six months of profibility an iOS version will be built.

A frequency table will show the most common genres for each environment.

For the App Store genre distribution we can see that the Games category represents the majority of the free app universe for iOS at around 58%. Entertainment is a distant second with less than 8%, while productivity and social media apps represent most of the rest of the field.

Google Play Store uses both a ```Genre``` and a ```Category``` discriptor. The difference between Category and Genre is not entirely clear, except that genre seems to be much more granular. The Android results show a large number of free apps in the FAMILY category. Expanding the data to look at apps within the FAMILY category show that most of them are games.

A quick comparison of both environments shows the iOS system offering mostly games, 58%. With Android though, even if we assume that most of the apps in the FAMILY category are games, combining that with the actual GAMES category only results in a little less than 29%. That suggests that Android has a more diverse set of applications. Within both environments many of the applications are games, but it is uncertain if they are the most popular among users.

In [None]:
# Interesting to see how sometimes the category and genre is identical
# and sometimes it shows a lot of sub-division.

# row_count = 0
# every_fifth = []

# for row in android_data:
#     row_count += 1
#     if row_count % 5 == 0:
#         every_fifth.append(row)
        
# for row in every_fifth:
#     print(row[1], row[9])

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        key = row[index]
        if key in table:
            table[key] += 1
        else:
            table[key] = 1
#     return table
            
    table_percentages = {}
    for value in table:
        percentage = (table[value] / total) * 100
        table_percentages[value] = percentage
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [None]:
ios_primary_genre = display_table(ios_data, 11)

In [None]:
android_category = display_table(android_data, 1)

In [None]:
android_genre = display_table(android_data, 9)

In [None]:
android_family = []
for row in android_data:
    if row[1] == "FAMILY":
        android_family.append(row)
        
android_family

## Determining Popularity of Existing Applications  ##
## Part 2 ##

In addition to seeing which category or genere of application is most common, the number of downloads in each category can be approximated. This is available under the ```Installs``` column in the Google Store, but is absent for iOS. it can be approximated using the total number of ratings as a proxy.

## iOS ##
When the top five App Store genres are examined more closely a few things can be considered.

For the ```Navigation``` genre, Waze and Google Maps account for more than 95% of downloads. This doesn't indicate much room for other apps. The number of downloads in the ```Reference``` genre is also heavily influenced by a Bilble translation and two dictionary apps. On the other hand, in the ```Social Media``` environment there are a few apps that are easily recognizable (like Facebook and Pinterest), but there are many more with less of the market share. This could be a space that allows for some competition among less popular providers. The same is true with ```Music``` apps. One avenue to look at is building an app that targets a specific group, like a social media conector for people with the same hobby or sports interests or a music app affiliated with a particular label or group. In the ```Weather``` category there a few more very popular applications, but the less popular ones represent only a small number of downloads.

In [None]:
ios_prime_genre_freq_table = freq_table(ios_data, 11)

rating_dict = {}

for prime_genre in ios_prime_genre_freq_table:
    total = 0
    len_genre = 0
    for row in ios_data:
        genre_app = row[11]
        if genre_app == prime_genre:
            rating_count_tot = float(row[5])
            total += rating_count_tot
            len_genre += 1
    avg_number_ratings = total/len_genre
    rating_dict[prime_genre] = avg_number_ratings
    
rating_dict_sorted = sorted(rating_dict.items(), key=lambda x: x[1], reverse=True)
print(rating_dict_sorted)

In [None]:
nav_apps = []
nav_percentages = {}

def genre_looker(ios_prime_genre, app_list, app_dict):
    total_apps_in_genre = 0
    for row in ios_data:
        app = []
        if row[11] == ios_prime_genre:
            app.append(row[1])
            app.append(row[5])
            app_list.append(app)
    for row in app_list:
        total_apps_in_genre += float(row[1])
        app_dict[row[0]] = row[1]
    for row in app_dict:
        app_percentage = (float(app_dict[row])/total_apps_in_genre) * 100
        app_dict[row] = app_percentage
        
genre_looker('Navigation', nav_apps, nav_percentages)

nav_percentages_sorted = sorted(nav_percentages.items(), key=lambda x: x[1], reverse=True)
print(nav_percentages_sorted)

In [None]:
reference_apps = []
reference_percentages = {}

def genre_looker(ios_prime_genre, app_list, app_dict):
    total_apps_in_genre = 0
    for row in ios_data:
        app = []
        if row[11] == ios_prime_genre:
            app.append(row[1])
            app.append(row[5])
            app_list.append(app)
    for row in app_list:
        total_apps_in_genre += float(row[1])
        app_dict[row[0]] = row[1]
    for row in app_dict:
        app_percentage = (float(app_dict[row])/total_apps_in_genre) * 100
        app_dict[row] = app_percentage
        
genre_looker('Reference', reference_apps, reference_percentages)

reference_percentages_sorted = sorted(reference_percentages.items(), key=lambda x: x[1], reverse=True)
print(reference_percentages_sorted)

In [None]:
social_networking_apps = []
social_networking_percentages = {}

def genre_looker(ios_prime_genre, app_list, app_dict):
    total_apps_in_genre = 0
    for row in ios_data:
        app = []
        if row[11] == ios_prime_genre:
            app.append(row[1])
            app.append(row[5])
            app_list.append(app)
    for row in app_list:
        total_apps_in_genre += float(row[1])
        app_dict[row[0]] = row[1]
    for row in app_dict:
        app_percentage = (float(app_dict[row])/total_apps_in_genre) * 100
        app_dict[row] = app_percentage
        
genre_looker('Social Networking', social_networking_apps, social_networking_percentages)

social_networking_percentages_sorted = sorted(social_networking_percentages.items(), key=lambda x: x[1], reverse=True)
print(social_networking_percentages_sorted)

In [None]:

music_apps = []
music_percentages = {}

def genre_looker(ios_prime_genre, app_list, app_dict):
    total_apps_in_genre = 0
    for row in ios_data:
        app = []
        if row[11] == ios_prime_genre:
            app.append(row[1])
            app.append(row[5])
            app_list.append(app)
    for row in app_list:
        total_apps_in_genre += float(row[1])
        app_dict[row[0]] = row[1]
    for row in app_dict:
        app_percentage = (float(app_dict[row])/total_apps_in_genre) * 100
        app_dict[row] = app_percentage
        
genre_looker('Music', music_apps, music_percentages)
print(music_percentages)

In [None]:
weather_apps = []
weather_percentages = {}

def genre_looker(ios_prime_genre, app_list, app_dict):
    total_apps_in_genre = 0
    for row in ios_data:
        app = []
        if row[11] == ios_prime_genre:
            app.append(row[1])
            app.append(row[5])
            app_list.append(app)
    for row in app_list:
        total_apps_in_genre += float(row[1])
        app_dict[row[0]] = row[1]
    for row in app_dict:
        app_percentage = (float(app_dict[row])/total_apps_in_genre) * 100
        app_dict[row] = app_percentage
        
genre_looker('Weather', weather_apps, weather_percentages)
print(weather_percentages)

## Android ##
The first thing that is apparent when looking at the Android data is the vast difference between the number of downloads for each environment. None of the iOS genres had over 100,000 downloads, which is less than the least downloaded Android category. The top nine Android categories all have over 10,000,000 downloads. Within the top five, the number of apps within the second, third, and fourth quintile can be approximated as being between 5,000,000 and 50,000 downloads. This shows an surprisingly larger set of applications in the photography and productivity categories as opposed to the three more popular groups.

In [None]:
android_category = freq_table(android_data, 1)
android_installs = {}

for row in android_category:
    total = 0
    len_category = 0
    for app in android_data:
        app_category = app[1]
        if app_category == row:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_n_installs = int(total/len_category)
    android_installs[row] = avg_n_installs

android_installs = sorted(android_installs.items(), key=lambda x: x[1], reverse=True)
print(android_installs)


In [None]:
android_communication = {}

for row in android_data:
    if row[1] == 'COMMUNICATION':
        n_installs = row[5]
        n_installs = n_installs.replace('+', '')
        n_installs = n_installs.replace(',', '')
        n_installs = int(n_installs)
        android_communication[row[0]] = n_installs

android_communication = sorted(android_communication.items(), key=lambda x: x[1], reverse=True)
# print(android_communication)

apps_in_range = 0

for row in android_communication:
    if 5000000 > row[1] > 500000:
        apps_in_range += 1
        
print(apps_in_range)

In [None]:
android_video_players = {}

for row in android_data:
    if row[1] == 'VIDEO_PLAYERS':
        n_installs = row[5]
        n_installs = n_installs.replace('+', '')
        n_installs = n_installs.replace(',', '')
        n_installs = int(n_installs)
        android_video_players[row[0]] = n_installs

android_video_players = sorted(android_video_players.items(), key=lambda x: x[1], reverse=True)
# print(android_video_players)

apps_in_range = 0

for row in android_video_players:
    if 5000000 > row[1] > 500000:
        apps_in_range += 1
        
print(apps_in_range)

In [None]:
android_social = {}

for row in android_data:
    if row[1] == 'SOCIAL':
        n_installs = row[5]
        n_installs = n_installs.replace('+', '')
        n_installs = n_installs.replace(',', '')
        n_installs = int(n_installs)
        android_social[row[0]] = n_installs

android_social = sorted(android_social.items(), key=lambda x: x[1], reverse=True)
# print(android_social)

apps_in_range = 0

for row in android_social:
    if 5000000 > row[1] > 500000:
        apps_in_range += 1
        
print(apps_in_range)

In [None]:
android_photography = {}

for row in android_data:
    if row[1] == 'PHOTOGRAPHY':
        n_installs = row[5]
        n_installs = n_installs.replace('+', '')
        n_installs = n_installs.replace(',', '')
        n_installs = int(n_installs)
        android_photography[row[0]] = n_installs

android_photography = sorted(android_photography.items(), key=lambda x: x[1], reverse=True)
# print(android_photography)

apps_in_range = 0

for row in android_photography:
    if 5000000 > row[1] > 500000:
        apps_in_range += 1
        
print(apps_in_range)

In [None]:
android_productivity = {}

for row in android_data:
    if row[1] == 'PRODUCTIVITY':
        n_installs = row[5]
        n_installs = n_installs.replace('+', '')
        n_installs = n_installs.replace(',', '')
        n_installs = int(n_installs)
        android_productivity[row[0]] = n_installs

android_productivity = sorted(android_productivity.items(), key=lambda x: x[1], reverse=True)
# print(android_productivity)

apps_in_range = 0

for row in android_productivity:
    if 5000000 > row[1] > 500000:
        apps_in_range += 1
        
print(apps_in_range)

# Conclusion #

Two notable conclusions can be reached from this quick analysis.
* The most popular app genres are over-represented by a few very popular applications. For example, the iOS ```Navigation``` category consists almost entireley of two apps. This is also true for Android, but may be less important. The total number of Android downloads per genre (tens of millions with Android as opposed to tens of thousands with iOS) still leaves ample room for development. Aditionally, all the groupings in the Google arena appear to be less shifted.
* With both environments micro-targeting or developing for a smaller niche group could still result in a product that reaches a lot of customers. An application that was associated with a unique style of music (i.e. cumbia, samba, or regional folk music) or a specific hobby (knitting, kite-boarding) could be globally popular or maintain a steady user base.