 # What Makes a "Killer App"?
![Image](https://i.imgur.com/wxc4mct.png)

The aim of this project is to build a free app for our customers that will allow us to attract the most possible users, hence maximizing our revenue. To do this, we will take a look at existing apps on both the Apple App Store and the Google Play Store, analyze, interpret and synthesize the data and use the results to influence how we go about building our app.

Before we get further though, let's present a bit of background on the data. As of September 2018 (the time when this data was taken), there were 2 million iOS apps on the Apple App Store and 2.1 million Android Apps on the Google Play Store.

Because analyzing data for 4 million apps would cause a significant amount of money, we'd be better served if we can find a much smaller but still relevant portion of it at no further costs to us. 

Doing some research, I was able to find some relevant datasets at the popular dataset repository site known as [Kaggle](https://www.kaggle.com/).

* The [Google Dataset](https://www.kaggle.com/lava18/google-play-store-apps) and the [direct spreadsheet download](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
* The [Apple Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and the [direct spreadsheet download](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)


## Exploring the Data
In the next cell, we are going to write code that opens up both of our datasets so we can start exploring.

In [1]:
"""Open the file, import 'reader' function from the csv module, read it in and
transform it into a list"""

opened_file1 = open('AppleStore.csv') # Apple Store App data
opened_file2 = open('googleplaystore.csv') # Google Play Store App data
from csv import reader
read_file1 = reader(opened_file1)
read_file2 = reader(opened_file2)

# Convert to a lists of lists
apple_alldata = list(read_file1) # Apple App Store lists of lists w/header row
google_alldata = list(read_file2) # Google Play store lists of lists w/header row

# Our App Store dataset divide

apple_dataset = apple_alldata[1:] # all of the apps
apple_header = apple_alldata[0] # header row

# Our Google dataset divided 

google_dataset = google_alldata[1:]
google_header = google_alldata[0]

# making a function that prints the Google Play store column names
def printgoogle_header():
    print(google_header)
    print('\n')

# making a function that prints the Apple App Store column names
def printapple_header():
    print(apple_header)
    print('\n')

# Create a function that allows us to see the index number and length of any set of rows that we specify
def explore_data(dataset, start, end, rows_and_columns=False, index_and_length=True):
    dataset_slice = dataset[start:end]
    index_num = start # Here we set 'index_num' to the same value as the 'start' parameter/variable
    
    ### This if block of code automatically prints out the corresponding header row of the dataset that we use
    if dataset == google_dataset:
        printgoogle_header()
        
    elif dataset == apple_dataset:
        printapple_header()
    
    else:
        pass
    
    for row in dataset_slice:
        
            print(row)
            if index_and_length:
                print('Index number: ' + str(index_num))  
                print('The length of this row is ' + str(len(row)))
                print('\n') # adds a new (empty) line after each row
                index_num += 1
            else:
                print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

For the App Store and the Google Play store, we need to identify the important columns that will facilitate our analysis. First, we'll print out the header for our Google dataset along with its first 3 rows of app data.

In [2]:
explore_data(google_dataset, 0, 4, True, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
Index number: 0
The length of this row is 13


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
Index number: 1
The length of this row is 13


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
Index number: 2
The length of this row is 13


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with d

As we've seen from our output, the most important column names for the Google Play store appear to be: 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres'. Next we'll look at the App Store.

In [3]:
explore_data(apple_dataset, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
Index number: 0
The length of this row is 16


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']
Index number: 1
The length of this row is 16


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']
Index number: 2
The length of this row is 16


Number of rows: 7197
Number of columns: 16


Above we do the same thing with the App Store apps as we did with the Google Play apps and we notice that column names for the header are quite different. However, we still manage to identify the most useful columns for our analysis.

From the App Store some of these important columns include: 'prime_genre', 'rating_count_tot', 'user_rating', 'cont_rating'.

Just in case you'd like to take another look the dataset:

* [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps)
* [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

## Search and Deleting Bad Data

When we're working from a dataset curated from a public space like Kaggle, it is important to check the discussion forums to make sure that the integrity of the data is intact. The advantage of having datasets on a site like Kaggle is in most cases, that we are not the first users to have interacted with a dataset. Regarding the Google Play dataset, I came across one such instance where users stated that a specific row had a missing value in its 'Category' column. 

According to the users, the index number of the row is 10472. So we'll use our `explore_data()` function to see if it finds anything off.

In [4]:
# Searching for a suspicious piece of data
explore_data(google_dataset, 10470, 10481, True )

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']
Index number: 10470
The length of this row is 13


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
Index number: 10471
The length of this row is 13


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Index number: 10472
The length of this row is 12


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Index number: 10473
The length of this row is 13


['Sat-Fi Voice', 'COMMUNICATI

After doing our testing, row index 10472 returned a length of 12 as opposed to the expected 13. This meant that a value was missing and looking through the row, I noticed that the `Genres` value just had `''` for a value.

Now that we know which row is the culprit. We are going to run the `del` command to get rid of it in the next cell!


In [5]:
### DON'T RUN THIS CELL AGAIN!!!! ###
del google_dataset[10472]
explore_data(google_dataset, 10470, 10481, True )

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']
Index number: 10470
The length of this row is 13


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
Index number: 10471
The length of this row is 13


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Index number: 10472
The length of this row is 13


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']
Index number: 10473
The length of this row is 13


['Wi-Fi Visualizer', 'TOOL

In [6]:
for app in google_dataset:
    app_name = app[0]
    if app_name == 'Coloring book moana':
        print(app)

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['Coloring book moana', 'FAMILY', '3.9', '974', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


In this example, we use the 'Instagram' app to illustrate issues with duplicates that can arise and from what we see, it has multiple duplicate entries in this set. In the following lines, we're going to see just how many duplicates that we're working with exactly by looping over the Google dataset, using the 'in' operator to weed the duplicates out.

In [7]:
duplicate_apps = []
unique_apps = []

for app in google_dataset:
    app_name = app[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])
      

Number of duplicate apps:  1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


After our work above, we were able to determine that this Google dataset has 1181 instances where a duplicate occurs. Continuing on, our objective is to properly remove these duplicates before we can work with this data. Once again, using our 'Instagram' example a few cells above, we're not going to randomly delete all the duplicates until only one is left, instead we will keep the instance that is the most recent piece of data. This should be evidenced by the number of reviews that the app has; the one with the highest amount of reviews should be the most recent version. Alternatively, we can also check the 'Last Updated', 'Current Ver' and 'Android Ver' columns

In [8]:
print('Expected length: ', len(google_dataset) - len(duplicate_apps))

Expected length:  9659


In [9]:
reviews_max = {}

for app in google_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
   
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))    
print('\n')


9659




In [10]:
android_clean = []
already_added = []

for app in google_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
    print(name)
    print('Key value: ', reviews_max[name])
    print('The app\'s amount of reviews: ', n_reviews)
    print(app in android_clean)
    if name in already_added:
        print('This app has already been added to the new list')
    else:
        print('This app has not been added yet')
    print('\n')
    
      
print(len(android_clean))

Photo Editor & Candy Camera & Grid & ScrapBook
Key value:  159.0
The app's amount of reviews:  159.0
True
This app has already been added to the new list


Coloring book moana
Key value:  974.0
The app's amount of reviews:  967.0
False
This app has not been added yet


U Launcher Lite – FREE Live Cool Themes, Hide Apps
Key value:  87510.0
The app's amount of reviews:  87510.0
True
This app has already been added to the new list


Sketch - Draw & Paint
Key value:  215644.0
The app's amount of reviews:  215644.0
True
This app has already been added to the new list


Pixel Draw - Number Art Coloring Book
Key value:  967.0
The app's amount of reviews:  967.0
True
This app has already been added to the new list


Paper flowers instructions
Key value:  167.0
The app's amount of reviews:  167.0
True
This app has already been added to the new list


Smoke Effect Photo Maker - Smoke Editor
Key value:  178.0
The app's amount of reviews:  178.0
True
This app has already been added to the new list

In the first of the two above cells, we first created an empty dictionary called `reviews_max` and looped through the App Store dataset. The purpose of this dictionary was to weed out the duplicates by pairing unique app names as keys with the highest instance of reviews for this app. To do this, we set a condition that `if name in reviews_max and reviews_max[name] < n_reviews:`, `reviews_max`'s key would be updated to the higher number of reviews found in `n_reviews`. Or we could create a new dictionary key-pair entry if `name` was not yet in our `reviews_max` dictionary. These two conditionals are responsible for populating our dictionary and getting rid of our duplicates. So to ensure that we were on the right track, we made sure that our expected length of cleaned up data, 9659 rows, matched that of our actual length. So from here we used `len(reviews_max)`, to check the length of our dictionary table and just like with our estimation, our actual length was 9659 rows.

In the second cell, we initialized two empty lists, `android_clean` and `already_added`. We once again looped through `google_dataset` focusing on isolating the app name and the number of reviews and check to see if `n_reviews == reviews_max[name]` (in other words, check to see if the current row of the number of reviews is equal to the dictionary value in our `reviews_max` video, which is determined by the value of our name of our current row). The second part of this `if` statement, ` and (name not in already_added)`, is necessary because some duplicate apps have multiple entries where the highest number of reviews is the same, meaning that the duplicates would still go through without this condition.


In [11]:
print(apple_dataset[813][1])
print(apple_dataset[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


In [12]:
def isolate_english(string):
    
    false_counter = 0
    
    for character in string:
        code = ord(character)
        
        if code > 127:
            false_counter += 1
        
        
    if false_counter > 3:
        return False
            
    return True

android_english = []
apple_english = []

for apps in android_clean:
    name = apps[0]
    
    if isolate_english(name):
        android_english.append(apps)


for apps in apple_dataset:
    name = apps[1]
    
    if isolate_english(name):
        apple_english.append(apps)

printgoogle_header()        
explore_data(android_english, 0, 3, True, True)
printapple_header()
explore_data(apple_english, 0, 3, True, True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
Index number: 0
The length of this row is 13


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
Index number: 1
The length of this row is 13


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
Index number: 2
The length of this row is 13


Number of rows: 9614
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating'

In our newly cleaned data:

* App Store: 6183 rows remaining
* Google Play: 9614 rows remaining

## Separating the Free Apps

As we mentioned in the first paragraph, we're focusing only on free apps, so we're going to append only the apps where `price == '0.0'` or `price == '0'` in both of our datasets

In [13]:
android_final = []
apple_final = []

for app in apple_english:
    price = app[4]
    if price == '0.0':
        apple_final.append(app)
    

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
print(len(android_final))
print(len(apple_final))
    


8864
3222


In the final length for our dataset, we have 8864 apps remaining for the Google Play Store and 3222 apps for the App Store

## Most Common Apps by Genre

We want to find an app profile that fits both the App Store and Google Play in order to maximize our revenue and profit. This is our validation strategy: 

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

We'll build our frequency tables based on `Genres` and `Category` for the Google Play store and for the App Store we'll be using the `prime_genre` column. 

In [14]:
def freq_table(dataset, index):

    frequency_table = {}
    total = 0

    for rows in dataset:
        total += 1
        column = rows[index]
        if column in frequency_table:
            frequency_table[column] += 1
        else:
            frequency_table[column] = 1
            
    freq_percentages = {}    
    
    for key in frequency_table:
        percentage = (frequency_table[key] / total) * 100
        freq_percentages[key] = percentage
    
    return freq_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(apple_final, -5) # Apple App Store- Prime Genres
print('\n')
display_table(android_final, -4) # Google Play Store- Genres
print('\n')
display_table(android_final, 1) #Google Play Store- Category
        

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.31678700

For our Apple Dataset, our most common genre is games. The next highest is entertainment. Education is only around 3.7 percent which is much lower than I expected. Entertainment related apps in general seem to have a higher level of engagement compared to other categories.

For our Android Dataset, the most popular apps appear to be tools, family apps, and more practical apps rather than games.


## Most Popular Apps on the App Store (By Genre)

Next we're going to create a frequency table that will calculate the number of ratings per genre, so that we can get an idea of how popular an app is.

In [15]:
genres_os = freq_table(apple_final, -5)

for genre in genres_os:
    
    total = 0
    len_genre = 0
    
    for row in apple_final:
        
        genre_app = row[-5]
        if genre == genre_app:
            ratings = float(row[5])
            total += ratings
            len_genre += 1

    avg_number = total / len_genre 
    print(genre, ':', avg_number)
            

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


For our Apple Store results, we can see that our most popular genre is most likely navigation apps. But it probably isn't accurate because it's being influenced largely by Google Maps and Waze. This same trend would also apply to 'Social Networking' apps, where a the amount of ratings will be skewed towards a few big apps like Facebook, Pinterest, Skype, etc. 

In [16]:
for app in apple_final:
    genre = app[-5]
    if genre == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


As we expected, the skewing here is very heavy and I don't see an opportunity for us to slip in and make our mark here

Let's try another category that is a bit more niche, like the "Reference" category.

In [17]:
for app in apple_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Even with a bit of top-heaviness, this genre is still something that we can entertain finding a niche in. We can do something like take a popular book and soup it up a bit (I'll explain this in more detail later). And we can also make a guide for a popular game like "Minecraft" (in the output above), Fortnite or perhaps even other games!

We know that since our app is going to be free, these types of app ideas can provide great ad revenue as well as in-app purchase potential for monetization.

The book genre is also something that seems to overlap a bit with the reference genre and meshes well with our app ideas as well.

The other popular app genres aren't quite as inviting because there will naturally be either a lot more infrastructure needed(Food & Drink) or users don't spend enough time there (Weather), so monetizing them will be difficult.


## Most Popular Apps on the Play Store (By # of Installs).

For the Google Play Store apps, we actually have info about the number of installs, unlike our Apple Data.

In [18]:
display_table(android_final, 5) # the Installs columns

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


Next, we would like to perform computations and to do that, we are going to have to eliminate any '+' and ',' characters. So in our next cell, we will generate a frequency table for our Play Store dataset, replace the superfluous characters and organize all of the genres by their average number of installs.

In [19]:
genres_android = freq_table(android_final, 1)

for category in genres_android:
    total = 0
    len_category = 0
    
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs )
            
      

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

It seems that communication apps have the most amount of installs on average, at arount 38456119 installs. But is that information actually useful to us? Let's try something in the next cell.

In [20]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

Once again, the problem is that there are a lot of huge apps that make up the bulk of these bloated install numbers. If we were to remove these apps, the average for the communication category will plunge sharply.

We should try another category like books and reference. Perhaps we won't once again run into the phenomenon of niches that are dominated by disproportionately huge apps.

In [21]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

From these results, we can see that the book and reference has a variety of apps such as:
* eBooks
* Dictionaries
* Language Guides
* Video Game Guides
* Religious Texts
* Libraries/Compilations of readings
* Programming Texts
* And more...

However, there are still a few dominant apps in this genre that we have to account for. In the next cell, we are going to see how many of those dominant apps we have to deal with exactly.

In [22]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


There are only a few super dominant apps, so maybe we still have a chance to get some traction within this genre. Let's try to get ideas from the apps that are in the 1,000,000 to the 100,000,000 range!

In [23]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

Within this range of installs for the app, there seems to be a lot of libraries, ebook readers, dictionaries and religious texts. So as long as we avoid specifically these kinds of apps, we can probably carve out a niche for ourselves in the books and reference genre(category).

Although the idea of building an app around the Bible or the Quran is a niche that has already sailed (due to oversaturation), the concept itself is still something that we can apply to our own app construction. As mentioned with the App Store "Reference" genre, we can use this concept to build an app around a popular book like say "Harry Potter", "Game of Thrones", "The Witcher'', or others for example and maybe add enhanced features like audio, pop-up facts, additional lore, community interaction and more. And going by the "Stats Royale for Clash Royale '', which  features over 1,000,000 installs, we can also build an app that serves as a companion guide or stat tracker for a popular game.


### Recap/Conclusion

For this project, we came into it with the purpose of analyzing two discrete datasets from the Apple App Store and Google's Play Store with the idea of desigining our own free app profile that could be popular on both markets.

After cleaning up the data, generating frequency tables and performing analysis on our cleaned data, we concluded that the best app profile that we could recommend would be once that be a book. We could use one of these approaches:
1. Build an app for a popular book that provides a lot of extra features like trivia, audio, community interaction, etc.
2. Build an app that serves as a companion guide or stat tracker for a popular game

If we do this, I am confident that avoid clashing with the giants of the genre and carve out our own niche with a unique base of customers.