# The Next Big App

This project aims to help app developers make **data-driven decisions** about the **kind of apps** they build.
- This will help developers build apps that can grow effectively and be sustainable on the revenue it generates.
- Targeted at apps that are **free** to download and install, which represents the majority of apps on these markets.
    - For these apps, the main revenue source is **in-app ads**. 
    - Thus, revenue for any given app is driven largely by **number of users**.

Goal: 
- To help developers understand the types of apps that are **likely to attract and engage users**.

### Table of Contents

- **Data Extraction**
    - Data Sources
- **Data Cleaning**
    - Removing Incorrect Data
    - Eliminating Duplicates
    - Removing Non-English Apps
    - Removing Paid Apps

- **Data Analysis**
    - Strategy
    - Most Common Apps
    - Most Popular Apps

- **Conclusion**
    - App Profile
    
    
&nbsp;

## Extracting Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play Store. 
- Analyzing the entire app population will be too expensive and time-consuming, so instead I will use a sample dataset.
- I will be using the following datasets:
    - [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) with 7k+ rows collected in Jul 2017
    - [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps/home) with 10k+ rows collected in Aug 2018

In [1]:
from csv import reader

### For the App Store dataset ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


### For the Google Play Store dataset ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

To explore the data, I will use a function *(that I can reuse)* called `explore_data()` to present the data in a more human-readable way, and to show the number of rows and columns in the data.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


- As `explore_data()` shows, there are: 
    - **7197 rows** and **17 columns** in the **App Store** data;
        - Some relevant columns are ***'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'prime_genre'***. 
        - For more App Store column info, see [Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

        &nbsp;
    - **10841 rows** and **13 columns** in the **Google Play Store** data. 
        - Some relevant columns are ***'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', 'Genres'***.
        - For more Google Play Store column info, see [Documentation](https://www.kaggle.com/lava18/google-play-store-apps/home)
        

## Data Cleaning

### Removing Incorrect Data
Outlined in one of the Google Play [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), there is an error in row 10472. 
- To identify the issue, let's have a look at it compared to another row.

In [5]:
# show the headers to match data
print(android_header)

# show the row with the wrong data
print(android[10472])

# show the row with the correct data
print(android[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


- Clearly, the data doesn't match. 
    - Row 10472 has '1.9' for **'Category'**, which is obviously wrong (even if categories were assigned to numbers, a float doesn't make sense). To highlight the error, the other row has 'ART_AND_DESIGN'.
    - Row 10472 has '19' for **'Rating'**. This is also wrong because apps are only rated on a scale of 5. 
- As such, I have to delete this row.

In [6]:
print(len(android))
# Only delete once. After deletion, a new row replaces index 10472.
del android[10472] 

print(len(android))

10841
10840


### Eliminating Duplicates
#### Duplicates in `android` apps

Unfortunately, I have some duplicates in the data. 
- Fortunately, there is a way to see how many duplicates I have in a dataset. 
    - First, I will initialize 2 lists, `duplicate_apps` and `unique_apps`.
    - For each iteration of the loop:
        - Assign the app name to the variable `name`
        - If the app name is already in the `unique_apps` list, add the app name to the `duplicate_apps` list.
        - If the app name is not in the `unique_apps` list, this means it is the first time it is showing up, so add it to the `unique_apps` list.
    

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('The number of apps with duplicates: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

The number of apps with duplicates:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


- Now I know that I have 1181 cases where an app occurs more than once.


Let's have a closer look at an example.
- How about the popular app *'Instagram'*?

In [8]:
for app in android:
    name = app[0] # assigning the first element of the app list to the name variable
    if name =='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


- As expected, *'Instagram'* has more than 1 entry. In fact, it has 4.
    - I don't want duplicate entries to skew the data, so I only want 1 entry per app.


- Interestingly, the difference between the 4 entries are in `app[3]`, which indicates the number of reviews the app has received. 
    - This suggests that the data for the same app was ***collected at different times***. 
    - Based on this, it makes sense to keep the entry with the **highest number of reviews**. 
        - It is the **latest** *(reviews can only increase or stay the same as time passes)* and consequently, the most **robust**. 

Voilà! I have a criterion.
- Now in order to build this criterion, I have to create a dictionary. 
    - Each key of the dictionary is a unique app name
    - The corresponding value is the highest number of reviews for that app.
- With the dictionary, I can then create a new cleaned dataset with only 1 entry per app. 


In [9]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


- For each iteration:
    - If the app name is already in the `reviews_max` dictionary, that means this is **NOT the first time** the app is showing up.
        - From the ***2nd instance onwards***, if the number of reviews from the ***previous case is lower than the number of reviews from the current case***, then assign `n_reviews` of the current case to the `reviews_max` dictionary.
        
        &nbsp;
    - Otherwise, if the app is not already there, that means this is **the first time** the app is showing up.
        - In this case, just add the `n_reviews` of the current case to the `reviews_max` dictionary.

Earlier, I found that there are 1181 apps with duplicates in my data.
- I can use that to check if the size of my dictionary is accurate.

In [10]:
print('Expected: ',len(android)-1181)
print('Actual: ',len(reviews_max))

Expected:  9659
Actual:  9659


With the `reviews_max` dictionary, I can use it to remove the duplicates in our data.
- First, I will initialize 2 lists, `android_clean` and `already_added`.
- For every iteration of the loop through `android`:
    - Isolate app `name` and `n_reviews`
    - If the **`n_reviews` value of the app matches the `reviews_max`** dictionary,
    - **And** if the app name is **not already in the `already_added` list**. 
        - *(This is to ensure no duplicates with the same number of `n_reviews`. For instance, the app named 'Box' has 3 entries with the same `n_reviews`, which means all 3 cases will go into the `reviews_max` dictionary. This condition removes those duplicates.)*
    - THEN:
        - Add the current row (app) to the `android_clean` list and the app name `name` to the `already_added` list.

&nbsp;
- In other words,
    - If the `n_reviews` is lower than `reviews_max`, it is **removed**.
    - If the `n_reviews` matches `reviews_max`, but it is already in the `already_added` list, it is **removed**.
    - If the `n_reviews` **matches** `reviews_max`, and it is **not in** `already_added`, then it is **kept**.

In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


#### Repeat the process for duplicates in `ios` apps

In [12]:
duplicate_apps = []
unique_apps = []

for app in ios:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('The number of apps with duplicates: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])

The number of apps with duplicates:  0


Examples of duplicate apps:  []


- Since there are no duplicates found for `ios` apps, I don't have to remove anything from the `ios` dataset.

### Removing Non-English Apps

In [13]:
print(ios[922][2])
print(ios[6734][2])

print(android_clean[4412][0])
print(android_clean[7940][0])

QQ游戏大厅HD
エレメンタル ファンタジー - 高精細３ＤアクションＲＰＧ
中国語 AQリスニング
لعبة تقدر تربح DZ


In order to remove these apps, I need to establish a criterion for English. 
- Since the characters of English text are encoded in ASCII (which has a corresponding number from **0-127** associated with each character), I can **check if an app name contains non-ASCII** characters.
- It is useful to note that the *built in* `ord()` function retrieves the corresponding encoding number.
    - Therefore, if the ord number is **greater than 127**, we know it is a **non-ASCII** character.
    
    &nbsp;
    - Having said that, there are instances where characters are out of the ASCII range, but are commonly found in English app names, such as '™', '—', '😜', etc.
    - I have to identify a more specific criterion to remove non-English apps ***without excessive removal of eligible English apps***. 
    - New criterion:
        - For every non-ASCII character, add 1 to `non_ascii`.
        - If an app name has **more than 3 non-ASCII** characters (`non_ascii` > 3), it is removed.

In [14]:
def is_english(string):
    non_ascii = 0
    
    for char in string:
        if ord(char)>127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english(ios[922][2]))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
        

True
False
False
True
True


With this removal criterion in place, we will now remove the non-English apps in both datasets.

In [15]:
ios_eng = []
android_eng = []

for app in ios:
    name = app[2]
    if is_english(name):
        ios_eng.append(app)

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_eng.append(app)

explore_data(ios_eng, 0, 3, True)
print('\n')
explore_data(android_eng, 0, 3, True)



['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6183
Number of columns: 17


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', '

### Removing Paid Apps

Since I am focusing only on free apps, I have to remove the paid apps from my data.

In [16]:
ios_final = []
android_final = []

for app in ios_eng:
    price = app[5]
    if price == '0':
        ios_final.append(app)

for app in android_eng:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
print(len(ios_final))
print(len(android_final))

3222
8864


**After all the data cleaning, we are left with 3222 iOS apps, and 8864 Android apps.**

## Analyzing Data

### Strategy

In line with my goal to identify the type of apps that are likely to attract and engage users, I have to come up with an efficient strategy to validate whether an app idea will achieve this goal.

1. Build a ***beta version*** of an app for Android, and add it to the Google Play Store.
2. If the app beta has a **good response**, decide to develop it further.
3. If the app is **profitable after 6 months**, we develop iOS version of the app, and add it to the App Store.

### Most Common Apps by Genre

Since the end goal is to develop a profitable free app on both the App Store and Google Play Store markets, I have to look at the profiles of apps that are *successful on both markets*. 
- Before I look at **popularity**, I want to look at what is most **commonly found** on the markets. 
    - The first method of **app profiling (classification)** I want to look at is the **genre**, so for each market I will sieve out the **most commonly found genres**.
    - To do this, I will have to build a ***frequency table*** for the `prime_genre` column of the App Store data, and the `Genres` and `Category` columns of the Google Play data. 

I will first create 2 functions that can later be used for both markets:
- With one function to generate frequency tables that show percentages
- Another function to display percentages in descending order

In [17]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    #for each row in ios_final:
        #total + 1 
        #assign genre to value
        #if genre is already in table:
            #table[genre] + 1 (add to existing)
        #or else if genre is not there yet:
            #table[genre] = 1 (the first one)
            
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value]+=1
        else:
            table[value]=1
        
        
    #create table of percentages as dictionary
    #for each key (count_genre) in the table[genre]:
        #the percent is derived by taking the (count of genre / total rows) * 100%
        #assign the percent to the corresponding key in the table of percentages
    
    table_percent = {}
    for key in table:
        percent = (table[key]/total)*100
        table_percent[key] = percent
        
    return table_percent



    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    #for each genre in the freq table
        #convert dict to tuple, with (%_genre, and the genre)
        #add that tuple to the table_display
    
    for key in table:
        key_val_tuple = (table[key], key)
        table_display.append(key_val_tuple)
    
    
    #sort the table_display by %_genre in descending
    #for every row in table_sorted:
        #print the (genre : %_genre)
    
    table_sorted = sorted(table_display, reverse = True)
    for r in table_sorted:
        print(r[1], ':', r[0])
        

#### For the App Store data
- Here is the frequency table for `prime_genre` 

In [18]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Results:
- Amongst the free, English apps, more than half are **Games** (58.16%), and a very distant second are **Entertainment** (7.88%). **Photo & Video** apps (~5%) came in third. 
- **Education** apps (3.66%), and **Social Networking** apps (3.29%) represent the other significant app genres.  

Based on these results, it appears that:
- App Store consists **mostly** of apps made for ***Fun*** (Games, Entertainment, Photo, Social Networking), 
- Whereas apps made for ***Practical Purposes***(Education, Utilities, Productivity, Finance) are more **rare**.

However, even though the ***Fun*** apps are the most commonly found on the App Store, it *doesn't mean they have the most number of users*. 
- Therefore, it is ***not a measure of success*** for this analysis, only an indication of the ***market landscape***.


#### Now for the Google Play data 
There are 2 relevant columns, `Category`, and `Genres`. 

- Let's first have a look at frequency table for the `Category` column of the Google Play data. 

In [19]:
display_table(android_final, 1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

- And secondly, the `Genres` data.

In [20]:
display_table(android_final, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Results (`Category`):
- I found that **Family** (18.9%) leads the charts, followed by **Game** (9.72%) and **Tools**(8.46%).
- Other subsequent significant genres include **Business**(4.59%), **Lifestyle**(3.9%), **Productivity**(3.89%), **Finance**(3.7%), and **Medical** (3.53%).

Results (`Genres`):
- **Tools**(8.45%) came in first, followed by **Entertainment**(6.07%) in second, and **Education** (5.35%) in third.
- Other subsequent significant genres include **Business**(4.59%), **Productivity** (3.89%), **Lifestyle** (3.89%), **Finance** (3.7%), and **Medical** (3.53%).
- `Genres` is more granular, whereas I am looking at the bigger picture, thus I will only look at `Category` moving forward.

Based on these results, it appears that:
- Android apps have more apps made for ***Practical Purposes*** (Tools, Business, Productivity)
- Whereas apps made for ***Fun*** (Games, Entertainment, Social Networking) are less common.
    - Having said that, **Games** came in 2nd for `Category` and **Entertainment** was 2nd for `Genres`, so the **disparity** between ***Practical*** apps and ***Fun*** apps is **not as wide** as in the App Store results.
    - Additionally, the **Family** genre for `Category` was found to be made up of mostly games for kids, so we can now consider that 29% of `Category` was in the genre of **Games**.


#### Between the two datasets

- There remains a larger representation of ***Practical*** apps on the Google Play Store than the App Store.
- This also means that there is a much smaller difference between ***Fun*** and ***Practical*** apps in the Google Play Store compared to the App Store. 

Again, this is **not a measure of success** (number of users), only an indication of the **landscape** for each market (number of apps).

### Most Popular Apps by Genre

Having examined the markets, I am now going to learn more about the apps with the most number of users.

&nbsp;
There are a number of ways to evaluate app popularity
- The simplest metric for popularity would definitely be the number of downloads. Unfortunately, while the Google Play data has an `Installs` column, our App Store data does not contain any information on number of downloads.


- Fret not, for there is always another way. 
    - One useful metric is the **number of ratings** for the app. 
    - This allows us to look at *how many users actually used the app considerably* enough to give a rating (as opposed to looking solely at number of downloads).
    - To get a better sense of each genre, I will look at the **total** number of ratings in each genre as well as the **average** number of apps per genre in each market.

#### App Store

- For the App Store, we will use the `ratings_count_tot` column as our measure.


In [21]:
# populate the genres_ios with the freq_table function
genres_ios = freq_table(ios_final, -5)

# for each genre 
    # define total ratings and rating count for each genre

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    
    # for each app in IOS:
        # assign the genre of the app
        # if the app genre matches the genre in the freq_table
            # assign the number of ratings to a float variable
            # accumulate total ratings
            # accumulate rating count
            
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
                  
    # get avg by taking (total ratings / rating count)
    avg_n_ratings = (total/len_genre)
    print(genre,':', avg_n_ratings, ',', total)
    

Productivity : 21028.410714285714 , 1177591.0
Weather : 52279.892857142855 , 1463837.0
Shopping : 26919.690476190477 , 2261254.0
Reference : 74942.11111111111 , 1348958.0
Finance : 31467.944444444445 , 1132846.0
Music : 57326.530303030304 , 3783551.0
Utilities : 18684.456790123455 , 1513441.0
Travel : 28243.8 , 1129752.0
Social Networking : 71548.34905660378 , 7584125.0
Sports : 23008.898550724636 , 1587614.0
Health & Fitness : 23298.015384615384 , 1514371.0
Games : 22788.6696905016 , 42705967.0
Food & Drink : 33333.92307692308 , 866682.0
News : 21248.023255813954 , 913665.0
Book : 39758.5 , 556619.0
Photo & Video : 28441.54375 , 4550647.0
Entertainment : 14029.830708661417 , 3563577.0
Business : 7491.117647058823 , 127349.0
Lifestyle : 16485.764705882353 , 840774.0
Education : 7003.983050847458 , 826470.0
Navigation : 86090.33333333333 , 516542.0
Medical : 612.0 , 3672.0
Catalogs : 4004.0 , 16016.0


- At the aggregate level, **Games** leads the charts with 42MM ratings, followed by **Social Networking** with 7.6MM.
    - It is noteworthy that some genres are more saturated with a lot of apps, which would therefore lead to more ratings. 
    - Thus, to investigate, I am going to look at the average ratings per app for each genre.

- **Navigation** (86090) has the highest number of ratings per app, followed by **Reference** (74942) and **Social Networking** (71548).
    - In order to dive deeper into a genre, I will create a function `genre_share` to investigate any specified genre in greater detail, including the ***total number of ratings***, and ***percentage share of that total*** each app has. 
    - This can help us identify if there are any particular apps that are **dominating (and thereby skewing)** the genre.
    - To keep information concise, I will exclude apps representing less than 4% of their genre's total ratings.

In [22]:
def genre_share(dataset, spec_genre):   
    count = 0
    
    # for each app in the dataset
        # assign genre
        # if statement for specified genre
            # assign ratings as float
            # accumulator to get total ratings for the specified genre
    # print the total number of ratings for the specified genre
    for app in dataset:
        genre = app[-5]
        if genre == spec_genre: 
            genre_ratings = float(app[6])
            count += genre_ratings
    genre_total=count
    print ('Number of ratings for the', spec_genre, 'genre:',genre_total,'\n')
    for app in dataset:
        genre = app[-5]
        if genre == spec_genre: 
            genre_ratings = float(app[6])
            count += genre_ratings
            percent_count = (genre_ratings/genre_total)*100
            if percent_count > 4:
                
                print (app[2], ' : ', genre_ratings, '\n', 'Percent of total ratings in Genre: ', percent_count, '%', '\n', sep='')


- Now let's first take a look at our leader, **Games**.
    - Although I established earlier that this genre is highly saturated, I am still interested in the popularity.

In [23]:
genre_share(ios_final,'Games')

Number of ratings for the Games genre: 42705967.0 

Temple Run : 1724546.0
Percent of total ratings in Genre: 4.03818510888654%

Clash of Clans : 2130805.0
Percent of total ratings in Genre: 4.989478402397491%



- Although there are 42MM ratings for **Games**, there are only 22K ratings per app. 
    - This indicates that the majority of apps in this genre have a ***low amount of ratings***.
    - In other words, this appears to be an ***app volume effect*** rather than the ***true popularity*** of the genre.
- This is also why I am looking at average ratings per app on top of total ratings.

Now, I will look at the leader of the average ratings, **Navigation**.

In [24]:
genre_share(ios_final,'Navigation')

Number of ratings for the Navigation genre: 516542.0 

Waze - GPS Navigation, Maps & Real-time Traffic : 345046.0
Percent of total ratings in Genre: 66.79921477827553%

Google Maps - Navigation & Transit : 154911.0
Percent of total ratings in Genre: 29.990010492854406%



- I can see that the **Navigation** genre is heavily skewed by the dominance of giant apps, such as Waze and Google Maps, which together contribute a **whopping 90%** of the user reviews.  

Let's have a look at a genre that showed up in top both the aggregate ratings and the average ratings, **Social Networking**.

In [25]:
genre_share(ios_final,'Social Networking')

Number of ratings for the Social Networking genre: 7584125.0 

Facebook : 2974676.0
Percent of total ratings in Genre: 39.222402056928125%

Skype for iPhone : 373519.0
Percent of total ratings in Genre: 4.925011125212203%

Tumblr : 334293.0
Percent of total ratings in Genre: 4.407799185799285%

Pinterest : 1061624.0
Percent of total ratings in Genre: 13.997976035468826%

Messenger : 351466.0
Percent of total ratings in Genre: 4.634232689993902%



- The **Social Networking** genre is also heavily influenced by dominant apps, such as Facebook (39%), albeit not to the same extent.
- Apart from Facebook, and maybe Pinterest (13%), there isn't actually a huge skew.

&nbsp;
Now, let's have a look at **Reference**

In [26]:
genre_share(ios_final,'Reference')

Number of ratings for the Reference genre: 1348958.0 

Bible : 985920.0
Percent of total ratings in Genre: 73.08752385174334%

Dictionary.com Dictionary & Thesaurus : 200047.0
Percent of total ratings in Genre: 14.82974266063139%

Dictionary.com Dictionary & Thesaurus for iPad : 54175.0
Percent of total ratings in Genre: 4.016062768447943%



- Similar to **Navigation**, the **Reference** genre has big players like Bible (73%) and Dictionary.com (15%). 

Since my aim is to identify the most popular genres, it is necessary that the popularity is representative of the genre. 
- In the case of **Navigation**,the popularity has been skewed upwards by a select few apps with hundreds of thousands of ratings, while other apps in the genre may be struggling to get past 10,000 ratings. 
- If I wanted to investigate this further, I could remove the extremely popular apps with a condition for these genres, and get a better picture of the genre averages. 
    - Out of curiosity, let's just have a quick look:
    

In [27]:
under_300_k = []

for app in ios_final:
    genre_ratings = app[6]
    if (app[-5] == 'Navigation') and (float(genre_ratings) < 300000):
        under_300_k.append(float(genre_ratings))
        
print(sum(under_300_k))
print(sum(under_300_k) / len(under_300_k))

171496.0
34299.2


- It is clear that by removing the giant apps from **Navigation**, the total ratings dropped from 516K to 171K average ratings for the genre dropped from 86K to 34K. 
    - This 34k represents a more accurate average of the genre's popularity as a whole.
    - This is also consistent with the big apps representing about 90% of the genre.
    
&nbsp;
- Now let's have a look at **Social Networking** without giants:

In [28]:
under_2_m = []

for app in ios_final:
    genre_ratings = app[6]
    if (app[-5] == 'Social Networking') and (float(genre_ratings) < 2000000):
        under_2_m.append(float(genre_ratings))
        
print(sum(under_2_m))
print(sum(under_2_m) / len(under_2_m))

4609449.0
43899.514285714286


- In the same vein, by removing the giant app called Facebook from **Social Networking**, the total has dropped from 7.5MM to 4.6MM, and the average ratings from 75K to 44K.

&nbsp;
- Finally, here is **Reference** without giants:

In [29]:
under_200_k = []

for app in ios_final:
    genre_ratings = app[6]
    if (app[-5] == 'Reference') and (float(genre_ratings) < 200000):
        under_200_k.append(float(genre_ratings))
        
print(sum(under_200_k))
print(sum(under_200_k) / len(under_200_k))

162991.0
10186.9375


- The total dropped from 1.3MM to 160K, and the average from 74K to 10K. A significant dip.

While giant apps normally mean a very difficult market to enter, it is important to have market context.
- In this case, the giant app dominating **Reference** is actually the Bible app, a digitized version of a religious book. 
    - In other words, the app is used solely for religious purposes, and its popularity could be ***driven by the popularity of the book, rather than that of the app***.
    - This is crucial, because I am not trying to replicate a book, my aim is to help developers build a new app that adds its own value to any material.
    - Additionally, in this relatively unsaturated genre, the addition of this type of app represents diving into a market that hasn't been fully tapped. 

On the other hand, **Social Networking** has also proven to be a popular genre.
- However, in this case, the giant app Facebook is an actual market leader. 
- Additionally, the genre is not exactly untapped, representing 3.29% of the apps on the App Store. 
    - This means it will be harder to enter, but not nearly as difficult as the **Games** genre, which makes up more than half of all the apps on the App Store.
- Due to its popularity both on an aggregate level and an average level (per app), it is a strong profile to include.

Based on these findings, both **Social Networking** and **Reference** are useful genres to consider as an app profile.
- It may even be a good idea to incorporate both genres.

- For example, the new app could take a document and break it down into segments, with added **content features** and **social features** to the experience, such as:
    - Social
        - Groups for shared interests (books, topics) with discussion panels
        - Ability to share doc segments to Facebook, Twitter, LinkedIn, and other social networking apps
    - Content
        - Summaries 
            - Users may submit their own summaries 
            - Rated by other users
        - Visual representations/Cartoons/Comics
        - Video/Audio versions
        - Easy APA/MLA citations for future referencing
        - In-app Dictionary
            - Iterative (incorporate top answers from discussions)

###### App Profile

This app profile is no longer just a **Reference** or a **Social Networking** app, it is really a combination of ***learning features*** for reference material (book, article, document) as well as ***topic/interest-specific discussions*** between users. 
- On top of purely social networking apps, this platform has a social element driven by ***shared interests*** about some form of reference material.
- Above and beyond the existing reference apps, which are mostly ***digitized versions of books***, this app is adds **value** to the user. 
    - In fact, all the other reference material can be passed through this app, rendering their apps useless, and facilitating the ***growth*** of this new app.


#### Google Play Store
- Now, for the Google Play Store, we will use `reviews` as our measure. 
- Similarly, I will want to look at both the totals as well as the averages.

In [30]:
# populate the cats_android with the freq_table function
cats_android = freq_table(android_final, 1)

# for each category 
    # define total ratings and rating count for each category

for cats in cats_android:
    total = 0
    len_cats = 0
    
    
    # for each app in android:
        # assign the category of the app
        # if the app category matches the category in the freq_table
            # assign the number of ratings to a float variable
            # accumulate total ratings
            # accumulate rating count
            
    for app in android_final:
        cats_app = app[1]
        if cats_app == cats:
            n_ratings = float(app[3])
            total += n_ratings
            len_cats += 1
                  
    # get avg by taking (total ratings / rating count)
    avg_n_ratings = (total/len_cats)
    print(cats,':', avg_n_ratings, ',', total)
    

ART_AND_DESIGN : 24699.42105263158 , 1407867.0
AUTO_AND_VEHICLES : 14140.280487804877 , 1159503.0
BEAUTY : 7476.226415094339 , 396240.0
BOOKS_AND_REFERENCE : 87995.06842105264 , 16719063.0
BUSINESS : 24239.727272727272 , 9865569.0
COMICS : 42585.61818181818 , 2342209.0
COMMUNICATION : 995608.4634146341 , 285739629.0
DATING : 21953.272727272728 , 3622290.0
EDUCATION : 56293.09708737864 , 5798189.0
ENTERTAINMENT : 301752.24705882353 , 25648941.0
EVENTS : 2555.84126984127 , 161018.0
FINANCE : 38535.8993902439 , 12639775.0
FOOD_AND_DRINK : 57478.79090909091 , 6322667.0
HEALTH_AND_FITNESS : 78094.9706959707 , 21319927.0
HOUSE_AND_HOME : 26435.465753424658 , 1929789.0
LIBRARIES_AND_DEMO : 10925.807228915663 , 906842.0
LIFESTYLE : 33921.82369942196 , 11736951.0
GAME : 683523.8445475638 , 589197554.0
FAMILY : 113142.99821002387 , 189627665.0
MEDICAL : 3730.1533546325877 , 1167538.0
SOCIAL : 965830.9872881356 , 227936113.0
SHOPPING : 223887.34673366835 , 44553582.0
PHOTOGRAPHY : 404081.37547892

At the aggregate level, **Games** tops the charts with 589MM ratings, followed by **Communication** with 285MM ratings, and then **Social** with 227MM ratings. 

At the average level, **Communication** has the highest number of ratings (995608), followed by **Social** (965830), and then by **Game** (683523).
- Similar to the IOS data, in order to dive deeper into a genre, I will create a function cats_share to investigate any specified genre in greater detail, including the total number of ratings, and percentage share of that total each app has.
- This can help us identify if there are any particular apps that are dominating (and thereby skewing) the genre.
- Similar to the App Store data, I will exclude apps that represent less than 4% of their category's total ratings.

In [31]:
def cats_share(dataset, spec_cats):   
    count = 0
    
    # for each app in the dataset
        # assign genre
        # if statement for specified category
            # assign ratings as float
            # accumulator to get total ratings for the specified category
    # print the total number of ratings for the specified category
    for app in dataset:
        cats = app[1]
        if cats == spec_cats: 
            cats_ratings = float(app[3])
            count += cats_ratings
    cats_total=count
    print ('Number of ratings for the', spec_cats, 'category:',cats_total,'\n')
    for app in dataset:
        cats = app[1]
        if cats == spec_cats: 
            cats_ratings = float(app[3])
            count += cats_ratings
            percent_count = (cats_ratings/cats_total)*100
            
            if percent_count > 4:
                print (app[0], ' : ', cats_ratings, '\n', 'Percent of total Category ratings: ', percent_count, '%', '\n', sep='')


Although the **Game** category is the aggregate leader, it is also the 3rd highest in average ratings so I will investigate that genre there. 
- As such, I will start with the highest number of ratings on average, **Communication**:

In [32]:
cats_share(android_final,'COMMUNICATION')

Number of ratings for the COMMUNICATION category: 285739629.0 

WhatsApp Messenger : 69119316.0
Percent of total Category ratings: 24.189614944869966%

Messenger – Text and Video Chat for Free : 56646578.0
Percent of total Category ratings: 19.824543833225178%

UC Browser - Fast Download Private & Secure : 17714850.0
Percent of total Category ratings: 6.199647581960009%

BBM - Free Calls & Messages : 12843436.0
Percent of total Category ratings: 4.494803904151496%



As you can see, Whatsapp (24%) and Messenger (20%) are dominant forces in the **Communication** genre, albeit not as dominant as the giants in the IOS data. 
- It can skew the data but perhaps not to the same extent as representing 90% of all reviews in the genre.
    - Again, out of curiosity, let's have a look at them without the giants


In [33]:
under_50_m = []

for app in android_final:
    cats_ratings = app[3]
    if (app[1] == 'COMMUNICATION') and (float(cats_ratings) < 50000000):
        under_50_m.append(float(cats_ratings))
        
print(sum(under_50_m))
print(sum(under_50_m) / len(under_50_m))

159973735.0
561311.350877193


- Evidently, for **Communication** the total dropped from 285MM to 159MM, and the average dropped from 995K to 561K. 
    - This is consistent with Whatsapp and Messenger representing 44% of the genre's ratings. 

In [34]:
cats_share(android_final, 'SOCIAL')

Number of ratings for the SOCIAL category: 227936113.0 

Facebook : 78158306.0
Percent of total Category ratings: 34.289566919130536%

Instagram : 66577446.0
Percent of total Category ratings: 29.208818700878698%

Snapchat : 17015352.0
Percent of total Category ratings: 7.464965413356856%



Similarly, for the **Social** category, Facebook (34%) and Instagram (29%) make up 63% of the category's ratings, which can definitely skew the data. 

In [35]:
under_50_m = []

for app in android_final:
    cats_ratings = app[3]
    if (app[1] == 'SOCIAL') and (float(cats_ratings) < 50000000):
        under_50_m.append(float(cats_ratings))
        
print(sum(under_50_m))
print(sum(under_50_m) / len(under_50_m))

83200361.0
355557.0982905983


- The total number of ratings **Social** dropped from 227MM to 83MM, and the average went from 965,830 to 355,557 ratings. 
    - This is also consistent with Facebook and Instagram representing 63% of the category.

- Now let's have a look at the **Game** category:

In [36]:
cats_share(android_final, 'GAME')

Number of ratings for the GAME category: 589197554.0 

Subway Surfers : 27725352.0
Percent of total Category ratings: 4.705612202863965%

Clash of Clans : 44893888.0
Percent of total Category ratings: 7.619496668854128%



Now this is interesting. **Game** is the first genre so far that doesn't have a giant app dominating more than half the ratings of the entire genre. Here are some of the top apps in this genre by number of ratings:
- Beach Head Shooting Assault (9.4%)
- BW-Go Free (9.3%)
- E.G. Chess Free (8.6%)
- BJ's Bingo & Gaming Casino (8.5%)
- Cardio B Piano Game (8.5%)

Based on this finding, I will retain the average number of ratings for **Game** at 683K.

&nbsp;
If I remove the giant apps from the other categories, and assuming other genres also have some sort of giant apps, **Game** would be top of the popularity charts by both total number of ratings and average ratings.

While all of these are interesting categories, the **Communication** and **Social** categories already have giant apps dominating the genre, which makes it pretty difficult for a new app to compete. 
- Additionally, as I found earlier, the **Game** category is highly saturated by the sheer number of apps.
    - Even though it had no major skews, its saturated market makes it less than ideal environment for a new app to grow.

Pursuant to my findings in the IOS data, I am keen to explore how the **Books and Reference** category fares in the Android data. If it is favorable, then that will support my reference app profile for both markets. 
- On the surface, it has significantly lower ratings than the top genres, with 87995 ratings on average. 

In [37]:
cats_share(android_final, 'BOOKS_AND_REFERENCE')

Number of ratings for the BOOKS_AND_REFERENCE category: 16719063.0 

Google Play Books : 1433233.0
Percent of total Category ratings: 8.572448109083625%

Bible : 2440695.0
Percent of total Category ratings: 14.598276231150034%

Amazon Kindle : 814151.0
Percent of total Category ratings: 4.869597058160497%

Wattpad 📖 Free Books : 2915189.0
Percent of total Category ratings: 17.436318052034373%

Dictionary.com: Find Definitions for English Words : 899010.0
Percent of total Category ratings: 5.377155406376541%

JW Library : 922752.0
Percent of total Category ratings: 5.519160972119072%



Similar to **Game**, the **Books and Reference** category is also not dominated by a giant app, and thus does not have a large skew in the data. This is good.
- The majority of the apps are libraries and dictionaries.
- The closest to a giant app is the Bible app, which represents 14.6% of the category ratings.
    - This is similar to the App Store data. Likewise, it is important to recognize the religious context of its popularity - the popularity may stem from the material, not necessarily the app.
- According to the number of ratings, however, this does not seem to be amongst the more popular categories in the Google Play Store. 

## Conclusion

For this project, my goal was to use the existing App Store and Google Play Store data to derive an app profile that can be successful in both markets. 

&nbsp;

According the App Store data, I developed an app profile that was a combination of both **Reference** and **Social Networking** genres. 
- However, the Google Play data shows **Social** to be more significantly dominated by giant apps (which means it is more difficult to compete).
- In addition, while **Reference** was popular in the App Store (top 3 avg), its equivalent **Books and Reference** in the Google Play Store does not fare nearly as well. 
- Due to these factors, it is possible that in the ***short-term***, the new app will fare *better in the App Store* market than the Google Play Store market.

&nbsp;
On the flip side, the fact that the app profile is a combination of both gives it a **unique value**. 
- It will not be just another purely **Social** app (for conversations, media, and keeping in contact), it has a ***targeted purpose*** based on reference material, and ***targeted audience*** based on the shared interest. 
- It will not be just another **Books and Reference** app (digitzed version of a book), it will provide a ***social edge*** through discussion with individuals who share the interest, it has ***reliability*** through user ratings on user summaries, and it provides an improved way to ***learn*** and refer to material.

&nbsp;
From a growth perspective:
- The **Books and Reference** category is relatively untapped - it has great ***potential*** to grow as a genre. 
- Finally, since it is based on shared interest, *graph theory* and *homophily* suggest that ***network cascades*** and information ***diffusion*** are far more likely - which means that **growth** is more likely to happen, and happen at a faster pace.

