# Understanding Profitable App Profiles from the Apple and Google Play Store

The objective of this project is to analyse Android and iOS app data to help developers understand what type of apps are likely to attract more users to make data-driven decisions with respect to the kind of apps they build.

Let's assume that the app developers are interested in building apps that are free to download; therefore, the main source of income is from in-app ads. This means that the more people that download and use the app, the more revenue. The developers are only interested in English-language apps.

This project implements the following topics in Python for data analysis:

1. Python Programming Fundamentals
2. Variables and Data Types
3. Lists and and For Loops
4. Conditional Statements
5. Dictionaries and Frequency Tables
6. Functions

The `googeplaystore.csv` dataset contains Android apps from the Google Play Store. The dataset was sourced from [Dataquest](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

The `AppleStore.csv` dataset contains iOS apps from the Apple Store. The dataset was sourced from [Dataquest](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

This project is part of the guided projects from Dataquest to learn Data Analysis with Python.

# Open and Explore the Two Data Sets

In [2]:
from csv import reader

### Google Play data set ###
opened_file = open('C:\\Users\\jorge\\DataScience\\Files_DataQuest\\googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### Apple Store data set ###
opened_file = open('C:\\Users\\jorge\\DataScience\\Files_DataQuest\\AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Define a `explore_data function` to .

1. Prints the android header defined in first cell
2. Prints a space
3. Runs the explore_data function for the first three rows, adding a space for every row runned. Prints the length of rows and columns.

In [10]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns', len (dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns 13


In [11]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns 16


# Delete Wrong Data

Because the developers are interested in English-language apps and that are free, we need to: 

1. Remove non-English apps.
2. Remove apps that aren't free.

We also need to make sure that we:

3. Detect innacurate data, and correcto or remove it.
4. Detect duplicate data, and remove the duplicates.

#### The Google Play data set has a [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), where it outlines an error for row 10472. 

Let's print this row and compare it against the header and another row that is correct.

In [42]:
print(android_header)
print('\n')
explore_data(android, 10471, 10473, True) # it will print row 10471 and row 10472

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns 13


#### Alternatively, we can use the following code:

In [43]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app `Life Made WI-Fi Touchscreen Photo Frame`, and we can see that the rating is 19. **This is clearly off because the maximum rating for a Google Play app is 5.** Additionally, the category is 1.9.

Therefore, row 10472 is deleted.

In [44]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android)) # We will verify that the row has been deleted

10841
10840


# Removing Duplicate Entries

### Part One

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance:

 - The application Instagram has four entries. The main difference happens on the fourth position of each row, which corresponds to the **number of reviews.**

The different numbers show the data was **collected at different times.**

In [45]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




We can use this information to build a criterion for removing the duplicates.


Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

Because **the higher the number of reviews, the more recent the data should be.**

#### In total, there are 1,181 cases where an app occurs more than once:

In [13]:
# Created two lists
duplicate_apps = []
unique_apps = []

for app in android: # looped through the android data set and for each iteration:
    name = app[0]   # saved the app name to a variable called name
    if name in unique_apps:
        duplicate_apps.append(name) # If name was already in the unique_apps list, we appended name to the duplicate_apps list.
    else:
        unique_apps.append(name)    # If name wasn't already in the unique_apps list, we appended name to the unique_apps list.
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

* Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
* Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

### Part Two

#### Build the dictionary

On the previous code cell, there are 1,182 duplicates. 

After removing the duplicates we should expect 9,659 rows.

In [14]:
print('Expected lenght:', len(android) - 1181)

Expected lenght: 9660


In [None]:
reviews_max = {}

for app in android:
    name = app[0] # app name colum has index 0 in android list
    n_reviews = float(app[3]) # reviews column has index 3 in android list
    
    if name in reviews_max and reviews_max[name] < n_reviews: # essentially if the newly added name has less reviews than the one it just found... do not include it.
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [50]:
print('Expected lenght:', len(android) - 1181)
print('Actual lenght:', len(reviews_max))

Expected lenght: 9659
Actual lenght: 9659


In [None]:
android_clean = []
already_added = []

for app in android:
    name = app[0] # app name colum has index 0 in android list
    n_reviews = float(app[3]) # reviews column has index 3 in android list
    
    if (reviews_max[name] == n_reviews) and (name not in already_added): # The number of reviews of the current app matches the number of reviews of that app as per the reviews_max dictionary created earlier, and the name of the app is not already added in the already_added list.
        android_clean.append(app) # add the current row to the android_clean list.
        already_added.append(name) # add the app name variable to the already_added list.
        
print(already_added[:5])

Let's make sure the data set has the expected info.  The data set should have 9,659 rows.

In [52]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns 13


## Removing Non-English Apps

### Part One

We can do this by:

 - Removing each app with a name containing a symbol that is not commonly used in English text.

 - Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65.

 - The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127. 

- Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters (ASCII range). Our app names, however, are stored as strings and not integers. However, in Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

The corresponding number of each character can be obtained using the ord() built-in function.

For example:

In [53]:
string = 'abc'
print(string[0])
print(string[1])
print(string[2])

print('\n')

for character in string:
    print(character)

a
b
c


a
b
c


In [54]:
def english_apps(string):
    for character in string:
        if ord(character) > 127:
            return False
        
            return True # if we use else, it will return True always.... you know.
        
print(english_apps('Instagram'))
print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('\n')
print(english_apps('Docs To Go™ Free Office Suite'))
print(english_apps('爱Instachat 😜'))

None
False


False
False


In [55]:
def english_apps(string):
    for character in string:
        if ord(character) > 127:
            return False
        else:
            return True # if we use else, it will return True always.... you know.
        
print(english_apps('Instagram'))
print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('\n')
print(english_apps('Docs To Go™ Free Office Suite'))
print(english_apps('Instachat 😜'))

True
False


True
True


However, as seen in the cell above, the function does not work for app names that use emojis or other symbols that fall outside of the ASCII range. 

In [56]:

print(english_apps('Docs To Go™ Free Office Suite'))
print(english_apps('Instachat 😜'))

print('\n')

print(ord('™'))
print(ord('😜'))

True
True


8482
128540


### Part Two

To minimize the impact of data loss, we can remove an app if its name has more than three non-ASCII characters. 

This means all English apps with **up to three emoji or other special characters** will still be labeled as English.

The function is not perfect and we could still have some data loss.

In [57]:
def english_apps(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
        
        if non_ascii > 3:
            return False
        else:
            return True 
        
print(english_apps('Instagram'))
print(english_apps('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('\n')
print(english_apps('Docs To Go™ Free Office Suite'))
print(english_apps('爱Instachat 😜'))

True
True


True
True


Let's use the above function to filter-out non-English apps from both data sets (android and ios lists). 

In this case, the logic is that if an app name is identified as English, append the whole row to a separate list.

In [58]:
android_english = []
ios_english = []

for app in android_clean: # android list with non-repeating apps
    name = app[0] # app name colum has index 0 in this list
    if english_apps(name):
        android_english.append(app)
        
for app in ios: # ios list with non-repeating apps
    name = app[0] # app name colum has index 0 in this list
    if english_apps(name):
        ios_english.append(app)
    

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True) 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', '

## Isolating Free Apps


In [59]:
print(android_header)
print('\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [60]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]

    if price == '0':
        android_final.append(app)
       
for app in ios_english:
    price = app [4]
    
    if price =='0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))


8905
4056


## Most Common Apps by Genre
So far, we have:

1. Removed inaccurate data
2. Removed duplicate app entries
3. Removed non-English apps
4. Isolated the free apps

To minimize risk, the developers will follow this strategy:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

### Part One

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we will use frequency tables.

In [61]:
# Generate frequency tables that show percentages

def freq_table(dataset, index):

    table = {}
    total = 0

    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages

Below, the `sorted( )` built-in function takes an iterable data type (list, dictionary and tuple) and returns a list of the elements of that iterable data sorted in ascending or descending order. 

The reverse parameter controls wether the order is ascending or descending.

reverse = True --> descending

For example:

In [62]:
a_list = [50, 20, 100]
print(sorted(a_list))
print(sorted(a_list, reverse = True))

[20, 50, 100]
[100, 50, 20]


The `sorted( )` function doesn't work well with dictionaries because it only considers and returns the dictionary keys.

For example:

In [63]:
freq_table_example = {'Genre_1': 50, 'Genre_3': 20, 'Genre_2': 100}
sorted(freq_table_example)

['Genre_1', 'Genre_2', 'Genre_3']

However, the `sorted()` function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. 

To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second.

For example:

In [64]:
freq_table_example = {'Genre_1': 50, 'Genre_3': 20, 'Genre_2': 100}
freq_table_tuple_example = [(50, 'Genre_1'), (20, 'Genre_3'), (100, 'Genre_2')]
sorted(freq_table_tuple_example)

[(20, 'Genre_3'), (50, 'Genre_1'), (100, 'Genre_2')]

In [75]:
# Generate a function to display the percentages in a descending order because
# dictionarys don't have order

def display_table(dataset, index):
    table = freq_table(dataset, index) # freq_table is a function defined in the above cell.
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key) 
        table_display.append(key_val_as_tuple) 
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Part Two

Examine the frequency tables. Specifically the prime_genre column in the ios data set.

In [66]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [67]:
display_table(ios_final, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Examine the frequency tables. Specifically the Genres and Category [1] columns in the android data set. The difference among the two is not clear but we just need the bigger picture so the Category column is more suited for our analysis.

In [68]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [69]:
display_table(android_final, -4) # Genre

Tools : 8.422234699606962
Entertainment : 6.086468276249298
Education : 5.390230207748456
Business : 4.581695676586187
Lifestyle : 3.9191465468837734
Productivity : 3.885457608085345
Finance : 3.6833239752947784
Medical : 3.5148792813026386
Sports : 3.4475014037057834
Personalization : 3.312745648512072
Communication : 3.2341381246490735
Action : 3.0881527231892196
Health & Fitness : 3.065693430656934
Photography : 2.9421673217293653
News & Magazines : 2.829870859067939
Social : 2.6501965188096577
Travel & Local : 2.313307130825379
Shopping : 2.2459292532285233
Books & Reference : 2.1785513756316677
Simulation : 2.0662549129702414
Dating : 1.8528916339135317
Arcade : 1.8416619876473892
Video Players & Editors : 1.7742841100505335
Casual : 1.7518248175182483
Maps & Navigation : 1.4149354295339696
Food & Drink : 1.235261089275688
Puzzle : 1.1229646266142617
Racing : 0.9882088714205502
Role Playing : 0.9320606400898372
Libraries & Demo : 0.9320606400898372
Strategy : 0.9208309938236946
Au

In [70]:
display_table(android_final, 1) # Category

FAMILY : 18.97810218978102
GAME : 9.70241437394722
TOOLS : 8.433464345873105
BUSINESS : 4.581695676586187
LIFESTYLE : 3.9303761931499155
PRODUCTIVITY : 3.885457608085345
FINANCE : 3.6833239752947784
MEDICAL : 3.5148792813026386
SPORTS : 3.3801235261089273
PERSONALIZATION : 3.312745648512072
COMMUNICATION : 3.2341381246490735
HEALTH_AND_FITNESS : 3.065693430656934
PHOTOGRAPHY : 2.9421673217293653
NEWS_AND_MAGAZINES : 2.829870859067939
SOCIAL : 2.6501965188096577
TRAVEL_AND_LOCAL : 2.3245367770915215
SHOPPING : 2.2459292532285233
BOOKS_AND_REFERENCE : 2.1785513756316677
DATING : 1.8528916339135317
VIDEO_PLAYERS : 1.7967434025828188
MAPS_AND_NAVIGATION : 1.4149354295339696
FOOD_AND_DRINK : 1.235261089275688
EDUCATION : 1.167883211678832
ENTERTAINMENT : 0.9545199326221224
LIBRARIES_AND_DEMO : 0.9320606400898372
AUTO_AND_VEHICLES : 0.9208309938236946
HOUSE_AND_HOME : 0.8197641774284109
WEATHER : 0.7973048848961257
EVENTS : 0.7074677147669848
PARENTING : 0.6513194834362718
ART_AND_DESIGN : 0

# Most Popular Apps by Genre on the App Store

If we want to find out what genres are the most popular we can calculate this using the average number of installs for each app genre.

For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. 

As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot column in the data set.

Below we calculate the average number of user ratings per app genre on the App Store:

In [79]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


To calculate the average number of user ratings for each genre, we'll use a for loop inside of another for loop. This is called a **nested loop**.

For example:

In [80]:
some_strings = ['FIRST', 'SECOND']
some_integers = [1,2,3,4,5]

for string in some_strings: 
    print(string)
    
    for integer in some_integers: # notice the indentation of this loop within the previous loop.
        print(integer)

FIRST
1
2
3
4
5
SECOND
1
2
3
4
5


In [76]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0 # this variable will store the sum of user ratings specific to each genre.
    len_genre = 0 # this variable will store the number of apps specific to each genre.
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Entertainment : 10822.961077844311
Travel : 20216.01785714286
Social Networking : 53078.195804195806
Sports : 20128.974683544304
Catalogs : 1779.5555555555557
Food & Drink : 20179.093023255813
Utilities : 14010.100917431193
Education : 6266.333333333333
Shopping : 18746.677685950413
News : 15892.724137931034
Lifestyle : 8978.308510638299
Productivity : 19053.887096774193
Photo & Video : 27249.892215568863
Music : 56482.02985074627
Weather : 47220.93548387097
Health & Fitness : 19952.315789473683
Reference : 67447.9
Business : 6367.8
Games : 18924.68896765618
Finance : 13522.261904761905
Book : 8498.333333333334
Navigation : 25972.05
Medical : 459.75


From above, navigations apps have the higest number of user reviews, but this result is heavily influenced by Waze and Google Maps, which have close to half a million user reviews altogether.

In [72]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
高德地图（精准专业的手机地图） : 1040
百度地图-智能的手机导航，公交地铁出行必备 : 1014
百度地图HD : 771
ImmobilienScout24: Real Estate Search in Germany : 187
ナビタイムの乗り換え案内 - 遅延情報やバス時刻表を案内するアプリ : 48
高德地图HD : 26
Railway Route Search : 5
NAVIRO(ナビロー) - カーナビ/バイクナビ/徒歩ナビが使える高性能ナビアプリ : 0
ホラースポット-ghost spot-意味が分かると怖いマップ : 0
MapFan(マップファン) – 渋滞情報/オービス/オフライン対応の本格カーナビ : 0
JR東日本アプリ : 0
えほう - 最強の恵方コンパス : 0
バーチャル恵方巻【節分・恵方コンパス・方位】 : 0
恵方コンパス. : 0
ナビタイム ドライブサポーター - NAVITIMEのカーナビアプリ : 0
自転車ナビ by NAVITIME(ナビタイム) - 自転車のナビができるアプリ : 0


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [77]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
彩库宝典-【官方版】 : 0
Jishokun-Japanese English Dictionary & Translator : 0
無料で音楽や写真・カメラの裏技アプリ for iPhone7 : 0


However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

# Most Popular Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity.

There is only one problem, the install number are in range; therefore it's not precise. 

For example:

In [78]:
display_table(android_final, 5) # the Installs column

1,000,000+ : 15.687815833801237
100,000+ : 11.577765300393038
10,000,000+ : 10.499719258843346
10,000+ : 10.252667040988209
1,000+ : 8.422234699606962
100+ : 6.917462099943853
5,000,000+ : 6.816395283548568
500,000+ : 5.53621560920831
50,000+ : 4.817518248175182
5,000+ : 4.525547445255475
10+ : 3.537338573834924
500+ : 3.2341381246490735
50,000,000+ : 2.2908478382930935
100,000,000+ : 2.1224031443009546
50+ : 1.9090398652442448
5+ : 0.7860752386299831
1+ : 0.5165637282425604
500,000,000+ : 0.26951151038742277
1,000,000,000+ : 0.22459292532285235
0+ : 0.044918585064570464
0 : 0.011229646266142616


To perform computations, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. 

We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

To remove characters from strings, we can use str.replace(old, new). This is a type of function called method.

str.replace() takes in two parameters, old and new, and replaces all occurrences of old within a string with new. 

For example:

In [82]:
n_installs = '100,000+'
print(n_installs.replace('+', 'plus'))
print(n_installs.replace('1', 'one'))
print(n_installs.replace( '&', 'ampersand')) # no change because these strings are not present in our variable.
print(n_installs.replace('+', ''))

100,000plus
one00,000+
100,000+
100,000


If you want to reassign the changes to the variable. You have to do the following:

In [87]:
n_installs = '100,000+'
n_installs = n_installs.replace('+', 'plus')

print(n_installs)

100,000plus


In [89]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0 # store the sum of installs specific to each genre.
    len_category = 0 # store the number of apps specific to each genre.
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

MAPS_AND_NAVIGATION : 3993339.603174603
TRAVEL_AND_LOCAL : 13984077.710144928
FINANCE : 1387692.475609756
MEDICAL : 120550.61980830671
ART_AND_DESIGN : 1952105.1724137932
BOOKS_AND_REFERENCE : 8587351.855670104
BEAUTY : 513151.88679245283
SPORTS : 3638640.1428571427
FAMILY : 3668870.823076923
HOUSE_AND_HOME : 1331540.5616438356
EVENTS : 253542.22222222222
PARENTING : 542603.6206896552
DATING : 854028.8303030303
LIFESTYLE : 1436126.94
BUSINESS : 1708215.906862745
PRODUCTIVITY : 16738957.554913295
HEALTH_AND_FITNESS : 4188821.9853479853
VIDEO_PLAYERS : 24573948.25
SHOPPING : 7001693.425
FOOD_AND_DRINK : 1924897.7363636363
NEWS_AND_MAGAZINES : 9401635.952380951
COMICS : 803234.8214285715
ENTERTAINMENT : 11640705.88235294
GAME : 15551995.891203703
EDUCATION : 1825480.7692307692
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
COMMUNICATION : 38322625.697916664
PERSONALIZATION : 5183850.806779661
PHOTOGRAPHY : 17772018.759541985
TOOLS : 10787009.952063914
WEATHER 

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [91]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [93]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3589717.245210728


We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [95]:

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E


The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [96]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads)

In [97]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

# Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.
