# Profitable App Profiles for the App Store and Google Play Markets

Imagine that I am working as a data analyst for a company that only builds mobile apps that are free to download and install. Our main source of revenue consists of in-app adds, which means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

The purpose of this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. This project replicates challenges that a data analyst regularly encounters in their work.

---

## Opening and Exploring the Data

As of June 2019, there were approximately 2.2 million apps available for download on the App Store, and approximately 2.8 million apps available for download on the Google Play Store.

Collecting data for ~5 million apps is extremely time consuming and costly, so for the purposes of this project we will analyze sample data instead. Here are two data sets that we will use for our analyses. You can access and download them through the link below:
- Google Play data set containing approximately ten thousand Android apps: [Google Play Store Dataset](https://www.kaggle.com/lava18/google-play-store-apps)
- App Store data set containing approximately seven thousand iOS apps: [App Store Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)



In [1]:
from csv import reader 

#Opening and reading data from the google play store, storing data in variables a_header and a_body
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
a_header = android[0]
a_body = android[1:]

#opening and reading data from the ios store, storing the header and body data
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_body = ios[1:]

In [2]:
#exploring the first 3 rows of each dataset
def explore_data(dataset, start, end):
    data = dataset[start:end]
    for element in data:
        print('\n', element)

explore_data(a_body, 0, 3)
print('\n')
explore_data(ios_body, 0 , 3)


 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

 ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']

 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']



 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']

 ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

 ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


The first three rows are outputs from the Google Play dataset. The following three rows are outputs from the App Store dataset. To see how many rows and columns are in each dataset, we can define a function below that provides this result:

In [3]:
#defining a function that prints the rows and columns of a given dataset
def data_info(name, dataset):
    columns = len(dataset[0])
    rows = len(dataset)
    print("\nNumber of rows in ", name, "data set: ", rows)
    print("Number of columns in ", name, "data set: ", columns)

data_info("Android", a_body)
data_info("iOS", ios_body)


Number of rows in  Android data set:  10841
Number of columns in  Android data set:  13

Number of rows in  IOS data set:  7197
Number of columns in  IOS data set:  16


This output shows that the amount of in the Android data set is about 50.6% larger than the iOS data set, and the iOS data set contains 3 more columns per entry. Before making any kind of analysis, however, we need to check that the data is clean.

## Cleaning the Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. By printing out this row, we can see that the error is attributed to a null value within the row, as depicted by '' in the ninth column. If we didn't know that there was an error in row 10472, however, we could find missing entries within our data manually:

In [4]:
#manually figuring out where our data contains null values
def check_null(dataset):
    error_list = {}
    
    for app in dataset:
        name = dataset[0]
        for i in range(len(name) - 1):
            null = ''
            column = app[i]
            
            if (column == null): 
                error_list[app[0]] = dataset.index(app)

    print("")
    return error_list

null_values = check_null(a_body)
print(null_values)
print("\n")

#printing rows with errors and verifying that we see null values
print(a_body[10472])
print("")
print(a_body[1553])


{'Life Made WI-Fi Touchscreen Photo Frame': 10472, 'Market Update Helper': 1553}


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

['Market Update Helper', 'LIBRARIES_AND_DEMO', '4.1', '20145', '11k', '1,000,000+', 'Free', '0', 'Everyone', 'Libraries & Demo', 'February 12, 2013', '', '1.5 and up']


Manually, we have found rows which contain a null value ( '' ) within their columns. This includes row #10472 and a new, unkowingly erroneus row: #1553 . Row #10472 also contains another error: it has a listed rating of 19. This is clearly incorrect because the maximum rating for a Google Play app is 5.

We can remove these entries from our dataset by using the del keyword:

In [5]:
#cleaning the data by removing the rows with incomplete data
del a_body[10472]
del a_body[1553]

## DO NOT RUN AGAIN

## Removing Duplicate Entries

If we look closely at the Google Play data set, we'll eventually find that some apps in this play store have more than one entry. In order to make this data cleaner, we need to remove these duplicate entries.

In [6]:
#for example, lets see how many duplicate entries there are of 'Instagram'
for app in a_body:
    if app[0] == 'Instagram':
        print(app, "\n")

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 



In [7]:
#lets start by sorting the apps into a duplicate and unique list and finding how many apps are in each one
unique_apps = []
duplicate_apps = []

for element in a_body:
    name = element[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print("Number of unique apps: ", len(unique_apps))
print("Number of duplicate apps: ", len(duplicate_apps), "\n")
print("Examples of duplicate apps: ", duplicate_apps[:15])

## we can see that in the range of duplicate apps, 'Box' and 'Google My Business' have already printed more than once

Number of unique apps:  9658
Number of duplicate apps:  1181 

Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. 

Instead of removing duplicate rows randomly, it would be better to analyze the differences between these duplicate entries and determine a criteria in order to keep the most relevant and reliable ones.

If you look at the duplicate entries above for the Instagram app, the main difference between the entries corresponds to the number of reviews. Ideally, we should keep the entry with the greatest number of reviews because the more reviews an app gets, the more reliable the rating is. 

In [8]:
#we can begin this approach by creating a dictionary
reviews_max = {}

#adding app name and their maximum rating to reviews_max dictionary
for element in a_body:
    name = element[0]
    rating = float(element[3])
    
    if name in unique_apps:
        reviews_max[name] = rating
        
    else:
        temp_rating = reviews_max[name]
        if rating > temp_rating:
            reviews_max[name] = rating

reviews_max

{'Yaoi Novels - Shounen ai Book&fiction': 304.0,
 'Wonder5 Masters R': 1655.0,
 'New 2018 Keyboard': 298321.0,
 'Air conditioner remote control': 29854.0,
 'EO RAIPUR': 1.0,
 'Tennis Champion 3D - Online Sports Game': 170973.0,
 'Learn R Programming Full': 11.0,
 'KBA-EZ Health Guide': 4.0,
 'Pet Lovers Dating': 0.0,
 'Sync for reddit': 62740.0,
 'Solo Locker (DIY Locker)': 474439.0,
 'BS Tractor': 3.0,
 'EJ.by': 10.0,
 'Comunidad BH': 23.0,
 'Math Solver': 2250.0,
 'codeSpark Academy & The Foos': 4522.0,
 'EMI, FD, RD - Bank Calculator': 42.0,
 'BlueDV AMBE': 0.0,
 'PPS Online': 37.0,
 'BG Products': 4.0,
 'Florida Lottery Results': 763.0,
 'Keep My Notes - Notepad & Memo': 122424.0,
 'Build.com - Shop Home Improvement & Expert Advice': 118.0,
 'Cyprus Police': 226.0,
 'BuzzFeed: News, Tasty, Quizzes': 131028.0,
 'German Listening': 18298.0,
 'Ay Yıldız Analog Saat': 37.0,
 'EGW Writings': 24278.0,
 'Allrecipes Dinner Spinner': 61881.0,
 'iCluster - The DX-Cluster database': 0.0,
 'IF

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [9]:
#if both of these numbers are equal, we know we have correctly filled our dictionary with non-duplicate apps
print("Number of expected entries in dictionary: ", len(a_body) - len(duplicate_apps))
print("Number of actual entries in dictionary: ", len(reviews_max))

Number of expected entries in dictionary:  9658
Number of actual entries in dictionary:  9658


Our outputs match! This means we have correctly filled our dictionary with non-duplicate apps.

Now, it's time to utilize this dictionary to clean our data and remove duplicate apps.

In [10]:
#although we aren't technically deleting duplicate apps, we are filling a replica dataset with non-duplicate apps
android_clean = []
already_added = []

for element in a_body:
    name = element[0]
    rating = float(element[3])
    
    if (name not in already_added) and (reviews_max[name] == rating):
        android_clean.append(element)
        already_added.append(name)

data_info("'android_clean'", android_clean)


Number of rows in  'android_clean' data set:  9658
Number of columns in  'android_clean' data set:  13


Since we have the same number of rows in our dataset list then we did in our dictionary, we can verify that we have correctly filled our new list with non-duplicate data. Now, it is time for the next step in the data-cleaning process.


## Removing Non-English Apps

Many apps in our dataset are not directed toward English audiences and they contain foreign names and texts. Since we are not interested in analyzing non-English apps, we need to remove them from our dataset. In order to successfully clean our data from non-English apps, we first need to identify the apps that do not contain English text.

We can use ASCII values to determine whether or not an app name contains non-english text. All characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

In [11]:
#defining a function that detects whether a string contains english characters or not
def is_english(string):
    for char in string:
        if ord(char) > 127:
            return False
    return True

#sample app names
print(is_english("Instagram"))
print(is_english("\n謎解き＆ブロックパズル"))
print(is_english("Go™"))
print(is_english("Instachat 😜"))

True
False
False
False


Although our function works fine, it does not pickup on specific characters such as the TM symbol ( ™ ), and emojis ( 😜 ) because they fall outside the ASCII range. It is not the best idea to use this function to remove non-English apps because it may remove apps relevant to our analysis.

In order to improve this function, we can try removing apps that contain three or more non-ASCII characters:

In [12]:
#improved is_english function
def is_english(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1  
    if (count > 2):
        return False
    return True

#sample app names
print(is_english("Instagram"))
print(is_english("\n謎解き＆ブロックパズル"))
print(is_english("Go™"))
print(is_english("Instachat 😜"))

True
False
True
True


By adding a limit to the amount of non-ASCII characters allowed, we can filter out non-English apps more effectively. Although very few non-English apps might get past our filter, this seems good enough for our analysis — we shouldn't spend too much time on optimization at this point.

Below, we use this function to filter out non-English apps from both our android and iOS datasets:

In [13]:
android_english = []
ios_english = []

#prints the quantity of rows in each dataset prior to filtering
print("Number of rows in Android data set before filtering: ", len(android_clean))
print("Number of rows in iOS data set before filtering: ", len(ios_body))
print('\n----------------------------------------------')

#filtering the android dataset
for element in android_clean:
    name = element[0]
    if (is_english(name)):
        android_english.append(element)

#filtering the iOS dataset
for element in ios_body:
    name = element[1] #the name of an app is at index 1 for IOS dataset
    if (is_english(name)):
        ios_english.append(element)

data_info('Android', android_english)
data_info('iOS', ios_english)

Number of rows in Android data set before filtering:  9658
Number of rows in iOS data set before filtering:  7197

----------------------------------------------

Number of rows in  Android data set:  9596
Number of columns in  Android data set:  13

Number of rows in  iOS data set:  6155
Number of columns in  iOS data set:  16


As we can see from this output, we have filtered out approximately 62 foreign app entries from our Android dataset, and approximately 1,042 foreign app entries from our iOS dataset.

## Isolating Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [14]:
print(a_header,'\n')
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


From this output, we can see that:
    - the price column for android lies at index 7
    - the price column for iOS lies at index 4
    
We can use this information to filter out priced apps:

In [15]:
android_free = []
ios_free = []

#filtering out priced Android apps
for element in android_english:
    price_type = element[6] #adding type constraint for more accurate data
    price = element[7]
    
    if (price_type == 'Free') and (price == '0'): 
        android_free.append(element)

#filtering out priced iOS apps
for element in ios_english:
    price = element[4]
    
    if (price == '0.0'):
        ios_free.append(element)
        
print(len(android_free))
print(len(ios_free))

8846
3203


After filtering out the priced apps, our dataset for Android now consists of 8846 free apps and our dataset for iOS now consists of 3203 free apps; this should be enough for our analysis.

## Most Common Apps By Genre

### Part One

Creating the functions to use for our analysis:

In [16]:
from collections import OrderedDict #used to sort the dictionaries
import pprint #used to print each element of the sorted dictionary on a new line for a clearer output

def freq_table(dataset, index):
    freq_table = {}
    freq_percentage = {}
    total = 0
    
    for element in dataset:
        column = element[index]
        if column in freq_table:
            freq_table[column] += 1 
        else:
            freq_table[column] = 1  
        total += 1
        
    for element in freq_table:
        count = freq_table[element]
        freq_percentage[element] = round(count/total * 100, 4)
        
    return freq_table, freq_percentage


#type 0 to return a frequency table, type 1 to return a frequency percentage table
def get_table(dataset, index, table_type, a_z): 
    
    table_tuple = freq_table(dataset, index)
    table = table_tuple[table_type]
    
    #a_z = true if you wish to return desired table in alphabetical order
    if (a_z):
        table_az = OrderedDict(sorted(table.items()))
        alphabetical = pprint.pprint(table_az)
        return alphabetical
    
    #else, it will return from highest to lowest value
    else:
        table_descending = OrderedDict(sorted(table.items(), reverse = True, key = lambda t: t[1]))
        descending = pprint.pprint(table_descending)
        return descending 

This algorithm get_table() gives us a lot of flexibility in our analysis because if we wanted to, we could visualize this data in a variety of different ways. With this algorithm we can see the frequency table (by count or percentage) of any dataset, by any column pertaining to that dataset, either by alphatebical or key-value descending order.

### Part Two

Examining our outputs:

In [30]:
print("Market Share of Free, English iOS Apps by Category: \n")
get_table(ios_free, -5, 1, False) #category for ios = index 11 (or -5)

print("\n\nMarket Share of Free, English Android Apps by Category: \n")
get_table(android_free, 1, 1, False) #category for android = index 1

Market Share of Free, English iOS Apps by Category: 

{'Games': 58.2579,
 'Entertainment': 7.8364,
 'Photo & Video': 4.9953,
 'Education': 3.684,
 'Social Networking': 3.3094,
 'Shopping': 2.5913,
 'Utilities': 2.4664,
 'Sports': 2.1542,
 'Music': 2.0606,
 'Health & Fitness': 2.0293,
 'Productivity': 1.7484,
 'Lifestyle': 1.561,
 'News': 1.3425,
 'Travel': 1.2488,
 'Finance': 1.0927,
 'Weather': 0.8742,
 'Food & Drink': 0.8117,
 'Reference': 0.5308,
 'Business': 0.5308,
 'Book': 0.3746,
 'Medical': 0.1873,
 'Navigation': 0.1873,
 'Catalogs': 0.1249}


Market Share of Free, English Android Apps by Category: 

{'FAMILY': 19.2516,
 'GAME': 9.4845,
 'TOOLS': 8.4558,
 'BUSINESS': 4.5896,
 'PRODUCTIVITY': 3.9001,
 'LIFESTYLE': 3.8888,
 'FINANCE': 3.7079,
 'MEDICAL': 3.5496,
 'SPORTS': 3.414,
 'PERSONALIZATION': 3.3235,
 'COMMUNICATION': 3.2444,
 'HEALTH_AND_FITNESS': 3.0748,
 'PHOTOGRAPHY': 2.9505,
 'NEWS_AND_MAGAZINES': 2.8035,
 'SOCIAL': 2.6679,
 'TRAVEL_AND_LOCAL': 2.34,
 'SHOPPING': 2.24

If we analyze the market make-up of the App Store and Google Play Store by app category, we can see some vast differences. The 'Games' app category consists of over half of the free, English apps within the App Store (58.26%), while the same category in the Google Play Store is far less common (9.48%). 

It seems as though the majority of the apps on the Apple Store are designed for fun (games, entertainment, photo and video, etc.), while the apps on the Google Play Store have a more balanced selection of practical purposes (family, tools, business, productivity). 

One discrepency to notice is that the App Store does not have 'Family' category. This category makes up the majority of apps on the Google Play Store (19.25%). It could be possible that kid-friendly games on the Google Play Store have been labeled as 'Family' rather than 'Games'. This possibility could explain the vast disparity between the amount of apps with the 'Games' category in the App Store compared to the amount of apps with this same category in the Google Play Store.  

Another factor to keep into account is that just because the "fun" apps are the majority within the App Store, this does not mean they are the most popular. The amount of apps offered does not directly translate into the demand for these type of apps.

In [18]:
print("\n\nMarket Share of Free, English Android Apps by Genre: \n")
get_table(android_free, 9, 1, False) #Genre for android = index 1



Market Share of Free, English Android Apps by Genre: 

{'Tools': 8.4445,
 'Entertainment': 6.0818,
 'Education': 5.3584,
 'Business': 4.5896,
 'Productivity': 3.9001,
 'Lifestyle': 3.8775,
 'Finance': 3.7079,
 'Medical': 3.5496,
 'Sports': 3.4592,
 'Personalization': 3.3235,
 'Communication': 3.2444,
 'Action': 3.0974,
 'Health & Fitness': 3.0748,
 'Photography': 2.9505,
 'News & Magazines': 2.8035,
 'Social': 2.6679,
 'Travel & Local': 2.3287,
 'Shopping': 2.2496,
 'Books & Reference': 2.1366,
 'Simulation': 2.0461,
 'Dating': 1.8652,
 'Arcade': 1.8539,
 'Video Players & Editors': 1.7861,
 'Casual': 1.7522,
 'Maps & Navigation': 1.3905,
 'Food & Drink': 1.2435,
 'Puzzle': 1.1305,
 'Racing': 0.9948,
 'Role Playing': 0.9383,
 'Libraries & Demo': 0.927,
 'Auto & Vehicles': 0.927,
 'Strategy': 0.9044,
 'House & Home': 0.8026,
 'Weather': 0.7913,
 'Events': 0.7122,
 'Adventure': 0.667,
 'Beauty': 0.5991,
 'Art & Design': 0.5991,
 'Comics': 0.5991,
 'Parenting': 0.4974,
 'Card': 0.4409,
 

The difference between the 'Category' of android apps and the 'Genre' of android apps is not entirely clear, but the majority of apps by 'Genre' also appear to consist of practical uses (tools, business, productivity, lifestyle).

The 'Genre' column has numerous rows, many of which are very specific. Since we are only looking for the bigger picture at the moment, it would be best to stick to the 'Category' column for our analysis.

Now, we'd like to get an idea about the kind of apps that have the most users.

## Most Popular Apps By Genre on the App Store

### Part One

One way to find out what genres are the most popular is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing.

As an alternative, we can use the total number of user ratings to find which apps are most popular in the App Store.

In [19]:
ios_genres = freq_table(ios_free, -5) #returns a tuple of dictionaries
ios_genres = ios_genres[0] #sets ios_genres equal to the freq_table dictionary

#need to iterate through nested-for loop to find # of user ratings
for name in ios_genres:
    genre_reviews = {}
    num_reviews = 0
    total = 0
    
    for app in ios_english:
        genre = app[-5] #index -5 = genre
        user_rating = float(app[5]) #index 5 = total user rating
        if (name == genre):
            num_reviews += user_rating
            total += 1
    
    avg_rating = num_reviews/total
    genre_reviews[name] = round(avg_rating, 2)   
    print(genre_reviews)

{'Entertainment': 8920.81}
{'Shopping': 26938.96}
{'News': 17283.54}
{'Finance': 23840.06}
{'Music': 29047.11}
{'Catalogs': 3465.0}
{'Food & Drink': 19934.39}
{'Lifestyle': 9021.5}
{'Business': 5149.32}
{'Utilities': 8002.3}
{'Weather': 23145.25}
{'Health & Fitness': 10868.02}
{'Education': 2478.21}
{'Games': 15641.67}
{'Social Networking': 60253.85}
{'Medical': 648.95}
{'Photo & Video': 14688.72}
{'Sports': 15350.91}
{'Book': 10750.11}
{'Navigation': 19370.82}
{'Reference': 28096.22}
{'Productivity': 8508.09}
{'Travel': 19351.44}


From a quick glance through this output, we can see that the 'Social Networking' genre generated the highest average number of user reviews with an average of over 60,000 reviews.

Upon further inspection, we can see which apps are the most popular within this genre:

In [20]:
#there is a large quantity of 'Social Net.' apps so this counter limits the output to just the 5 most popular ones
count = 0  

for app in ios_free:
    name = app[1]
    user_rating = float(app[5])
    
    if (count < 5):
        print(name, ':', user_rating)
        count += 1

Facebook : 2974676.0
Instagram : 2161558.0
Clash of Clans : 2130805.0
Temple Run : 1724546.0
Pandora - Music & Radio : 1126879.0


With dominant apps like Facebook with nearly three million user reviews, the average number of 'Social Networking' app reviews is skewed. This occurance also applies to music apps where big players like Spotify and Soundcloud heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while other apps may struggle to reach a few thousand. 

To further illustrate this, Reference apps have 28,096.22 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [31]:
ratings = []

for app in ios_free:
    name = app[1]
    genre = app[-5]
    user_rating = float(app[5])
    
    if  (genre == 'Reference'):
        ratings.append(user_rating)
        
        #Printing all of the free, iOS apps with a 'Reference' genre and printing their quantity of user ratings
        print(name, ':', user_rating)        

Bible : 985920.0
Dictionary.com Dictionary & Thesaurus : 200047.0
Dictionary.com Dictionary & Thesaurus for iPad : 54175.0
Google Translate : 26786.0
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418.0
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588.0
Merriam-Webster Dictionary : 16849.0
Night Sky : 12122.0
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535.0
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693.0
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497.0
Guides for Pokémon GO - Pokemon GO News and Cheats : 826.0
WWDC : 762.0
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718.0
VPN Express : 14.0
Real Bike Traffic Rider Virtual Reality Glasses : 8.0
Jishokun-Japanese English Dictionary & Translator : 0.0


Since the average number of user reviews is skewed for most genres, it would be better to look at the median value instead; many averages have a skewed distribution from 1 or 2 popular apps heavily influencing the average. 

Let's take a look at the median number of reviews for each genre:

In [22]:
import statistics
def get_median_ios(genre_name):
    ratings = []

    for app in ios_free:
        genre = app[-5]
        user_rating = float(app[5])
    
        if  (genre == genre_name):
            ratings.append(user_rating)       
        else:
            continue
    
        median = statistics.median(ratings)
    return median

for app in ios_genres:
    print(app, ': ', get_median_ios(str(app)))

Entertainment :  1205.0
Shopping :  6408.0
News :  373.0
Finance :  2207.0
Music :  3850.0
Catalogs :  1229.0
Food & Drink :  1490.5
Lifestyle :  1183.0
Business :  1150.0
Utilities :  1341.0
Weather :  289.0
Health & Fitness :  2459.0
Education :  606.5
Games :  913.5
Social Networking :  4199.0
Medical :  566.5
Photo & Video :  2206.0
Sports :  1628.0
Book :  665.0
Navigation :  8196.5
Reference :  8535.0
Productivity :  8737.5
Travel :  798.5


This is a much better representation of the most popular apps by genre within the iOS App Store... we see that with Reference, Navigation, and Productivity having the highest medians of user-reviews, these genres could be very promising for a new app coming to market. 

If we research our potential competition and create a plan to develop a clear competitive advantage, we could have a good opportunity to create a popular iOS app within these genres. The more popular the app becomes, the more revenue we will have coming in (due to in-app advertisements). 

## Most Popular Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [23]:
get_table(android_free, 5, 1, False) #installs column for android = index 5

{'1,000,000+': 15.7698,
 '100,000+': 11.5532,
 '10,000,000+': 10.5245,
 '10,000+': 10.208,
 '1,000+': 8.3993,
 '100+': 6.9297,
 '5,000,000+': 6.8279,
 '500,000+': 5.5618,
 '50,000+': 4.7705,
 '5,000+': 4.4879,
 '10+': 3.5383,
 '500+': 3.2444,
 '50,000,000+': 2.2835,
 '100,000,000+': 2.1366,
 '50+': 1.9218,
 '5+': 0.7913,
 '1+': 0.5087,
 '500,000,000+': 0.2713,
 '1,000,000,000+': 0.2261,
 '0+': 0.0452}


This data is not very precise. For example, we don't know whether an app with 100,000+ installs has 150,000 installs or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users. Thus, we are going to leave these numbers as they are.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. In the loop below, we will make this conversion and we will also calculate the average number of installs by genre:

In [24]:
categories_dict = freq_table(android_free, 1)
categories_dict = categories_dict[1]

avg_max = []

for category in categories_dict:
    total_installs = 0
    num_of_categories = 0
    
    for app in android_free:
        app_category = app[1]
        
        if (app_category == category):            
            installs = app[5]
            installs = installs.replace(',', '')
            installs = installs.replace('+', '')
            total_installs += float(installs)
            num_of_categories += 1
            
    average_installs = total_installs / num_of_categories
    avg_max.append(average_installs)
    print(category, ': ', average_installs)

print('\nGreatest average of installs: ', max(avg_max))

EDUCATION :  1768500.0
PHOTOGRAPHY :  17840110.40229885
SHOPPING :  7036877.311557789
SOCIAL :  23253652.127118643
FAMILY :  5183203.576042279
ENTERTAINMENT :  9146923.076923076
AUTO_AND_VEHICLES :  647317.8170731707
ART_AND_DESIGN :  1986335.0877192982
PRODUCTIVITY :  16772838.591304347
LIFESTYLE :  1446158.2238372094
TOOLS :  10830251.970588235
LIBRARIES_AND_DEMO :  634095.243902439
DATING :  854028.8303030303
NEWS_AND_MAGAZINES :  9549178.467741935
MEDICAL :  123064.7898089172
FOOD_AND_DRINK :  1924897.7363636363
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
HEALTH_AND_FITNESS :  4167457.3602941176
TRAVEL_AND_LOCAL :  13984077.710144928
SPORTS :  4288677.758278145
WEATHER :  5145550.285714285
PERSONALIZATION :  5201482.6122448975
VIDEO_PLAYERS :  24790074.17721519
MAPS_AND_NAVIGATION :  4049274.6341463416
HOUSE_AND_HOME :  1360598.042253521
COMICS :  832613.8888888889
PARENTING :  542603.6206896552
BUSINESS :  1704192.3399014778
COMMUNICATION :  38459603.452961676
GAME :

On average, communication apps have the most installs: 38,459,603. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [25]:
for app in android_free:
    name = app[0]
    category = app[1]
    installs = app[5]
    
    if (category == 'COMMUNICATION' and (installs == '100,000,000+' or 
                                         installs == '500,000,000+' or 
                                         installs == '1,000,000,000+')):
        
        print(name, ': ', installs)

Messenger – Text and Video Chat for Free :  1,000,000,000+
Gmail :  1,000,000,000+
imo beta free calls and text :  100,000,000+
imo free video calls and chat :  500,000,000+
Android Messages :  100,000,000+
Google Duo - High Quality Video Calls :  500,000,000+
UC Browser - Fast Download Private & Secure :  500,000,000+
Skype - free IM & video calls :  1,000,000,000+
Who :  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji :  100,000,000+
WhatsApp Messenger :  1,000,000,000+
Google Chrome: Fast & Secure :  1,000,000,000+
Firefox Browser fast & private :  100,000,000+
Messenger Lite: Free Calls & Messages :  100,000,000+
LINE: Free Calls & Messages :  500,000,000+
Hangouts :  1,000,000,000+
Kik :  100,000,000+
KakaoTalk: Free Calls & Text :  100,000,000+
Opera Mini - fast web browser :  100,000,000+
Opera Browser: Fast and Secure :  100,000,000+
Telegram :  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer :  100,000,000+
UC Browser Mini -Tiny Fast Private & Secure :  

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [26]:
under_hundred_million = []
average_installs = 0

for app in android_free:
    category = app[1]
    installs = app[5].replace('+', '')
    installs = installs.replace(',', '')
    installs = float(installs)
        
    if (category == 'COMMUNICATION') and (installs < 100000000.0):
        under_hundred_million.append(installs)

average_installs = sum(under_hundred_million)/len(under_hundred_million)
print('\nAverage Google Play Store app installs for the \'Communication\' category: ', round(average_installs, 2))


Average Google Play Store app installs for the 'Communication' category:  3607331.5


This pattern also holds true for the video players category. This category is dominated by apps like Youtube, Google Play Movies & TV, and MX Player. The pattern is repeated for social apps (Facebook, Instagram, Google+, etc.), and productivity apps (Microsoft Word, Dropbox, Google Calendar, etc.).

These giants may cause the genres to seem more popular than they truly are. Additionally, it would be difficult to compete with them since they are already dominating the market. For these reasons, we should shift our focus to alternative genres.

The books and reference genre looks fairly popular, with an average number of approximately 8,814,200 installs. Since we found this genre to have potential in the App Store, it could be a good idea to explore this genre in more depth for the Google Play Store:

In [27]:
for app in android_free:
    name = app[0]
    category = app[1]
    installs = app[5]
    
    if (category == 'BOOKS_AND_REFERENCE'):
        print(name, ': ', installs)

E-Book Read - Read Book for free :  50,000+
Download free book with green book :  100,000+
Wikipedia :  10,000,000+
Cool Reader :  10,000,000+
Free Panda Radio Music :  100,000+
Book store :  1,000,000+
FBReader: Favorite Book Reader :  10,000,000+
English Grammar Complete Handbook :  500,000+
Free Books - Spirit Fanfiction and Stories :  1,000,000+
Google Play Books :  1,000,000,000+
AlReader -any text book reader :  5,000,000+
Offline English Dictionary :  100,000+
Offline: English to Tagalog Dictionary :  500,000+
FamilySearch Tree :  1,000,000+
Cloud of Books :  1,000,000+
Recipes of Prophetic Medicine for free :  500,000+
ReadEra – free ebook reader :  1,000,000+
Anonymous caller detection :  10,000+
Ebook Reader :  5,000,000+
Litnet - E-books :  100,000+
Read books online :  5,000,000+
English to Urdu Dictionary :  500,000+
eBoox: book reader fb2 epub zip :  1,000,000+
English Persian Dictionary :  500,000+
Flybook :  500,000+
All Maths Formulas :  1,000,000+
Ancestry :  5,000,00

This genre includes a variety of apps: reading ebooks, dictionaries, tutorials, etc. Nevertheless, there's still a small number of extremely popular apps such as Google Play Books, Bible, and Amazon Kindle that skew the average:

In [28]:
for app in android_free:
    name = app[0]
    category = app[1]
    installs = app[5]
    
    if (category == 'BOOKS_AND_REFERENCE' and (installs == '100,000,000+' or 
                                         installs == '500,000,000+' or 
                                         installs == '1,000,000,000+')):
        
        print(name, ': ', installs)

Google Play Books :  1,000,000,000+
Bible :  100,000,000+
Amazon Kindle :  100,000,000+
Wattpad 📖 Free Books :  100,000,000+
Audiobooks from Audible :  100,000,000+


There are only a few very popular apps, however, so this market still shows some potential. Let's analyze the apps with average-popularity to get an idea on the kinds of apps we could realistically develop:

In [29]:
for app in android_free:
    name = app[0]
    category = app[1]
    installs = app[5].replace('+', '')
    installs = installs.replace(',', '')
    installs = float(installs)
        
    if (category == 'BOOKS_AND_REFERENCE') and (1000000 <= installs <= 100000000.0): #average popularity: 1M to 100M
        print(name, ': ', installs)

Wikipedia :  10000000.0
Cool Reader :  10000000.0
Book store :  1000000.0
FBReader: Favorite Book Reader :  10000000.0
Free Books - Spirit Fanfiction and Stories :  1000000.0
AlReader -any text book reader :  5000000.0
FamilySearch Tree :  1000000.0
Cloud of Books :  1000000.0
ReadEra – free ebook reader :  1000000.0
Ebook Reader :  5000000.0
Read books online :  5000000.0
eBoox: book reader fb2 epub zip :  1000000.0
All Maths Formulas :  1000000.0
Ancestry :  5000000.0
HTC Help :  10000000.0
Moon+ Reader :  10000000.0
English-Myanmar Dictionary :  1000000.0
Golden Dictionary (EN-AR) :  1000000.0
All Language Translator Free :  1000000.0
Bible :  100000000.0
Amazon Kindle :  100000000.0
Aldiko Book Reader :  10000000.0
Wattpad 📖 Free Books :  100000000.0
Dictionary - WordWeb :  5000000.0
50000 Free eBooks & Free AudioBooks :  5000000.0
Al-Quran (Free) :  10000000.0
Al Quran Indonesia :  10000000.0
Al'Quran Bahasa Indonesia :  10000000.0
Al Quran Al karim :  1000000.0
Al Quran : EAlim -

Since this niche seems to be dominated by software for processing, reading ebooks, libraries, and dictionaries, it's probably not the best idea to build similar apps. There would be some significant competition if we build a similar app to the ones already published.

Noticeably, there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It is also worth noticing that reference apps could be very popular as well. The "My Little Pony AR Guide" and "Stats Royale for Clash Royale" apps are perfect examples of this. One underlying theme between these examples is that these apps have benefitted from popular topics (Quran, My Little Pony, Clash Royale).

Nevertheless, it seems that taking a popular book or reference and turning it into an app could be profitable for both the Google Play and the App Store markets.

In order to distinguish ourselves from the competition, we would need to add some special features to our app. For example, this could include daily quotes from the book or reference, included audio of the book or reference, quizzes on the book or reference, a forum for the book or reference, etc.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular topic and turning it into a book or reference app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features to our app. This might include daily quotes from the book or reference, included audio of the book or reference, quizzes on the book or reference, a forum for the book or reference, or any other feature that can provide a competitive advantage over our competitors.