# Analysis of the app store to identify market interests. 
(Google Play + iOS App Store)

This project is an analysis of the digital app stores of the 2 most popular operating systems for mobiles(iOS and Android). With intentions of creating an app, this project will serve as a basic reasearch into the market to see what apps attract the most users. 

The focus of this project, will be on free to download and install apps, with the main source of revenue coming in from in-app ads. To genereate more revenue, apps that experience good traffic on average will be of key interest.

This research method uses a dated dataset, that contains a condensed version of each playstore databasee. In September 2018, there were approximately 2 million iOS apps available in the App Store and 2.1 million on the Google Play market. Collecting data for over 4 million apps would require a significant amount of time and money, so we will analyze a sample instead. There are 2 existing datasets that are suitable for our purpose:

- **[Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv)** scraped from the iTunes Search API that contains more than 7,000 Apple iOS mobile application details.

- **[Dataset](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)** that contains data of about 10,000 Android apps from the Google Play market.

Let us begin by opening the two datasets and continue with our exploration/research:

In [1]:
from csv import reader

In [2]:
## Google Play Dataset
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
# Assigning the first row to headers variable
android_header = android[0] 
# Assign the data rows, excluding the headers, back to android variable
android = android[1:]
del android[10472] # known problematic entry

In [3]:
## Apple Store Dataset
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
# Assigning the first row to headers variable
ios_header = ios[0] 
# Assign the data rows, excluding the headers, back to ios variable
ios = ios[1:]

**To be able to explore our datasets, it would be easier to create a function that can be called upon to look at either slices or the entire dataset.**

In [4]:
def explore(dataset, start=0, end=5, rows_cols=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row,'\n')
    
    #row_cols parameter lets you check the number of rows and columns in the dataset when set to true.
    if rows_cols:
        print('Number of rows:' ,len(dataset))
        print('Nuber of columns:', len(dataset[0]))

Using our defined function to explore the android dataset:

In [5]:
print(android_header,'\n')
explore(android, rows_cols=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

Using our defined function to explore the ios dataset:

In [6]:
print(ios_header,'\n')
explore(ios,rows_cols=True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] 

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'] 

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] 

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] 

['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37'

### Exploring header columns to identify ones that will be relevant to this project. 

A breakdown of the Google Play dataset reveals 10841 apps and 13 columns.

('App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver')

Since, our focus is towards user engagement for free-to-download and install apps, columns useful for our purpose of analysis are:
- **'App'**,
- **'Category'**, 
- **'Reviews'**, 
- **'Installs'**, 
- **'Price'**, and 
- **'Genres'.**

For the App Store dataset, we have 7197 apps with 16 field columns, of which ones relevant to this project would be: 

- **'track_name'**,
- **'price'**,
- **'rating_count_tot'**,
- **'rating_count_ver'** and
- **'prime_genre'**

Before proceeding further, it is best we clean our data, which would involve removing or correcting wrong data, removing duplicate data, and modifying the data to the fit the purpose of our analysis.

### Finding duplicate entries and deciding a criterion for removal of such entries:

In [7]:
android_duplicate_apps = []
android_unique_apps = []

for app in android:
    name = app[0]
    if name in android_unique_apps:
        android_duplicate_apps.append(name)
    else:
        android_unique_apps.append(name)

print("Number of unique apps:", len(android_unique_apps))        
print('Number of duplicate apps:', len(android_duplicate_apps))
print('\n')
print('Examples of duplicate apps:', android_duplicate_apps[:15])

Number of unique apps: 9659
Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [8]:
ios_unique_apps = []
ios_duplicate_apps = []

for app in ios:
    name = app[2]
    if name in ios_unique_apps:
        ios_duplicate_apps.append(name)
    else:
        ios_unique_apps.append(name)

print("Number of unique apps:", len(ios_unique_apps))        
print("Number of duplicate apps:", len(ios_duplicate_apps))
print('\n')
print('Examples of duplicate apps:', ios_duplicate_apps[:2])

Number of unique apps: 7195
Number of duplicate apps: 2


Examples of duplicate apps: ['VR Roller Coaster', 'Mannequin Challenge']


These duplicate entries would need to be taken care before proceeding further with our analysis. One route for us to take would be to remove these entries randomly, but it would be best to come up with a criterion that picks the most intelligible choice to continue with our anylsis.

Going through different duplicate entries, it is evident that the number of reviews can be used to differentiate between them. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows with the highest number of reviews. To achieve this, we will:

- Create a dictionary where each key is a unique app name, and the value is its highest review count.

- Use above created dictionary to manufacture a new data set with no duplicates.

Dictionary creation for android dataset

In [9]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))    

Expected length: 9659
Actual length: 9659


Dictionary creation for ios dataset

In [10]:
most_reviewed = {}
for app in ios:
    name = app[2]
    reviews = float(app[6])
    
    if name in most_reviewed and most_reviewed[name] < reviews:
        most_reviewed[name] = reviews
        
    elif name not in most_reviewed:
        most_reviewed[name] = reviews
        
print('Expected length:', len(ios) - 2)
print('Actual length:', len(most_reviewed))

Expected length: 7195
Actual length: 7195


We can now use the created dictionaries reviews_max and max_reviews for android and ios, respectively to keep only the entries with the highest number of reviews.

In the code cell below:

We start by initializing two empty lists, one that tracks unique app data and the other that tracks their names to check for duplicates.

Looping through each data set, and for every iteration:
- We isolate the name of the app and the number of reviews.
- We add the current row (app) to the first list, and the app name (name) to the already_added list.
- We check that the number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and the name of the app is not already in the already_added list.

In [11]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9659


In [12]:
ios_clean = []
entered_name = []

for app in ios:
    name = app[2]
    reviews = float(app[6])
    
    if reviews == most_reviewed[name] and name not in entered_name:
        ios_clean.append(app)
        entered_name.append(name)
print(len(ios_clean))

7195


### Exploring the cleaned dataset using the function we defined above:

In [13]:
explore(android_clean)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] 

['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'] 



In [14]:
explore(ios_clean)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'] 

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'] 

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'] 

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'] 

['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'] 



## Filtering out non-english apps:

Since English will be used for the development of our app, it would be best to analyze apps that are geared towards an English speaking audience.To remove apps that do not fit this purpose, we define a function that goes through each character of the app name and finds out its corresponding ASCII number. For the English language characters this should range between 0-127. Since we do have apps that can have 1 or 2 characters outside of this spectrum, it is best to only remove apps that have more than 3 characters that do not belong to this range.

In [15]:
def check_name(app_name):
    non_eng_char_count = 0
    
    for char in app_name:
        if ord(char) > 127:
            non_eng_char_count += 1
    
    if non_eng_char_count > 3:
        return False
    else:
        return True

**Conducting a check, using the function defined above and creating a new list that contains apps that meet our criteria.**

In [16]:
android_eng = []
ios_eng = []

for app in android_clean:
    name = app[0]
    if check_name(name):
        android_eng.append(app)

for app in ios_clean:
    name = app[2]
    if check_name(name):
        ios_eng.append(app)
        
print(f'Number of english apps on the Google Play store: {len(android_eng)}')  
print(f'Number of english apps on the iOS store: {len(ios_eng)}')

Number of english apps on the Google Play store: 9614
Number of english apps on the iOS store: 6181


### Filtering the dataset to only pay attention to free apps.

As mentioned previously, our focus is towards free to download and install apps. Since the acquired datasets contain both free and paid apps, we will now have to filter the remaining apps to only look into free apps.

In [17]:
android_free = []
ios_free = []

for app in android_eng:
    price = app[7]
    if price == '0':
        android_free.append(app)
print(f'Number of free apps: {len(android_free)}')

for app in ios_eng:
    price = app[5]
    if price == '0':
        ios_free.append(app)
print(f'Number of free apps: {len(ios_free)}')

Number of free apps: 8864
Number of free apps: 3220


**After filtering the datasets to meet our selected criteria, we are left with 8864 android apps and only 3222 iOS apps. This is still a good sample size to carry forward with our analysis.**

### Finding out the most common genre amongst the free apps

#### Our goal is to determine the kinds of apps that are likely to attract more users, since the number of people using our apps affects our revenue. To minimize risks and overhead, our validation strategy would be as follows:

1. **Build a minimal version of the app, and add it to Google Play.**
2. **Based on user response, decide to develop it further or scrap the project.**
3. **If the app is profitable after six months, we also build an iOS version of the app and add it to the app store.**

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets.

Let's begin by analysing the most common genres for the app store by building frequency tables. 
There are 2 functions defined below: 

1. ftable : takes dataset and index as arguments and returns a frequency distribution table of the column values at the specified index in terms of their percentage spread.
2. display_table: Takes the same 2 arguments, dataset and index, to return a sorted version of the frequency dictionary by converting the key-value pairs to tuples for sorting. Sorting is done in reverse order.

Function to generate frequency tables that show percentages.

In [18]:
def ftable(dataset, index):
    freq_table = {}
    count_pct = {}
    app_count = 0
    
    for row in dataset:
        app_count += 1
        col_val = row[index]
        
        if col_val in freq_table:
            freq_table[col_val] += 1
        else:
            freq_table[col_val] = 1
    
    for key in freq_table:
        pct = (freq_table[key] / app_count) * 100
        count_pct[key] = pct
        
    return count_pct

Function to  display percentages in descending order.

In [19]:
def display_table(dataset, index):
    table = ftable(dataset, index)
    table_display = []
    
    for key in table:
        key_val = (table[key], key)
        table_display.append(key_val)
        
    table_sort = sorted(table_display, reverse=True)
    for entry in table_sort:
        print(entry[1], ':', entry[0])

Using the functions above to generate the frequency tables for app genres from both datasets.

In [20]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

Based on the above frequency table, family apps seem to dominate the Google Play landscape. Games only account for close to 10% of the market, with most apps developed for their practical purposes eg Tools/Business/Lifestyle/Family etc. 

In [21]:
display_table(ios_free, -5)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


The above table shows that the majority of the free apps on the App Store are Games, with 58.14% of the market share dominated by them. The next best app categories are 'Entertainment' and 'Photo and Video' but they are far behind with only 7.88% and 4.97 % of the free apss falling in those categories, respectively. Compared to the Google Dataset, we can see that the iOS store is dominated by 'for-fun' apps, completely different from its counterpart's landscape.

## Most Popular Apps by Genre on the App Store

To find out the more popular genres, we can average the number of installs for each. For the Google Play dataset we can derive this information from the 'Installs' column but for the App Store data set this information is missing.

In this instance we will take the total number of user ratings as a proxy, which can be found under the 'rating_count_tot' column.

In [22]:
ios_genre = ftable(ios_free, -5)

for genre in ios_genre:
    total = 0
    len_genre = 0
    
    for app in ios_free:
        app_genre = app[-5]
        
        if app_genre == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    
    avg_nrat = total / len_genre
    print(genre, ':', avg_nrat)

Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22812.92467948718
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


Going by the above information, Navigation apps are the most reviewed apps, followed closely by 'Reference' and 'Social Networking' apps. Taking a closer look at these categories below:

In [23]:
for app in ios_free:
    if app[-5] =='Navigation':
        print(app[2], ':', app[6])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


#### Going through each category, we can see that
on average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by apps like Waze and Google Maps, which have close to half a million user reviews together.

In [24]:
for app in ios_free:
    if app[-5] =='Social Networking':
        print(app[2], ':', app[6])

Facebook : 2974676
LinkedIn : 71856
Skype for iPhone : 373519
Tumblr : 334293
Match™ - #1 Dating App. : 60659
WhatsApp Messenger : 287589
TextNow - Unlimited Text + Calls : 164963
Grindr - Gay and same sex guys chat, meet and date : 23201
imo video calls and chat : 18841
Ameba : 269
Weibo : 7265
Badoo - Meet New People, Chat, Socialize. : 34428
Kik : 260965
Qzone : 1649
Fake-A-Location Free ™ : 354
Tango - Free Video Call, Voice and Chat : 75412
MeetMe - Chat and Meet New People : 97072
SimSimi : 23530
Viber Messenger – Text & Call : 164249
Find My Family, Friends & iPhone - Life360 Locator : 43877
Weibo HD : 16772
POF - Best Dating App for Conversations : 52642
GroupMe : 28260
Lobi : 36
WeChat : 34584
ooVoo – Free Video Call, Text and Voice : 177501
Pinterest : 1061624
知乎 : 397
Qzone HD : 458
Skype for iPad : 60163
LINE : 11437
QQ : 9109
LOVOO - Dating Chat : 1985
QQ HD : 5058
Messenger : 351466
eHarmony™ Dating App - Meet Singles : 11124
YouNow: Live Stream Video Chat : 12079
Cougar 

**The same pattern can be seen for social networking apps, where the average number is heavily influenced by a handful few apps such as Facebook, Pinterest, Skype etc. Music apps, with some big apps, like Pandora, Spotify and Shazam also heavily skew the average number.**

In [25]:
for app in ios_free:
    if app[-5] =='Reference':
        print(app[2], ':', app[6])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


'Reference' apps have extremely high number of user ratings, but even these are skewed by 'Bible' and 'Dictionary.com'. However, in this category, other remaining apps are also quite popular, with many able to rake up more than 10,000 reviews quite easily. This is a category that shows promise and should be something to target for our new app. 

*Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. In our situation, however, a handful few apps have skewed the average user rating numbers, with most apps barely able to cross the 10,000 ratings threshold. We can dive deeper by removing these popular apps.*

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. We ignore these for the following reasons:

**Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.**

**Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.**

**Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.**

## Most Popular Apps by Genre on Google Play

For this dataset, we can simply start with our analysis of the 'installs' column.

In [26]:
display_table(android_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The above ranges are quite open-ended and hence the data is not precise. For the purposes of this project, this is not a concern. Leaving the numbers as they are and treating them on their lowest limit would require us to clean this field up a bit. We will also have to convert their type from a string to a float and remove unwanted characters such as '+',',' etc.

In [32]:
android_cat = ftable(android_free, 1)

for cat in android_cat:
    total = 0
    len_category = 0
    for app in android_free:
        app_cat = app[1]
        if app_cat == cat:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
            
    avg_n_installs = total / len_category
    print(cat, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

**Sifting through the acquired information, communication apps, on average, have the most installs. However, this metric seems skewed by a handful few apps such as WhatsApp, Facebook Messenger, Skype etc.**

In [33]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

**If we remove all 'Communication' apps with over 100 million installs, the average would be greatly reduced**

In [34]:
thresh_100m = []

for app in android_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        thresh_100m.append(float(n_installs))
        
sum(thresh_100m) / len(thresh_100m)

3603485.3884615386

We can apply the same methodology to other categories that are at the top of the charts. A pattern can be identified for the video players category, social apps, photography apps and productivity apps.

The primary concern here is to avoid categories that may seem more popular than they actually are becuase of a handful few apps skewing the data favourably in their direction. This also lets us know if the categories are dominated by giants, who can be hard to compete against.

Besides the categories mentioned above, the games genre looks saturated to the brim. We can look into the 'Books and Reference' category to see if our finding from the App Store dataset translates well over to the Google Play data.

In [35]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0],':',app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

This category contains a plethora of apps that range in functions from processing/reading books, collection of libraries, dictionaries to tutorials on different subjects. This category, however is also not exempt from some apps amassing most of the installs.

In [36]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


This niche seems to be dominated either by software for processing and reading ebooks or collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

For this we need to add some special features besides the raw version of the book which might include daily quotes from the book, its audio version, trivia section etc.

# Conclusions

Going through the datasets for both digital stores, we can see that a gap exists in the 'Reference' category of mobile apps, more specifically around apps built on popular books. The gap was identified in both digital stores and going by research this app idea shows the most promise.

An app of just the raw version of the book may not be enough and special features packaged in would be quite helpful. These could range from daily quotes from the book, its audio version, trivia section etc.