# Profitable Apps for the App Store & Google Play

In this project, I will be acting as a data analyst for a company that builds Android and iOS mobile apps. This company only builds apps that are free to download & install, and its main source of revenue consists of in-app ads. Therefore, the company's revenue of any of its apps is greatly tied to its number of users.

The goal of this project is to analyze data to help company developers understand what types of apps are likely to attract more users, and thus, are more profitable.

In [2]:
from csv import reader

#App Store dataset
open_apple = open('AppleStore.csv')
read_apple = reader(open_apple)
apple_data = list(read_apple)
apple_header, apple_data = apple_data[0], apple_data[1:]


#Google Play dataset
open_google = open('googleplaystore.csv')
read_google = reader(open_google)
google_data = list(read_google)
google_header, google_data = google_data[0], google_data[1:]

To make the datasets easier to explore, I utilize the function `explore_data()`, as written below. This function is used to repeatedly print rows in a readable way, as well as print the number of rows & columns present in a given dataset.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
print(apple_header)
print('\n')
explore_data(apple_data,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


As we can see above, the App Store dataset has a total of 7197 rows (apps), and 16 columns present. Looking ahead, the columns `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'` may be of interest for further analysis. 

For details regarding each column, see [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

Now for the Google Play dataset:

In [5]:
print(google_header)
print('\n')
explore_data(google_data,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The Google Play dataset consists of 10841 rows (apps), with 13 columns present. At a glance, the columns `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'` will likely be of use for further analysis.

## Data Cleaning

### Detecting & Deleting Inaccurate Data

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion?sort=recent-comments), where [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for row 10472.

Below, I've printed row 10472, as well as the header for this dataset, and another row for comparison.

In [6]:
#wrong entry for row index 10472 vs. row index 0
print(google_data[10472])
print('\n')
print(google_header)
print('\n')
print(google_data[0])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The error is immediately noticeable in the `'Category'` column. Instead of a categorical entry (such as the 'ART_AND_DESIGN' entry in our comparison row), it has a listed numerical entry of '1.9'. 

In fact, the true `'Category'` entry for row 10472 is missing (as mentioned [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015)), causing all further entries for that row to be inappropriately shifted one column over.

Thus, this particular app will be removed from our data.

In [7]:
print(len(google_data))
del google_data[10472]
print(len(google_data))

10841
10840


### Detecting & Removing Duplicate Data

If you explore the Google Play dataset long enough, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [8]:
for app in google_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are 1181 cases where an app occurs more than once:

In [9]:
duplicates = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicates.append(name)
    else:
        unique_apps.append(name)
            
print('Number of duplicate apps: ',len(duplicates))
print('Examples of duplicate apps: ',*duplicates[:6],sep='\n')

Number of duplicate apps:  1181
Examples of duplicate apps: 
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box


If you examine the rows we printed two cells above for the Instagram app, the main difference happens in the fourth column of each row, which corresponds to the number of reviews. Rather than removing duplicates randomly, I'll only keep the row with the highest number of reviews and remove the other entries for any given app.

This will be done in two steps:
* Create a dictionary where each key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app
* Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, select the entry with the highest number of reviews)

Building the dictionary:

In [10]:
reviews_max = {}
for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Next, I will confirm that the length of `reviews_max` agrees with the length of the Google Play dataset once the number of duplicate rows has been subtracted:

In [11]:
print('Expected Length: ',len(google_data)-len(duplicates))
print('Actual Length: ',len(reviews_max))

Expected Length:  9659
Actual Length:  9659


Using the `reviews_max` dictionary, I will remove duplicate entries from the Google Play dataset. Again, I will be using highest number of reviews as the criteria for which entries to keep.

In the code cell below:
* I start by initializing empty lists `google_clean` and `already_added`.
* I loop through the Google Play dataset (`google_data`), and for each iteration:
    * Assign the app name to a variable `name`, and number of reviews to `n_reviews`.
    * Add the current row (`app`) to `google_clean` and add `name` to `already_added` if:
        * The number of reviews is the same as the maximum reviews of that app, as found in `reviews_max`, and
        * The name of the app is not already in the `already_added` list. This is to account for when the highest number of reviews of a duplicate app is the same for more than one entry.

In [12]:
google_clean = []
already_added = []

for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        google_clean.append(app)
        already_added.append(name)

Below, I will confirm that the length our new dataset, `google_clean`, agrees with that of `reviews_max`:

In [13]:
explore_data(google_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


As expected, `google_clean` has a total of 9659 rows (apps).

Thankfully, the step of removing duplicates is rendered unneccesary for the App Store dataset (`apple_data`) due to the existence of a unique key in the `'id'` column, as evidenced below:

In [14]:
unique_ids = []
duplicate_ids = []

for app in apple_data:
    ids = app[0]
    if ids in unique_ids:
        duplicate_ids.append(ids)
    else:
        unique_ids.append(ids)
print('Number of Duplicate Entries: ',len(duplicate_ids))

Number of Duplicate Entries:  0


## Removing Non-English Apps

If you explore the datasets long enough, you'll find that both datasets have apps with names that suggest they are not designed for English-speaking audiences.

For example:

In [15]:
print(apple_data[813][1])
print(apple_data[6731][1])
print('\n')
print(google_clean[4412][0])
print(google_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


I'm not interested in keeping these apps, so I'll remove them. One way to do this is to remove each app with a name containing a symbol that isn't commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. This can be used to create a function that labels a string as containing ASCII characters outside of this range.

In [16]:
def is_eng(a_str):
    for char in a_str:
        char_num = ord(char)
        if char_num > 127:
            return False
    return True

print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


Unfortunately, the above function does not appropriately identify certain English app names with emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. 

Using this function moving forward would mean losing useful data, as many English apps would be incorrectly labeled as non-English.

In [17]:
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))
print(ord('™'))
print(ord('😜'))

False
False
8482
128540


To minimize the impact of data loss, I will instead edit the function to label app names which contain more than three characters outside the target 0-127 ASCII range:

In [18]:
def is_eng(a_str):
    non_eng = 0
    for char in a_str:
        char_num = ord(char)
        if char_num > 127:
            non_eng += 1
    if non_eng > 3:
        return False
    else:
        return True
    
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))

True
False
True
True


While the above filter function is not perfect, it should still be quite effective for analysis moving forward.

Below, the `is_eng()` function is used to filter out non-English apps for from the Google Play and the App Store datasets:

In [19]:
google_eng = []
apple_eng = []

for app in google_clean:
    name = app[0]
    if is_eng(name):
        google_eng.append(app)

for app in apple_data:
    name = app[1]
    if is_eng(name):
        apple_eng.append(app)
        
explore_data(google_eng,0,3,True)
print('\n')
explore_data(apple_eng,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

As a result, I am left with two datasets which comprise 9614 Google Play apps and 6183 App Store apps.

### Isolating Free Apps

As mentioned in the introduction, I am only concerned with apps that are free to download & install, whose main source of revenue consists of in-app ads. Since the above datasets contain both free and non-free apps, I will isolate the free apps below.

In [20]:
google_final = []
apple_final = []

for app in google_eng:
    price = float(app[7].lstrip('$'))
    if price == 0:
        google_final.append(app)
        
for app in apple_eng:
    price = float(app[4].lstrip('$'))
    if price == 0:
        apple_final.append(app)

print('Google Play dataset:')
explore_data(google_final,0,0,True)
print('\n')
print('App Store dataset:')
explore_data(apple_final,0,0,True)

Google Play dataset:
Number of rows: 8864
Number of columns: 13


App Store dataset:
Number of rows: 3222
Number of columns: 16


Finally, I am left with 8864 free Google Android apps and 3222 free Apple iOS apps.

## Analysis

### Most Common Apps by Genre
As mentioned in the introduction, my goal is to determine the kinds of apps that are likely to attract a greater number of users, as this number is greatly tied to an app's revenue.

To minimize risks and overhead, my validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app & add it to Google Play.
2. If the app has a good response from users, develop it further.
3. If the app is profitable after six months, build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app to both Google Play & the App store, I need to find app profiles that are successful in both markets.

I will begin by building frequency tables based on the `'Category'` & `'Genres'` columns in the Google Play dataset, and the `'prime_genre'` column in the App Store dataset.

Below are two functions I will use to build my frequency tables:

In [21]:
def freq_table(dataset,index):
    ft = {}
    total = len(dataset)
    for row in dataset:
        element = row[index]
        if element in ft:
            ft[element] += 1
        else:
            ft[element] = 1
    for key in ft:
        ft[key] *= 100/total
    return ft

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

The `freq_table()` function creates a frequency table such that frequencies are expressed as percentages.

The `display_table()` function is used to display the frequencies from a frequency table in descending order.

----
Starting with the frequency table for `'prime_genre'` in the App Store dataset:

In [22]:
#'prime_genre'
print('Most Common iOS Apps by Genre')
display_table(apple_final,11)

Most Common iOS Apps by Genre
Games : 58.16263190564866
Entertainment : 7.883302296710117
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.513966480446927
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002482
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Among free English apps in the App store, roughly 58% of them are games. This is clearly the most common genre by a wide margin, as the next most common genre, entertainment, only encompasses about 8% of these apps. These are followed by photo & video apps (~ 5%), education (~ 3.66%), and social networking (~ 3.29%).

The general impression is that a large majority of free English apps in the App Store are designed for entertainment purposes (games, photo and video, social networking, sports, music, etc.), while there are fewer designed for practicality (education, shopping, utilities, productivity, lifestyle, etc.).

However, it should be noted that the sheer number of apps in a particular genre is not indicative of the number of users for these types of apps. This frequency table alone cannot be used to recommend an app profile for the App Store market.

Moving to the frequency tables for `'Category'` and `'Genres'` for the Google Play dataset:

In [23]:
#'Category'
print('Most Common Android Apps by Category')
display_table(google_final,1)

Most Common Android Apps by Category
FAMILY : 18.90794223826715
GAME : 9.724729241877258
TOOLS : 8.461191335740073
BUSINESS : 4.591606498194946
LIFESTYLE : 3.903429602888087
PRODUCTIVITY : 3.8921480144404335
FINANCE : 3.700361010830325
MEDICAL : 3.531137184115524
SPORTS : 3.3957581227436826
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180508
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.3352888086642603
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.143501805054152
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090254
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505416
LIBRARIES_AND_DEMO : 0.9363718411552348
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833936
EVENTS : 0.7107400722021661
PARENTING 

Unlike the free English apps in the App Store, those on Google Play seem to be more balanced between entertainment & practical purpose apps. None of the individual categories holds a majority, and the most numerous is the Family category at rougly 19%.

In [24]:
#'Genres'
print('Most Common Android Apps by Genre')
display_table(google_final,9)

Most Common Android Apps by Genre
Tools : 8.44990974729242
Entertainment : 6.069494584837545
Education : 5.347472924187726
Business : 4.591606498194946
Productivity : 3.8921480144404335
Lifestyle : 3.8921480144404335
Finance : 3.700361010830325
Medical : 3.531137184115524
Sports : 3.4634476534296033
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.9444945848375452
News & Magazines : 2.7978339350180508
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.143501805054152
Simulation : 2.041967509025271
Dating : 1.861462093862816
Arcade : 1.8501805054151625
Video Players & Editors : 1.7712093862815885
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090254
Food & Drink : 1.2409747292418774
Puzzle : 1.128158844765343
Racing : 0.9927797833935019
Role Playing : 0.9363718411552348
Libraries & Demo : 0.9363718411552348

The frequency table for `'Genres'` reinforces much of what can be seen the previous table for `'Category'`. Representation for apps designed for practical purposes is much more balanced among free Englisn apps on Google Play than it was for those in the App Store. 

### Most Popular Apps by Genre

Now that I've determined the most common kinds of apps by genre, I want to determine the kinds of apps with the most users.

To do this, I will calculate the average number of installs per app genre. This information can be found in the `'Installs'` column for the Google Play dataset. Unfortunately, this column does not exist for the App Store data set, so I will use the total number of user ratings (`'rating_count_tot'`) instead.

In [25]:
apple_ft = freq_table(apple_final,11)

for genre in apple_ft:
    total = 0
    len_genre = 0
    for row in apple_final:
        genre_app = row[11]
        if genre_app == genre:
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
    avg_ratings = total/len_genre
    print(genre, ':', avg_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Based on the above metric, the app genres with the highest average number of users are the following:
* Navigation
* Reference
* Social Networking
* Music
* Weather

Let's take a closer look at the top genre, Navigation:

In [26]:
print('Navigation ---')
for apps in apple_final:
    if apps[11] == 'Navigation':
        print(apps[1],':',apps[5])

Navigation ---
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Due to the fact that our popularity metric is based on averages, the numbers we attained can be heavily skewed by a few select outliers. This is clearly evident in a genre such as a Navigation, where the total number of ratings is completely dominated by Waze and Google Maps. This resulted in navigation topping the list of popular apps by genre, despite having only six total navigation apps in the dataset.

Ideally, we want to choose a popular app genre with a market that is neither sparse (high competition), nor oversaturated. Recalling the results from the Most Common iOS Apps by Genre frequency table, gaming apps had a clear majority, indicating a completely oversaturated market. The genres that followed, however, had much more reasonable numbers: Entertainment, Photo & Video, Education, and Social Networking. Given the overlap between these and the most popular app genres, a free **Social Networking** app might be an app profile with a potential for high revenue in the App Store. Another option might be a **Music** app, which despite being in the top five for genre popularity, is really only "middle of the pack" in terms of app commonality.

---
Moving on to the Google Play dataset:

In [27]:
#frequency table for 'Installs' column
display_table(google_final,5)

1,000,000+ : 15.726534296028882
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411553
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.772111913357401
5,000+ : 4.512635379061372
10+ : 3.542418772563177
500+ : 3.2490974729241877
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.9178700361010832
5+ : 0.7897111913357401
1+ : 0.5076714801444043
500,000,000+ : 0.27075812274368233
1,000,000,000+ : 0.2256317689530686
0+ : 0.04512635379061372
0 : 0.01128158844765343


The dataset's `'Installs'` column should provide a clearer idea of the number of app users over the App Store's total number of ratings. However, this column is split into unequal intervals, so it should be noted that the number of installs for a given app is not precise.

Furthermore, it should be noted that the App Store genres and the Google Play categories do not use the same labels. In fact, Google Play has 33 categories, while the App store only has 23 genres. This means that apps found in a particular App store genre might end up being split into multiple Google Play categories.

In [40]:
google_cat_ft = freq_table(google_final,1)

for category in google_cat_ft:
    total = 0
    len_category = 0
    for app in google_final:
        category_app = app[1]
        if category_app == category:
            installs = float(app[5].replace('+','').replace(',',''))
            #removed '+' and ',' characters for evaluation
            total += installs
            len_category += 1
    avg_installs = total/len_category
    print(category,':',avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The top five most installed apps on Google Play by Category are as follows:
* Communication
* Video Players
* Social
* Photography
* Productivity

Recalling the most common android apps, aside from Video Players, each of these individual categories comprises between 2-4% of free English Google Play apps. Thus, it is safe to say that none of these markets is oversaturated, nor overly sparse. This makes **Communication**, **Social**, **Photography**, and **Productivity** apps ideal choices for potential sources of revenue on Google Play.

However, due to the choice and number of categories on Google Play, I suspect there might be some overlap. Let's look closer at the Communication and Social categories, alongside the App Store's Social Networking genre:

In [34]:
print('Communication (Google Play) ---')
rows = 10
for apps in google_final:
    if apps[1] == 'COMMUNICATION' and rows > 0:
        print(apps[0],':',apps[5])
        rows -= 1
        
print('\n')
print('Social (Google Play) ---')
rows = 10
for apps in google_final:
    if apps[1] == 'SOCIAL' and rows > 0:
        print(apps[0],':',apps[5])
        rows -= 1
        
print('\n')
print('Social Networking (App Store) ---')
rows = 10
for apps in apple_final:
    if apps[11] == 'Social Networking' and rows > 0:
        print(apps[1],':',apps[5])
        rows -= 1

Communication (Google Play) ---
WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+


Social (Google Play) ---
Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Social network all in one 2018 : 100,000+
Pinterest : 100,000,000+
TextNow - free text + calls : 10,000,000+
Google+ : 1,000,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+


Social Networking (App Store) ---
Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger –

According to Google Play, apps such as WhatsApp and Messenger are considered Communication apps, while the Facebook and Pinterest apps are considered Social apps. However, all four of these apps are placed in the Social Networking genre in the App Store.

When combined, these apps still only make up about 6% of apps on the Google Play dataset, but still top the list of popular apps with a combined ~ 32 million average installs by users. This means that an app comprising both **Social** & **Communication** aspects is not only extremely popular, but exists in a market that is neither oversaturated nor sparse.

## Conclusion

Overall, the aim of this project was to determine the types of apps which would be successful in both the free English iOS and Android markets. While there were a number of app profiles that were found to be potentially successful in each individual market, there was a clear crossover in the potential for success of  a free English **Social Networking** app. 

Neither the App Store's Social Networking genre, nor Google Play's Communication or Social categories seem to be oversaturated with the number of apps available. Nor were they sparse, leaving plenty of room for potential competition to corner a piece of each market's immense popularity. 

Therefore, I believe a new **Social Networking** app would have the greatest amount of potential for revenue.