# Profitable App Profiles for the App Store and Google Play Markets

Welcome!

This is my first Dataquest Guided Project in my Data Scientist journey.

The goal of this project is to analyze data sets from the Google Play Store and iOS App Store in order to gain valuable insights for future app developments. 

We are assuming a role of analysts in an app development company. We are seeking to design apps that will be succesfull in both markets. But first we need to understand what commonalities may exist, how we could exploit them and what should we avoid.

The original data sets of the Google Play Store and iOS App Store consist of over 10k and 7k apps, respectively. This is the analysis.

In [1]:
from csv import reader
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Above we opened the original files and made them readable for Python; we also separated the headers from the whole entries to make the analysis much simpler.

Now, to get a sense of what we are dealing with, we'll create a function that will take four inputs and return relevant information about a specific data set.

Below:

- We iterate for every row in the specified data slice that the user selects, and:
  - print each row
  - print a space in between them
- If the user specifies it, we:
  - print the number of rows in the data set
  - print the number of columns in the data set
- Print the header

We do it for both data sets.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


If you wish to see the detailed description of each column, click this link: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home

# Data cleaning

**Missing entries**

In the Kaggle forums (where we extracted the data sets from), we found that there is a category cell missing in the entry 10472. It's best to delete it to avoid future errors. However, it's important to run the delete command only once, as running it multiple times will eliminate more entries.

In [4]:
print(len(android))
del(android[10472])
print(len(android))

10841
10840


*Removing duplicates*

We already removed the wrong entry. Now we need to make sure that there are no duplicate entries, and, if there are, remove them. The duplicates should probably be removed based on the number of reviews of each entry -- i.e., the entry with more reviews will be kept. This will ensure that the most relevant entry will be analyzed.

**We know that the App Store data set holds no duplicate entries, which makes things easier.**

However, as we can se below, there are duplicate entries in the `android` data set.

In [5]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Below:

- We create two empty lists: one that will hold the names of duplicate apps and one that will hold the names of unique apps
- We iterate for every app in the `android` data set, and:
  - define the name of each app to the correspondent column position in the data set
  - if the name already appears in the `unique_apps` list (originally empty), then that name (app) is placed on the `duplicate_apps` list
  - else (i.e., the name appears for the first time), the name is placed on the  `unique_apps` list
  
Then we explore:

- How many apps are repeated, and
- Some examples of them

In [6]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


**To remove the duplicates, we will:**

1. Create a dictionary that will hold the most reviewed apps, and keep the most relevant of each duplicate entry.
2. Use the dictionary as a reference to create the corrected version of the Google Play Store data set.

Below:

- We create the empty dictionary
- We iterate for every app in the `android` data set, and:
  - define the name of each app to the correspondent column position
  - we assign the float value of the number of reviews (in the 3 index position) to a new variable called `n_reviews`
  - if the name already appears in the reviews_max dictionary (originally empty) and `n_reviews` is bigger than the name content in the dictionary, then:
    - we update the number of reviews for that entry
  - if the name does not appear in the dictionary, then:
    - we create a new key-pair value
  - We then proceed to compare the expected length against the actual lenght to make sure it matches 

In [7]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [8]:
print('Expected length:', len(android) - len(duplicate_apps))
print('Actual lenght:', len(reviews_max))

Expected length: 9659
Actual lenght: 9659


**Now we need to use the `reviews_max` dictionary to create the new, clean, `android` data set.**

In the cells bellow, we:

- Create two empty lists: the first one will hold the unique values and the second one will be used as a reference to create the first one
- Loop through the apps in the `android` list, and:
  - observe the name of the app (located in the index 0)
  - observe the number of reviews as a float (located in the index 3)
  - **if** `n_reviews` is equal to the number of reviews of the app `name` (in the `reviews_max` dictionary) **and** `name` is not already in the `already_added` list, then:
    - we append the app (the entire row) to the `android_clean` list, and
    - we append the name to the `already_added` list (this is to prevent adding duplicate app entries that have identical numer of reviews; for instance: if you have two entries of 'App X' with the same number of reviews, and you already iterated one time, the next time the code will inspect the `already_added` list and realize it was already included; so it won't include it again -- in short: it's a failsafe filter)

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [10]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


By inspecting the data above we observe that the numbers match: `android_clean` = `reviews_max`, or `android` - `duplicate_apps`. 

As mentioned in the beginning, we're interested in an English-speaking audience, which means we shouldn't include non-English apps in our final data set.

Firstly, we must create a function that correctly evaluates whether the characters of a string correspond to the english alphabet (i.e., that consists of characters no bigger than the 127 ASCII range). 

This is our first approach:

- We define the function `is_english` and expect a string input
- We iterate for every character in the string
  - if the character number is bigger than 127, then 
    - we return `False`, meaning is NOT english
  - otherwise, just return `True`

In [11]:
def is_english(string):
    
    for character in string:
        
        if ord(character) > 127:
            return False
        
    return True

In [12]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print('\n')

print(ord('😜'))
print(ord('™'))

True
False
False
False


128540
8482


The function above seems to work. However, since emojis and other characters are above the ASCII range (127 for English), some strings are not considered as English. We should fix this to avoid the loss of useful data incorrecly identified as non-English speaking.

To do this, we'll use a more nuanced criterion: we'll keep apps with up to three characters above the ASCII range. This should minimize data loss. Although this solution is not perfect, it's sufficient for the purposes of this project.

Below:

- We define once again the `is_english` funcion, but now we'll include other elements
- We initialize a value of 0 for a variable named `non_ascii`
- We iterate for every character in each string
  - if the character number is bigger than 127, then
    - we increase the variable number by 1
- outside the previous conditional statement (this is so that the first conditional statement finishes analyzing the string entirely first):
- if the `non_ascii` variable is bigger than 3 (at that iteration), then
  - the function returns `False`
- else, the function returns `True`

In [13]:
def is_english(string):
    non_ascii = 0

    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


We can observe that the function is more nuanced and precise than before.

Now that we have a reasonable function for English evaluation, we can then use it to isolate the English apps in new data sets:

Below:

- We create two lists that will hold English-speaking apps: one for the Google Play Store and one for the App Store
- We iterate for every app in each data set
  - We create a temporary variable that holds the name of the app (index 0 and 1, respectively)
  - If the `is_english` function returns a `True` statement after evaluating the name variable, then 
    - the app will be appended to the data set
- We then explore each data set

In [14]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

# Isolating the Free Apps

Since we're only interested in free apps (our source of in-app ads revenue), we need to isolate them, i.e., exclude the paid apps. Then we'll check the final number of apps in our data sets.

Below:

- We create the two final data sets lists: one for the Google Play Store and one for the App Store
- We iterate for every app in each data set (the English-speaking ones)
- For each iteration, we observe the price of the app (position: app[7])
  - If the price is 0 or 0.0, then
    - the app will be appended to the final data set, respectively

- We print both lenghts

In [15]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print(len(android_final))
print(len(ios_final))

8864
3222


# Analysis

As stated in the beginning, the whole point of this project is to gain insights about the most succesfull apps in the Google Play Store and App Store so that we can use this information for our own app developments.

Since we want to be in both the Google Play Store and App Store, it's important to find a common app profile. For that reason, we should look for the most common types of apps in our data sets. In the case of the App Store data set, the `prime_genre` column states the genre/category of the apps. And in the case of the android data set, the `category` and `genre` columns also offer this type of information. 

To analyze these columns, we'll build frequency tables that will show the proportions as percentages. For that reason we'll create two functions: one that will calculate the frequency tables as percentages and another one that will use this same function but display requested results in a sorted manner and as a tuple (since dictionaries don't organize data in a particular order).

Below:

**First function**

- We define a `freq_table` function that takes two inputs (dataset and index), by:
  - Creating an empty dictionary called `table` 
  - Creating a variable called `total` of value zero
  - Then, we iterate for each row in the data set, and for each iteration, we:
    - increase the `total` variable by 1
    - observe the index position for each row and assign it to a variable called `value` 
    - if the `value` **is** already in the `table` dictionary, then:
      - increase that value key by 1
    - else:
      - create a new value key of 1
  - Now we need to express the frequencies as percentages. So we:
    - create a new empty dictionary called `table_percentages`, and:
    - iterate for each key in the `table` dictionary
      - the result of `(table[key] / total) * 100` is assigned to a variable called `percentage`
      - finally, the `percentage` variable is assigned to each key in the `table_percentages` dictionary
    - outside the for loop, we return the completed `table_percentages` dictionary
    
**Second function**

- We define a `display_function` function that takes two inputs (dataset and index), by:
  - creating right away the frequency table using the previous function, using the specified parameters, and assign it to a variable called table. This will create a dictionary at the moment.
  - creating an empty list
  - then we iterate for each key in the new `table` dictionary, and:
    - for each key, we'll organize it as a table in this order: `(table[key], key)`, where the first value is the repeated category and the second is the amount of times it's repeated, and:
    - for each created tuple (pair of values), we'll append them to the `table_display` list
  - Now that we have the list completed, we just to sort it:
  - we use the `sort` function and assign it to a list called `table_sorted`. First parameter: the `table_display` list. Second parameter: reverse = **True** (since we want the list in a descending order)
  - we iterate for every entry the `table_sorted` list
    - so for each entry, we'll `print(entry[1], ':', entry[0])`
    
Finally, we use these functions on the data sets and analyze the results.

In [16]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1

    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage

    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As we can see, it seems that the App Store is dominated by the Games genre (58%). Apps that have a practical function don't seem to be that present. However, the demand might not be the same as the offer.

In [18]:
display_table(android_final, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [19]:
display_table(android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

With regards to the `android` data set:

1. We observe that the `genres` column is more nuanced than the `categories` column. We'll use the `category` column since we are looking for the bigger picture.
2. The Family category, which represents 18.9% of the share, actually consists of mostly games for kids.
3. It looks that there is more representation for practical apps. It seems more balanced altogether.

These are the offered apps. And offered apps are not necessarily demanded. This means that we are merely seeing what the most common (and what type of) apps exist in each Store. To see the most popular apps, however, we should investigate further. We have this information in the `android` data set (sort of), but not in the `ios` data set. The closest we have is the number of reviews per app genre. We could calculate the average numer of reviews per app genre. That could give us a good idea of the genre proportions in both data sets and make them reasonably comparable.

Bellow, we:

- Iniate by creating a frequency table called `genre_ios` (as a dictionary) for the genre column in the `ios` data set
- Then iterate for each genre in the `genre_ios` frequency table
  - initialize a `total` variable of value 0
  - initialize a `len_genre` variable of value 0
    - create a nested loop that will iterate for each `app` in the `ios_final` data set
    - assign the [-5] index position for a variable called `genre_app`
      - if the `app_genre == genre`, i.e., if the genre from the `ios` data set matches the genre of the frequency table, then:
        - assign the number of ratings (index[5]) as a float to a variable named `n_ratings`
        - increase the total variable by adding the `n_ratings` each time for each app
        - increase the length by 1 each time
      
    - Once the nested loop finishes its iterations (but still inside the initial for loop) and both the `total` and `lenght` variables are completed:
    - calculate the average number of ratings for each `genre`: `avg_n_ratings = total / len_genre`
    - for each genre, `print(genre, ':', avg_n_ratings)`

In [20]:
genre_ios = freq_table(ios_final, -5)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

News : 21248.023255813954
Social Networking : 71548.34905660378
Education : 7003.983050847458
Navigation : 86090.33333333333
Sports : 23008.898550724636
Catalogs : 4004.0
Shopping : 26919.690476190477
Music : 57326.530303030304
Lifestyle : 16485.764705882353
Travel : 28243.8
Business : 7491.117647058823
Weather : 52279.892857142855
Entertainment : 14029.830708661417
Reference : 74942.11111111111
Book : 39758.5
Food & Drink : 33333.92307692308
Health & Fitness : 23298.015384615384
Medical : 612.0
Utilities : 18684.456790123455
Photo & Video : 28441.54375
Finance : 31467.944444444445
Productivity : 21028.410714285714
Games : 22788.6696905016


In [21]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [22]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


By analyzing the `ios` data set, we observe:
 
1. Navigation apps (like Google Maps and Waze) should probably not be included on the analysis, as the big number of users (close to half a million together) might show a mistaking picture of user proportionality
2. The next most common category, 'reference', shows some potential for two reasons:
   - Although the Bible app and dictionaries have a huge share of the users, there is room for creative development
   - Since the App Store is saturated with gaming apps, the possibility that our app will stand out is bigger

**Now we can start the analysis of the `android` data set.**

As we mentioned above, we do have information about the number of downloads per app, which shows the level of popularity. However, as we see in the cell below, the data is not precise enough, since it cuts the numbers: for instance, we don't know whether 10,000+ is 10,001 or 49,999.

Although this is not great, we'll leave the numbers as they are. For example, we'll assume that 10,000+ is 10,000. But first we need to make some adjustments.

In [23]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The '+' and ',' characters will interfiere with the numerical analysis. We need to eliminate them (or, in technical terms, replace them with blank spaces).

Below, we:

- Create a frequency table for the `category` column in the `android` data set
- Iterate for every `category` in the `categories_android` dictionary, then:
  - initialize a variable called `total` of value 0
  - initialize a variable called `len_category` of value 0
    - iterate again for every `app` in `android_final`
      - assign a new `category_app` to the app index[1]: category
      - if the `category_app` matches the `category` in the `categories_android` dictionary, then:
        - assign `n_installs` to the index[5]: number of installs, so that:
        - the characters are replaced in `n_installs` for every iteration
        - the `total` increases with each `n_installs` value as a float
        - and the `len_category` variable increases by 1 for each iteration
    - once the nested iterations are completed (but still inside the initial for loop) and the variables are filled, then, for every 'macro' iteration:
    - calculate the `avg_n_installs = total / len_category`
    - and `print(category, ':', avg_n_installs)`

We'll use this replacement technique later on.

In [24]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

GAME : 15588015.603248259
SHOPPING : 7036877.311557789
MEDICAL : 120550.61980830671
COMMUNICATION : 38456119.167247385
BUSINESS : 1712290.1474201474
LIBRARIES_AND_DEMO : 638503.734939759
EVENTS : 253542.22222222222
LIFESTYLE : 1437816.2687861272
FAMILY : 3695641.8198090694
TRAVEL_AND_LOCAL : 13984077.710144928
MAPS_AND_NAVIGATION : 4056941.7741935486
ART_AND_DESIGN : 1986335.0877192982
EDUCATION : 1833495.145631068
HOUSE_AND_HOME : 1331540.5616438356
BOOKS_AND_REFERENCE : 8767811.894736841
PERSONALIZATION : 5201482.6122448975
VIDEO_PLAYERS : 24727872.452830188
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
HEALTH_AND_FITNESS : 4188821.9853479853
FOOD_AND_DRINK : 1924897.7363636363
PARENTING : 542603.6206896552
COMICS : 817657.2727272727
SOCIAL : 23253652.127118643
ENTERTAINMENT : 11640705.88235294
WEATHER : 5074486.197183099
PHOTOGRAPHY : 17840110.40229885
NEWS_AND_MAGAZINES : 9549178.467741935
TOOLS : 10801391.298666667
SPORTS : 3638640.1428571427
FINANCE : 1387692.

In [25]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [26]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

In [27]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

In [28]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


In [29]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

By analizing the `android` data set, we observe:

1. The 'COMMUNICATION' category is saturated. Apps like Whatsapp have a huge share of users, which means is not a great idea to consider that niche as a possiblity. In fact, by removing the most popular apps in that category, the average number of reviews drops roughly 10 times.
2. The 'BOOKS_AND_REFERENCE' category, which is equivalent to the 'reference' category in the `ios` data set, also shows some promise:
   - leaving aside the most popular apps (the Bible and Google Play Books, for instance), and observing apps with a middle ground popularity (in this case, between 1 million to 50 million users), we find some libraries and reading platforms
   - however, we can also see that there are some Quoran apps (as well as the Bible). This could mean that people could be interested in having an app for a specific book, one that might be considered as worthy of deep analysis
   - the app could include quotes, quizzes, interesting facts, forums, etc.
   - this app could be a prototype for future similar developments

# Conclusions

In this project, we:

 - cleaned data sets: removed errors and isolated data by interest/category
 - analyzed the most popular and offered apps
 - investigated further to observe specific market niches
 - proposed a possible profile app that could be profitable for both markets