# Profitable Apps Profile on Apple App Store and Google Play Markets
The aim of this project is to find a mobile app profile that will be profitable in both the Apple App Store and the Google Play markets. In this project we will be analyzing a set of applications available in both markets in order to accomplish this objective. 

For this project, we will be taking the role of a data analyst and it is our job to help developers make data-driven decisions in respect to the new app to be built.

Our company only builds apps that are free to download and install, directed to English speaking audience, having in-app ads being the main source of revenue. Having this in mind, the revenue of our new app will depend on the number of users that will be using our app. Considering all these points, the goal of this project is to analyze both markets data in order to help our developers understand what will be the best App Profile that will more likely attract more users.

## Opening and Exploring the Data
As of the first quarter of 2020, Android users were able to choose between 2.56 million apps, making Google Play the app store with biggest number of available apps. Apple's App Store was the second-largest app store with almost 1.85 million available apps for iOS. - [Taken from statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)


![](app_stores_stats-2020.png)

Collecting data for over 4 million apps is time consuming and costly. For this project we will be using sample data sets that are available for free from the following links:
- [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps) data set which contains over 10,000 apps and can be downloaded from [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- [Apple App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) data set which contains over 7,000 apps and can be downloaded from [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). 


### Open data sets
In the code below, we will open both data sets.

In [1]:
from csv import reader

### Google Play Data ###
android = list(reader(open("googleplaystore.csv")))
android_header = android[0]
android = android[1:]

### iOS App Store Data ###
ios = list(reader(open("AppleStore.csv")))
ios_header = ios[0]
ios = ios[1:]

### Explore data sets
We will now create a function that will help us explore the data sets in a user friendly readable way. We will be defining a function `explore_data()` that can be reusable all over our project.

In [2]:
def explore_data(data_set, start=0, end=5, rows_and_columns=False):
    """
        Print a slice of the data set in order to explore the data
        in a readable way.
        Inputs:
            - data_set (list): List of lists containing the data set.
            - start (int): Index position for starting the slice.
            - end (int): Index position for ending the slice.
            - rows_and_columns (bool): True for printing the number of rows and cols.
    """

    dataset_slice = data_set[start:end]
    for row in dataset_slice:
        print("\n")
        print(row)

    if rows_and_columns:
        print("\n")
        print("Number of rows: {}".format(len(data_set)))
        print("Number of columns: {}".format(len(data_set[0])))

print("Google Play Data:\n")
print(android_header)
explore_data(android, rows_and_columns=True)

Google Play Data:

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+'

We can see we have a total of 10,841 differents app from the Google Play data set. We can get a full description of each column from [here](https://www.kaggle.com/lava18/google-play-store-apps)

| Column           | Description                                                                                                                                   |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| 'App'            | Application name                                                                                                                              |
| 'Category'       | Category the app belongs to                                                                                                                   |
| 'Rating'         | Overall user rating of the app (as when scraped)                                                                                              |
| 'Reviews'        | Number of user reviews for the app (as when scraped)                                                                                          |
| 'Size'           | Size of the app (as when scraped)                                                                                                             |
| 'Installs'       | Number of user downloads/installs for the app (as when scraped)                                                                               |
| 'Type'           | Paid or Free                                                                                                                                  |
| 'Price'          | Price of the app (as when scraped)                                                                                                            |
| 'Content Rating' | Age group the app is targeted at - Children / Mature 21+ / Adult                                                                              |
| 'Genres'         | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres. |
| 'Last Updated'   | Date of last update                                                                                                                           |
| 'Current Ver'    | Current App Version                                                                                                                           |
| 'Androud Ver'    | Supported Adnroid Version                                                                                                                     |

From the columns descriptions above, we will be using the following columns that will help us achieve the goal of this project:
- App
- Category
- Reviews
- Size
- Type
- Price
- Genres


We will now explore the Apple App Store data set.


In [3]:
print("Apple App Store Data:\n")
print(ios_header)
explore_data(ios, rows_and_columns=True)

Apple App Store Data:

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', 

We have 7197 different apps in the Apple App Store data set, and we can get full descriptions of its columns [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

Columns:

| Column name        | Description                                     |
|--------------------|-------------------------------------------------|
| "id"               | App ID                                          |
| "track_name"       | App Name                                        |
| "size_bytes"       | Size (in bytes)                                 |
| "currency"         | Currency Type                                   |
| "price"            | Price amount                                    |
| "rating_count_tot" | User rating counts (for all version)            |
| "rating_count_ver" | User Rating counts (for current version)        |
| "user_rating"      | Average User Rating value (for all version)     |
| "user_rating_ver"  | Average User Rating value (for current version) |
| "ver"              | Latest version code                             |
| "cont_rating"      | Content Rating                                  |
| "prime_genre"      | Primary Genre                                   |
| "sup_devices.num"  | Number of supporting devices                    |
| "ipadSc_urls.num"  | Number of screenshots showed for display        |
| "lang.num"         | Number of supported languages                   |
| "vpp_lic"          | Vpp Device Based Licensing Enabled              |


Columns of interest:
- track_name
- size_bytes
- currency
- price
- rating_count_tot
- rating_count_ver
- prime_genre

## Cleaning Up Data
Like with every data set, we need to make sure our data is clean (duplicates and wrong data removed) before we start analysing it and drawing conclusions. When using data sets provided freely like ours from Kaggle, we can check their discussion sections in order to find comments that might point us to wrong data or incosistencies in the data set. Besides this method, we could also go ahead and check the data ourselves in order to find missing or inconsistent data.

### Missing Data
In order to ensure that all rows have the same amount of values as the number of columns, we will use the following code to achieve this:



In [4]:
def check_row_length(dataset, header):
    """
        Compares each row's length against the header length in order to find
        data incosistencies. Prints the row and index number when found.
        Inputs:
            - dataset (list): List of lists containing the data set.
            - header (list): List of columns from the original data set.
    """
    header_length = len(header)
    print(header)
    for row in dataset:
        if len(row) != header_length:
            print("\n")
            print(row)
            print("Index: {}".format(dataset.index(row)))

We will now check the Google Play Store data set:

In [5]:
check_row_length(android, android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Index: 10472


After running our check, we can see that the Google Play Store data set contains one row which is missing a data point. We can right away see that it's missing the Category.

We will now run the same function for the iOS App Store data set.

In [6]:
check_row_length(ios, ios_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


No data incosistencies were found for the iOS App Store.

### Fix the Missing Data
We can take two different paths in order to fix this missing data found in the Google Play Store Data set:

- Delete the row.
- Research on the App and get the proper Category.

If we search directly on the Google Play Store for the app `Life Made WI-Fi Touchscreen Photo Frame` we are able to see that it belongs to the `Lifestyle` Category.

In the code below we will go ahead and replace the current row `10472` with the correct one.



In [7]:
# ONLY RUN ONCE
android[10472].insert(1, 'LIFESTYLE')
print(android_header)
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', 'LIFESTYLE', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We can run our `check_row_length()` function over the Google Play Store data set again to ensure everything is now properly set.

In [8]:
check_row_length(android, android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


### Finding Duplicates
We will now create a function that will check our data sets for duplicates, and then we will proceed to remove them.


In [9]:
def find_duplicates(dataset, name_index):
    """
        Checks duplicates on dataset based on the index provided. Prints the lists of duplicates.

        Inputs:
            - dataset (list): list of lists which makes up the date set.
            - name_index (int): index position to compare/look for duplicates.
    """
    unique_apps = []
    duplicate_apps = []

    for app in dataset:
        name = app[name_index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)

    print("Number of Duplicate Apps: {}".format(len(duplicate_apps)))
    print("Example of Duplicate Apps:\n{}".format(duplicate_apps[:15]))

# check against Google Play Store Data set.
find_duplicates(android, 0)

Number of Duplicate Apps: 1181
Example of Duplicate Apps:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [10]:
find_duplicates(ios, 1)

Number of Duplicate Apps: 0
Example of Duplicate Apps:
[]


After running our `find_duplicates()` function, we are able to see that the Google Play Store data set containts 1181 duplicated apps. We will now build a function that removes these duplicate apps.

Before we proceed, lets go ahead and check entries for one of these apps.

In [11]:
print(android_header)
for index, value in enumerate(android):
    if value[0] == 'Slack':
        print("{} - {}".format(index, android[index]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
240 - ['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
269 - ['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
294 - ['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


Based on the check we did with `Slack` app, we can see different numbers of reviews for each entry. Due to this, instead of randomly deleting the duplicated apps, we will check for the entry with highest reviews and delete the rest. 

We will start by creating a dictionary that contains each of the apps with their max amount of reviews. The code will check if the name is already in the new dictionary and if the number of reviews is greater than the one already existing in order to keep one instance of each app.

In [12]:
max_reviews = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if name in max_reviews and max_reviews[name] < n_reviews:
        max_reviews[name] = n_reviews
    elif name not in max_reviews:
        max_reviews[name] = n_reviews
        

When checking for duplicates, we found a total of 1,181 duplicated app. If we subtract this number from the total number of rows `10,841 - 1,181` we get a total of `9,660` unique apps from the Google Play Store data set. This means that the length of our new dictionary should equal `9,660`, and we will be checking this on the code cell below.

In [13]:
# Check that length of new dictionary matches 9660
print(len(max_reviews))

9660


Having this dictionary we can now go ahead and create a clean data set by remmoving the duplicates. In the code cell below, we will be creating a new list with the right amount of apps meanwhile we check and compare the number of reviews from our previous dictionary.

In [14]:
#Create new data set without duplicates
new_android = []
added_app = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])

    if (max_reviews[name] == n_reviews) and (name not in added_app):
       new_android.append(app)
       added_app.append(name)

We can now explore our new data set and check for duplicates.

In [15]:
find_duplicates(new_android, 0)
explore_data(new_android, 0, 5, True)

Number of Duplicate Apps: 0
Example of Duplicate Apps:
[]


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9660
Number of 

We can now say that our new Google Play Store data set is clean from duplicates.

### Removing non-English Apps.

Based on our Project summary, we specified that our company only builds apps for English speaking audience, therefore, we need to check that our data sets contains only app profiles for english speakers. In order to accomplish this, we will be using the `ord()` function that gives us back the integer position of ASCII character. English has a range of 0 - 127, but we might be excluding special characters or emoji with this approach, therefore, in our code below we will check for English characters that has no more than 3 special characters in order to clean our data sets.

In [16]:
def is_english(string):
    non_ascii = 0

    for char in string:
        if ord(char) > 127:
            non_ascii += 1

    if non_ascii  > 3:
        return False
    else:
        return True

# Test function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('™'))
print(is_english('😜'))

True
False
True
True


After testing our function, we can now go ahead and clean our data sets.

In [17]:
android_english = []
ios_english = []

for app in new_android:
    name = app[0]
    if is_english(name):
        android_english.append(app)

for app in ios:
    name = app[2]
    if is_english(name):
        ios_english.append(app)

# Explore our data
explore_data(android_english, rows_and_columns=True)
explore_data(ios_english, rows_and_columns=True)



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9615
Number of columns: 13


['1', '281656475', 'PAC-MAN Premium', '10078

We now have two data sets which is free of duplicates, data incosistencies and non-english app profiles.

### Removing non-free apps.
As mentioned in the project introduction, our company only creates free apps and our revenue is based on ads. We will now clean our data sets and keep only app profiles of Free apps. In the code below, we will check for the price and keep only the apps that have a price of `0`.

In [18]:
android_clean = []
ios_clean = []

# Isolate free apps.

for app in android_english:
    price = app[7]
    if price == '0' or price == '0.0':
        android_clean.append(app)

for app in ios_english:
    price = app[5]
    if price == '0' or price == '0.0':
        ios_clean.append(app)

explore_data(android_clean, rows_and_columns=True)
explore_data(ios_clean, rows_and_columns=True)



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8865
Number of columns: 13


['2', '281796108', 'Evernote - stay organize

We're left with __8865__ apps in our Google Play Store data set and __3222__ apps in our iOS App Store. We can now start analysis.

## Data Analysis
As mentioned in the project introduction, we are trying to find an app profile that will be profitable in both Google Play Store and iOS App Store. Our app will be free and our revenue will be based on ads; meaning it should be popular with users.

Validation Strategy:
- Build a minimal Android version of the app, add it to Google Play Store.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We will start our analysis by figuring out what is the most common genre from each market. For this, we will need to create frequency tables from our data sets. We will define a function that will build a frequency table and display it, in order to be reusable in the rest of our data analysis.


In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0

    for row in dataset:
        total += 1
        value = row[index]

        if value in table:
            table[value] += 1
        else:
            table[value] = 1

    table_percentages = {}

    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage

    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Most Common Apps by Genre
We will start by examining the iOS App Store apps by genre.

In [20]:
display_table(ios_clean, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see that among the free English iOS App Store, Gaming apps are the most common with 58.16% making up more than half of the available apps. Entertainment and Photo apps follows with 7.88% and 4.97% each. Social Networking only makes up 3.30% of the apps and Productivity comes up to 1.74%. 

Even though gaming apps makes up more than half of the iOS App Store, it doesn't tells us that they are the ones with most users (installs). From the table above we can see that apps made for fun (games, entertainment, photos, social networking, music, sports) dominates the App Store against apps with practical purposes (education, utilities, shopping, productivity, lifestyle). 

We will now explore the most common genre in the Google Play Store. In this case we will base our research on the `Category` column, as the genre column is granular on the app description.

In [21]:
display_table(android_clean, 1)

FAMILY : 18.905809362662154
GAME : 9.723632261703328
TOOLS : 8.460236886632826
BUSINESS : 4.591088550479413
LIFESTYLE : 3.914269599548787
PRODUCTIVITY : 3.8917089678511
FINANCE : 3.699943598420756
MEDICAL : 3.5307388606880994
SPORTS : 3.395375070501974
PERSONALIZATION : 3.3164128595600673
COMMUNICATION : 3.2374506486181613
HEALTH_AND_FITNESS : 3.0795262267343486
PHOTOGRAPHY : 2.9441624365482233
NEWS_AND_MAGAZINES : 2.7975183305132543
SOCIAL : 2.662154540327129
TRAVEL_AND_LOCAL : 2.33502538071066
SHOPPING : 2.2447828539199097
BOOKS_AND_REFERENCE : 2.143260011280316
DATING : 1.8612521150592216
VIDEO_PLAYERS : 1.793570219966159
MAPS_AND_NAVIGATION : 1.3987591652566271
FOOD_AND_DRINK : 1.2408347433728144
EDUCATION : 1.161872532430908
ENTERTAINMENT : 0.9588268471517203
LIBRARIES_AND_DEMO : 0.9362662154540328
AUTO_AND_VEHICLES : 0.924985899605189
HOUSE_AND_HOME : 0.8234630569655951
WEATHER : 0.8009024252679076
EVENTS : 0.7106598984771574
PARENTING : 0.6542583192329385
ART_AND_DESIGN : 0.6429

From the table above, we can see a different picture from the Google Play Store. The most common apps seems to be more into productivty as compared to the iOS App Store. Family takes first place with 18.91% of the store, followed by Games with 9.72. Moving down the list, we start seeing apps in the Tools, Business, Lifestyle, Productivity and Finance which covers up the productivy apps instead of gaming apps or others made for fun.

From these two analysis, we can say that the iOS App Store is dominated by apps made for fun, while Google Play Store is more balanced between apps made for fun and apps made for productivty. With this in mind, we now want to find out which kind of apps have the most users.

One way to find out about the most popular genres (have most users) is to calculate the number of installs. In Google Play Store, we can use the `Installs` column to achieve this, but in the case of the iOS App Store, we will be using the `rating_count_tot` (number of total ratings) in order to get the most popular genre. In the code block below, we will be calculating the average number of ratings per genre in the App Store data set.

### Most Popular Apps by Genre - iOS App Store

In [22]:
genres_ios = freq_table(ios_clean, -5)
sorted_genres_ios = []

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_clean:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    sorted_genres_ios.append((avg_n_ratings, genre))
    
sorted_genres_ios = sorted(sorted_genres_ios, reverse=True)

for entry in sorted_genres_ios:
    print(entry[1], ':', entry[0])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News :21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


Based on our data above, we can clearly see that Navigation, Reference, Social Networking, Music and Weather are the top 5 most popular genres based on apps ratings count for the iOS App Store. We will now explore these genres to see which apps makes them so popular.

In [23]:
def display_app_filter(dataset, name_index, filter_index, value_index, filter_str):
    for app in dataset:
        if app[filter_index] == filter_str:
            print(app[name_index], ':', app[value_index])


print('\nNavigation Apps')
display_app_filter(ios_clean, 2, -5, 6, 'Navigation')

print('\nReference Apps')
display_app_filter(ios_clean, 2, -5, 6, 'Reference')

print('\nSocial Networking Apps')
display_app_filter(ios_clean, 2, -5, 6, 'Social Networking')

print('\nMusic Apps')
display_app_filter(ios_clean, 2, -5, 6, 'Music')


Navigation Apps
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911

Reference Apps
Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - T

From the data displayed above, we can see that __Navigation__, __Social Networking__, and __Music__ genre is out of the list for recommendation on an App profile, as these genres are dominated by specific apps such as Facebook, Skype, Waze, Google Maps, Pandora and Spotify; meanwhile the other apps in these genres struggle to reach the popularity of the others.

Closely analysing the __Reference__ genre, we can clearly see that the most popular apps are the Bible, Dictionary and Thesaurus, Google Translate and the Quran. This niche seems to be a great candidate for an App Profile, since people don't usually spend much time in Weather apps and the other genres are already dominated by few apps.

We can now explore the __Google Play Store__ data set.

### Most Popular Apps by Genre - Google Play Store

We will first explore few rows from the data set.

In [24]:
print(android_header)
explore_data(android_clean)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Every

Even though on the Google Play data set we __do__ have data for number of installs, we can see that they are open-ended (10,000+, 100,000+). This data are not exact, therefore we will be using them as the exact value (10,000 for 10,000+, 100,000 for 100,000+). We will start with a frequency table for the number of installs.


In [25]:
display_table(android_clean, 5)

1,000,000+ : 15.724760293288211
100,000+ : 11.551043429216017
10,000,000+ : 10.547095318668923
10,000+ : 10.197405527354766
1,000+ : 8.403835307388608
100+ : 6.91483361534123
5,000,000+ : 6.824591088550479
500,000+ : 5.561195713479977
50,000+ : 4.771573604060913
5,000+ : 4.512126339537507
10+ : 3.542019176536943
500+ : 3.248730964467005
50,000,000+ : 2.3011844331641287
100,000,000+ : 2.131979695431472
50+ : 1.9176536943034406
5+ : 0.7896221094190639
1+ : 0.5076142131979695
500,000,000+ : 0.2707275803722504
1,000,000,000+ : 0.2256063169768754
0+ : 0.04512126339537507
0 : 0.011280315848843767


We will first convert the values in the `Installs` column into floats and compute averages for each category.

In [26]:
categories_android = freq_table(android_clean, 1)
sorted_android = []

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_clean:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category

    sorted_android.append((avg_n_installs, category))

sorted_android  = sorted(sorted_android, reverse=True)
for entry in sorted_android:
    print(entry[1], ":", entry[0])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1433675.5878962537
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

From the data above, we can see that the communcation category leads the genres with 38,456,119 approximate installs, follow by video players and social apps. Based on our data from the App Store, we know that there are some specific apps that skews these comparisons dramatically, such as giants like Facebook, WhatsApp, YouTube, etc. Let's explore these categories further to see if we have the same case with Google Play Store.

In [27]:
display_app_filter(android_clean, 0, 1, 5, 'COMMUNICATION')

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

In [28]:
display_app_filter(android_clean, 0, 1, 5, 'VIDEO_PLAYERS')

YouTube : 1,000,000,000+
All Video Downloader 2018 : 1,000,000+
Video Downloader : 10,000,000+
HD Video Player : 1,000,000+
Iqiyi (for tablet) : 1,000,000+
Video Player All Format : 10,000,000+
Motorola Gallery : 100,000,000+
Free TV series : 100,000+
Video Player All Format for Android : 500,000+
VLC for Android : 100,000,000+
Code : 10,000,000+
Vote for : 50,000,000+
XX HD Video downloader-Free Video Downloader : 1,000,000+
OBJECTIVE : 1,000,000+
Music - Mp3 Player : 10,000,000+
HD Movie Video Player : 1,000,000+
YouCut - Video Editor & Video Maker, No Watermark : 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects : 1,000,000+
YouTube Studio : 10,000,000+
video player for android : 10,000,000+
Vigo Video : 50,000,000+
Google Play Movies & TV : 1,000,000,000+
HTC Service － DLNA : 10,000,000+
VPlayer : 1,000,000+
MiniMovie - Free Video and Slideshow Editor : 50,000,000+
Samsung Video Library : 50,000,000+
OnePlus Gallery : 1,000,000+
LIKE – Magic Video Maker & Community : 50,

In [29]:
display_app_filter(android_clean, 0, 1, 5, 'SOCIAL')

Facebook : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Social network all in one 2018 : 100,000+
Pinterest : 100,000,000+
TextNow - free text + calls : 10,000,000+
Google+ : 1,000,000,000+
The Messenger App : 1,000,000+
Messenger Pro : 1,000,000+
Free Messages, Video, Chat,Text for Messenger Plus : 1,000,000+
Telegram X : 5,000,000+
The Video Messenger App : 100,000+
Jodel - The Hyperlocal App : 1,000,000+
Hide Something - Photo, Video : 5,000,000+
Love Sticker : 1,000,000+
Web Browser & Fast Explorer : 5,000,000+
LiveMe - Video chat, new friends, and make money : 10,000,000+
VidStatus app - Status Videos & Status Downloader : 5,000,000+
Love Images : 1,000,000+
Web Browser ( Fast & Secure Web Explorer) : 500,000+
SPARK - Live random video chat & meet new people : 5,000,000+
Golden telegram : 50,000+
Facebook Local : 1,000,000+
Meet – Talk to Strangers Using Random Video Chat : 5,000,000+
MobilePatrol Public Safety App : 1,000,000+
💘 WhatsLov: Smileys of love, sti

We can see from above that we have a similar case with Google Play store data set, where specific apps dominate and skews our data in these categories. Game genre is also really popular, but as noticed previously, both markets are saturated with gaming apps. Let's go ahead and explore the `BOOKS_AND_REFERENCE` category as it seems quite popular as well, and we found out that it is a potential profile for the iOS App Store.

In [30]:
display_app_filter(android_clean, 0, 1, 5, 'BOOKS_AND_REFERENCE')

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We can see that most of the apps are e-book readers, as well as popular books such as the Bible, and a variety of Quran apps. It also includes several guides and programming language tutorials. Besides all these, we do notice some popular apps that skews the average, lets explore these.

In [31]:
for app in android_clean:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


We can see that there's only a few 'very' popular apps based on this, which are composed of e-book libraries and the Bible. With this information, we can say that the __Books__ genre is a potential candidate for our new app profile. Lets further explore this category in Google Play by getting our mid-range popular apps.

In [32]:
for app in android_clean:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas: 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
Ha

This niche seems to be dominated by apps used for e-books, as well as dictionaries and various libraries. It wouldn't be a good idea to build something similar to these. We also noticed several apps based on one single popular book, __Quran__. This tells us that building an app in one particular book or series can be profitable. Taking a popular or most recent book and build an app around it is a potential candidate for our app profile. We will need to build new features that will make it stand out against other apps since we already have several libraries in this niche. 

## Conclusions
In this project, we analyzed two sample data sets that belongs to the Google Play Store and the Apple App Store. Our objective was to come up with an application profile that will be profitable in both markets, having the revenue based on ads as the app will be free and made for an english-speaking audience.

We concluded that taking a popular, recent book or series and making an app around it could be profitable for both markets as long as it has special features that will make it more entertaining and stand out against the already existing libraries. Features such as daily quotes, quizzes, trivia, integrated forum or comments where people can discuss about the book, audio version, are some of the features can make our new app stand out and become really popular, therefore, profitable.