# Analysing Mobile App Data (Project 101)

This project is my way of demonstrating what I have learnt so far in Python through a project.

## Project Objective
The primary goal of this project is to analyse mobile app data to provide insights for our developers. By understanding which types of apps are likely to attract more users, we can better inform our development strategies.

## Dataset Documentation

For further details on the datasets used in this project, you can refer to the official documentation available on Kaggle:

- **Android Dataset**: You can download the documentation [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps/data).
- **iOS Dataset**: You can download the documentation [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

The source for both datasets is Kaggle.


## Google Play Dataset Documentation

| Column Name     | Description                                                              |
|-----------------|--------------------------------------------------------------------------|
| **App**         | Name of the application.                                                 |
| **Category**    | The category under which the app is listed in the Google Play Store.     |
| **Reviews**     | The number of user reviews for the app.                                  |
| **Installs**    | The number of times the app has been installed.                          |
| **Type**        | Indicates whether the app is free or paid (`Free` or `Paid`).            |
| **Price**       | The cost of the app (as a string), with the currency symbol (e.g., `$4.99`). Free apps have a price of `0`. |
| **Genres**      | A list of genres associated with the app.                                |
| **Content Rating** | The age group for which the app is suitable (e.g., `Everyone`, `Teen`).|
| **Last Updated**| The date the app was last updated on the Play Store.                     |
| **Current Ver** | The current version of the app.                                          |
| **Android Ver** | The minimum required Android version for the app.            
## iOS Dataset Documentation

| Column Name         | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| **track_name**       | Name of the application.                                                    |
| **currency**        | The currency in which the app's price is listed (e.g., `USD`).               |
| **price**           | The cost of the app in the listed currency. Free apps have a price of `0.0`. |
| **rating_count_tot**| The total number of user ratings.                                            |
| **rating_count_ver**| The number of user ratings for the current version of the app.               |
| **prime_genre**     | The primary genre of the app.                                                |
| **user_rating**     | The average user rating for all versions of the app.                         |
| **user_rating_ver** | The average user rating for the current version of the app.                  |
| **ver**             | The current version of the app.                                              |
| **cont_rating**     | The content rating, indicating the suitable age group (e.g., `4+`, `9+`).    |
| **sup_devices.num** | The number of devices the app is compatible with.                            |
| **ipadSc_urls.num** | The number of screenshots displayed for iPad devices.                        |
| **lang.num**        | The number of supported languages.                                           |
| **vpp_lic**         | Whether the app is available for volume purchase (`1` for yes, `0` for no).  |
            |


### Data Loading and Initial Exploration

Let's begin by loading the two datasets. Once the data is loaded, we will proceed with exploring the content to understand its structure and key features.



In [3]:
from csv import reader

# Opening the App Store dataset and creating a variable for the header
opened_appstore = open(r'C:\Users\frede\Documents\DATAQUEST COURSE\PROJECT 1\datasets\AppleStore.csv',encoding='utf8')
read_appstore = reader(opened_appstore)
ios = list(read_appstore) # Converting the read object into a list
ios_header = ios[0]
ios = ios[1:]

# Opening the Play Store dataset and creating a variable for the header
opened_playstore = open(r'C:\Users\frede\Documents\DATAQUEST COURSE\PROJECT 1\datasets\googleplaystore.csv', encoding='utf8')
read_playstore = reader(opened_playstore)
android = list(read_playstore) # Converting the read object into a list
android_header = android[0]
android = android[1:]

### Loading the Datasets

To begin our analysis, we first need to load the datasets. We'll be working with two datasets: one from the Apple App Store and the other from the Google Play Store.
 ### Steps:
1. **Open and Read**: We open each dataset using the `open()` function and read the content using the `csv.reader` class.
2. **Convert to List**: The read object is converted into a list to make it easier to manipulate.
3. **Extract Headers**: We extract the header row from each dataset and store it separately.
4. **Remove Headers from Data**: Finally, we remove the header row from the main dataset, leaving only the data rows for further analysis.


## Creating a Data Exploration Function

To make it easier to explore the two datasets, we'll first write a function named `explore_data()` that we can use repeatedly to examine rows in a more readable format. Additionally, we’ll add an option for our function to display the number of rows and columns in any dataset.



In [6]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0])) 

print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


- 
We observe that the Google Play dataset contains 10,841 apps and 13 columns. At a quick glance, the columns that are likely to be useful for our analysis include:

- **App**
- **Category**
- **Reviews**
- **Installs**
- **Type**
- **Price**
- **Genres**


## Exploring the Data

We have defined a function named `explore_data()` to help us explore the datasets in a more readable way. This function allows us to specify a range of rows to print and, optionally, the number of rows and columns in the dataset.

### Function Details:
- **`dataset`**: The dataset to explore.
- **`start` and `end`**: The range of rows to display.
- **`rows_and_columns` (optional)**: If set to `True`, the function will also display the total number of rows and columns in the dataset.

### Example Usage:
Above, we print the header of the Android dataset, followed by the first three rows. We also display the number of rows and columns in the dataset.


In [9]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


## Overview of iOS Dataset

We have 7,197 iOS apps in this dataset. The columns that seem particularly interesting for our analysis are:

- **track_name**
- **currency**
- **price**
- **rating_count_tot**
- **rating_count_ver**
- **prime_genre**

While not all column names are self-explanatory, further details about each column can be found in the dataset documentation.




In [13]:
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


# Data Cleaning

In the previous step, we opened the two datasets and explored the data. Before beginning our analysis, it's crucial to ensure that the data we analyze is accurate; otherwise, the results of our analysis could be misleading. This means we need to:

- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.
- Focus on data that aligns with our company's goals, specifically apps that are free to download and designed for an English-speaking audience.

This process of preparing our data for analysis is called **data cleaning**. We will start by detecting and deleting incorrect data.

### Identifying Incorrect Data
Upon reviewing the datasets, I found that row 10,472 in the Google Play dataset has an incorrect entry. It is missing the column for genre.


## Deleting wrong data
To remove the row we use the *del statement* for instance del data[index]


In [15]:
del android[10472]
print(android[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Removing duplicates
* In the previous data cleaning we deleted a row with wrong data. Now we are going to get rid of duplicates entries. For instance, Instagram has four entries. In total there are 1,181 cases where an app occurs more than once.

In [17]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


* Let's count the number of duplicate entries using a loop.
* We create two lists, one for storing unique names and one for the duplicate names.

In [19]:
unique_names = []
duplicate_names = []

# we will iterate each row of android dataset, then we will assign to a var name the fisrt index[0] to the var and check if it is in unique_names, if it is in unique we append to duplicate
for app in android:
    name = app[0]
    if name in unique_names:
        duplicate_names.append(name)
    else:
        unique_names.append(name)

print('Number of duplicate app', len(duplicate_names))
print('\n')
print('Example of duplicate apps',  duplicate_names[:11])

Number of duplicate app 1181


Example of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic']


* we are not going to remove duplicate entries randomly. We will use a criterion to keep the apps with more reviews as they should be the most updated version.
* The expected lenght will be the total lenght of android list - 1181(number of duplicate apps)

In [21]:
print('Expected lengh', len(android) - 1181)

Expected lengh 9659


* To remove duplicates we will do the following:
1) Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app
2) We will use the information stored in the dictionary to create a new dataset, which will have only one entry per app. (and for each app, we will only select the entry with the highest number of reviews)

In [23]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    # If the app name is already in reviews_max, only update if n_reviews is greater
    if name in reviews_max and n_reviews > reviews_max[name]:
            reviews_max[name] = n_reviews
        
    # If the app name is not in reviews_max, simply add it with the current n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max)

9659

* Now we will use the dictionary to remove the duplicate rows.
* We created two lists. one to store the dataset clean, and the other to use during the iteration to remove duplicate
* So for each app in android we will assign a var name and convert number of reviews to float
* if number of reviews is equal to number of reviews in the dictionary with reviews_max that we created before and name is not in already added. we will append to the new list "android_clean". and finally we add the app name to already added to keep the iteration running.
* The dataset should have now 9659 rows.

In [25]:
print('Number of rows without data cleaning', len(android))
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print('Number of rows after cleaning', len(android_clean))
print('\n')
#using explore function
explore_data(android_clean,0,3,True)
        

Number of rows without data cleaning 10840
Number of rows after cleaning 9659


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English apps
* In the previous step, we managed to remove duplicate apps in the google play dataset.
* Remember that we will only use English for the apps we develop, and we`d like to analyse only the apps that are designed for an English-speaking audience.
* If we explore the data long enough we will find that there are some apps with names that suggest they are not designed for English-speaking audience.


In [27]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


* We are not interest in keeping these apps, so we'll remove them. 
* One way to do it is to remove a name containing a symbol tha isn't commonly used in English text.
* English text includes letters from the English alphabet, numbers 0-9, punctuation marks (.,!,?) and other symbols (+,*)
* Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for 'a' is 97, for 'A' is 65, and character '爱' is 29,233.
* We can get the corresponding number of each character using the <code>ord()</code> built-in function. [Documentation link](https://docs.python.org/3/library/functions.html#ord)
* The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system.

In [29]:
print(ord('a'))
print(ord('b'))
print(ord('c'))
print(ord('f'))

97
98
99
102


* Based on this number range we can create a function that detects if the character belongs to the set of common English characters or not.
* if the number is equal or less than 127, then the character belongs to the set of common English characters.
* In python strings are indexable and iterable, which means we can use indexing to select an individual character and we can also iterate on the string using a for loop.

In [31]:
example = 'abc'
for letter in example:
    print(letter)

print(example[0])
print(example[1])
print(example[2])

a
b
c
a
b
c


* We are going to write a function to check each character of the app-name

In [33]:
def check_english(string):
    for letter in string:
        if ord(letter) > 127:
            return False
            
    return True

In [34]:
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instagram📷'))


True
False
False
False


* we wrote a function that detects non-English app names, but we saw that the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

## * Below, we use the <code>is_english()</code> function to filter out the non-English apps for both data sets:
* If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English.
* To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English.
* We will change the function we created.



In [37]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instagram📷'))

True
False
True
True


**Filtering out non-English apps from both datasets**

In [39]:
ios_english = []
android_english = []

for app in ios:
    name = app[1] 
    if is_english(name):
        ios_english.append(app)


for app in android_clean:
    name = app[0] 
    if is_english(name):
        android_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

## Isolating the free apps
**So far we:**
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* We can see that we're left with 9614 Android apps and 6183 iOS apps.


* we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [41]:
print(android_header)
# index 7 is the price in Android dataset
print(ios_header)
#index 4 is the price in IOS dataset

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


* We'll loop through each dataset to isolate the free apps in separate lists
* Isolate the apps that is free into those new lists
* Remember to compare with strings otherwise it will get an error

In [43]:
android_free = []
ios_free = []

for app in android_english:
    if app[7] == '0':
        android_free.append(app)

for app in ios_english:
    if app[4] == '0.0':
        ios_free.append(app)


explore_data(android_free,0,3,True)
print('\n')
explore_data(ios_free,0,3,True)
        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

## Most commons apps part one
* So far we removed innacurate data
* removed duplicate apps entries
* removed Non-English apps
* isolated the free apps

* our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps

* Because our goal is to app the app on both Google Play and App store, we need to find app profiles that are successful on both markets. A profile that works well for both markets might be a productivity app tha makes use of gamification. 
* Let's begin the analysis by getting a sense of what are the most common genres for each market.
* We will need to build **frequency** tables for a few columns is our dataset.

We will inspect both datasets and identify the columns you could use to generate frequency tables to find out what are the most common genres in each market.

In [46]:
print('Android dataset','\n', android_header)
# index 3 is the reviews in Android dataset
print('Ios dataset','\n', ios_header)
#index 5 is the reviews in IOS dataset

Android dataset 
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
Ios dataset 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


<code>sorted()</code> doest not work well with dictionaries. We can transform the dictionary into tuple where each tuple contains a key along with its corresponding dictionary value, to ensure the sorting works right, the dictionary values comes first and the dictionary key comes second. 
* We wrote a function named <code>display_table()</code>
**<code>display_table()</code>** does the following
  - Takes in two parameters: dataset and index. The dataset will be a list os lists and index will be an integer.
  - Generates a frequency table using <code>freq_table()</code>
  - Transform the frequency table into a list of tuples than sorts it in a descending order.
  - prints the entries of the frequency table
 
  

In [48]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [49]:
def freq_table(dataset, index):
    # Initialize an empty dictionary to store the frequency of each unique value
    frequency_table = {}    
    # Initialize a counter to keep track of the total number of rows
    total = 0
    
    # Loop through each row in the dataset
    for row in dataset:
        # Increment the total count by 1 for each row
        total += 1
        name = row[index]        
        if name in frequency_table:
            # If it is, increment its count by 1
            frequency_table[name] += 1
        else:
            # If it's not, add it to the dictionary with a count of 1
            frequency_table[name] = 1
    
    # Convert counts to percentages
    table_percentage = {}
    for key in frequency_table:
        # Calculate the percentage of each value relative to the total count
        percentage = (frequency_table[key] / total) * 100
        table_percentage[key] = percentage
    
    # Return the final frequency table with percentages
    return table_percentage


Use it to display the frequency table of the columns **prime_genre**, **Genres**, and **Category**.

In [51]:
display_table(ios_free,11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


* What is the most common genre? What is the next most common?
* What other patterns do you see?
* What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
* Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?


### Analysing IOS apps
The most common genre among English-language and free apps in the iOS dataset is **Games**, accounting for **58.16%** of the total apps. The next most common genre is **Entertainment**, at **7.88%**. There is a significant gap between the first and second genres.

We can observe a pattern where apps related to entertainment—such as games, photo and video, and social media—tend to be more prevalent. Based on this frequency table, it is recommended to focus on developing gaming apps, followed by entertainment apps, as these categories dominate the App Store market in terms of the number of available apps.

However, it's important to note that the high number of entertainment apps does not necessarily indicate that they also have the largest user base. The demand for these apps might not match their supply, so further analysis would be needed to assess user engagement and profitability in these genres.

In [54]:
print('Category', '\n')
display_table(android_free,1)
print('\n')
print('Genre', '\n')

display_table(android_free,9)

Category 

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AN

### Analysing Android apps
In the Android apps dataset, the most common category is Family, accounting for 18.90% of the total apps. This is followed by Game at 9.72% and Tools at 8.46%.

When looking at genres specifically, Tools lead with 8.44%, followed by Entertainment at 6.06% and Education at 5.34%.

Insights:
- Family Category Dominance: The dominance of the Family category suggests a broad range of apps aimed at family-related activities or child-friendly content.
- Tools Consistency: The Tools category appears prominently in both the overall categories and genres, indicating a strong focus on utility apps in the Android market.
- Entertainment and Education: These genres also stand out, highlighting their importance alongside tools, though they are less common than in the iOS market.

## Most Popular Apps by Genre on the App Store
* Now, we'd like to determine the kind of apps with the most users.
* One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. 
* For the Google Play dataset, we can find this information in the Installs column, but this information is missing for the App Store dataset. 
* As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

To do that, we'll need to do the following:

1) Isolate the apps of each genre
2) Add up the user ratings for the apps of that genre
3) Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

In [58]:
genre_ios = freq_table(ios_free,11)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', round(avg_n_ratings,2))


Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


Social Networking, photo & video, games, and music are the most popular app categories on average by number of ratings and installs in the iOS App Store. I would recommend choosing one of these categories.

### Most popular apps in Google Play
- the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.)
- We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.
- To remove characters from strings we can use the <code> str.replace(old,new)</code> method (just like list.append(), list.copy(). It is a special kind of function called method.
- <code> str.replace(old,new)</code> takes two parameters (old,new)

In [61]:
# start by generating a frequency table for the Category column
display_table(android_free, 5)
android_freq_table = freq_table(android_free,1)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [62]:
for category in android_freq_table:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_category = total / len_category
    print(category,':', round(avg_category,2))



ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.6
FAMILY : 3695641.82
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


- On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [64]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [66]:
under_100_m = []

for app in android_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [68]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

In [69]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [71]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])


Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

In [72]:
for app in android_free:
    if app[1] == 'COMICS':
        print(app[0], ':', app[5])

Manga Master - Best manga & comic reader : 500,000+
GANMA! - All original stories free of charge for all original comics : 1,000,000+
Röhrich Werner Soundboard : 500,000+
Unicorn Pokez - Color By Number : 50,000+
MangaToon - Comics updated Daily : 50,000+
Manga Net – Best Online Manga Reader : 50,000+
Manga Rock - Best Manga Reader : 1,000,000+
Manga - read Thai translation : 10,000+
The Vietnam Story - Fun Stories : 10,000+
Dragon Ball Wallpaper - Ringtones : 10,000+
Funny Jokes Photos : 10,000+
Truyện Vui Tý Quậy : 10,000+
Comic Es - Shojo manga / love comics free of charge ♪ ♪ : 100,000+
comico Popular Original Cartoon Updated Everyday Comico : 5,000,000+
漫咖 Comics - Manga,Novel and Stories : 1,000,000+
Emmanuella Funny Videos 2018 : 100,000+
Manga Zero - Japanese cartoon and comic reader : 1,000,000+
Marvel Unlimited : 1,000,000+
Tapas – Comics, Novels, and Stories : 1,000,000+
Children's cartoons (Mithu-Mina-Raju) : 100,000+
Narrator's Voice : 5,000,000+
【Ranobbe complete free】 No

This niche is filled with apps for reading and processing ebooks, as well as collections of libraries and dictionaries, so making similar apps might not be the best idea because there's a lot of competition.

We also see many apps focused on the Quran, which shows that apps based on popular books can do well. Creating an app around a well-known book, maybe a newer one, could be successful in both the Google Play and App Store.

But since there are already many library apps, we'll need to add something extra to stand out. This could be features like daily quotes, an audio version, quizzes, or a forum for discussions.

# Conclusion 
In this project, we analysed data from the App Store and Google Play to identify an app profile that could be profitable in both markets.

We found that creating an app based on a popular book, especially a newer one, could be a good opportunity. However, since there are already many library apps available, it's important to offer additional features beyond just the basic book content. These could include daily quotes, an audio version, quizzes, or a discussion forum related to the book.