# Profitable App Profiles for the App Store and Google Play Markets

<br />

<br />


![header.jpg](header.jpg)


**Introduction to the Project**



In today's mobile development landscape, it is crucial to understand which types of applications attract more users in both Android and Apple ecosystems. With the goal of providing valuable information to developers, this project focuses on data analysis to identify common characteristics among the applications that receive a higher number of downloads and engagement.

1. **What Does the Project Entail?**

    This data analysis combines information from various sources, such as Google Play Store and iTunes App Store, along with usage and performance metrics. Using statistical techniques and machine learning algorithms, patterns and trends are identified to indicate which types of applications perform better in each platform.

2. **Project Objective**

    The primary objective of this project is to provide developers with precise information about the characteristics that resonate best with users on Android and iOS. By understanding better what types of applications perform well in terms of downloads, engagement, and retention, developers can make more informed decisions for the design and optimization of their own apps.


The first is the collection and analysis of data on mobile apps available on Google Play and the App Store. To avoid wasting resources on collecting new data, we should first try to see if we can find relevant existing data at no cost. Fortunately, here are two datasets that seem suitable for our goals:

## <a id='0'>Index</a>

 
### <a href='#1'>1. Exploring data </a>
 
 
- <a href='#11'>1.1 Function to explore datasets</a>


-  <a href='#12'>1.2 Data dictionary:</a>


-  <a href='#13'>1.3 Knowing the size in both data sets.</a>

### <a href='#2'>2. Data cleaning </a>


-  <a href='#21'>2.1. Detect inaccurate data... remove it.</a>


-  <a href='#22'>2.2 Detect duplicate data, and remove the duplicates.</a>


-  <a href='#23'>2.3 What differentiates duplicate applications?</a>


-  <a href='#24'>2.4 Removing Non-English Apps</a>


-  <a href='#25'>2.5 Filter non-English applications from both datasets. </a>


-  <a href='#26'>2.6 Exploring datasets and see how many rows we have left for each dataset.</a>



### <a href='#3'>3. Isolating Free Apps </a>


-  <a href='#31'>3.1 Our **ultimate goal** </a>


-  <a href='#32'>3.2 Profiles of apps successful in both markets	Table</a>



### <a href='#4'>4. Most Popular Apps by Genre on Apple Store </a>

### <a href='#5'>5. Most Popular Apps by Genre on Apple Store </a>

### <a href='#6'>5. Summary. </a>

* * *

The datasets are in two documents of type csv's i.e. **Comma Separated Values file** for read the content, it is necessary to load a function called `reader` of the python `csv` module.

One dataset contains data from approximately 10,000 **Android apps from Google Play**; Data was collected in August 2018. 


Another dataset contains data from approximately 7,000 **iOS apps from the App Store**; Data was collected in July 2017.

In [1]:
!pwd

/home/ion/Formacion/git_repo_klone/albertjimrod/Python


In [2]:
from csv import reader

* * *
<a href='#0'> back to index</a>

## <a id='1'>1. Exploring datasets</a>:

Dataset exploration is crucial because it provides a deep and quick understanding of the dataset. 

It allows you to identify patterns, trends, and anomalies that are not noticed at first. In addition, it helps to select relevant variables, discover relationships between variables, and prepare data for more complex analyses. 

This results in better predictive models, more accurate reporting, and more informed decisions based on the information contained in the datasets.

So, we are going to explore the two sets of data, but first we are going to kwon what type of [character encoding](https://en.wikipedia.org/wiki/Character_encoding) both datasets have.

A command called [file](https://www.man7.org/linux/man-pages/man1/file.1.html), will help us to kwon the type of file regardless of its extension and avoid a error called `UnicodeDecodeError`.

In [3]:
! file -i datasets/AppleStore.csv

datasets/AppleStore.csv: text/csv; charset=utf-8


In [4]:
! file -i datasets/googleplaystore.csv

datasets/googleplaystore.csv: text/csv; charset=utf-8


Our files have the same type of charset https://en.wikipedia.org/wiki/UTF-8.


Quick idea:

- **charset:** is the set of characters you can use.
- **encoding:** is the way these characters are stored into memory.

[source](https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset)

In [5]:
AppleStore = open('datasets/AppleStore.csv', encoding='utf8')
AppleStore = reader(AppleStore)
AppleStore = list(AppleStore)
header_apple = [x for x in AppleStore[0]]
header_apple #Columns on AppleStore

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [6]:
dataset_apple = AppleStore[1:] #whole dataset apple, no headers

In [7]:
GooglePlay = open('datasets/googleplaystore.csv', encoding='utf8')
GooglePlay = reader(GooglePlay)
GooglePlay = list(GooglePlay)
header_google = [x for x in GooglePlay[0]]
header_google 

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [8]:
dataset_google = GooglePlay[1:] #whole dataset google, no header

* * *
<a href='#0'> back to index</a>

### <a id='11'>1.1 Function to explore datasets</a>

Sure, here's the text with the substitution: python The 'explore_data' function is used to explore a dataset in a specific range. 

Here's how it works step by step: 

1. **Take three arguments**:
   - `dataset`: The complete dataset.
   - `start`: The starting position from which to start the visualization.
   - `end`: The final position up to which the data will be displayed.

2. **Create a portion of the dataset**:
    - `dataset_slice = dataset[start:end]`: This selects a subset of rows of the dataset, from row 'start' to row 'end'.

3. **Print each row in the specified range**:
   - `for row in dataset_slice: print(row)`: These are iterated through rows in 'dataset_slice' and printed by console. - 'print('\n')': This adds an empty line after each row to improve readability.

5. **Optionally, display the number of rows and columns**:
   - If `rows_and_columns=True` is passed, prints the total number of rows and the number of columns in the dataset.
   - 'print(`Number of rows:`, len(dataset)): Prints the total number of rows in the entire dataset.
   - 'print(`Number of columns:`, len(dataset))': Prints the number of columns in each row, which is constant if all rows have the same structure.

In [9]:
def explore_data(dataset, start, end, rows_and_columns=False):
        dataset_slice = dataset[start:end]    
        for row in dataset_slice:
            print(row)
            print('\n') # adds a new (empty) line after each row
        if rows_and_columns:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))

In [10]:
%%html
<style>
table {float:left}
</style>

<a href='#0'> back to index</a>
* * *
### <a id='12'> 1.2 Data dictionary:</a>

We have the name of the columns and the data that form the set of both data sets:


#### AppleStore.csv

 - `header_apple`

 - `dataset_apple` 

||Name | Description|
|:--|:---|:--|
|1|"id" : |App ID|
|2|"track_name": |App Name|
|3|"size_bytes": |Size (in Bytes)|
|4|"currency": |Currency Type|
|5|"price": |Price amount|
|6|"ratingcounttot": |User Rating counts (for all version)|
|7|"ratingcountver": |User Rating counts (for current version)|
|8|"user_rating" : |Average User Rating value (for all version)|
|9|"userratingver": |Average User Rating value (for current version)|
|10|"ver" : |Latest version code|
|11|"cont_rating": |Content Rating|
|12|"prime_genre": |Primary Genre|
|13|"sup_devices.num": |Number of supporting devices|
|14|"ipadSc_urls.num": |Number of screenshots showed for display|
|15|"lang.num": |Number of supported languages|
|16|"vpp_lic": |Vpp Device Based Licensing Enabled|


#### Googleplaystore.csv

 - `header_google`

 - `dataset_google`


||Name | Description|
|:--|:---|:--|
|1 | App: |Application name|
|2 | Category: |Category the app belongs to|
|3 | Rating: |Overall user rating of the app (as when scraped)|
|4 | Reviews: | Number of user reviews for the app (as when scraped)|
|5 | Size: | Size of the app (as when scraped)|
|6 | Installs: | Number of user downloads/installs for the app (as when scraped)|
|7 | Type: | Paid or Free|
|8 | Price: | Price of the app (as when scraped)|
|9 | Content: | Rating Age group the app is targeted at - Children / Mature 21+ / Adult|
|10 |  Genres: | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.|

<a href='#0'> back to index</a>
* * *
### <a id='13'>1.3 Knowing the size in both data sets</a>.

In [11]:
explore_data(dataset_apple, 2, 4, True)

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


In [12]:
explore_data(dataset_google, 2, 4, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


### Resume

|`dataset_apple`|`dataset_google`|
|:----|:----|
|Number of rows: 7197| Number of rows: 10841|
|Number of columns: 16|Number of columns: 13|

<a href='#0'> back to index</a>
* * *
## <a id='2'>2. Data cleaning</a>

Once we have explored the datasets, considering that we create free applications in English, simply to download and install, we must remove the applications that are not free and those that are not in English. The next step we have to do is to perform a data cleansing before the analysis, which involves:

This means that we need to:

- 1. **Detect inaccurate data, and correct or remove it**.

- 2. **Detect duplicate data, and remove the duplicates**.

- 3. **Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播**.

- 4. **Remove apps that aren't free##**.


The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=votes), and we can see that [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

<a href='#0'> back to index</a>
* * *
### <a id='21'>2.1. Detect inaccurate data... remove it.</a>

If the number of columns in a row varies then some data is missing.

This function `field_number` has a pair of parameters `dataset`, `header` in which we enter the dataset and the header.

The passage of each row compare the number of fields with the number of fields in header if they do not coincide then in that line **some type of data is missing**.

<br>

#### `field_number` feature:

<br>

The `field_number` function checks the number of columns in each row of the dataset against the header length. 

It takes two arguments: 

- the dataset and the header.

<br>

The function iterates through each row, comparing the number of elements in each row to the length of the header. 

If a row has a different number of columns, it prints that row, indicates the index number where the error occurs, and notes the incorrect number of columns.

In [13]:
def field_number(dataset, header):
    index_number = 0
    for row in range(len(dataset)):
        if len(dataset[row]) != len(header):
            print(dataset[row])
            print("\n")
            print("error in index number: ",index_number)
            print("number of columns has an error: ",len(dataset[row]))
        index_number +=1

In [14]:
field_number(dataset_apple, header_apple)

In [15]:
field_number(dataset_google, header_google)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


error in index number:  10472
number of columns has an error:  12


In [16]:
len(dataset_google[10472]) #discussion row

12

In [17]:
dataset_google[10472]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

We have passed the two datasets through the function and there is a row in the `dataset_google` with a different length from the number of columns therefore we are going to remove it from our dataset.

In [18]:
del dataset_google[10472]

Checking if row has been removed properlly.

In [19]:
dataset_google[10472]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

#### yes ...It has been removed

<a href='#0'> back to index</a>
* * *
### <a id='22'>2.2 Detect duplicate data, and remove the duplicates.</a>

We have already checked that there is a coherence between the columns and columns of each of the rows of both datasets, so we must go forwards and check the existence of applications with the repeated name. 

So let's check if this is true in both datasets:


<br>

#### `detect_duplicated` feature:

<br>

The `detect_duplicated` function identifies and prints any duplicate apps in the dataset. 

It takes the dataset as an argument and iterates through each row, checking if the app name (the first element of each row) has appeared before. 

If a duplicate is found, it adds the name to a list of duplicates. 

After processing all rows, it prints whether there are any duplicates and, if so, provides examples of five such duplicates.


In [20]:
def detect_duplicated(dataset):
    unique_apps = []
    duplicate_apps = []
    
    for row in dataset:
        name = row[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    if len(duplicate_apps) == 0:
        print("In this dataset there are no duplicate apps")
    else:
        times=len(duplicate_apps)
        print("There are {reapeated} repeated apps".format(reapeated=times,dataset=dataset))
        print("\n")
        print('Examples of duplicate apps:',"\n"+"\n" , duplicate_apps[:5])

In [21]:
detect_duplicated(dataset_apple)

In this dataset there are no duplicate apps


In [22]:
detect_duplicated(dataset_google)

There are 1181 repeated apps


Examples of duplicate apps: 

 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Just only 5 samples of the repeated applications in **dataset_google**

* * * 
<a href='#0'> back to index</a>
### <a id='23'>2.3 What differentiates duplicate applications?</a>

If we take one of this as an example of a repeated application, we see that out of all the fields that make up the row **there is only one that differs**.

Once we have verified that there are many applications (some) repeated more than once, the next step would be to determine which criterion to follow for removing the applications:

In this case, the criterion has been: **the application with the higher number of reviews should have the most recent data**."

In [23]:
for app in dataset_google[1:]:
    if app[0] == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [24]:
for app in dataset_google[1:]:
    if app[0] == 'MailChimp - Email, Marketing Automation':
        print(app)

['MailChimp - Email, Marketing Automation', 'BUSINESS', '4.1', '5448', '12M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 25, 2018', '4.9.1', '5.0 and up']
['MailChimp - Email, Marketing Automation', 'BUSINESS', '4.1', '5448', '12M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 25, 2018', '4.9.1', '5.0 and up']


Once we have verified that there are applications (many) repeated more than once, the next step would be to determine **what criteria to follow to remove the applications**:

- Instead of eliminating random duplicates we will only keep those applications that are repeated and have the highest number of reviews, they will be the ones that will prevail in our dataset, the others will be eliminated.

<br>

#### `uniq_apps_top_reviews` feature:

<br>

The `uniq_apps_top_reviews` function creates a dictionary where the keys are unique app names and the values are the maximum number of reviews for each app. 

It iterates through the dataset, extracts the app name and review count, and updates the dictionary to keep only the highest review count for each app.

After processing all rows, it prints the length of the resulting dictionary and returns it.

In [25]:
def uniq_apps_top_reviews(dataset):  # diccionario con el nombre de las aplicaciones con su máximo review
    reviews_max = {} #key=name[0], value=review[3]
    
    for app in dataset: # No tengo que tener en cuenta la fila de las columnas
        name = app[0]
        n_reviews = float(app[3]) # porque si lo hiciera tendría un error al convertir a float 
                                  # pero esto me deja el dataset sin nombres de columnas! Debo hacer lo mismo en el de Apple.  
        
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
            
        elif name is not reviews_max:
            reviews_max[name] = n_reviews
    
    return reviews_max

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [26]:
print('Expected length:', len(dataset_google) - 1181)
print('Actual length:', len(uniq_apps_top_reviews(dataset_google)))

Expected length: 9659
Actual length: 9659


The next step we have to do is a function that takes a dataset and a dictionary as inputs. It filters the dataset to include only those entries where the number of reviews matches the maximum number of reviews for that application, based on the dictionary.

In [27]:
reviews_max = uniq_apps_top_reviews(dataset_google)

<br>

#### `rev_apps` feature:

<br>

The `rev_apps` function filters the dataset to include only those apps that have the maximum number of reviews as recorded in a given dictionary (`reviews_max`). 

It takes two arguments: 

- the `dataset` and a dictionary (`dict`). 

For each app in the dataset, it checks if the app's name exists in the dictionary and if the review count matches the value in the dictionary. 

If both conditions are met and the app has not been added to the result list yet, it appends the app to `android_clean` and marks the app as added by appending its name to `already_added`. 

Finally, it returns the filtered list of apps.

In [28]:
def rev_apps(dataset, dict):
    data_cleaned = []
    already_added = []
    for app in dataset:
        name = app[0]
        n_reviews = float(app[3])
        
        if ((dict[name] == n_reviews) and (name not in already_added)):
            data_cleaned.append(app)
            already_added.append(name)
    
    return data_cleaned

In [29]:
android_clean = rev_apps(dataset_google,reviews_max) # cleaning all dataset

`android_clean` is the our **new dataset**

In [30]:
explore_data(android_clean, 1, 2, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


In [31]:
explore_data(android_clean, 1, 2, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


We update the status of the datasets
|`dataset_apple`|`android_clean`|
|:----|:----|
|Number of rows: 7197| Number of rows: 9659|
|Number of columns: 16|Number of columns: 13|
|________________________________________________________________________________________________|_______________________________________________________________|

<a href='#0'> back to index</a>
* * *
### <a id='24'>2.4 Removing Non-English Apps</a>

1. **General Objective**: Our company develops applications in English and we are interested in analyzing only those applications aimed at an English-speaking audience.

2. **Identified Problem**: Upon examining the data, we have found that some application names suggest they are not intended for an English-speaking audience. This is often due to the presence of symbols or characters that are not commonly used in English text.

3. **Proposed Strategy**: To address this problem, we will remove all applications whose names contain symbols that are not typically used in English. These symbols include letters from other alphabets, punctuation marks (., !, ?, ;), and other non-alphabetic characters such as (+, *, /).

4. **Detailed Method**:
   - Each character (letter, number, dot, comma, etc.) has a corresponding number that can be obtained using the built-in function `ord()`.
   - For example, the character "a" corresponds to 97, "A" corresponds to 65, and any other non-alphabetic character might have a different value (for instance, "爱" is 29.233).
   - We will use the `ord()` function to get the numerical value of each character in a string.
   - If a character has a number that falls outside the common range for English characters (i.e., not within the range 65-90 for uppercase and 97-122 for lowercase), we will remove that application.

In summary, our strategy is:
- Use `ord()` to get the numerical value of each character in the names of the applications.
- Compare this value with the common range of values for English characters (65-90 for uppercase and 97-122 for lowercase).
- Remove any applications whose names contain at least one character outside this range.

This strategy will ensure that we only analyze applications with names clearly in English.

We use English for the applications we develop in our company, and we would like to analyze only applications aimed at an English-speaking audience. 


However, if we explore the data enough, we will find that **both datasets** have applications with names that suggest they are not aimed at an English-speaking audience.


<br>

#### `english_speak` feature:

<br>

The `english_speak` function checks whether a given string consists only of English characters. 

It iterates through each character in the string, converts it to its ASCII value using `ord(character)`, and returns `True` if all characters have an ASCII value less than or equal to 127 (indicating they are English). 

If any character has an ASCII value greater than 127, it returns `False`.

In [32]:
def english_speak(string):
    for character in string:
        valor_character = ord(character)
        if valor_character > 127:
            return False #not English 
        else:
            return True # is English

In [33]:
english_speak('Instagram')

True

In [34]:
english_speak('爱奇艺PPS')

False

In [35]:
english_speak('Docs To Go™ Free Office Suite')#Is English!

True

In [36]:
english_speak('Instachat 😜😜😜😜') #Is English!

True

If we are going to use the function we have created, we will lose useful data, we saw that the function could not correctly identify certain names of English applications such as **'Docs To Go™ Free Office Suite'** and **'Instachat 😜'** since many English applications will be incorrectly labeled as non-English. 


To **minimize the impact of data loss**,it is necessary to have a basic criterion that helps in the screening so **we will only delete an application if its name is longer than three characters with the corresponding numbers outside the ASCII range**. 

This means that all English apps with up to three emoji or other special characters will still be labeled as English. 

In [37]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

In [38]:
english_speak('Docs To Go™ Free Office Suite')

True

In [39]:
english_speak('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [40]:
english_speak('Instachat 😜 😜')

True

In [41]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in dataset_apple:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

At this point it is important to remember what the names of the datasets are, thus avoiding failures.

|Dataset names|lenght|
|:----|:----|
|android_english|9614|
|ios_english|6183|

<a href='#0'> back to index</a>
* * *

### <a id='25'>2.5 Filter non-English applications from both datasets.</a> 

If an app's name is identified as English, the we add the entire row to a separate list.

<br>

#### `cleaning_dataset` feature:

<br>


The `cleaning_dataset` function cleans the dataset by filtering out rows where the app name is not composed of English characters. 

It takes the dataset as an argument and iterates through each row, starting from the second row (index 1). 

For each row, it extracts the app name and uses the `english_speak` function to check if the name consists only of English characters. 

If the name is valid, the row is appended to `list_clean`. 

Otherwise, it skips to the next row. Finally, it returns the filtered list of rows.

In [42]:
def cleaning_dataset(dataset):
    list_clean = []
    for row in dataset[1:]:
        name = row[1]
        if english_speak(name):
            list_clean.append(row)
        else:
            pass
    return list_clean

In [43]:
# Android

clean_android = cleaning_dataset(android_english)

In [44]:
# AppleStore

clean_ios = cleaning_dataset(ios_english)

<a href='#0'> back to index</a>
* * *
### <a id='26'>2.6 Exploring datasets and see how many rows we have left for each dataset.</a>

In each cleaning process we have put different names to our dataset variables so it is important to take them into account

In [45]:
explore_data(clean_android, 2, 4, True)

['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9613
Number of columns: 13


In [46]:
explore_data(clean_ios, 2, 4, True)

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 6145
Number of columns: 16


**Now we are working with this:**


|Dataset names|lenght|
|:----|:----|
|`clean_android`|9658|
|`clean_ios`|6154|

<a href='#0'> back to index</a>
* * *
## <a id='3'>3. Isolating Free Apps</a>

Our datasets contain **free and non-free applications**, to isolate only free apps from our datasets and check how many apps we have left for analysis, we need to determine **which number** of column indicates whether an app is free or paid. Then, we will see the length of each dataset to know the exact number of free apps.

Checking the length of each dataset to see how many apps have left.

In [47]:
android_final = []
ios_final = []

for app in clean_android:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in clean_ios:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8863
3189


<br>

#### `free_apple_clean` feature:

<br>

The `free_apple_clean` function filters the dataset to include only those rows where the price (either in the 4th column or the 7th column) is '0.0' or '0'. It checks each row and appends it to `free_apple_clean` if the price meets the criteria. If the price is neither '0.0' nor '0', it skips that row. Finally, it returns the filtered list of rows.

To make it work for both cases (4th column or 7th column), you can modify the function to accept an additional parameter specifying the index of the price column.

**Updating names and lengths**


|Dataset names|lenght|
|:----|:----|
|`clean_ios`|3189|
|`clean_android`|8863|

<a href='#0'> back to index</a>
* * *

### <a id='31'>3.1 Our ultimate goal</a>

As we mentioned in the introduction, our goal is to determine the types of apps that are likely to attract more users because the number of people who use our apps affects our revenue. 

To minimize risks and overhead, our validation strategy for an app idea has three steps: 

- Create a minimum Android version of the app and add it to Google Play.
- If the app has a good response from users, we develop it further.
    - If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


<br>

Because our ultimate goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For example, a profile that works well for both markets could be a productivity app that makes use of gamification.

Our **ultimate goal** is to add the app on both `Google Play` and the `App Store`.

So we need to find profiles of apps that are successful **in both markets**. 

Let's start the analysis by having an idea of **what are the most common genres for each market**. To do this, we will have to build frequency tables for some columns of our datasets.

The columns that can give us the information we need are:

`free_apple_clean[11]` ---> `prime_genre`

`free_android_clean[9]` ---> `Genres`

<br>

#### `freq_table` feature:

<br>

The `freq_table` function creates a frequency table and percentage distribution for the values in a specified column of a dataset. 

It takes two parameters: `dataset`, which is the list of rows, and `column`, which specifies which column to analyze. 

The function counts how many times each value appears in that column and calculates the percentage of total occurrences for each value. 

Finally, it returns a dictionary where each key is a unique value from the specified column, and the corresponding value is its frequency percentage.

In [48]:
def freq_table(dataset, column):
    table = {}
    total = 0
    
    for row in dataset:
        total +=1 # counter give us length.
        value = row[column]
        if value in table:
            table[value] +=1
        else:
            table[value] = 1
        
    table_percentages = {}
    
    for key in table: #table[column value]<--as key] value <--as +1 values added 
        percentage = (table[key] / total ) * 100
        table_percentages[key]=percentage
            
    return table_percentages # {column:times repeated}


- The value of the dictionary we place in `key_val_as_tuple`, tuple along with the key of this


<br>

#### `display_table` feature:

<br>


The `display_table` function takes a dataset and a column index as input, creates a frequency table using the `freq_table` function, sorts the frequency percentages in descending order, and then displays each key-value pair in a formatted manner.

- It first calls the `freq_table` function to get a dictionary where each key is a unique value from the specified column, and the corresponding value is its frequency percentage.
- Then, it creates a list of tuples, where each tuple contains the frequency percentage and the corresponding key.
- The list of tuples is sorted in descending order based on the frequency percentages.
- Finally, it prints each key-value pair in a formatted string.


In [49]:
def display_table(dataset, column):
    table = freq_table(dataset, column) #return diccionary
    table_display = []
    list_table_sorted = []
    for key in table:
        key_val_as_tuple = (table[key], key)   #times repeated & key same variable
        table_display.append(key_val_as_tuple) #append tuple into list
        
    list_table_sorted.append(sorted(table_display, reverse = True))
    table_sorted = sorted(table_display, reverse = True) #sorting list
    for entry in table_sorted: #print every cell on list
        print(entry[1], ':', round(entry[0],2),'%')
   # return list_table_sorted

In [50]:
display_table(clean_ios,11) #apple prime_genere column

Games : 54.87 %
Entertainment : 7.27 %
Education : 6.67 %
Photo & Video : 5.55 %
Utilities : 3.47 %
Productivity : 2.73 %
Health & Fitness : 2.69 %
Music : 2.23 %
Social Networking : 2.0 %
Sports : 1.69 %
Lifestyle : 1.59 %
Shopping : 1.33 %
Weather : 1.11 %
Travel : 0.93 %
News : 0.93 %
Business : 0.86 %
Book : 0.86 %
Reference : 0.83 %
Finance : 0.78 %
Food & Drink : 0.72 %
Navigation : 0.46 %
Medical : 0.34 %
Catalogs : 0.08 %


In the set of free English apps, more than half (58.16%) are games. 

Entertainment apps account for approximately 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, while social networking apps make up 3.29% of the apps in our data set.

The general impression is that the App Store (at least the part containing free English apps) is dominated by apps designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. 

Although fun apps are the most numerous, this does not necessarily imply that they also have the greatest number of users—demand might not be the same as supply.

Let's continue by examining the `Genres` and `Category` columns of the Google Play data set (two columns which seem to be related).

In [51]:
display_table(clean_android,1)  # Android Category 

FAMILY : 19.62 %
GAME : 9.62 %
TOOLS : 8.61 %
BUSINESS : 4.35 %
MEDICAL : 4.12 %
PERSONALIZATION : 3.9 %
PRODUCTIVITY : 3.88 %
LIFESTYLE : 3.79 %
FINANCE : 3.59 %
SPORTS : 3.4 %
COMMUNICATION : 3.28 %
HEALTH_AND_FITNESS : 2.99 %
PHOTOGRAPHY : 2.91 %
NEWS_AND_MAGAZINES : 2.6 %
SOCIAL : 2.49 %
TRAVEL_AND_LOCAL : 2.28 %
BOOKS_AND_REFERENCE : 2.27 %
SHOPPING : 2.09 %
DATING : 1.77 %
VIDEO_PLAYERS : 1.69 %
MAPS_AND_NAVIGATION : 1.34 %
FOOD_AND_DRINK : 1.17 %
EDUCATION : 1.07 %
LIBRARIES_AND_DEMO : 0.87 %
AUTO_AND_VEHICLES : 0.87 %
ENTERTAINMENT : 0.83 %
WEATHER : 0.82 %
HOUSE_AND_HOME : 0.76 %
EVENTS : 0.67 %
PARENTING : 0.62 %
ART_AND_DESIGN : 0.61 %
COMICS : 0.57 %
BEAUTY : 0.55 %


<a href='#0'> back to index</a>
* * *
### <a id='32'>3.2 Profiles of apps successful in both markets</a>

In the Google Play landscape, the landscape looks quite different: there are fewer apps focused on entertainment, and many more focus on practical purposes (family, tools, business, lifestyle, productivity, etc.). 

However, exploring this in more detail, we realize that the family category (which accounts for almost 19% of apps) is mostly broken down into children's games.

In [52]:
display_table(android_final, -4)

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.58 %
Productivity : 3.89 %
Lifestyle : 3.89 %
Finance : 3.7 %
Medical : 3.54 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.25 %
Action : 3.1 %
Health & Fitness : 3.07 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.25 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.86 %
Video Players & Editors : 1.78 %
Casual : 1.75 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.93 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.59 %
Parenting : 0.5 %
Card : 0.44 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music : 0.2 %


### We would like to get an idea about the type of apps with more users.

| Dataset name       | Genres            | percentage   | Category    | Category Percentage |
|:------------------|:-----------------|-------------:|------------:|--------------------:|
| free_apple_clean  | Games            | 54%          |             |                     |
|                   | **Entertainment**    | 7.27%        |             |                     |
|                   | Photo & Video    | 5%           |             |                     |
|                   | Education        | 4%           |             |                     |
|                   | Social Networking| 3%           |             |                     |
| free_android_clean| Tools            | 8.5%           | FAMILY      | 19%                 |
|                   | **Entertainment**    | 6%           | GAME        | 9.5%                |
|                   | Education        | 5%           | BUSINESS    | 4.5%                |
|                   | Business         | 5%           | TOOLS       | 8.4%                |
|                   | Productivity     | 4%           | LIFESTYLE   | 4.0%                |


To determine the most popular genres in terms of user engagement, one approach is to calculate the average number of installs for each app genre. 

For the Google Play data set, this information is available in the Installs column. However, for the App Store data set, the necessary information is missing. As a workaround, we will use the total number of user ratings (found in the `rating_count_tot` column) as a proxy to estimate popularity.

One way to find out which genres are the most popular (have the most users) is to calculate the **average number of installs for each app genre**.

<br>

To gain insights into the types of applications that users are utilizing from both the Google Play and iOS datasets, we will analyze the categories based on predefined classifications.

- **For the App Store (iOS) dataset**, we will take the total number of user ratings as a proxy, which we can find in the `'rating_count'` column. This provides an indication of the popularity and usage of each app category.

- **For the Google Play dataset**, we can find this information in the **installs** column. By examining which categories have the highest number of installs, we can determine the most popular types of apps among Android users.


By analyzing these categories, we can identify which types of apps are most favored by users. For example, if the "Games" category has the highest number of installs in the Google Play dataset, it suggests that games are particularly popular among Android users. Similarly, if the "Games" category has a high `'rating_count'` in the App Store dataset, it indicates that games are a significant and favored category for iOS users.

By comparing the distribution of app categories in both datasets, we can gain a comprehensive understanding of the types of applications that are most utilized across different platforms. This analysis helps us identify trends and preferences, allowing us to make informed conclusions about user behavior and app popularity on each platform.

<br>

<a href='#0'> back to index</a>
* * *

### <a id='4'>4. Most Popular Apps by Genre on Apple Store</a>

Number of user ratings per app genre on the App Store:

In [53]:
genres_ios = freq_table(clean_ios, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Photo & Video : 28441.54375
Games : 22689.146012931036
Music : 57326.530303030304
Social Networking : 44744.12621359223
Reference : 79350.4705882353
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 29721.605263157893
Shopping : 27898.802469135804
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16815.48
Entertainment : 14139.42857142857
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 46384.916666666664
Finance : 32367.02857142857
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps receive the highest number of user reviews, though this figure is significantly influenced by Waze and Google Maps, which together have nearly half a million reviews.

In [54]:
for app in clean_ios:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
MotionX GPS : 14970
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
Gaia GPS Classic : 2429
Plane Finder - Flight Tracker : 1438
iMaps+ for Google Maps ™ and Street View ™ : Transit and Offline Contacts : 1225
NAVIGON Europe : 927
Localscope - Find places and people around you : 868
Ski Tracks : 829
TRANSPORT MODS for MINECRAFT Pc EDITION : 754
Pocket Earth PRO Offline Maps & Travel Guides : 748
Ship Finder : 624
Boating USA : 342
Maps 3D PRO - GPS for Bike, Hike, Ski & Outdoor : 280
Cachly - Simple and powerful Geocaching for iPhone : 263
ImmobilienScout24: Real Estate Search in Germany : 187
The JMU Bus App : 35
Avertinoo : 32
iStellar : 30
mySTATE - State College : 26
Road watcher: dash camera, car video recorder. : 10
Streets – Street View Browser : 10
Railway Route Search : 5
parkOmator – for Apple Watch meter expiration timer, notifications & GPS navigator t

The same pattern applies to social networking apps and music apps, where a few dominant players like `Facebook`, `Pinterest`, `Skype`, `Pandora`, `Spotify`, and `Shazam` heavily influence the average number of user reviews. 

Our goal is to identify popular genres accurately, but the average ratings for navigation, social networking, or music apps appear higher than they should due to a few very popular apps with hundreds of thousands of reviews, while other apps may have fewer than 10,000 reviews. 

To get a clearer picture, we could remove these extremely popular apps and recalculate the averages, but we'll address this level of detail later.

In [55]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0


This niche seems to show some potential. One approach could be to take another popular book and turn it into an app by adding various features such as daily quotes, an audio version of the book, quizzes, and even a built-in dictionary. This would allow users to look up words directly within the app without needing to exit.

This idea aligns well with the fact that the App Store is dominated by for-fun apps, suggesting that the market might be somewhat saturated with such apps. Therefore, a practical app could have more of a chance to stand out among the numerous options available on the App Store.

Other popular genres include weather, book, food and drink, or finance. While the book genre aligns well with our app idea, other genres like weather, food and drink, and finance do not seem as promising:

- Weather apps generally don't keep users engaged for long, and making a profit through in-app ads is unlikely. Additionally, reliable live weather data may require using non-free APIs.
- Food and drink apps, such as Starbucks or McDonald's, require actual cooking and delivery services, which are outside our company's scope.
- Finance apps involve banking, bill payment, and money transfer, requiring domain knowledge that we don't want to obtain by hiring a finance expert.

Now let's analyze the Google Play market.

In [56]:
freq_table(clean_ios, 11)

{'Photo & Video': 5.549227013832384,
 'Games': 54.873881204231076,
 'Music': 2.229454841334418,
 'Social Networking': 2.001627339300244,
 'Reference': 0.8299430431244915,
 'Health & Fitness': 2.6851098454027666,
 'Weather': 1.1065907241659887,
 'Utilities': 3.466232709519935,
 'Travel': 0.9275834011391376,
 'Shopping': 1.3344182262001627,
 'News': 0.9275834011391376,
 'Navigation': 0.4556550040683482,
 'Lifestyle': 1.594792514239219,
 'Entertainment': 7.274206672091131,
 'Food & Drink': 0.7160292921074044,
 'Sports': 1.6924328722538649,
 'Book': 0.8624898291293734,
 'Finance': 0.7811228641171685,
 'Education': 6.672091131000814,
 'Productivity': 2.7339300244100895,
 'Business': 0.8624898291293734,
 'Catalogs': 0.08136696501220504,
 'Medical': 0.3417412530512612}

### - sorting Genres on App Store 

<a href='#0'> back to index</a>
* * *

## <a id='5'>5. Most Popular Apps by Genre on Android</a>

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

**Our aim** is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. 

**The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold**. 


We could get a **better picture by removing these extremely popular apps** for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [57]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 2021626.7857142857
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1704192.3399014778
COMICS : 817657.2727272727
COMMUNICATION : 38326063.197916664
DATING : 854028.8303030303
EDUCATION : 1768500.0
ENTERTAINMENT : 9146923.076923076
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4167457.3602941176
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 12914435.883748516
FAMILY : 5180161.789906103
MEDICAL : 123064.7898089172
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 4274688.722772277
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16772838.591304347
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24790074.17721519
NEWS_AND_MAGAZINES : 

On average, communication apps have the most installs, with an average of 38,456,119. However, this figure is significantly skewed by a few dominant apps that have over one billion installs (such as WhatsApp, Facebook Messenger, Skype), and a handful of others with over 100 million and 500 million installs. This suggests that while communication apps are popular in general, the majority of their user base is concentrated among just a few highly successful applications. 

Given this information, it might be more meaningful to focus on the remaining communication apps that have fewer installs to identify potential areas for new or improved apps that could better meet specific user needs and stand out from the highly competitive landscape dominated by giants like WhatsApp and Facebook Messenger.

In [58]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
Gmail : 1,000,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Hangouts : 1,000,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [59]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3593510.3486590036

We see the same pattern in the video players category, which ranks second with 24,727,872 installs. This market is dominated by apps like YouTube, Google Play Movies & TV, or MX Player. The same pattern repeats for social apps (with giants like Facebook, Instagram, and Google+), photography apps (such as Google Photos and popular photo editors), and productivity apps (including Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

The main concern is that these app genres might seem more popular than they actually are. Moreover, these niches are heavily dominated by a few giants that are difficult to compete against.

While the game genre appears popular, we previously found it to be somewhat saturated, so we would like to consider a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. This is interesting to explore in more depth since we believe this genre has potential for being profitable on both the App Store and Google Play. 

Let's take a look at some of the apps from this genre and their number of installs.

En el panorama de Google Play, la apariencia del paisaje parece bastante diferente: hay menos aplicaciones enfocadas en el entretenimiento, y muchas más se centran en propósitos prácticos (familia, herramientas, negocios, estilo de vida, productividad, etc.). Sin embargo, al explorar esto con más detalle, nos damos cuenta de que la categoría de familia (que representa casi el 19% de las aplicaciones) se desglosa principalmente en juegos para niños.

In [60]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps, such as software for processing and reading eBooks, various collections of libraries, dictionaries, tutorials on programming or languages, and more. However, it seems that there are still a few extremely popular apps that skew the average number of installs.


For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):



In [61]:

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps in this market, so it still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1 million and 100 million downloads).

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

This niche is dominated by software for processing and reading eBooks, as well as various collections of libraries and dictionaries. Given this competition, it might not be ideal to develop similar apps because there will be significant competition from existing popular titles.

Additionally, we observe that many apps are built around the book Quran, indicating that creating an app around a popular or recent book can be profitable. It appears that turning a popular book (perhaps a more recent one) into an app could be beneficial for both the Google Play and App Store markets.

However, since the market is already saturated with libraries, we need to add some unique features to differentiate our app. These could include daily quotes from the book, an audio version of the book, quizzes based on the book, or a forum where users can discuss the book.

<a href='#0'> back to index</a>
* * *

### <a id='6'>6. Conclusions</a>

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets. 

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. 

The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.