# Data analysis to help developers understand what type of apps are likely to attract more users.

<br>

**Introduction to the Project**



In today's mobile development landscape, it is crucial to understand which types of applications attract more users in both Android and Apple ecosystems. With the goal of providing valuable information to developers, this project focuses on data analysis to identify common characteristics among the applications that receive a higher number of downloads and engagement.

1. **What Does the Project Entail?**

    This data analysis combines information from various sources, such as Google Play Store and iTunes App Store, along with usage and performance metrics. Using statistical techniques and machine learning algorithms, patterns and trends are identified to indicate which types of applications perform better in each platform.

2. **Project Objective**

    The primary objective of this project is to provide developers with precise information about the characteristics that resonate best with users on Android and iOS. By understanding better what types of applications perform well in terms of downloads, engagement, and retention, developers can make more informed decisions for the design and optimization of their own apps.

Lo primero es la recopilar y analizar datos sobre aplicaciones móviles disponibles en Google Play y App Store.

Para evitar gastar recursos en la recopilación de nuevos datos, primero debemos tratar de ver si podemos encontrar datos existentes relevantes sin costo alguno. Afortunadamente, aquí hay dos conjuntos de datos que parecen adecuados para nuestros objetivos:

<br>

Un conjunto de datos contiene datos de aproximadamente 10,000 aplicaciones de **Android de Google Play**; los datos se recopilaron en agosto de 2018. 

Un conjunto de datos contiene datos de aproximadamente 7,000 aplicaciones de **iOS de la App Store**; los datos se recopilaron en julio de 2017. 

Vamos a abrir y a explorar estos conjuntos de datos con la función `explore_data()`

Parámetros:

- `dataset`: Una lista de listas que representa el conjunto de datos.

- `start`: Un entero que indica el índice inicial del segmento a mostrar.

- `end`: Un entero que indica el índice final (exclusivo) del segmento a mostrar.

- `rows_and_columns` (opcional): Un booleano con valor predeterminado False. Si es True, imprimirá el número de filas y columnas.

* * *

The datasets are in two documents of type csv's i.e. **Comma Separated Values file** for read the content, it is necessary to load a function called `reader` of the python `csv` module.

In [1]:
! ls -ls

total 336
268 -rw-rw-r-- 1 ion ion 272175 ene 23 18:02  00_For_Loops_Conditionals.ipynb
 52 -rw-rw-r-- 1 ion ion  50707 ene 24 19:51 'Analyzing Mobile App Data.ipynb'
  4 drwxrwxr-x 2 ion ion   4096 ene 23 18:07  datasets
  4 -rw-rw-r-- 1 ion ion    159 ene 23 18:00  enlace_DQ
  4 -rw-rw-r-- 1 ion ion    538 ene 23 18:01  README.md
  4 -rw-rw-r-- 1 ion ion    618 ene 23 16:02  Untitled.ipynb


In [2]:
! ls datasets

 AppleStore.csv  'data dictionary.txt'	 googleplaystore.csv


In [3]:
from csv import reader

## 1. Exploring datasets:

We are going to explore the two sets of data, but first we are going to kwon what type of [character encoding](https://en.wikipedia.org/wiki/Character_encoding) both datasets have.

A command called [file](https://www.man7.org/linux/man-pages/man1/file.1.html), will help us to kwon the type of file regardless of its extension and avoid a error called `UnicodeDecodeError`.


```
FILE(1)                                         BSD General Commands Manual                                         FILE(1)

NAME
     file — determine file type

SYNOPSIS
     file [-bcdEhiklLNnprsSvzZ0] [--apple] [--exclude-quiet] [--extension] [--mime-encoding] [--mime-type] [-e testname]
          [-F separator] [-f namefile] [-m magicfiles] [-P name=value] file ...
     file -C [-m magicfiles]
     file [--help]

DESCRIPTION
     This manual page documents version 5.41 of the file command.

     file tests each argument in an attempt to classify it.  There are three sets of tests, performed in this order:
     filesystem tests, magic tests, and language tests.  The first test that succeeds causes the file type to be printed.

     The type printed will usually contain one of the words text (the file contains only printing characters and a few com‐
     mon control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of
     compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is
     usually “binary” or non-printable).  Exceptions are well-known file formats (core files, tar archives) that are known
     to contain binary data.  When modifying magic files or the program itself, make sure to preserve these keywords.
     Users depend on knowing that all the readable files in a directory have the word “text” printed.  Don't do as Berkeley
     did and change “shell commands text” to “shell script”.

```

In [4]:
! file -i datasets/AppleStore.csv

datasets/AppleStore.csv: text/csv; charset=utf-8


In [5]:
! file -i datasets/googleplaystore.csv

datasets/googleplaystore.csv: text/csv; charset=utf-8


Our files have the same type of charset https://en.wikipedia.org/wiki/UTF-8.


Quick idea:

- **charset:** is the set of characters you can use.
- **encoding:** is the way these characters are stored into memory.

[source](https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset)

In [6]:
opened_ios = open('datasets/AppleStore.csv')
opened_android = open('datasets/googleplaystore.csv')

readed_ios = reader(opened_ios)
readed_android = reader(opened_android)

# The datasets we are going to work with
data_ios = list(readed_ios) 
data_android = list(readed_android)

In [7]:
def explore_data(dataset, start, end, rows_and_columns = False):
    offset = 1
    cnt = 1
    columns = len(dataset[0])
    
    if rows_and_columns == True:
        print("\n")

        for row in dataset[offset+start:end+offset]:
            print("row:",cnt, row)
            print("\n")
            cnt +=1
            
        print("columns ",columns , dataset[0])
        print("\n")
        print('Number of rows:', len(dataset))
        
    else:
        for row in dataset[offset+start:end+offset]:
            print(row)
            print("\n")

In [8]:
explore_data(data_ios, 0, 10, True)



row: 1 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


row: 2 ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


row: 3 ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


row: 4 ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


row: 5 ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


row: 6 ['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


row: 7 ['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '

In [9]:
explore_data(data_android, 0, 10, True)



row: 1 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


row: 2 ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


row: 3 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


row: 4 ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


row: 5 ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


row: 6 ['Paper flowers instructions', 'ART

Aparentemente parece que todo está bien.

### Knowing the size in both data sets

|`data_ios`|`data_android`|
|:----|:----|
|Number of rows: 7198| Number of rows: 10842|
|Number of columns: 16|Number of columns: 13|
|________________________________________________________________________________________________|_______________________________________________________________

Estas son las columnas que tenemos en los datasets y la cantidad de filas.


### Diccionario de datos de IOS (`data_ios`)

|||
|:--|:--|
|**id**| App ID|
|**track_name**| App Name|
|**size_bytes**| Size (in Bytes)|
|**currency**| Currency Type|
|**price**| Price amount|
|**rating_count_tot**| User Rating counts (for all version)|
|**rating_count_ver**| User Rating counts (for current version)|
|**user_rating**| Average User Rating value (for all version)|
|**user_rating_ver**| Average User Rating value (for current version) |
|**ver**| Latest version code|
|**cont_rating**| Content Rating|
|**prime_genre**| Primary Genre|
|**sup_devices.num**| Number of supporting devices|
|**ipadSc_urls.num**| Number of screenshots showed for display|
|**lang.num**| Number of supported languages|
|**vpp_lic**| Vpp Device Based Licensing Enabled|
||_________________________________________________________________________________________________________________________|

### Diccionario de datos de Android (`data_android`)

|||
|:--|:--|
|**App**| Application name|
|**Category**| Category the app belongs to|
|**Rating**| Overall user rating of the app (as when scraped)|
|**Reviews**| Number of user reviews for the app (as when scraped)|
|**Size**| Size of the app (as when scraped)|
|**Installs**| Number of user downloads/installs for the app (as when scraped)|
|**Type**| Paid or Free|
|**Price**| Price of the app (as when scraped)|
|**Content Rating**| Age group the app is targeted at - Children / Mature 21+ / Adult|
|**Genres**| An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.|
|**Last Updated**| Date when the app was last updated on Play Store (as when scraped)|
|**Current Ver**| Current version of the app available on Play Store (as when scraped)|
|**Android Ver**| Min required Android version (as when scraped)|
||_________________________________________________________________________________________________________________________|

Here are the columns that might help us in our analysis in the quest to understand what kind of apps are likely to attract the most users:

### 2. Data cleaning

**We have a number of requirements:**

- Detect inaccurate data and correct or delete it. - Detect and remove duplicates.

Considering that we create free applications in English, simply to download and install, we must **remove the applications that are not free and those that are not in English**.

**Therefore, before the analysis, we perform a data cleansing that involves:**

- 1. **Detect inaccurate data, and correct or remove it**.

- 2. **Detect duplicate data, and remove the duplicates**.

- 3. **Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播**.

- 4. **Remove apps that aren't free##**.

<br>

**Detect inaccurate data... remove it.**

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=votes), and we can see that [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

If the number of columns in a row varies then some data is missing.

This function `field_number` has a pair of parameters `dataset`, `header` in which we enter the dataset and the header.

The passage of each row compare the number of fields with the number of fields in header if they do not coincide then in that line **some type of data is missing**.

In [10]:
def field_number(dataset):
    index_number = 0
    header = len(dataset[0])
    for row in range(len(dataset)):
        if len(dataset[row]) != header:
            print(dataset[row])
            print("\n")
            print("error in index number: ",index_number)
            print("number of columns has an error: ",len(dataset[row]))
        index_number +=1

In [11]:
field_number(data_ios)

En el dataset `data_ios` no tenemos ningun problema con las columnas.

In [12]:
field_number(data_android)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


error in index number:  10473
number of columns has an error:  12


Sin embargo en el conjunto de android, podemos comprobar que hay un aviso:

En la fila **10473** el número de columnas no coincide, por lo tanto debemos eliminarla. 

In [13]:
data_android[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [14]:
del data_android[10473]

### Detect duplicate data, and remove the duplicates

We have already checked that there is a coherence between the columns and columns of each of the rows of both datasets, so we must check the existence of applications with the repeated name. So let's check if this is true in both datasets:

`for` is running through each row extracting the name of the application.

- Checking if it already exists or not.

- if not exist then the name is inserted in `unique_apps` else is inserted into `duplicate_apps`.

In [15]:
def detect_duplicated(dataset):
    duplicate_apps = []
    unique_apps = []
    
    for app in dataset:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    if len(duplicate_apps) == 0:
        print("In this dataset there are no duplicate apps")
    else:
        times=len(duplicate_apps)
        print("There are {reapeated} repeated apps".format(reapeated=times,dataset=dataset))
        print("\n")
        print('Examples of duplicate apps:',"\n"+"\n" , duplicate_apps[:25])

In [16]:
detect_duplicated(data_ios)

In this dataset there are no duplicate apps


In [17]:
detect_duplicated(data_android)

There are 1181 repeated apps


Examples of duplicate apps: 

 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents']


This list that has appeared shows an example of the `1181` repeated applications that are in the dataset, but that does not mean that they are duplicates, there are some that are even quadrupled. 

<br>

By entering in this piece of code the applications that appeared as repeated in the previous cell we can get an idea of what it looks like.

### 2.3 What differentiates duplicate applications?

If we take one of this as an example of a repeated application, we see that out of all the fields that make up the row **there is only one that differs**.

Once we have verified that there are many applications (some) repeated more than once, the next step would be to determine which criterion to follow for removing the applications:

In this case, the criterion has been: **the application with the higher number of reviews should have the most recent data**."

In [18]:
for app in data_android[1:]:
    if app[0] == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [19]:
for app in data_android[1:]:
    if app[0] == 'MailChimp - Email, Marketing Automation':
        print(app)

['MailChimp - Email, Marketing Automation', 'BUSINESS', '4.1', '5448', '12M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 25, 2018', '4.9.1', '5.0 and up']
['MailChimp - Email, Marketing Automation', 'BUSINESS', '4.1', '5448', '12M', '500,000+', 'Free', '0', 'Everyone', 'Business', 'July 25, 2018', '4.9.1', '5.0 and up']


Once we have verified that there are applications (many) repeated more than once, the next step would be to determine what criteria to follow to remove the applications:

In this case, the criterion has been:

- **The higher the number of reviews, the more recent the data should be!** 

<br> 
Therefore, instead of eliminating random duplicates we will only keep those applications that are repeated and have the highest number of reviews, they will be the ones that will prevail in our dataset, the others will be eliminated.

We will start by discriminating the applications by their name and by their number of reviews in the android dataset

In [20]:
android_header = [f"{x} = {i}" for i, x in enumerate(data_android[0])]
android_header

['App = 0',
 'Category = 1',
 'Rating = 2',
 'Reviews = 3',
 'Size = 4',
 'Installs = 5',
 'Type = 6',
 'Price = 7',
 'Content Rating = 8',
 'Genres = 9',
 'Last Updated = 10',
 'Current Ver = 11',
 'Android Ver = 12']

In [21]:
ios_header = [f"{x} = {i}" for i, x in enumerate(data_ios[0])]
ios_header

['id = 0',
 'track_name = 1',
 'size_bytes = 2',
 'currency = 3',
 'price = 4',
 'rating_count_tot = 5',
 'rating_count_ver = 6',
 'user_rating = 7',
 'user_rating_ver = 8',
 'ver = 9',
 'cont_rating = 10',
 'prime_genre = 11',
 'sup_devices.num = 12',
 'ipadSc_urls.num = 13',
 'lang.num = 14',
 'vpp_lic = 15']

## cuidado con el nombre de las columnas.

Cuando el dataset de android pasa por aqui me quedo sin la linea de las columnas.

In [22]:
def uniq_apps_top_reviews(dataset):  # diccionario con el nombre de las aplicaciones con su máximo review
    reviews_max = {} #key=name[0], value=review[3]
    
    for app in dataset[1:]: # No tengo que tener en cuenta la fila de las columnas
        name = app
        n_reviews = float(app[3]) # porque si lo hiciera tendría un error al convertir a float 
                                  # pero esto me deja el dataset sin nombres de columnas! Debo hacer lo mismo en el de Apple.  
        
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
            
        elif name is not reviews_max:
            reviews_max[name] = n_reviews
    
    print('length of the dictionary:',len(reviews_max))
    return reviews_max

Hemos creado un diccionario donde tenemos el `name` de las aplicaciones con el valor de `reviews_max`, procedemos a eliminar aquellas que no necesitamos.

In [24]:
! pwd

/home/ion/Formacion/git_repo_klone/albertjimrod/Python


In [23]:
...........salida = uniq_apps_top_reviews(data_android)

SyntaxError: invalid syntax (1520508044.py, line 1)

In [None]:
............def rev_apps(dataset, dict):
    android_clean = []
    already_added = []
    for app in dataset[1:]:
        name = app[0]
        n_reviews = float(app[3])
        
        if ((dict[name] == n_reviews) and (name not in already_added)):
            android_clean.append(app)
            already_added.append(name)
    
    return android_clean

In [None]:
reviews_max = uniq_apps_top_reviews(data_android)
reviews_max

In [None]:
dict_names_maxreviews = uniq_apps_top_reviews(data_android) # diccionario con el nombre de las aplicaciones con su máximo review

In [None]:
android_clean = rev_apps(data_android,dict_names_maxreviews) # cleaning all dataset

Once we have cleaned the dataset, we will check the absence of repeated applications in both datasets.

In [None]:
header_android_clean = [f"{x} = {i}" for i, x in enumerate(android_clean[0])]
header_android_clean

In [None]:
remove_dupli(header_android_clean)

Comprobemos si permanecen las aplicaciones repetidas en el conjunto de datos de 

In [None]:
remove_dupli(data_ios)

We update the status of the datasets
|`data_ios`|`android_clean`|
|:----|:----|
|Number of rows: 7198| Number of rows: 9659|
|Number of columns: 16|Number of columns: 13|
|________________________________________________________________________________________________|_______________________________________________________________|

 ### Aqui quiero explicar de donde sale el 1183 

We have removed 1183 duplicate apps, so, we have a new dataset for android with **9659** different apps in different languages, so now what we're going to do is filter those apps.

### Removing Non-English Apps

1. **General Objective**: Our company develops applications in English and we are interested in analyzing only those applications aimed at an English-speaking audience.

2. **Identified Problem**: Upon examining the data, we have found that some application names suggest they are not intended for an English-speaking audience. This is often due to the presence of symbols or characters that are not commonly used in English text.

3. **Proposed Strategy**: To address this problem, we will remove all applications whose names contain symbols that are not typically used in English. These symbols include letters from other alphabets, punctuation marks (., !, ?, ;), and other non-alphabetic characters such as (+, *, /).

4. **Detailed Method**:
   - Each character (letter, number, dot, comma, etc.) has a corresponding number that can be obtained using the built-in function `ord()`.
   - For example, the character "a" corresponds to 97, "A" corresponds to 65, and any other non-alphabetic character might have a different value (for instance, "爱" is 29.233).

   - We will use the `ord()` function to get the numerical value of each character in a string.
   - If a character has a number that falls outside the common range for English characters (i.e., not within the range 65-90 for uppercase and 97-122 for lowercase), we will remove that application.

In summary, our strategy is:
- Use `ord()` to get the numerical value of each character in the names of the applications.
- Compare this value with the common range of values for English characters (65-90 for uppercase and 97-122 for lowercase).
- Remove any applications whose names contain at least one character outside this range.

This strategy will ensure that we only analyze applications with names clearly in English.

### Code Explanation

The function `english_speak(text)` checks if a given text is primarily composed of English characters. It does this by:
1. **Checking Each Character**: For each character in the input text.
2. **Determining Ordinal Value**: Using the `ord()` function to get the ordinal value of each character.
3. **Marking Non-English Characters**: If the ordinal value is greater than 127, it marks that character as non-English.
4. **Returning Result**:
   - Returns `False` if there are more than two non-English characters.
   - Returns `True` otherwise.

### Examples

#### Example 1: "hello"
- Each character is checked: `'h'`, `'e'`, `'l'`, `'l'`, `'o'`.
- All characters have ordinal values between 33 and 126, so it returns `True`.

#### Example 2: "¡hola"
- Each character is checked: `'¡'`, `'h'`, `'o'`, `'l'`, `'a'`.
- `'¡'` has an ordinal value of 161 (greater than 127), so it is marked as non-English.
- Only one character is marked, so it returns `True`.

#### Example 3: "我爱你"
- Each character is checked: `'我'`, `'爱'`, `'你'`.
- Both `'我'` and `'爱'` have ordinal values greater than 127, so they are marked as non-English.
- More than two characters are marked, so it returns `False`.

### Summary

- The function checks if there are more than two non-English characters in the text.
- If there are more than two, it returns `False`.
- Otherwise, it returns `True`.

This way, the function ensures that only texts with primarily English characters are considered.

In [None]:
def english_speak(text):
    non_validchar = [] #aux list
    for character in text:
        valor_character = ord(character)
        if valor_character > 127:
            non_validchar.append(valor_character)
    if len(non_validchar) >= 3:
        return False
    else:
        return True

Recordamos en que indice tanto android como ios tienen los indices que corresponden a los nombres

In [None]:
english_speak('Docs To Go™ Free Office Suite')

In [None]:
english_speak('Instachat 😜 😜')

In [None]:
english_speak('爱奇艺PPS -《欢乐颂2》电视剧热播')

In [None]:
english_speak('Instachat 😜 😜 😜')

### Filter non-English applications from both datasets

In [None]:
header_clean_ios = [f"{x} = {i}" for i, x in enumerate(android_clean[0])]
header_clean_ios

In [None]:
header_clean_ios = [f"{x} = {i}" for i, x in enumerate(data_ios[0])]
header_clean_ios

In [None]:
def cleaning_dataset(dataset):
    list_clean = []
    for row in dataset[1:]:
        name = row[1]
        if english_speak(name):
            list_clean.append(row)
        else:
            pass
    return list_clean

In [None]:
# AppleStore

clean_android = cleaning_dataset(android_clean)

In [None]:
# AppleStore

clean_ios = cleaning_dataset(data_ios)

In [None]:
header_clean_ios = [f"{x} = {i}" for i, x in enumerate(clean_android[0])]
header_clean_ios

In [None]:
header_clean_ios = [f"{x} = {i}" for i, x in enumerate(clean_ios[0])]
header_clean_ios

### 2.6 Exploring datasets and see how many rows we have left for each dataset.

In [None]:
...........

In [None]:
explore_data(clean_ios, 0, 5, True)

In [None]:
explore_data(clean_android, 0, 5, True)

We update the status of the datasets
|`clean_ios`|`clean_android`|
|:----|:----|
|Number of rows: 9659| Number of rows: 6156|
|Number of columns: 16|Number of columns: 13|
|________________________________________________________________________________________________|_______________________________________________________________|

### 3. Isolating Free Apps

Our datasets contain **free and non-free applications** we need to isolate only the free applications for our analysis.

Checking the length of each dataset to see how many apps have left.

In [None]:
header_clean_android = [f"{x} = {i}" for i, x in enumerate(clean_android[0])]
header_clean_android

In [None]:
header_clean_ios = [f"{x} = {i}" for i, x in enumerate(clean_ios[0])]
header_clean_ios

In [None]:
.....

In [None]:
def free_apps(dataset, number_col): # Depending on the dataset, the number of column is one or the other
    free_clean = [] 
    
    for row in dataset:
        prix = row
        if prix == '0.0' or prix == '0': #not numbers just characters! 
            free_clean.append(row)
        else:
            pass
    return free_clean

In [None]:
print(free_apps(clean_android, ))