<a href="https://colab.research.google.com/github/anderson-ferreira-83/Data_Science_Repo_anderson83/blob/main/1_Alura_Voz/Week_1_data_cleaning/p1_Cleaning_for_git.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1 - Data Cleaning

In [1]:
import os # imports the os module for operating system related functionalities
import sys # imports the sys module for system-specific parameters and functions

In [None]:
!pip install --upgrade gdown


In [13]:
str_data_telco_cust_churn_file='1DPIC3QOFiKuYpBnnjOfUcLCfmoqOBRIZ'
!gdown --id $str_data_telco_cust_churn_file


Downloading...
From: https://drive.google.com/uc?id=1DPIC3QOFiKuYpBnnjOfUcLCfmoqOBRIZ
To: /content/Telco-Customer-Churn.json
100% 3.81M/3.81M [00:00<00:00, 136MB/s]


Welcome to the Data Science Challenge Notebook - Week 1!

In this notebook, we will be cleaning and processing the data obtained from the Alura Voz API, a telecommunications company.

<p align = 'center'>
<img src = 'https://i.imgur.com/8LTNXxF.jpg'>
</p>

### Importing the Data from the API

The first step to start the data processing is to install and import the necessary libraries. We will use the `pandas` library and the `numpy` library. The documentation for both libraries can be accessed below:

 - [Pandas documentation](https://pandas.pydata.org/docs/)
 - [Numpy documentation](https://numpy.org/doc/stable/)

In [4]:
import pandas as pd # imports the pandas library for data manipulation and analysis
import numpy as np # imports the numpy library for numerical computing

To read a JSON file, you can use the `pd.read_json()` method, passing the file path as a parameter to the method

*This procedure is demonstrated in the lesson [Loading data](https://cursos.alura.com.br/course/python-pandas-tecnicas-avancadas/task/91739) from the course [Python Pandas: técnicas avançadas](https://cursos.alura.com.br/course/python-pandas-tecnicas-avancadas)*

In [None]:
data = pd.read_json("Telco-Customer-Churn.json") # reads data from a JSON file into a pandas DataFrame
data.head() # displays the first few rows of the DataFrame

### Exploring the content of each column

Since the columns **customer, phone, internet,** and **account** contain multiple values within keys, making it difficult to analyze just by looking at the table, let's unpack the first element of each of these columns to understand them better.

In [None]:
data.customer[0] # accesses and displays the first element of the 'customer' column

In [None]:
data.phone[0] # accesses and displays the first element of the 'phone' column

In [None]:
data.internet[0] # accesses and displays the first element of the 'internet' column

In [None]:
data.account[0] # accesses and displays the first element of the 'account' column

We noticed that the elements in the columns **customer, phone, internet,** and **account** are dictionaries and contain a lot of condensed information. As they are currently organized, it is very difficult to perform any analysis, so it will be necessary to transform each piece of information into a new column in the DataFrame.

### Normalizing the data in each column

To transform the data into new columns, we will use the `pd.json_normalize()` method. This method maps each key of the dictionary to a new column, and the contained values become the rows.

We need to perform this procedure for each of the columns **customer, phone, internet,** and **account,** storing the result in variables to be merged later.

*This procedure is demonstrated in the lesson[Transforming JSON Data to a Table](https://cursos.alura.com.br/course/python-pandas-tecnicas-avancadas/task/91745) from the course [Python Pandas: Advanced Techniques.](https://cursos.alura.com.br/course/python-pandas-tecnicas-avancadas)*

In [None]:
customer_data = pd.json_normalize(data.customer) # normalizes the 'customer' column and stores it in a new DataFrame
customer_data # displays the new DataFrame

In [None]:
phone_data = pd.json_normalize(data.phone) # normalizes the 'phone' column and stores it in a new DataFrame
phone_data # displays the new DataFrame

In [None]:
internet_data = pd.json_normalize(data.internet) # normalizes the 'internet' column and stores it in a new DataFrame
internet_data # displays the new DataFrame

In [None]:
account_data = pd.json_normalize(data.account, sep='') # normalizes the 'account' column and stores it in a new DataFrame
account_data # displays the new DataFrame

### Combining all normalizations

To combine the information, you need to use the `pd.concat()` method.

We have built a function to normalize the JSON objects and combine the information into a DataFrame.

*This procedure is demonstrated in the lesson [Stacking DataFrames](https://cursos.alura.com.br/course/python-pandas-tecnicas-avancadas/task/91755) from the course [Python Pandas: técnicas avançadas](https://cursos.alura.com.br/course/python-pandas-tecnicas-avancadas)*

In [None]:
# Function to normalize JSON objects and combine information into a DataFrame
def normalize_json(dataframe):
    return_dataframe = pd.DataFrame()
    for column in list(data.columns[2:]):
        dataframe_column = pd.json_normalize(dataframe[column])
        return_dataframe = pd.concat([return_dataframe, dataframe_column], axis=1)

    return pd.concat([dataframe[list(data.columns[:2])], return_dataframe], axis=1)

In [None]:
# Applying the normalize_json function to the data
data = normalize_json(data)
data

Using the `info()` method, we can view all the columns that were generated from the concatenation of the DataFrames.

In [None]:
# Displaying information about the DataFrame
data.info()

Let's use the `value_counts()` method on each of the columns to identify possible categories with incorrect or inconsistent names.

In [None]:
# Looping through each column and printing the value counts
for col in data.columns:
    print('---')
    print(data[col].value_counts())

It is noticeable in the Churn variable that there is an unnamed category, representing missing data. Missing data does not provide useful information for analysis, so we should remove it from the dataset.

In [None]:
# Printing the value counts for the 'Churn' column
data['Churn'].value_counts()

To remove the data with empty names, we select the rows in the Churn column where the name is not empty (''). We store the result in the variable `dados`.

*This procedure is demonstrated in the lesson[Selection frequencies](https://cursos.alura.com.br/course/introducao-python-pandas/task/40991) do curso [Python Pandas: Handling and Analyzing Data](https://cursos.alura.com.br/course/introducao-python-pandas)*

In [None]:
# Printing the value counts for the 'Churn' column
data = data[data['Churn']!= '']
data.reset_index(drop=True, inplace=True)

At the end of the code execution, we can identify that the Churn variable no longer has an empty name class

In [None]:
# Printing the value counts for the 'Churn' column
data['Churn'].value_counts()

Another column that has empty data (' ') is **Charges.Total**. This column is related to **Charges.Monthly** and tenure.

The tenure column represents the number of months the customer has subscribed to the service. The **Charges.Monthly** column represents the customer's monthly expenses, and **Charges.Total** is the total amount of expenses, which would be a multiplication of **Charges.Monthly** by tenure.

Let's select all rows where tenure = 0, that is, customers who subscribed to the service for 0 months, and show the results for the columns **Charges.Total** and **Charges.Monthly**.

In [None]:
data.query('tenure == 0')[['Charges.Total', 'Charges.Monthly', 'tenure']]

We observed that when tenure = 0, the data in **Charges.Total** is empty (' ').

Now let's select the data where **Charges.Total** = ' ', showing the results for **Charges.Monthly** and tenure.

In [None]:
# Pegando todas as linhas onde a coluna "Charges.Total" é vazia.
data[data['Charges.Total'] == ' '][['Charges.Total', 'Charges.Monthly', 'tenure']]

It is noticeable that all rows in **Charges.Total** that are empty are because the customer did not subscribe for even one month. We need to fill this value with the same value that is present in **Charges.Monthly** since this represents the total.

In [None]:
idx = data[data['Charges.Total'] == ' '].index
data.loc[idx, "Charges.Total"] = data.loc[idx, "Charges.Monthly"]

In [None]:
data.query('tenure == 0')[['Charges.Total', 'Charges.Monthly', 'tenure']]

Finally, let's change the variable type to float, since it was previously set as an object.

In [None]:
data['Charges.Total'] = data['Charges.Total'].astype('float64')

Finally, let's store the processed data in a file Telco-Customer-Churn-limpeza.json in the Dados folder using the `to_json()` method.

The data can be stored in any file format, for example, CSV using the `to_csv()` method.

In [None]:
data.to_json('Telco-Customer-Churn-clean.json')