<a href="https://colab.research.google.com/github/anderson-ferreira-83/Data_Science_Repo_anderson83/blob/main/1_Alura_Voz/Week_3_models/p3_Models_for_git.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 3 - Model Analysis

In [None]:
#
import os
import sys

In [None]:
#
import pandas as pd

Having processed and analyzed the data, we can now build classification models that will be useful for Alura Voz. Among the models we can create are SVC, Decision Tree and Random Forest.

### Importing libraries

For the application we will use `pandas`, `seaborn`, `sklearn`, `imblearn` and `sys`. To learn more about the sklearn and imblearn libraries, access the documentation:
- [Scikit Learn](https://scikit-learn.org/stable/); and
- [Imbalanced Learn](https://imbalanced-learn.org/stable/).

In [None]:
#
str_utils = '1DJEF0jli6eQixbcz-ARBX7X5d9ojoP4J'

In [None]:
#
!gdown --id $str_utils

In [None]:
#
from utils import plot_countplot,plot_matrix_confusion, compare_models_metrics

In [None]:
#
import pandas as pd
import seaborn as sns
#
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
#
sns.set_theme(style="darkgrid")
SEED = 42

### Applying encoder

The database is read using the pandas `read_json` method.

In [None]:
#
str_data_telco_cust_chrun_clean_file = '1--WyhFs-oY4xjPXyHd5Qatpio-2Z4zFa'

In [None]:
#
!gdown --id $str_data_telco_cust_chrun_clean_file

In [None]:
#
data = pd.read_json("Telco-Customer-Churn-clean.json")
data.head()

Now, we need to remove some columns that are not so important for the analysis we want to perform. The method that allows us to remove columns is `drop()` from the pandas library.

There are two columns that are not interesting for the analysis and that will be removed:

* `customerID` column: Its value is unique for each row and does not provide us with relevant information for an analysis, so we can remove it; and
* `Charges.Total` column: this column contains information about the months of `Charges.Monthly` multiplied by `tenure`, so it is "duplicate" information.

In [None]:
#
data.drop(['customerID', 'Charges.Total'], axis=1, inplace=True)

Let's print the classes of each column that is of the categorical type to understand which treatments and where they can be performed.

In [None]:
#
for i in data.select_dtypes(include=['object']).columns:
    if len(data[i].unique()) > 2:
       print(f"{i}: {data[i].unique()}")

It is possible to notice that some columns have the class `No phone service` and `No internet service` which is equivalent to the class `No`, that is, there is no service. For these classes, we will consider them as `No` to avoid duplicate information. Since there will be only two results `Yes` and `No`, we will replace them with a binary number, 1 and 0.

In addition, the columns 'PaymentMethod', 'Contract' and 'InternetService' have more than 2 categories and because of this, we will encode the data of these columns.

In [None]:
#
cols = ['PaymentMethod', 'Contract', 'InternetService']
#
data2 = data.drop(cols, axis=1)
#
data2.columns

In [None]:
#
dictionary = {'No internet service':0,
              'No phone service': 0,
              'No': 0,
              'Yes': 1,
              'Male':0,
              'Female':1}

In [None]:
#
data2 = data2.replace(dictionary)
#
data2.head()

There are several ways to create encoding, two of which are Label Encoding and One-Hot Encoding.

#### Types of encoding

* `Label Encoding` - Renames classes with numeric values ​​from 1 to n, where n is the number of classes. There may be a hierarchy between classes.

* `One-Hot Encoding` - Transforms variables into n binary columns, where n is the number of classes. All classes are analyzed equally; when a class occurs, the column will have the value 1 and when it does not occur, the value 0, which happens for the other columns created.

In our case, we will choose the method that transforms variables into binary columns. To learn more about this method, see the documentation.

- [OneHotEncoder documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

It is also possible to prepare this form of encoding with pandas [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html). If you want to know more about this method and the first one, we recommend reading the article [Pandas Get Dummies (One-Hot Encoding) – pd.get_dummies()](https://amiradata.com/pandas-get-dummies/).

Feel free to test both ways.

In [None]:
#
ohe = OneHotEncoder(dtype=int)
ohe

In [None]:
#
cols_ohe = ohe.fit_transform(data[cols]).toarray()
cols_ohe

In [None]:
#
import sklearn
print(sklearn.__version__)


In [None]:
# get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
ohe.get_feature_names_out(cols)

In [None]:
#
data3 = pd.concat([data2, pd.DataFrame(cols_ohe, columns=ohe.get_feature_names_out(cols))], axis=1)

In [None]:
data3

Now we have data with only numeric values.

## Data balancing

In [None]:
plot_countplot(data_db=data3,
               x='Churn',
               titulo="Distribution of the Churn variable before balancing",
               label_x='Churn')

We can see from the graph above that the dataset has the **target** (column `'Churn'`) [unbalanced](https://www.alura.com.br/artigos/lidando-com-desbalanceamento-dados?utm_source=gnarus&utm_medium=timeline). If the model is created with the variable in this way, it may harm the learning and results.

To avoid problems in the model's learning, we will perform the balancing using the [`SMOTE`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) method from the imblearn library.

In [None]:
X = data3.drop(['Churn'], axis = 1)
y = data3['Churn']

In [None]:
sm = SMOTE(random_state=SEED)
X_res, y_res = sm.fit_resample(X, y)

In [None]:
data4 = pd.concat([pd.DataFrame(X_res), pd.DataFrame(y_res)], axis=1)

In [None]:
plot_countplot(data_db=data4,
               x='Churn',
               titulo="Distribution of the Churn variable after balancing",
               label_x='Churn')

Now, `data4` has the values ​​of the **target** column with the same quantities, that is, they are balanced. Therefore, we will use `data4` to build the classification models.

In [None]:
data4.to_json('Telco-Customer-Churn-balancing.json')

## Creating the models

To start training, separate the data into **training** and **test**.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res , y_res, random_state=SEED)

### 1. SVC

The first model to be assembled is the **SVC** classifier. To assemble it, we use the [SVC method from the sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

To learn more about this method, you can access the video [Nonlinear estimators and support vector machines from the Machine Learning course: introduction to classification with SKLearn](https://cursos.alura.com.br/course/machine-learning-introducao-a-classificacao-com-sklearn/task/46782).

In [None]:
svc = SVC(random_state=SEED)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)

After training the model, we need to know how well it performed in its training. To do this, we collect the classifications from a data set unknown to the model, the test set.

The responses from the evaluation of each item in the test set performed by the model can be checked to know how well it performed in its test. The evaluation consists of analyzing several metrics that inform the success of the model. The metrics we will evaluate are [**Accuracy**](https://cursos.alura.com.br/course/machine-learning-credit-scoring/task/92910), [**Precision, Recall and F1 Score**](https://cursos.alura.com.br/course/machine-learning-credit-scoring/task/92914) and the [**Confusion Matrix**](https://cursos.alura.com.br/course/machine-learning-credit-scoring/task/92912)

We obtain these metrics using the `plot_matriz_confusao()` function to analyze the final result of the model.

In [None]:
labels = ["True Neg","False Pos","False Neg","True Pos"]
categories = ["No Churn", "Churn"]
plot_matrix_confusion(y_test,
                      y_pred_svc,
                      group_names=labels,
                      categories=categories,
                      figsize=(8, 6),
                      title="Confusion matrix for the SVC classifier")

### 2. Decision Tree

The second model to be assembled is the **Decision Tree** classifier. To assemble it, we use the [Decision Tree method from the sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

To learn more about this method, you can watch the video [Decision Trees: delving deeper into Machine Learning models](https://cursos.alura.com.br/course/arvores-decisao-aprofundando-modelos-machine-learning).

After training the model, we test it and plot the confusion matrix and other metrics using the `plot_matriz_confusao()` function to analyze the final result of the model.

In [None]:
dtree = DecisionTreeClassifier(max_depth=5, random_state = SEED)
dtree.fit(X_train, y_train)
y_pred_dt = dtree.predict(X_test)

In [None]:
labels = ["True Neg","False Pos","False Neg","True Pos"]
categories = ["No Churn", "Churn"]
plot_matrix_confusion(y_test,
                      y_pred_dt,
                      group_names=labels,
                      categories=categories,
                      figsize=(8, 6),
                      title="Confusion matrix for the Decision Tree classifier")

### 3. Random Forest

The second model to be assembled is the **Decision Tree** classifier. To assemble it, we use the [Random Forest method from the sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

To learn more about this method, you can watch the video [Decision Trees: delving deeper into Machine Learning models](https://cursos.alura.com.br/course/arvores-decisao-aprofundando-modelos-machine-learning).

After training the model, we test it and plot the confusion matrix and other metrics using the `plot_matriz_confusao()` function to analyze the final result of the model.

In [None]:
rforest = RandomForestClassifier(max_depth = 5, random_state=SEED)
rforest.fit(X_train, y_train)
y_pred_rf = rforest.predict(X_test)

In [None]:
labels = ["True Neg","False Pos","False Neg","True Pos"]
categories = ["No Churn", "Churn"]
plot_matrix_confusion(y_test,
                      y_pred_rf,
                      group_names=labels,
                      categories=categories,
                      figsize=(8, 6),
                      title="Confusion matrix for the random forest classifier")

### Comparing the models

After training and testing the **SVC**, **Decision Tree** and **Random Forest** models, we can compare the results obtained to find the best model.

To do this, we collect the classification metrics of the three models and group them in a comparison table.

In [None]:
models = ['svc', 'decision tree', 'random forest']
y_pred_train = [svc.predict(X_train), dtree.predict(X_train), rforest.predict(X_train)]
y_pred_test = [y_pred_svc, y_pred_dt, y_pred_rf]

In [None]:
table_models = compare_models_metrics('Recall', models, y_train, y_pred_train, y_test, y_pred_test).round(2)
table_models