<a href="https://colab.research.google.com/github/daniellorieri/Data_Science_E_Analise_de_Dados/blob/main/Churn_Modelo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Objetivo:
 Verificar qual modelo é mais preciso para prever se o cliente vai ou não cancelar o serviço. Esse estudo têm como ponto de partida o projeto da Flai, porém neste projeto para fins de estudo eu ultilizo técnica de padronização de dados usando Z-score e utilizado modelosde de Oversampling e UnderSampling para verificar qual a melhor acurácia e determinar o melhor modelo para fazer as previsões.

###Origem dos dados
 - Dados utilizados a partir so site Kaggle.com 

In [1]:
#Importando as biblliotecas
import numpy as np
import seaborn as sns
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('/content/drive/My Drive/dataset/Churn_Modelling.csv')

In [242]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [243]:
df2 = df

##Variáveis existentes:
- CustomerId: identificação do cliente;
- Surname: sobrenome do cliente;
- CreditScore: pontuação de credito, 0 alto risco de - inadimplência e 1000 - - clientes com baixo risco de - - - - inadimplência;
- Geography: país que o serviço é oferecido;
- Gender: sexo do cliente;
- Age: idade do cleinte;
- Tenure: um indicativo de estabilidade no emprego, em que -0 significa pouca - estabilidade e 10 muita estabilidade.
- Balance: saldo da conta corrente;
- NumOfProducts: número de produtos bancários adquiridos;
- HasCrCard: se tem cartão de credito ou não, (Sim = 1 e -Não = 0);
- IsActiveMember: se é um cliente com conta ativa, (Ativo -= 1) ;
- EstimatedSalary: salário estimado;
- Exited: cliente deixou de ser cliente do banco ou não --(Churn = 1).

##Pré-Processamento

- Eliminar as variáveis que não serão utilizadas;
- Identificação de dados missing;
- Separação das variáveis categóricas, numéricas e resposta;
- Processamento variáveis categóricas;
- Processamento variáveis numéricas.

###Eliminação de varíavies que não serão utilizadas

 - Existem algumas varáveis que não trazem informações relevantes para entender o Churn, como Customerid e RowNumber, Surname.

In [3]:
df1 = df.drop(columns = ['RowNumber','CustomerId', 'Surname'])

In [4]:
df1

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


###Identificar valores faltantes(missing)

In [5]:
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [6]:
#Exibe a quantidade de valore faltantes e usa respectiva porcentagem.
total = df.isnull().sum().sort_values(ascending=False)
percent = df.isnull().sum() / df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, sort=False, keys=['total', 'percent'])
missing_data[missing_data['percent'] != 0] *100

Unnamed: 0,total,percent


- As varáveis não possuem dados faltantes.

###Separando as variáveis categóricas das variáveis numéricas.

In [7]:
#Criando variaveis independentes e dependentes
y = df1['Exited']
X = df1
X = df1.drop('Exited',axis = 1)
X

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.00,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.80,3,1,0,113931.57
3,699,France,Female,39,1,0.00,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10
...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5,0.00,2,1,0,96270.64
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77
9997,709,France,Female,36,7,0.00,1,0,1,42085.58
9998,772,Germany,Male,42,3,75075.31,2,1,0,92888.52


In [8]:
#Varáveis Continuas
cont=['CreditScore','Balance','Age','EstimatedSalary','Tenure']
cont


['CreditScore', 'Balance', 'Age', 'EstimatedSalary', 'Tenure']

In [9]:
#Varáveis Categoricas
cat = list(set(X)-set(cont))
cat


['Gender', 'NumOfProducts', 'HasCrCard', 'Geography', 'IsActiveMember']

In [10]:
#df1 = df.drop(columns = ['RowNumber','CustomerId', 'Surname'])
#df1.head()

def remove_features(lista_features):
    for i in lista_features:
        df.drop(i, axis=1, inplace=True)
remove_features(['RowNumber','CustomerId','Surname'])
df.head(n=20)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
5,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
6,822,France,Male,50,7,0.0,2,1,1,10062.8,0
7,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
8,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
9,684,France,Male,27,2,134603.88,1,1,1,71725.73,0


###Transformando as varáveis categóricas em numéricas.
 - Alguns algoritmos de ML não entendem variáveis categóricas, se tratando disso devemos fazer a conversão.
 - Para varáveis categóricas com mais de dua classes utilizamos o método get_dummies()
  - Para variávies categoricas com até 2 Classes ultilizamos o método enconder.

In [16]:
#Aplicando métido get dummy varávies com mais de 2 classes.
X_final = pd.get_dummies(data= X, columns=['Geography','NumOfProducts'], drop_first=True)
X_final.head()


Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,NumOfProducts_2,NumOfProducts_3,NumOfProducts_4
0,619,0,42,2,0.0,1,1,101348.88,0,0,0,0,0
1,608,0,41,1,83807.86,0,1,112542.58,0,1,0,0,0
2,502,0,42,8,159660.8,1,0,113931.57,0,0,0,1,0
3,699,0,39,1,0.0,0,0,93826.63,0,0,1,0,0
4,850,0,43,2,125510.82,1,1,79084.1,0,1,0,0,0


In [20]:
#Transormano a varável sexo em 0 e 1.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['Gender'] = le.fit_transform(X['Gender'])
X.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,0,42,2,0.0,1,1,1,101348.88
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58
2,502,France,0,42,8,159660.8,3,1,0,113931.57
3,699,France,0,39,1,0.0,2,0,0,93826.63
4,850,Spain,0,43,2,125510.82,1,1,1,79084.1
5,645,Spain,1,44,8,113755.78,2,1,0,149756.71
6,822,France,1,50,7,0.0,2,1,1,10062.8
7,376,Germany,0,29,4,115046.74,4,1,0,119346.88
8,501,France,1,44,4,142051.07,2,0,1,74940.5
9,684,France,1,27,2,134603.88,1,1,1,71725.73


In [None]:
#Aplicando métido get dummy varávies com mais de 2 classes.
#df1 = pd.get_dummies(data = X, columns=['Geography','NumOfProducts'])
#df1.head()

###Variáveis Numéricas
 - Como algumas varáveis estão em diferentes escalas, devemos fazer a  padronização para deixar os valores na mesma escala.
 

###Separando a base de dados em teste e treino

In [21]:
from sklearn.model_selection import train_test_split

X_treinamento, X_teste, y_treinamento, y_teste = train_test_split(X_final, y, test_size = 0.25, random_state = 1) #x_final é o conjunto convertido para numérico

####Aplicando padronizção z-score
 - técnica usada para colocar os valores na mesma escala, de modo que a média tenha o valor 0 e o desvio padrao o valor 1.

In [37]:
from sklearn.preprocessing import StandardScaler

z_score_treinamento = StandardScaler()
z_score_teste = StandardScaler()



In [23]:
X_treinamento_padrao = z_score_treinamento.fit_transform(X_treinamento)
X_teste_padrao = z_score_teste.fit_transform(X_teste)

In [24]:
X_treinamento_padrao, X_teste_padrao

(array([[ 0.71997988,  0.91776859,  1.25618088, ..., -0.92048392,
         -0.16763484, -0.07941149],
        [-0.75327193, -1.08959928, -0.65952721, ...,  1.08638509,
         -0.16763484, -0.07941149],
        [ 0.57574543, -1.08959928,  0.39411224, ...,  1.08638509,
         -0.16763484, -0.07941149],
        ...,
        [ 0.22546179, -1.08959928,  0.58568305, ..., -0.92048392,
         -0.16763484, -0.07941149],
        [ 0.13273964, -1.08959928,  0.01097062, ...,  1.08638509,
         -0.16763484, -0.07941149],
        [ 1.16298567,  0.91776859,  0.29832684, ..., -0.92048392,
         -0.16763484, -0.07941149]]),
 array([[-1.06697781,  0.89652206,  0.74989348, ...,  1.08347268,
         -0.15814629, -0.07229925],
        [ 0.29599597,  0.89652206, -0.47339308, ..., -0.92295821,
         -0.15814629, -0.07229925],
        [-1.26618167, -1.11542152,  0.27939865, ...,  1.08347268,
         -0.15814629, -0.07229925],
        ...,
        [ 0.09679211,  0.89652206,  0.74989348, ..., -

In [265]:
X_treinamento_padrao_dataframe = pd.DataFrame(X_treinamento_padrao)
X_teste_padrao_dataframe = pd.DataFrame(X_teste_padrao)

In [266]:
X_treinamento_padrao_dataframe

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.719980,0.917769,1.256181,0.683089,-1.228601,0.642621,0.983339,-1.480501,-0.582072,-0.573245,-0.920484,-0.167635,-0.079411
1,-0.753272,-1.089599,-0.659527,-1.397668,-1.228601,0.642621,-1.016944,-1.565230,-0.582072,-0.573245,1.086385,-0.167635,-0.079411
2,0.575745,-1.089599,0.394112,-1.397668,-1.228601,0.642621,-1.016944,-1.182108,-0.582072,1.744456,1.086385,-0.167635,-0.079411
3,1.791436,0.917769,0.585683,-0.704082,0.844319,-1.556128,-1.016944,-0.692551,-0.582072,-0.573245,-0.920484,-0.167635,-0.079411
4,-1.577469,0.917769,-0.659527,1.029882,-1.228601,0.642621,-1.016944,-0.567289,-0.582072,-0.573245,1.086385,-0.167635,-0.079411
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,-0.299964,0.917769,0.777254,0.683089,0.493838,0.642621,0.983339,-0.577104,1.718001,-0.573245,-0.920484,-0.167635,-0.079411
7496,0.349091,-1.089599,2.309820,-0.704082,0.074833,0.642621,-1.016944,-0.527712,1.718001,-0.573245,-0.920484,-0.167635,-0.079411
7497,0.225462,-1.089599,0.585683,1.376675,-1.228601,0.642621,0.983339,-0.138963,-0.582072,-0.573245,-0.920484,-0.167635,-0.079411
7498,0.132740,-1.089599,0.010971,1.029882,-1.228601,0.642621,0.983339,0.019792,-0.582072,-0.573245,1.086385,-0.167635,-0.079411


##Random Forest

In [161]:
#Criando uma isntancia com 1000 árvores de decisão e treinando o modelo nos dados de treinamento
modelo = RandomForestClassifier (n_estimators = 1000, random_state = 42) 
modelo.fit (X_treinamento_padrao, y_treinamento);

In [162]:
#Previsão
previsoes = modelo.predict(X_teste_padrao)
print(previsoes)
print(y_teste)

[0 0 0 ... 0 0 0]
9953    0
3850    0
4962    0
3886    0
5437    0
       ..
6955    0
557     1
2455    1
3920    0
6405    0
Name: Exited, Length: 2500, dtype: int64


In [163]:
#Acurácia
from sklearn.metrics import accuracy_score
previsoes = modelo.predict(X_teste_padrao) 
accuracy_score(previsoes, y_teste)

0.8672

- Observamos que a acurácia é de aproximadamente 87% de assertividade nas previsões.

In [164]:
from sklearn.metrics import classification_report
previsoes = modelo.predict(X_teste_padrao) 
print(classification_report(previsoes, y_teste))


              precision    recall  f1-score   support

           0       0.97      0.88      0.92      2186
           1       0.48      0.80      0.60       314

    accuracy                           0.87      2500
   macro avg       0.73      0.84      0.76      2500
weighted avg       0.91      0.87      0.88      2500



In [165]:
#Analisando a importância de cada variável
modelo.feature_importances_

feature_importances = pd.DataFrame(modelo.feature_importances_, index = X_treinamento.columns , columns=['importance']).sort_values('importance',ascending=False)
feature_importances

Unnamed: 0,importance
Age,0.232038
EstimatedSalary,0.146217
CreditScore,0.143458
Balance,0.142694
Tenure,0.084539
NumOfProducts_2,0.067757
NumOfProducts_3,0.045093
IsActiveMember,0.041292
Geography_Germany,0.028357
Gender,0.020241


###Sobreamostragem (Oversamplig)
 - Verificar se conseguimos melhorar a acurácia através dessa técnica, que consite

In [127]:
from imblearn.over_sampling import SMOTE


In [192]:
#Treinando o modelo nos dados de treinamento
smote = SMOTE(sampling_strategy=str('minority') )
X_over, y_over = smote.fit_sample(X_treinamento_padrao,y_treinamento)

In [193]:
#verificando que a classe minoritaria foi balenceada e esta com 5983 registros.
np.unique(y_over, return_counts=True)

(array([0, 1]), array([5983, 5983]))

In [194]:
#Dividindo a base de dados em treinamento e teste, e estamos fazendo uma amostragem estratificada da base de dados de teste.
X_treinamento_over, X_teste_over, y_treinamento_over, y_teste_over = train_test_split(X_over,y_over,
                                                                                          test_size=0.25, stratify = y_over)


In [195]:
modelo_over = RandomForestClassifier(n_estimators = 1000, random_state = 42)
modelo_over.fit(X_treinamento_padrao, y_treinamento)

RandomForestClassifier(n_estimators=1000, random_state=42)

In [196]:
previsoes_over = modelo_over.predict(X_teste_over)


In [197]:
accuracy_score(previsoes_over,y_teste_over)

0.8890374331550802

- Com a técnica de oversampling tivemos uma acuracia um pouco melhor da anterior, co aproximadamente 89% de assertividade.

In [219]:
from imblearn.under_sampling import TomekLinks

In [220]:
tl = TomekLinks(sampling_strategy=str('majority'))
X_under, y_under = tl.fit_sample(X_treinamento_padrao,y_treinamento)

In [221]:
np.unique(y, return_counts=True)

(array([0, 1]), array([7963, 2037]))

In [222]:
#Exibimos essa contagem para mostrar quue hou uma diminuição da classe majoritaria em relacao a classe da base original
np.unique(y_under, return_counts=True)

(array([0, 1]), array([5600, 1517]))

In [223]:
#Separando a base de treino e teste
X_treinamento_under, X_teste_under, y_treinamento_under, y_teste_under = train_test_split(X_under,y_under,
                                                                                          test_size=0.25, stratify = y_under)

In [224]:
modelo_under = RandomForestClassifier(n_estimators = 1000, random_state = 42)
modelo_under.fit(X_treinamento_padrao, y_treinamento)


RandomForestClassifier(n_estimators=1000, random_state=42)

In [225]:
previsoes_under = modelo_over.predict(X_teste_under)


In [226]:
accuracy_score(previsoes_under, y_teste_under)

1.0

- A acurácia utilizando a técnica de undersampling apresenta uma taxa de 100% de acertividade nas previsões.
- Esse modelo mostra uma melhor precisão para predizer se o cliente irá ou não cancelar o serviço.

###Comparando os dois modelos de classificação Under e Over

In [227]:
print(classification_report(previsoes_under, y_teste_under))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1401
           1       1.00      1.00      1.00       379

    accuracy                           1.00      1780
   macro avg       1.00      1.00      1.00      1780
weighted avg       1.00      1.00      1.00      1780



In [228]:
print(classification_report(previsoes_over, y_teste_over))


              precision    recall  f1-score   support

           0       1.00      0.82      0.90      1828
           1       0.78      1.00      0.88      1164

    accuracy                           0.89      2992
   macro avg       0.89      0.91      0.89      2992
weighted avg       0.91      0.89      0.89      2992

