## Projeto A.M - Classificação



## 1. Introdução

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report


churn_transformed = pd.read_csv('data/churn_transformed.csv', encoding='UTF-8')
churn_transformed.head()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Surname,Geography,Gender,CreditScore_norm,EstimatedSalary_norm,Balance_norm,Credit_per_age,isOutlier
0,4988.5,15634602.0,619.0,42.0,2.0,0.0,1.0,1.0,1.0,101348.88,1.0,Hargrave,France,Female,0.538,0.506541,0.0,14.738095,1
1,2.0,15647311.0,608.0,41.0,1.0,83807.86,1.0,0.0,1.0,112542.58,0.0,Hill,Spain,Female,0.516,0.562537,0.334031,14.829268,1
2,3.0,15619304.0,502.0,42.0,8.0,159660.8,3.0,1.0,0.0,100272.165,1.0,Onio,France,Female,0.304,0.501155,0.636357,11.952381,1
3,4988.5,15701354.0,699.0,39.0,1.0,0.0,2.0,1.0,0.0,93826.63,0.0,Boni,France,Male,0.698,0.468912,0.0,17.923077,1
4,5.0,15737888.0,850.0,43.0,2.0,125510.82,1.0,1.0,1.0,100272.165,0.0,Mitchell,Spain,Female,1.0,0.501155,0.500246,19.767442,1


In [2]:
churn_transformed.columns

Index(['RowNumber', 'CustomerId', 'CreditScore', 'Age', 'Tenure', 'Balance',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
       'Exited', 'Surname', 'Geography', 'Gender', 'CreditScore_norm',
       'EstimatedSalary_norm', 'Balance_norm', 'Credit_per_age', 'isOutlier'],
      dtype='object')

##  2. Escolha da coluna:

Uma das atividades propostas nesse projeto é a escolha da coluna que será nosso Target. A variável alvo deve ser aquela que melhor representa o resultado ou comportamento que o modelo deve aprender a prever. Pensando nisso tudo, escolheremos a coluna 'Exited'

### Tratando as variáveis Categóricas

Para essa situação, vamos fazer o OneHotEncoder

In [3]:
from sklearn.preprocessing import OneHotEncoder

churn_transformed.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)

# Identificar colunas categóricas
categorical_columns = ['Geography', 'Gender']

# Aplicar One-Hot Encoding nas colunas categóricas
encoder = OneHotEncoder(drop='first')
encoded_categories = encoder.fit_transform(churn_transformed[categorical_columns])

# Criar um DataFrame com as colunas codificadas
encoded_df = pd.DataFrame(encoded_categories.toarray(), columns=encoder.get_feature_names_out(categorical_columns))
encoded_df.index = churn_transformed.index

# Concatenar o DataFrame codificado com o DataFrame original (sem as colunas categóricas)
churn_transformed = pd.concat([churn_transformed.drop(columns=categorical_columns), encoded_df], axis=1)

# Exibir as primeiras linhas do DataFrame transformado
churn_transformed.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,CreditScore_norm,EstimatedSalary_norm,Balance_norm,Credit_per_age,isOutlier,Geography_Germany,Geography_Spain,Gender_Male
0,619.0,42.0,2.0,0.0,1.0,1.0,1.0,101348.88,1.0,0.538,0.506541,0.0,14.738095,1,0.0,0.0,0.0
1,608.0,41.0,1.0,83807.86,1.0,0.0,1.0,112542.58,0.0,0.516,0.562537,0.334031,14.829268,1,0.0,1.0,0.0
2,502.0,42.0,8.0,159660.8,3.0,1.0,0.0,100272.165,1.0,0.304,0.501155,0.636357,11.952381,1,0.0,0.0,0.0
3,699.0,39.0,1.0,0.0,2.0,1.0,0.0,93826.63,0.0,0.698,0.468912,0.0,17.923077,1,0.0,0.0,1.0
4,850.0,43.0,2.0,125510.82,1.0,1.0,1.0,100272.165,0.0,1.0,0.501155,0.500246,19.767442,1,0.0,1.0,0.0


## 3. Separando em treino e teste:

In [4]:
from sklearn.model_selection import train_test_split

# Separando entre treino e teste
X = churn_transformed.drop(columns=['Exited'])
y = churn_transformed['Exited']


# Dividir os dados em conjuntos de treinamento (60%), validação (20%) e teste (20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


print("Tamanho do conjunto de treinamento:", X_train.shape, y_train.shape)
print("Tamanho do conjunto de validação:", X_val.shape, y_val.shape)
print("Tamanho do conjunto de teste:", X_test.shape, y_test.shape)

Tamanho do conjunto de treinamento: (5986, 16) (5986,)
Tamanho do conjunto de validação: (1996, 16) (1996,)
Tamanho do conjunto de teste: (1996, 16) (1996,)


## 4. Inicializando os Modelos:

In [5]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

# Inicialização dos modelos
modelos = {
    'Regressão Logística': LogisticRegression(max_iter=1000, random_state=42),
    'Floresta Aleatória': RandomForestClassifier(random_state=42),
    'Máquina de Vetores de Suporte': SVC(random_state=42),
    'K-Vizinhos Mais Próximos': KNeighborsClassifier()
}

# Treinar e avaliar cada modelo usando validação cruzada
for nome, modelo in modelos.items():
    print(f'Avaliando modelo: {nome}')

    # Executar validação cruzada
    cv_scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='accuracy')
    print(f'{nome} - Acurácia CV média: {np.mean(cv_scores):.4f} (Desvio Padrão: {np.std(cv_scores):.4f})')

    # Previsões e métricas detalhadas usando validação cruzada
    y_pred = cross_val_predict(modelo, X_test, y_test, cv=5)
    acuracia = accuracy_score(y_test, y_pred)
    print(f'{nome} - Acurácia no conjunto de teste: {acuracia:.4f}')
    print(classification_report(y_test, y_pred))
    print('Matriz de Confusão:')
    print(confusion_matrix(y_test, y_pred))
    print('-' * 60)


Avaliando modelo: Regressão Logística


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Regressão Logística - Acurácia CV média: 0.8316 (Desvio Padrão: 0.0033)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Regressão Logística - Acurácia no conjunto de teste: 0.8342
              precision    recall  f1-score   support

         0.0       0.84      0.99      0.91      1646
         1.0       0.64      0.12      0.21       350

    accuracy                           0.83      1996
   macro avg       0.74      0.55      0.56      1996
weighted avg       0.81      0.83      0.78      1996

Matriz de Confusão:
[[1622   24]
 [ 307   43]]
------------------------------------------------------------
Avaliando modelo: Floresta Aleatória
Floresta Aleatória - Acurácia CV média: 0.8540 (Desvio Padrão: 0.0024)
Floresta Aleatória - Acurácia no conjunto de teste: 0.8467
              precision    recall  f1-score   support

         0.0       0.86      0.97      0.91      1646
         1.0       0.66      0.25      0.37       350

    accuracy                           0.85      1996
   macro avg       0.76      0.61      0.64      1996
weighted avg       0.83      0.85      0.82      1996

Matriz de C

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[WinError 2] O sistema não pode encontrar o arquivo especificado
  File "c:\Users\vitor\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Users\vitor\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\vitor\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Users\vitor\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _wi

K-Vizinhos Mais Próximos - Acurácia CV média: 0.7934 (Desvio Padrão: 0.0083)
K-Vizinhos Mais Próximos - Acurácia no conjunto de teste: 0.7926
              precision    recall  f1-score   support

         0.0       0.82      0.95      0.88      1646
         1.0       0.14      0.03      0.05       350

    accuracy                           0.79      1996
   macro avg       0.48      0.49      0.47      1996
weighted avg       0.70      0.79      0.74      1996

Matriz de Confusão:
[[1570   76]
 [ 338   12]]
------------------------------------------------------------


## 5. Adicionando o O MLFLOW no treinamento

In [6]:
import mlflow
import mlflow.sklearn

mlflow.set_experiment("Churn Model Tracking")

2024/07/11 18:42:29 INFO mlflow.tracking.fluent: Experiment with name 'Churn Model Tracking' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///c:/Users/vitor/Desktop/6%20periodo/introdu%C3%A7%C3%A3o%20ciencia%20de%20dados/IF679---Analise-Bank-Churn/mlruns/669967341799188521', creation_time=1720734149092, experiment_id='669967341799188521', last_update_time=1720734149092, lifecycle_stage='active', name='Churn Model Tracking', tags={}>

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score, cross_val_predict
import numpy as np

# Inicialização dos modelos
modelos = {
    'Regressão Logística': LogisticRegression(max_iter=1000, random_state=42),
    'Floresta Aleatória': RandomForestClassifier(random_state=42),
    'Máquina de Vetores de Suporte': SVC(random_state=42),
    'K-Vizinhos Mais Próximos': KNeighborsClassifier()
}

# Treinar e avaliar cada modelo usando validação cruzada
for nome, modelo in modelos.items():
    print(f'Avaliando modelo: {nome}')

    with mlflow.start_run(run_name=nome):
        # Log de parâmetros (se houver)
        if hasattr(modelo, 'get_params'):
            params = modelo.get_params()
            mlflow.log_params(params)

        # Executar validação cruzada
        cv_scores = cross_val_score(modelo, X_train, y_train, cv=5, scoring='accuracy')
        mean_cv_score = np.mean(cv_scores)
        std_cv_score = np.std(cv_scores)
        
        mlflow.log_metric('mean_cv_accuracy', mean_cv_score)
        mlflow.log_metric('std_cv_accuracy', std_cv_score)
        
        print(f'{nome} - Acurácia CV média: {mean_cv_score:.4f} (Desvio Padrão: {std_cv_score:.4f})')

        # Treinar o modelo
        modelo.fit(X_train, y_train)

        # Previsões e métricas detalhadas usando validação cruzada
        y_pred = cross_val_predict(modelo, X_test, y_test, cv=5)
        acuracia = accuracy_score(y_test, y_pred)
        mlflow.log_metric('test_accuracy', acuracia)
        
        print(f'{nome} - Acurácia no conjunto de teste: {acuracia:.4f}')
        print(classification_report(y_test, y_pred))
        
        mlflow.log_metrics({
            'precision': precision_score(y_test, y_pred, average='weighted'),
            'recall': recall_score(y_test, y_pred, average='weighted'),
            'f1_score': f1_score(y_test, y_pred, average='weighted')
        })

        print('Matriz de Confusão:')
        print(confusion_matrix(y_test, y_pred))
        print('-' * 60)

        # Log do modelo treinado
        mlflow.sklearn.log_model(modelo, nome)


Avaliando modelo: Regressão Logística


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Regressão Logística - Acurácia CV média: 0.8316 (Desvio Padrão: 0.0033)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Regressão Logística - Acurácia no conjunto de teste: 0.8342
              precision    recall  f1-score   support

         0.0       0.84      0.99      0.91      1646
         1.0       0.64      0.12      0.21       350

    accuracy                           0.83      1996
   macro avg       0.74      0.55      0.56      1996
weighted avg       0.81      0.83      0.78      1996

Matriz de Confusão:
[[1622   24]
 [ 307   43]]
------------------------------------------------------------
Avaliando modelo: Floresta Aleatória
Floresta Aleatória - Acurácia CV média: 0.8540 (Desvio Padrão: 0.0024)
Floresta Aleatória - Acurácia no conjunto de teste: 0.8467
              precision    recall  f1-score   support

         0.0       0.86      0.97      0.91      1646
         1.0       0.66      0.25      0.37       350

    accuracy                           0.85      1996
   macro avg       0.76      0.61      0.64      1996
weighted avg       0.83      0.85      0.82      1996

Matriz de C

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Avaliando modelo: K-Vizinhos Mais Próximos
K-Vizinhos Mais Próximos - Acurácia CV média: 0.7934 (Desvio Padrão: 0.0083)
K-Vizinhos Mais Próximos - Acurácia no conjunto de teste: 0.7926
              precision    recall  f1-score   support

         0.0       0.82      0.95      0.88      1646
         1.0       0.14      0.03      0.05       350

    accuracy                           0.79      1996
   macro avg       0.48      0.49      0.47      1996
weighted avg       0.70      0.79      0.74      1996

Matriz de Confusão:
[[1570   76]
 [ 338   12]]
------------------------------------------------------------


# 5. Hiperparâmetros

In [9]:
from sklearn.model_selection import GridSearchCV

# Inicialização dos modelos com configurações para GridSearchCV
modelos = {
    'Regressão Logística': {
        'modelo': LogisticRegression(random_state=42),
        'params': {
            'max_iter': [100, 500, 1000],
            'C': [0.1, 1, 10]
        }
    },
    'Floresta Aleatória': {
        'modelo': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [10, 50, 100],
            'max_depth': [None, 10, 20, 30]
        }
    },
    'Máquina de Vetores de Suporte': {
        'modelo': SVC(random_state=42),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['rbf', 'linear']
        }
    },
    'K-Vizinhos Mais Próximos': {
        'modelo': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [3, 5, 10],
            'metric': ['euclidean', 'manhattan']
        }
    }
}

# Treinar e avaliar cada modelo usando GridSearchCV
for nome, config in modelos.items():
    print(f'Avaliando modelo: {nome}')

    # Configuração do GridSearchCV
    grid_search = GridSearchCV(estimator=config['modelo'], param_grid=config['params'], cv=5, scoring='accuracy', verbose=1)

    # Executar o GridSearchCV
    grid_search.fit(X_train, y_train)

    # Melhores parâmetros e resultado
    print(f'Melhores parâmetros para {nome}: {grid_search.best_params_}')
    print(f'Melhor acurácia CV para {nome}: {grid_search.best_score_:.4f}')

    # Previsões e métricas detalhadas usando o melhor modelo
    y_pred = grid_search.predict(X_test)
    acuracia = accuracy_score(y_test, y_pred)
    print(f'{nome} - Acurácia no conjunto de teste: {acuracia:.4f}')
    print(classification_report(y_test, y_pred))
    print('Matriz de Confusão:')
    print(confusion_matrix(y_test, y_pred))
    print('-' * 60)

Avaliando modelo: Regressão Logística
Fitting 5 folds for each of 9 candidates, totalling 45 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Melhores parâmetros para Regressão Logística: {'C': 10, 'max_iter': 500}
Melhor acurácia CV para Regressão Logística: 0.8328
Regressão Logística - Acurácia no conjunto de teste: 0.8342
              precision    recall  f1-score   support

         0.0       0.84      0.99      0.91      1646
         1.0       0.65      0.12      0.20       350

    accuracy                           0.83      1996
   macro avg       0.75      0.55      0.55      1996
weighted avg       0.81      0.83      0.78      1996

Matriz de Confusão:
[[1624   22]
 [ 309   41]]
------------------------------------------------------------
Avaliando modelo: Floresta Aleatória
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Melhores parâmetros para Floresta Aleatória: {'max_depth': 10, 'n_estimators': 100}
Melhor acurácia CV para Floresta Aleatória: 0.8582
Floresta Aleatória - Acurácia no conjunto de teste: 0.8507
              precision    recall  f1-score   support

         0.0       0.87      0.97