## Exercício 1
### Verificar a documentação dos modelos RandomForestClassifier, LogisticRegression, KNeighborsClassifier, GradientBoostingClassifier e altere ou inclua algum parâmetro dos modelos e compare os resultados com o baseline executado nesse notebook.

### Import libs

In [6]:
%pip install pandas
%pip install sklearn
%pip install mlflow

# Manipulação e visualização de dados
import pandas as pd
import time

# Bibliotecas para aprendizado de máquina
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


# MLflow para gerenciamento de experimentos
import mlflow

# Supressão de avisos
import warnings
warnings.filterwarnings("ignore")

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


### Constants

In [7]:
DATA_PATH = "../data"

### Carrega dados

In [8]:
x_train = pd.read_csv(f"{DATA_PATH}/x_train.csv")
x_test = pd.read_csv(f"{DATA_PATH}/x_test.csv")
y_train = pd.read_csv(f"{DATA_PATH}/y_train.csv")
y_test = pd.read_csv(f"{DATA_PATH}/y_test.csv")

#### Utils

In [9]:
def avalia_modelo(models):
    results = []

    for name, model_value in models.items():
        model = model_value["model"]
        params = model_value["params"]

        inicio = time.time()
        model.fit(x_train, y_train)  # Treinamento
        fim = time.time()

        # Previsões
        y_pred = model.predict(x_test)

        # Métricas
        acuracia = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average="weighted")
        tempo_treino = fim - inicio

        # Registrar no MLflow
        with mlflow.start_run(run_name=name):
            mlflow.log_param("Modelo", name)

            for param_key, param_value in params.items():
                mlflow.log_param(param_key, param_value)

            mlflow.log_metric("Acurácia", acuracia)
            mlflow.log_metric("F1-Score", f1)
            mlflow.log_metric("Tempo de Treinamento", tempo_treino)
            mlflow.sklearn.log_model(model, "modelo")

        # Armazenar resultados
        results.append({
            "Modelo": name,
            "Acurácia": acuracia,
            "F1-Score": f1,
            "Tempo de Treinamento (s)": tempo_treino,
            "model": model,
            "params": params
        })
        print(f"Modelo {name} treinado e registrado no MLflow.")

    df_results = pd.DataFrame(results)
    return check_best_df(df_results)

def check_best_df(df: pd.DataFrame):
    df.sort_values(by=["Acurácia", "Tempo de Treinamento (s)"], ascending=[False, True], inplace=True)
    print("Resultado da comparação:")
    print(df)
    df_best = df.iloc[0]
    print(f"Melhor Modelo: {df_best['Modelo']}")
    return df_best

#### Randon Forest

In [10]:
randon_forest_baseline_key = "Random Forest"

randon_forest_models = {
    "Random Forest": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            "random_state": 42
        }
    },
    "Random Forest with 500 estimators": {
        "model": RandomForestClassifier(random_state=42, n_estimators=500),
        "params": {
            "random_state": 42,
            "n_estimators": 500
        }
    },
    "Random Forest with 10 max depth": {
        "model": RandomForestClassifier(random_state=42, max_depth=10),
        "params": {
            "random_state": 42,
            "max_depth": 10
        }
    },
    "Random Forest with 500 estimators and 10 max depth": {
        "model": RandomForestClassifier(random_state=42, n_estimators=500, max_depth=10),
        "params": {
            "random_state": 42,
            "n_estimators": 500,
            "max_depth": 10
        }
    },
}

best_radon_forest = avalia_modelo(randon_forest_models)



Modelo Random Forest treinado e registrado no MLflow.




Modelo Random Forest with 500 estimators treinado e registrado no MLflow.




Modelo Random Forest with 10 max depth treinado e registrado no MLflow.




Modelo Random Forest with 500 estimators and 10 max depth treinado e registrado no MLflow.
Resultado da comparação:
                                              Modelo  Acurácia  F1-Score  \
0                                      Random Forest  0.821229  0.818174   
1                  Random Forest with 500 estimators  0.810056  0.806810   
2                    Random Forest with 10 max depth  0.787709  0.780354   
3  Random Forest with 500 estimators and 10 max d...  0.776536  0.765034   

   Tempo de Treinamento (s)  \
0                  0.442809   
1                  1.881878   
2                  0.251360   
3                  0.934243   

                                               model  \
0  (DecisionTreeClassifier(max_features='sqrt', r...   
1  (DecisionTreeClassifier(max_features='sqrt', r...   
2  (DecisionTreeClassifier(max_depth=10, max_feat...   
3  (DecisionTreeClassifier(max_depth=10, max_feat...   

                                              params  
0          

#### Logistic Regression

In [11]:
logistic_regression_baseline_key = "Logistic Regression"

logistic_regression_models = {
    "Logistic Regression": {
        "model": LogisticRegression(max_iter=1000, random_state=42),
        "params": {
            "max_iter": 1000,
            "random_state": 42,
        }
    },
    "Logistic Regression with 10 intercept scaling": {
        "model": LogisticRegression(max_iter=1000, random_state=42, intercept_scaling=10),
        "params": {
            "max_iter": 1000,
            "random_state": 42,
            "intercept_scaling": 10
        }
    },
    "Logistic Regression with False fit intercept": {
        "model": LogisticRegression(max_iter=1000, random_state=42, fit_intercept=False),
        "params": {
            "max_iter": 1000,
            "random_state": 42,
            "fit_intercept": False
        }
    },
    "Logistic Regression with 10 intercept scaling and False fit intercept": {
        "model": LogisticRegression(max_iter=1000, random_state=42, intercept_scaling=10, fit_intercept=False),
        "params": {
            "max_iter": 1000,
            "random_state": 42,
            "intercept_scaling": 10,
            "fit_intercept": False
        }
    },
}

best_logistic_regression = avalia_modelo(logistic_regression_models)



Modelo Logistic Regression treinado e registrado no MLflow.




Modelo Logistic Regression with 10 intercept scaling treinado e registrado no MLflow.




Modelo Logistic Regression with False fit intercept treinado e registrado no MLflow.




Modelo Logistic Regression with 10 intercept scaling and False fit intercept treinado e registrado no MLflow.
Resultado da comparação:
                                              Modelo  Acurácia  F1-Score  \
2       Logistic Regression with False fit intercept  0.821229  0.819935   
3  Logistic Regression with 10 intercept scaling ...  0.821229  0.819935   
1      Logistic Regression with 10 intercept scaling  0.821229  0.819935   
0                                Logistic Regression  0.821229  0.819935   

   Tempo de Treinamento (s)  \
2                  0.057311   
3                  0.058075   
1                  0.058624   
0                  0.079705   

                                               model  \
2  LogisticRegression(fit_intercept=False, max_it...   
3  LogisticRegression(fit_intercept=False, interc...   
1  LogisticRegression(intercept_scaling=10, max_i...   
0  LogisticRegression(max_iter=1000, random_state...   

                                              p

#### K Neighbors Classifier

In [12]:
k_neighbors_classifier_baseline_key = "K-Nearest Neighbors"

k_neighbors_classifier_models = {
    "K Neighbors Classifier": {
        "model": KNeighborsClassifier(),
        "params": {}
    },
    "K Neighbors Classifier with 10 in neighbors": {
        "model": KNeighborsClassifier(n_neighbors=10),
        "params": {
            "n_neighbors": 10
        }
    },
    "K Neighbors Classifier with 50 leaf size": {
        "model": KNeighborsClassifier(leaf_size=50),
        "params": {
            "leaf_size": 50
        }
    },
    "K Neighbors Classifier with 10 in neighbors and  50 leaf size": {
        "model": KNeighborsClassifier(n_neighbors=10, leaf_size=50),
        "params": {
            "n_neighbors": 50,
            "leaf_size": 50
        }
    },
}

best_k_neighbors_classifier = avalia_modelo(k_neighbors_classifier_models)



Modelo K Neighbors Classifier treinado e registrado no MLflow.




Modelo K Neighbors Classifier with 10 in neighbors treinado e registrado no MLflow.




Modelo K Neighbors Classifier with 50 leaf size treinado e registrado no MLflow.




Modelo K Neighbors Classifier with 10 in neighbors and  50 leaf size treinado e registrado no MLflow.
Resultado da comparação:
                                              Modelo  Acurácia  F1-Score  \
1        K Neighbors Classifier with 10 in neighbors  0.810056  0.805268   
3  K Neighbors Classifier with 10 in neighbors an...  0.810056  0.805268   
2           K Neighbors Classifier with 50 leaf size  0.804469  0.803325   
0                             K Neighbors Classifier  0.804469  0.803325   

   Tempo de Treinamento (s)  \
1                  0.029608   
3                  0.031187   
2                  0.029382   
0                  0.032176   

                                               model  \
1               KNeighborsClassifier(n_neighbors=10)   
3  KNeighborsClassifier(leaf_size=50, n_neighbors...   
2                 KNeighborsClassifier(leaf_size=50)   
0                             KNeighborsClassifier()   

                                 params  
1            

#### Gradient Boosting

In [13]:
gradient_boosting_classifier_baseline_key = "Gradient Boosting"

gradient_boosting_classifier_models = {
    "Gradient Boosting": {
        "model": GradientBoostingClassifier(random_state=42),
        "params": {
            "randon_state": 42
        }
    },
    "Gradient Boosting with 200 n estimators": {
        "model": GradientBoostingClassifier(random_state=42, n_estimators=200),
        "params": {
            "random_state": 42,
            "n_estimators": 200
        }
    },
    "Gradient Boosting with 10 max depth": {
        "model": GradientBoostingClassifier(random_state=42, max_depth=10),
        "params": {
            "random_state": 42,
            "max_depth": 10
        }
    },
    "Gradient Boosting with 200 n estimators and 10 max depth": {
        "model": GradientBoostingClassifier(random_state=42, n_estimators=200, max_depth=10),
        "params": {
            "random_state": 42,
            "n_estimators": 200,
            "max_depth": 10
        }
    },
}

best_gradient_boosting_classifier = avalia_modelo(gradient_boosting_classifier_models)



Modelo Gradient Boosting treinado e registrado no MLflow.




Modelo Gradient Boosting with 200 n estimators treinado e registrado no MLflow.




Modelo Gradient Boosting with 10 max depth treinado e registrado no MLflow.




Modelo Gradient Boosting with 200 n estimators and 10 max depth treinado e registrado no MLflow.
Resultado da comparação:
                                              Modelo  Acurácia  F1-Score  \
2                Gradient Boosting with 10 max depth  0.821229  0.818174   
3  Gradient Boosting with 200 n estimators and 10...  0.821229  0.817477   
1            Gradient Boosting with 200 n estimators  0.810056  0.807491   
0                                  Gradient Boosting  0.804469  0.800754   

   Tempo de Treinamento (s)  \
2                  3.065399   
3                  6.352979   
1                  1.946008   
0                  1.214474   

                                               model  \
2  ([DecisionTreeRegressor(criterion='friedman_ms...   
3  ([DecisionTreeRegressor(criterion='friedman_ms...   
1  ([DecisionTreeRegressor(criterion='friedman_ms...   
0  ([DecisionTreeRegressor(criterion='friedman_ms...   

                                              params  
2    

#### Best of best

In [14]:
best_results = pd.concat([best_radon_forest.to_frame().T,
                         best_logistic_regression.to_frame().T,
                         best_k_neighbors_classifier.to_frame().T,
                         best_gradient_boosting_classifier.to_frame().T],
                         ignore_index=True,
                         sort=False)

best_results.head()
best_result = check_best_df(best_results)

Resultado da comparação:
                                         Modelo  Acurácia  F1-Score  \
1  Logistic Regression with False fit intercept  0.821229  0.819935   
0                                 Random Forest  0.821229  0.818174   
3           Gradient Boosting with 10 max depth  0.821229  0.818174   
2   K Neighbors Classifier with 10 in neighbors  0.810056  0.805268   

  Tempo de Treinamento (s)                                              model  \
1                 0.057311  LogisticRegression(fit_intercept=False, max_it...   
0                 0.442809  (DecisionTreeClassifier(max_features='sqrt', r...   
3                 3.065399  ([DecisionTreeRegressor(criterion='friedman_ms...   
2                 0.029608               KNeighborsClassifier(n_neighbors=10)   

                                              params  
1  {'max_iter': 1000, 'random_state': 42, 'fit_in...  
0                               {'random_state': 42}  
3              {'random_state': 42, 'max_depth':

#### Salvando o melhor melhor com hiper parâmetro

In [15]:
with mlflow.start_run(run_name="Melhor Modelo com hiperparâmetros"):
    mlflow.log_param("Modelo", best_result["Modelo"])
    mlflow.log_metric("Acurácia", best_result["Acurácia"])
    mlflow.log_metric("F1-Score", best_result["F1-Score"])
    mlflow.log_metric("Tempo de Treinamento", best_result["Tempo de Treinamento (s)"])
    mlflow.sklearn.log_model(best_result["model"], "modelo")

best_model_name = best_result["Modelo"]
print(f"Melhor modelo ({best_model_name}) armazenado com sucesso no MLflow.")



Melhor modelo (Logistic Regression with False fit intercept) armazenado com sucesso no MLflow.
