## Experimentação de modelos e configurações para a tarefa de regressão.

##### Alunos:
- Gabriel Fonseca (2111066)
- Yasmim Santos (2116925)
- Alejandro Elias (2111189)
- Pedro Lucas (2111131)

Base de dados escolhida - Exame Nacional do Ensino Médio (Enem): https://basedosdados.org/dataset/3e9c8804-c31c-4f48-9a45-d67f1c21a859

### Importando as dependências:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler

### Lendo e visualizando os dados:

In [2]:
df_enem = pd.read_csv(
    f"../data/out/enem-dados-tratados.csv",
    dtype={
        "id_inscricao": np.int64,
        "ensino": int,
        "nota_ciencias_natureza": float,
        "nota_ciencias_humanas": float,
        "nota_linguagens_codigos": float,
        "nota_matematica": float,
        "nota_redacao": float,
        "q_formacao_pai": str,
        "q_formacao_mae": str,
        "q_renda_familia": str,
    },
)

df_enem

Unnamed: 0,id_inscricao,ensino,nota_ciencias_natureza,nota_ciencias_humanas,nota_linguagens_codigos,nota_matematica,nota_redacao,q_formacao_pai,q_formacao_mae,q_renda_familia,ano
0,150001892848,3,366.8,436.9,374.2,331.4,380.0,B,A,C,2015
1,150002421428,1,512.0,636.9,552.0,549.2,760.0,A,A,C,2015
2,150004396764,1,470.8,519.3,465.2,350.8,580.0,B,A,B,2015
3,150001657786,1,492.6,641.2,553.2,649.5,840.0,A,A,A,2015
4,150005415838,1,473.3,533.4,443.3,447.4,400.0,A,A,A,2015
...,...,...,...,...,...,...,...,...,...,...,...
357268,210054596750,1,450.6,403.1,443.3,479.8,0.0,E,E,B,2022
357269,210056286560,1,416.5,427.3,484.6,376.2,0.0,D,D,A,2022
357270,210057495281,1,462.1,421.7,432.1,530.9,0.0,C,D,B,2022
357271,210056812211,1,519.1,570.4,537.3,388.7,0.0,D,H,B,2022


### Preparando os dados para utilização no teste:

In [3]:
mm_scaler = MinMaxScaler()

df_enem = df_enem[df_enem["nota_ciencias_natureza"] != 0.0]
df_enem = df_enem[df_enem["nota_ciencias_humanas"] != 0.0]

df_enem["nota_ciencias_natureza_scl"] = mm_scaler.fit_transform(df_enem[["nota_ciencias_natureza"]])
df_enem["nota_ciencias_humanas_scl"] = mm_scaler.fit_transform(df_enem[["nota_ciencias_humanas"]])

df_enem

Unnamed: 0,id_inscricao,ensino,nota_ciencias_natureza,nota_ciencias_humanas,nota_linguagens_codigos,nota_matematica,nota_redacao,q_formacao_pai,q_formacao_mae,q_renda_familia,ano,nota_ciencias_natureza_scl,nota_ciencias_humanas_scl
0,150001892848,3,366.8,436.9,374.2,331.4,380.0,B,A,C,2015,0.111326,0.236413
1,150002421428,1,512.0,636.9,552.0,549.2,760.0,A,A,C,2015,0.367095,0.595157
2,150004396764,1,470.8,519.3,465.2,350.8,580.0,B,A,B,2015,0.294522,0.384215
3,150001657786,1,492.6,641.2,553.2,649.5,840.0,A,A,A,2015,0.332922,0.602870
4,150005415838,1,473.3,533.4,443.3,447.4,400.0,A,A,A,2015,0.298925,0.409507
...,...,...,...,...,...,...,...,...,...,...,...,...,...
357268,210054596750,1,450.6,403.1,443.3,479.8,0.0,E,E,B,2022,0.258940,0.175785
357269,210056286560,1,416.5,427.3,484.6,376.2,0.0,D,D,A,2022,0.198873,0.219193
357270,210057495281,1,462.1,421.7,432.1,530.9,0.0,C,D,B,2022,0.279197,0.209148
357271,210056812211,1,519.1,570.4,537.3,388.7,0.0,D,H,B,2022,0.379602,0.475874


In [4]:
features = [
    "nota_ciencias_natureza_scl",
]

X = np.array(df_enem[features])
Y = np.array(df_enem["nota_ciencias_humanas_scl"])

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, train_size=0.8, random_state=5487
)

X_train: np.ndarray = X_train
Y_train: np.ndarray = Y_train
X_test: np.ndarray = X_test
Y_test: np.ndarray = Y_test

pd.DataFrame(X_train).head()

Unnamed: 0,0
0,0.28818
1,0.258411
2,0.315484
3,0.112912
4,0.275145


### Realizando o teste:
##### Modelos selecionados:
1. **Regressão Linear (`LinearRegression`)**
2. **Ridge Regression (`Ridge`)**
3. **Lasso Regression (`Lasso`)**
4. **Elastic Net (`ElasticNet`)**
5. **Support Vector Regression (`SVR`)**
6. **Decision Tree Regression (`DecisionTreeRegressor`)**
7. **Random Forest Regression (`RandomForestRegressor`)**
8. **Gradient Boosting Regression (`GradientBoostingRegressor`)**
9. **K-Nearest Neighbors Regression (`KNeighborsRegressor`)**
10. **Bayesian Ridge Regression (`BayesianRidge`)**

In [5]:
from sklearn.linear_model import (
    LinearRegression,
    Ridge,
    Lasso,
    ElasticNet,
    BayesianRidge,
)
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

models_params = {
    LinearRegression(): {},
    Ridge(): {"alpha": [0.1, 1, 10, 100, 1000]},
    Lasso(): {"alpha": [0.001, 0.01, 0.1, 1, 10]},
    ElasticNet(): {
        "alpha": [0.001, 0.01, 0.1, 1, 10],
        "l1_ratio": [0.1, 0.3, 0.5, 0.7, 0.9],
    },
    BayesianRidge(): {
        "alpha_1": [1e-6, 1e-5, 1e-4, 1e-3],
        "lambda_1": [1e-6, 1e-5, 1e-4, 1e-3],
    },
    # SVR(): {
    #     "kernel": ["linear", "poly", "rbf", "sigmoid"],
    #     "C": [0.1, 1, 10, 100],
    #     "gamma": ["scale", "auto"],
    # },
    DecisionTreeRegressor(): {
        "max_depth": [None, 10, 20, 30],
        "min_samples_split": [2, 10, 20],
        "min_samples_leaf": [1, 5, 10],
    },
    RandomForestRegressor(): {
        "n_estimators": [50, 100, 200],
        "max_features": ["sqrt", "log2"],
        "max_depth": [None, 10, 20, 30],
    },
    GradientBoostingRegressor(): {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.01, 0.1, 1, 2],
        "max_depth": [3, 5, 7, 10],
    },
    KNeighborsRegressor(): {
        "n_neighbors": [3, 5, 7, 9],
        "weights": ["uniform", "distance"],
        "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
    },
}

results = []

models_and_params = models_params.items()
n_models = len(models_and_params)
current_model_iter = 1

for model, params in models_and_params:
    model_name = model.__class__.__name__
    print(f"Avaliando o modelo {model_name}... [{current_model_iter}/{n_models}]")
    try:
        grid_search = GridSearchCV(estimator=model, param_grid=params, cv=5)
        grid_search.fit(X_train, Y_train)
        test_score = grid_search.score(X_test, Y_test)
        results.append(
            {
                "Modelo": model_name,
                "Melhor pontuação": grid_search.best_score_,
                "Melhores parâmetros": grid_search.best_params_,
                "Resultado avaliação": test_score,
            }
        )
    except Exception as e:
        print(f"Erro no modelo {model.__class__.__name__}")
    current_model_iter += 1

results_df = pd.DataFrame(results)
results_df.to_csv('../data/out/resultado_gs_regressao.csv', index=False)
results_df

Avaliando o modelo LinearRegression... [1/9]
Avaliando o modelo Ridge... [2/9]
Avaliando o modelo Lasso... [3/9]
Avaliando o modelo ElasticNet... [4/9]
Avaliando o modelo BayesianRidge... [5/9]
Avaliando o modelo DecisionTreeRegressor... [6/9]
Avaliando o modelo RandomForestRegressor... [7/9]
Avaliando o modelo GradientBoostingRegressor... [8/9]
Avaliando o modelo KNeighborsRegressor... [9/9]


Unnamed: 0,Modelo,Melhor pontuação,Melhores parâmetros,Resultado avaliação
0,LinearRegression,0.490225,{},0.493215
1,Ridge,0.490225,{'alpha': 0.1},0.493215
2,Lasso,0.488488,{'alpha': 0.001},0.491437
3,ElasticNet,0.489094,"{'alpha': 0.001, 'l1_ratio': 0.1}",0.492052
4,BayesianRidge,0.490225,"{'alpha_1': 0.001, 'lambda_1': 0.001}",0.493215
5,DecisionTreeRegressor,0.522759,"{'max_depth': 10, 'min_samples_leaf': 10, 'min...",0.524162
6,RandomForestRegressor,0.524201,"{'max_depth': 10, 'max_features': 'sqrt', 'n_e...",0.525405
7,GradientBoostingRegressor,0.52529,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",0.526133
8,KNeighborsRegressor,0.47328,"{'algorithm': 'brute', 'n_neighbors': 9, 'weig...",0.476273
