## Experimentação de modelos e configurações para a tarefa de classificação.

##### Alunos:
- Gabriel Fonseca (2111066)
- Yasmim Santos (2116925)
- Alejandro Elias (2111189)
- Pedro Lucas (2111131)

Base de dados escolhida - Exame Nacional do Ensino Médio (Enem): https://basedosdados.org/dataset/3e9c8804-c31c-4f48-9a45-d67f1c21a859

### Importando as dependências:

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler

### Lendo e visualizando os dados:

In [2]:
df_enem = pd.read_csv(
    f"../data/out/enem-dados-tratados.csv",
    dtype={
        "id_inscricao": np.int64,
        "ensino": int,
        "nota_ciencias_natureza": float,
        "nota_ciencias_humanas": float,
        "nota_linguagens_codigos": float,
        "nota_matematica": float,
        "nota_redacao": float,
        "q_formacao_pai": str,
        "q_formacao_mae": str,
        "q_renda_familia": str,
    },
)

df_enem

Unnamed: 0,id_inscricao,ensino,nota_ciencias_natureza,nota_ciencias_humanas,nota_linguagens_codigos,nota_matematica,nota_redacao,q_formacao_pai,q_formacao_mae,q_renda_familia,ano
0,150001892848,3,366.8,436.9,374.2,331.4,380.0,B,A,C,2015
1,150002421428,1,512.0,636.9,552.0,549.2,760.0,A,A,C,2015
2,150004396764,1,470.8,519.3,465.2,350.8,580.0,B,A,B,2015
3,150001657786,1,492.6,641.2,553.2,649.5,840.0,A,A,A,2015
4,150005415838,1,473.3,533.4,443.3,447.4,400.0,A,A,A,2015
...,...,...,...,...,...,...,...,...,...,...,...
357268,210054596750,1,450.6,403.1,443.3,479.8,0.0,E,E,B,2022
357269,210056286560,1,416.5,427.3,484.6,376.2,0.0,D,D,A,2022
357270,210057495281,1,462.1,421.7,432.1,530.9,0.0,C,D,B,2022
357271,210056812211,1,519.1,570.4,537.3,388.7,0.0,D,H,B,2022


### Preparando os dados para utilização no teste:

In [3]:
mm_scaler = MinMaxScaler()

df_enem = df_enem[df_enem["nota_ciencias_natureza"] != 0.0]
df_enem = df_enem[df_enem["nota_ciencias_humanas"] != 0.0]

df_enem["nota_objetiva"] = (
    df_enem["nota_ciencias_natureza"]
    + df_enem["nota_ciencias_humanas"]
    + df_enem["nota_linguagens_codigos"]
    + df_enem["nota_matematica"]
) / 4

df_enem["nota_objetiva_scl"] = mm_scaler.fit_transform(df_enem[["nota_objetiva"]])

map_grupo_renda = {
    "A": "nenhuma_renda",
    "B": "muito_baixa_renda",
    "C": "muito_baixa_renda",
    "D": "muito_baixa_renda",
    "E": "muito_baixa_renda",
    "F": "baixa_renda",
    "G": "baixa_renda",
    "H": "baixa_renda",
    "I": "baixa_renda",
    "J": "media_renda",
    "K": "media_renda",
    "L": "media_renda",
    "M": "media_renda",
    "N": "alta_renda",
    "O": "alta_renda",
    "P": "alta_renda",
    "Q": "alta_renda"
}

df_enem["q_renda_familia_classe"] = df_enem["q_renda_familia"].map(map_grupo_renda)
df_enem

Unnamed: 0,id_inscricao,ensino,nota_ciencias_natureza,nota_ciencias_humanas,nota_linguagens_codigos,nota_matematica,nota_redacao,q_formacao_pai,q_formacao_mae,q_renda_familia,ano,nota_objetiva,nota_objetiva_scl,q_renda_familia_classe
0,150001892848,3,366.8,436.9,374.2,331.4,380.0,B,A,C,2015,377.325,0.293561,muito_baixa_renda
1,150002421428,1,512.0,636.9,552.0,549.2,760.0,A,A,C,2015,562.525,0.579518,muito_baixa_renda
2,150004396764,1,470.8,519.3,465.2,350.8,580.0,B,A,B,2015,451.525,0.408129,muito_baixa_renda
3,150001657786,1,492.6,641.2,553.2,649.5,840.0,A,A,A,2015,584.125,0.612870,nenhuma_renda
4,150005415838,1,473.3,533.4,443.3,447.4,400.0,A,A,A,2015,474.350,0.443372,nenhuma_renda
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
357268,210054596750,1,450.6,403.1,443.3,479.8,0.0,E,E,B,2022,444.200,0.396819,muito_baixa_renda
357269,210056286560,1,416.5,427.3,484.6,376.2,0.0,D,D,A,2022,426.150,0.368949,nenhuma_renda
357270,210057495281,1,462.1,421.7,432.1,530.9,0.0,C,D,B,2022,461.700,0.423840,muito_baixa_renda
357271,210056812211,1,519.1,570.4,537.3,388.7,0.0,D,H,B,2022,503.875,0.488960,muito_baixa_renda


In [4]:
features = [
    "nota_objetiva_scl",
]

X = np.array(df_enem[features])
Y = np.array(df_enem["q_renda_familia_classe"])

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, train_size=0.8, random_state=5487
)

X_train: np.ndarray = X_train
Y_train: np.ndarray = Y_train
X_test: np.ndarray = X_test
Y_test: np.ndarray = Y_test

pd.DataFrame(X_train).head()

Unnamed: 0,0
0,0.403883
1,0.423686
2,0.445881
3,0.297614
4,0.454605


### Realizando o teste:
##### Modelos selecionados:
1. **Regressão Logística (`LogisticRegression`)**:
2. **K-Nearest Neighbors (`KNeighborsClassifier`)**:
3. **Support Vector Classifier (`SVC`)**:
4. **Decision Tree Classifier (`DecisionTreeClassifier`)**:
5. **Random Forest Classifier (`RandomForestClassifier`)**:
6. **Gradient Boosting Classifier (`GradientBoostingClassifier`)**:
7. **AdaBoost Classifier (`AdaBoostClassifier`)**:
8. **Naive Bayes (`GaussianNB`, `MultinomialNB`, `BernoulliNB`)**:
9. **Perceptron (`Perceptron`)**:
10. **SGD Linear Classifier (`SGDClassifier`)**:

In [5]:
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
)
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

models_params = {
    LogisticRegression(): {
        "C": [0.1, 1, 10, 100, 1000],
        "penalty": ["l1", "l2", "elasticnet", "none"],
        "solver": ["liblinear", "saga"],
        "max_iter": [1000, 2000, 3000],
    },
    KNeighborsClassifier(): {
        "n_neighbors": [3, 5, 10, 20],
        "weights": ["uniform", "distance"],
        "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
    },
    # SVC(): {
    #     "kernel": ["linear", "poly", "rbf", "sigmoid"],
    #     "C": [0.1, 1, 10, 100],
    #     "gamma": ["scale", "auto"],
    # },
    DecisionTreeClassifier(): {
        "max_depth": [None, 10, 20, 30],
        "min_samples_split": [2, 10, 20],
        "min_samples_leaf": [1, 5, 10],
    },
    RandomForestClassifier(): {
        "n_estimators": [50, 100, 200],
        "max_features": ["sqrt", "log2"],
        "max_depth": [None, 10, 20, 30],
    },
    GradientBoostingClassifier(): {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.01, 0.1, 0.2],
        "max_depth": [3, 5, 10],
    },
    AdaBoostClassifier(): {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.01, 0.1, 1.0],
    },
    GaussianNB(): {},
    MultinomialNB(): {"alpha": [0.01, 0.1, 1, 10]},
    BernoulliNB(): {"alpha": [0.01, 0.1, 1, 10]},
    Perceptron(): {
        "penalty": ["l1", "l2", "elasticnet"],
        "alpha": [0.0001, 0.001, 0.01, 0.1],
        "max_iter": [1000, 2000, 3000],
    },
    SGDClassifier(): {
        "loss": ["hinge", "log", "modified_huber"],
        "penalty": ["l1", "l2", "elasticnet"],
        "alpha": [0.0001, 0.001, 0.01, 0.1],
        "learning_rate": ["constant", "optimal", "invscaling", "adaptive"],
    },
}

results = []

models_and_params = models_params.items()
n_models = len(models_and_params)
current_model_iter = 1

for model, params in models_and_params:
    model_name = model.__class__.__name__
    print(f"Avaliando o modelo {model_name}... [{current_model_iter}/{n_models}]")
    try:
        grid_search = GridSearchCV(estimator=model, param_grid=params, cv=5)
        grid_search.fit(X_train, Y_train)
        test_score = grid_search.score(X_test, Y_test)
        results.append(
            {
                "Modelo": model_name,
                "Melhor pontuação": grid_search.best_score_,
                "Melhores parâmetros": grid_search.best_params_,
                "Resultado avaliação": test_score,
            }
        )
    except Exception as e:
        print(f"Erro no modelo {model.__class__.__name__}")
    current_model_iter += 1

results_df = pd.DataFrame(results)
results_df.to_csv('../data/out/resultado_gs_classificacao.csv', index=False)
results_df

Avaliando o modelo LogisticRegression... [1/11]


300 fits failed out of a total of 600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
75 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\fonsecovizk\Projetos\enem-microdados-ml\.env\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\fonsecovizk\Projetos\enem-microdados-ml\.env\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\fonsecovizk\Projetos\enem-microdados-ml\.env\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self

Avaliando o modelo KNeighborsClassifier... [2/11]
Avaliando o modelo DecisionTreeClassifier... [3/11]
Avaliando o modelo RandomForestClassifier... [4/11]
Avaliando o modelo GradientBoostingClassifier... [5/11]
Avaliando o modelo AdaBoostClassifier... [6/11]




Avaliando o modelo GaussianNB... [7/11]
Avaliando o modelo MultinomialNB... [8/11]
Avaliando o modelo BernoulliNB... [9/11]
Avaliando o modelo Perceptron... [10/11]
Avaliando o modelo SGDClassifier... [11/11]


600 fits failed out of a total of 720.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
360 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\fonsecovizk\Projetos\enem-microdados-ml\.env\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\fonsecovizk\Projetos\enem-microdados-ml\.env\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\fonsecovizk\Projetos\enem-microdados-ml\.env\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py", line 915, in fit
    self._more_valid

Unnamed: 0,Modelo,Melhor pontuação,Melhores parâmetros,Resultado avaliação
0,LogisticRegression,0.748239,"{'C': 100, 'max_iter': 1000, 'penalty': 'l2', ...",0.749965
1,KNeighborsClassifier,0.743807,"{'algorithm': 'ball_tree', 'n_neighbors': 20, ...",0.746675
2,DecisionTreeClassifier,0.746623,"{'max_depth': 10, 'min_samples_leaf': 5, 'min_...",0.749431
3,RandomForestClassifier,0.747589,"{'max_depth': 10, 'max_features': 'sqrt', 'n_e...",0.750077
4,GradientBoostingClassifier,0.748148,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",0.750021
5,AdaBoostClassifier,0.748179,"{'learning_rate': 0.1, 'n_estimators': 50}",0.749613
6,GaussianNB,0.744988,{},0.74735
7,MultinomialNB,0.735886,{'alpha': 0.01},0.737931
8,BernoulliNB,0.735886,{'alpha': 0.01},0.737931
9,Perceptron,0.73966,"{'alpha': 0.0001, 'max_iter': 1000, 'penalty':...",0.562038
