<a href="https://colab.research.google.com/github/farieu/data-analysis/blob/OutrosModelos/ModelosAdicionais(ADABoost).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Utilização da biblioteca Pycaret para analisar modelos

#### Instalando a biblioteca

In [None]:
!pip install pycaret

#### Importação, setup e avaliação de modelos

In [None]:
import pandas as pd
from pycaret.regression import *

Primeiro está sendo testado o setup para o dataset já tratado e sem inputs, passando como parâmetro o que desejamos avaliar.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/BackEnd/GoodReads_cleanedwoImput.csv')
df.shape

(84054, 9)

In [None]:
rg_setup = setup(df, target='rating')

  fitted_transformer = self._memory_fit(
  X, y = self._memory_transform(
  fitted_estimator = self._memory_fit(
  X, y = pipeline._memory_transform(transformer, X, y)
  X, y = self._memory_full_transform(
  X, y = pipeline._memory_transform(transformer, X, y)
  X, y = self._memory_full_transform(


Unnamed: 0,Description,Value
0,Session id,8592
1,Target,rating
2,Target type,Regression
3,Original data shape,"(84054, 9)"
4,Transformed data shape,"(84054, 9)"
5,Transformed train set shape,"(58837, 9)"
6,Transformed test set shape,"(25217, 9)"
7,Numeric features,3
8,Categorical features,5
9,Preprocess,True


In [None]:
compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ada,AdaBoost Regressor,0.2657,0.1327,0.3641,0.0439,0.0902,0.0692,5.94
huber,Huber Regressor,0.2671,0.1357,0.3682,0.0217,0.0908,0.0698,2.919
ridge,Ridge Regression,0.2704,0.1359,0.3685,0.0203,0.091,0.0704,2.371
gbr,Gradient Boosting Regressor,0.2724,0.137,0.37,0.0126,0.0911,0.0709,9.709
lar,Least Angle Regression,0.2718,0.137,0.37,0.0123,0.0913,0.0707,2.36
lr,Linear Regression,0.2719,0.1371,0.3701,0.012,0.0913,0.0708,3.048
br,Bayesian Ridge,0.2719,0.1371,0.3701,0.012,0.0913,0.0708,2.535
llar,Lasso Least Angle Regression,0.271,0.1374,0.3706,0.0092,0.0912,0.0705,2.616
lasso,Lasso Regression,0.271,0.1374,0.3706,0.0092,0.0912,0.0705,2.379
en,Elastic Net,0.2709,0.1375,0.3707,0.0086,0.0912,0.0705,2.537


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

  master_display_.apply(


A avaliação demonstra que um bom modelo para se trabalhar com o dataset é o AdaBoostRegressor, também disponível no Scikit-learn.

Vou aplicar o mesmo, porém substituindo o random_state de 8592 para o 42, que é o padrão que estou adotando em todos os outros.

# Ada ciom Pipeline

In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Binarizer, LabelEncoder, MultiLabelBinarizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
df = pd.read_csv('/content/drive/MyDrive/BackEnd/GoodReads_100k_books.csv')

Utilizando o pipeline de pré processamento para definir tratamento para os dados faltantes e transformação de dados categóricos.

In [3]:
class DataFramePreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, to_drop=None, to_binarize=None, to_mtbinarize=None, drop_nan=True, drop_duplicates=True):
        self.to_drop = to_drop if to_drop else []
        self.to_binarize = to_binarize if to_binarize else []
        self.to_mtbinarize = to_mtbinarize if to_mtbinarize else []
        self.drop_nan = drop_nan
        self.drop_duplicates = drop_duplicates
        self.label_encoder = LabelEncoder()
        self.mlb = MultiLabelBinarizer()

    def fit(self, X, y=None):
        return self

    def transform(self, dataframe):
        if self.drop_duplicates:
            dataframe = dataframe.drop_duplicates()

        for column in self.to_drop:
            if column in dataframe.columns:
                dataframe = dataframe.drop(column, axis=1)

        if self.drop_nan:
            dataframe = dataframe.dropna()

        for column in self.to_binarize:
            if column in dataframe.columns:
                dataframe[column] = self.label_encoder.fit_transform(dataframe[column])

        for column in self.to_mtbinarize:
            if column in dataframe.columns:
                dataframe[column] = dataframe[column].apply(lambda x: x.split(',') if isinstance(x, str) else [])
                column_encoded = self.mlb.fit_transform(dataframe[column])
                column_df = pd.DataFrame(column_encoded, columns=self.mlb.classes_, index=dataframe.index)
                dataframe = pd.concat([dataframe.drop(columns=[column]), column_df], axis=1)

        return dataframe


cleaning_pipeline = Pipeline([
    ('cleaner', DataFramePreprocessor(
        to_drop=['isbn', 'isbn13', 'link', 'img'],
        to_binarize=['author', 'bookformat', 'title'],
        to_mtbinarize=['genre']
    ))
])
df_cleaned = cleaning_pipeline.fit_transform(df)

In [4]:
X = df_cleaned.drop(columns=['rating', 'desc'])
y = df_cleaned['rating']

In [5]:
binarizer = Binarizer(threshold=4.0)
y_binary = binarizer.fit_transform(y.values.reshape(-1, 1)).ravel()

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

Antes de treinar o modelo, vou buscar os melhores hiperparâmetros (além do fornecido pelo Pycarret como ideal) através do GridSearchCV.

In [9]:
ada_regressor = AdaBoostRegressor(random_state=42)

# Definindo os hiperparâmetros para o GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200, 50],
    'learning_rate': [0.01, 0.1, 0.5, 1.0]
}

grid_search = GridSearchCV(estimator=ada_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

grid_search.fit(X_train, y_train)

bestregressor = grid_search.best_estimator_

Após o GRID achar o melhor parâmetro, vai prever as avaliação positiva e negativa com o modelo. Geralmente as métricas de saída do ADA é uma previsão númerica contínua.



*   **MAE:** média de erro absoluto entre valores reais e os valores preditos;
*   **MSE:** média dos erros ao quadrado entre os valores reais e preditos;
*   **R2:** proporção da variabilidade dos dados explicada pelo modelo.



In [10]:
y_pred = bestregressor.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Melhores hiperparâmetros encontrados:", grid_search.best_params_)
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared Score (R2): {r2}")

Melhores hiperparâmetros encontrados: {'learning_rate': 0.01, 'n_estimators': 50}
Mean Absolute Error (MAE): 0.45045668378158765
Mean Squared Error (MSE): 0.2256053984707266
R-squared Score (R2): 0.044873882529111286
