<a href="https://colab.research.google.com/github/farieu/data-analysis/blob/OutrosModelos/ModelosAdicionais(ADABoost).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Utilização da biblioteca Pycaret para analisar modelos

#### Instalando a biblioteca

In [None]:
!pip install pycaret

#### Importação, setup e avaliação de modelos

In [None]:
import pandas as pd
from pycaret.regression import *

Primeiro está sendo testado o setup para o dataset já tratado e sem inputs, passando como parâmetro o que desejamos avaliar.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/BackEnd/GoodReads_cleanedwoImput.csv')
df.shape

(84054, 9)

In [None]:
rg_setup = setup(df, target='rating')

  fitted_transformer = self._memory_fit(
  X, y = self._memory_transform(
  fitted_estimator = self._memory_fit(
  X, y = pipeline._memory_transform(transformer, X, y)
  X, y = self._memory_full_transform(
  X, y = pipeline._memory_transform(transformer, X, y)
  X, y = self._memory_full_transform(


Unnamed: 0,Description,Value
0,Session id,8592
1,Target,rating
2,Target type,Regression
3,Original data shape,"(84054, 9)"
4,Transformed data shape,"(84054, 9)"
5,Transformed train set shape,"(58837, 9)"
6,Transformed test set shape,"(25217, 9)"
7,Numeric features,3
8,Categorical features,5
9,Preprocess,True


In [None]:
compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ada,AdaBoost Regressor,0.2657,0.1327,0.3641,0.0439,0.0902,0.0692,5.94
huber,Huber Regressor,0.2671,0.1357,0.3682,0.0217,0.0908,0.0698,2.919
ridge,Ridge Regression,0.2704,0.1359,0.3685,0.0203,0.091,0.0704,2.371
gbr,Gradient Boosting Regressor,0.2724,0.137,0.37,0.0126,0.0911,0.0709,9.709
lar,Least Angle Regression,0.2718,0.137,0.37,0.0123,0.0913,0.0707,2.36
lr,Linear Regression,0.2719,0.1371,0.3701,0.012,0.0913,0.0708,3.048
br,Bayesian Ridge,0.2719,0.1371,0.3701,0.012,0.0913,0.0708,2.535
llar,Lasso Least Angle Regression,0.271,0.1374,0.3706,0.0092,0.0912,0.0705,2.616
lasso,Lasso Regression,0.271,0.1374,0.3706,0.0092,0.0912,0.0705,2.379
en,Elastic Net,0.2709,0.1375,0.3707,0.0086,0.0912,0.0705,2.537


Processing:   0%|          | 0/81 [00:00<?, ?it/s]

  master_display_.apply(


A avaliação demonstra que um bom modelo para se trabalhar com o dataset é o AdaBoostRegressor, também disponível no Scikit-learn.

Vou aplicar o mesmo, porém substituindo o random_state de 8592 para o 42, que é o padrão que estou adotando em todos os outros.

### ADABoostRegressor

#### Imputação de bibliotecas

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, LabelEncoder, MultiLabelBinarizer, FunctionTransformer, OneHotEncoder, Binarizer
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import numpy as np

In [None]:
df_pipe = pd.read_csv('/content/drive/MyDrive/BackEnd/GoodReads_100k_books.csv')

Utilizando o pipeline de pré processamento para definir tratamento para os dados faltantes e transformação de dados categóricos

In [None]:
numeric_features = ['pages', 'totalratings', 'reviews']
categorical_features = ['author', 'bookformat', 'genre', 'title']

# Pipeline para os atributos numéricos
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Pipeline para os atributos categóricos
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Combinando os dois transformadores em um pré-processador

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Em seguida, executa o pipeline completo.

In [None]:
pipeline = Pipeline(steps=[
    ('dropna', FunctionTransformer(lambda df: df.dropna(subset=['title', 'desc', 'genre', 'bookformat']), validate=False)),
    ('drop_columns', FunctionTransformer(lambda df: df.drop(columns=['desc', 'isbn', 'isbn13', 'img', 'link']), validate=False)),
    ('preprocessor', preprocessor)
])

In [None]:
X = df_pipe.drop(columns=['rating'])
y = df_pipe['rating']

binarizer = Binarizer(threshold=4.0)
y_binary = binarizer.fit_transform(y.values.reshape(-1, 1)).ravel()

In [None]:
# Criar um DataFrame temporário com X e y_binary
temp_df = pd.concat([X, pd.DataFrame(y_binary, columns=['rating_binary'])], axis=1)
df_preprocessed = pipeline.fit_transform(temp_df)

Separação das features e o target do DataFrame pré-processado, o primeiro pegando todas exceto a última, e o y pegando apenas a coluna de rating.

In [None]:
# Separar as features e o target do DataFrame pré-processado
X_processed = df_preprocessed[:, :-1]
y_processed = df_preprocessed[:, -1]

Dividindo os dados em conjunto de treino e teste, com mesmo tamanho (80% treino, 20% teste), e random_state=42.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)

#### Instânciando o ADA e avaliando o modelo

In [None]:
adaboost_regressor = AdaBoostRegressor(n_estimators=100, random_state=42)

# Treinar o modelo com os dados de treino, convertendo y_train para um array denso
adaboost_regressor.fit(X_train, y_train.toarray().ravel())

y_pred = adaboost_regressor.predict(X_test)

mse = mean_squared_error(y_test.toarray(), y_pred)
r2 = r2_score(y_test.toarray(), y_pred)
print(f"Erro Quadrático Médio (MSE) do modelo: {mse:.2f}")
print(f"Coeficiente de Determinação (R2) do modelo: {r2:.2f}")

Erro Quadrático Médio (MSE) do modelo: 0.00
Coeficiente de Determinação (R2) do modelo: 1.00


Esses valores indicam que o modelo AdaBoost está conseguindo prever com exatidão os dados do conjunto de treino, o que pode ser um indício de overfitting (quando o modelo se ajusta tão bem aos dados de treinamento que perde a capacidade de generalizar para novos dados).


---



Sendo assim, o modelo de Regressão Logística ainda é o modelo que mais
apresenta confiabilidade.