# Desafio 2 - Regressão: Treinamento de Modelo

## Carregamento de features em diferentes formatos e target

In [288]:
import pickle
file_path = '/content/drive/MyDrive/curso_machine-learning-python/datasets/proccessed_insurance_data.pkl'
processed_insurance_data = open(file_path, 'rb')
target = pickle.load(processed_insurance_data)
all_features = pickle.load(processed_insurance_data)
scaled_all_features = pickle.load(processed_insurance_data)
high_corr_features = pickle.load(processed_insurance_data)
scaled_high_corr_features = pickle.load(processed_insurance_data)
processed_insurance_data.close()

In [289]:
import numpy as np
import pandas as pd

## Verificação de dois dos conjuntos carregados

In [290]:
target_df = pd.DataFrame(target)
target_df.shape

(1338, 1)

In [291]:
all_features_df = pd.DataFrame(all_features)
all_features_df.shape

(1338, 11)

## Separação de dados em treino e teste

In [292]:
def divide_dataset_treino_teste(source_features, source_target):
  from sklearn.model_selection import train_test_split
  features_treino, features_teste, target_treino, target_teste = train_test_split(
    source_features, source_target, test_size = 0.3, random_state = 0
  )
  return features_treino, features_teste, target_treino, target_teste

In [293]:
features_treino, features_teste, target_treino, target_teste = divide_dataset_treino_teste(all_features, target)

## Treinamento de modelos

### Multiple Linear Regression

In [294]:
from sklearn.linear_model import LinearRegression

In [295]:
lin_reg_model = LinearRegression()

In [296]:
lin_reg_model.fit(features_treino, target_treino)

In [297]:
score_treino = lin_reg_model.score(features_treino, target_treino)
score_treino

0.7309569871174701

In [298]:
score_teste = lin_reg_model.score(features_teste, target_teste)
score_teste

0.7909160991789905

#### Resultados

**Modelo:** Regressão Linear Múltipla

---

**Conjunto de features:** all_features

**score em treino:** 0.73

**score em teste:** 0.79

---

**Conjunto de features:** scaled_all_features

**score em treino:** 0.73

**score em teste:** 0.79

---


**Conjunto de features:** high_corr_features

**score em treino:** 0.70

**score em teste:** 0.76

---


**Conjunto de features:** scaled_high_corr_features

**score em treino:** 0.70

**score em teste:** 0.76



As features possuem muitas dimensões, não vai ser possível fazer um gráfico com a reta de regressão.

#### Métricas de Desempenho

In [299]:
from sklearn.metrics import root_mean_squared_error

In [300]:
predicted_target = lin_reg_model.predict(features_teste)
rmse = root_mean_squared_error(target_teste, predicted_target)
print(f"Erro médio usando all_features: {rmse}")

Erro médio usando all_features: 5774.2963057808665


#### Validação Cruzada

In [301]:
from sklearn.model_selection import KFold, cross_val_score

In [302]:
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [303]:
modelo = LinearRegression()

In [304]:
resultado = cross_val_score(modelo, all_features, target, cv = k_fold)

In [305]:
print(f"Coeficiente R² médio: {resultado.mean()}")
print(f"Coeficiente R² inferior: {resultado.mean() - resultado.std()}")
print(f"Coeficiente R² superior: {resultado.mean() + resultado.std()}")

Coeficiente R² médio: 0.7356373982088413
Coeficiente R² inferior: 0.680117796174853
Coeficiente R² superior: 0.7911570002428296


#### Conclusão

Modelo baseado em Regressão Linear Múltipla conseguiu se ajustar em 79% à target. É um dos modelos mais simples, então parece um bom resultado. Vamos testar outros modelos em seguida.

In [306]:
resultados_finais_df = pd.DataFrame()
linear_regression_performance = pd.DataFrame({
  'model': ['multiple_linear_regression'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [resultado.mean()],
  'R^2_std_dev': [resultado.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, linear_regression_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552


### Support Vector Regressor

In [307]:
from sklearn.svm import SVR

In [308]:
# default kernel is rbf
sup_vec_regressor = SVR(kernel = 'rbf')

In [309]:
sup_vec_regressor.fit(features_treino, target_treino)

In [310]:
sup_vec_regressor.score(features_treino, target_treino)

-0.09426391963281566

In [311]:
sup_vec_regressor.score(features_teste, target_teste)

-0.08859789219262182

Do que o Support Vector Regressor precisa mesmo? Padronizar escalas de variáveis, tanto dependentes quando independentes.

In [312]:
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
scaled_features_treino = feature_scaler.fit_transform(features_treino)

In [313]:
target_scaler = StandardScaler()
scaled_target_treino = target_scaler.fit_transform(target_treino.reshape(-1, 1))

In [314]:
SVR_wscaled = SVR(kernel = 'rbf')
SVR_wscaled.fit(scaled_features_treino, scaled_target_treino)

  y = column_or_1d(y, warn=True)


In [315]:
score_treino = SVR_wscaled.score(scaled_features_treino, scaled_target_treino)
score_treino

0.8374931487203512

In [316]:
scaled_features_teste = feature_scaler.fit_transform(features_teste)

In [317]:
scaled_target_teste = target_scaler.fit_transform(target_teste.reshape(-1,1))

In [318]:
score_teste = SVR_wscaled.score(scaled_features_teste, scaled_target_teste)
score_teste

0.8752604597810801

O coeficiente de determinação $R^2$ ficou melhor no SVR.

#### Resultados

**Modelo:** Support Vector Regressor

---

**Conjunto de features:** scaled_all_features and scaled_target

**score em treino:** 0.84

**score em teste:** 0.88


#### Métricas de Desempenho

In [319]:
from sklearn.metrics import root_mean_squared_error

In [320]:
predicted_target = SVR_wscaled.predict(scaled_features_teste)

In [321]:
target_teste_inversed = target_scaler.inverse_transform(scaled_target_teste)

In [322]:
predicted_target_inversed = target_scaler.inverse_transform(predicted_target.reshape(-1, 1))

In [323]:
rmse = root_mean_squared_error(target_teste_inversed, predicted_target_inversed)
rmse

4460.061407282226

#### Validação Cruzada

In [324]:
from sklearn.model_selection import KFold, cross_val_score

In [325]:
X_scaler = StandardScaler()

In [326]:
y_scaler = StandardScaler()

In [327]:
X_scaled = X_scaler.fit_transform(all_features)

In [328]:
y_scaled = y_scaler.fit_transform(target.reshape(-1, 1))

In [329]:
modelo = SVR(kernel = 'rbf')

In [330]:
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [331]:
resultado = cross_val_score(modelo, X_scaled, y_scaled.ravel(), cv = k_fold)

In [332]:
print(f"Coeficiente R² médio: {resultado.mean()}")
print(f"Coeficiente R² inferior: {resultado.mean() - resultado.std()}")
print(f"Coeficiente R² superior: {resultado.mean() + resultado.std()}")

Coeficiente R² médio: 0.8356454051983156
Coeficiente R² inferior: 0.7880730379941618
Coeficiente R² superior: 0.8832177724024695


#### Conclusão

O regressor baseado em support vector machines se saiu melhor do que a regressão linear, tanto no coeficiente R² quanto na métrica de erro RMSE.
Entre os 2, deve ser escolhido SVR. Apesar disso, o contra do SVR é que exige várias transformações de escala para poder ser utilizado. Talvez por isso, muitas vezes as pessoas escolhem outros modelos.

In [333]:
linear_regression_performance = pd.DataFrame({
  'model': ['support_vector_regressor'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [resultado.mean()],
  'R^2_std_dev': [resultado.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, linear_regression_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552
0,support_vector_regressor,0.837493,0.87526,4460.061407,0.835645,0.047572


### Decision Tree Regressor

In [334]:
from sklearn.tree import DecisionTreeRegressor

In [335]:
tree = DecisionTreeRegressor(max_depth = 4, criterion='poisson', random_state = 0)

In [336]:
tree.fit(features_treino, target_treino)

In [337]:
score_treino = tree.score(features_treino, target_treino)
score_treino

0.8531650151210386

In [338]:
score_teste = tree.score(features_teste, target_teste)
score_teste

0.8854782318253775

squared_error:
- treino: 0.852 ~ 0.85
- teste: 0.884 ~ 0.88

poisson:
- treino: 0.853 ~ 0.85
- teste: 0.885 ~ 0.89

#### Resultados

all_features == scaled_all_features

R² em treino: 0.85

R² em teste: 0.89

#### Métricas de Desempenho

In [339]:
from sklearn.metrics import root_mean_squared_error

In [340]:
predicted_target_teste = tree.predict(features_teste)
rmse = root_mean_squared_error(target_teste, predicted_target_teste)
rmse

4273.490975094296

#### Validação Cruzada

In [341]:
from sklearn.model_selection import KFold, cross_val_score

In [342]:
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [343]:
modelo = DecisionTreeRegressor(max_depth = 4, criterion='poisson', random_state = 0)

In [344]:
resultado = cross_val_score(modelo, all_features, target)

In [345]:
print(f"Coeficiente R² médio: {resultado.mean()}")
print(f"Coeficiente R² inferior: {resultado.mean() - resultado.std()}")
print(f"Coeficiente R² superior: {resultado.mean() + resultado.std()}")

Coeficiente R² médio: 0.8510140323464425
Coeficiente R² inferior: 0.8141620277296231
Coeficiente R² superior: 0.8878660369632618


#### Conclusão

O Regressor baseado em árvore de decisão superou o Suppor Vector Regressor. A diferença foi pequena, mas ele obteve melhores coeficiente R² além de menor erro com RSME. Além disso, ainda tem a vantagem de exigir menos pré/pós-processamento.

In [346]:
model_performance = pd.DataFrame({
  'model': ['decision_tree_regressor'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [resultado.mean()],
  'R^2_std_dev': [resultado.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, model_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552
0,support_vector_regressor,0.837493,0.87526,4460.061407,0.835645,0.047572
0,decision_tree_regressor,0.853165,0.885478,4273.490975,0.851014,0.036852


### Random Forest

In [347]:
from sklearn.ensemble import RandomForestRegressor

In [348]:
random_forest = RandomForestRegressor(n_estimators = 100, criterion = 'friedman_mse', max_depth = 6, random_state = 0)

In [349]:
random_forest.fit(features_treino, target_treino)

In [350]:
score_treino = random_forest.score(features_treino, target_treino)
score_treino

0.9047018875232218

In [351]:
score_teste = random_forest.score(features_teste, target_teste)
score_teste

0.8798194991833832

#### Resultados

all_features == scaled_all_features

Random Forest

- score em treino: 0.90

- score em teste: 0.88

#### Métricas de Desempenho

In [352]:
from sklearn.metrics import root_mean_squared_error

In [353]:
predicted_target_teste = random_forest.predict(features_teste)
rmse = root_mean_squared_error(target_teste, predicted_target_teste)
rmse

4377.798554276961

In [354]:
# Observação: às vezes o algoritmo pode ter um bom score e ter um RMSE ruim. Mudei
# o n_estimators de 250 para 100 para diminuir esse erro. Me parece que estava tendendo ao
# overfitting.

#### Validação Cruzada

In [355]:
from sklearn.model_selection import KFold, cross_val_score

In [356]:
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [357]:
modelo = RandomForestRegressor(n_estimators = 100, criterion = 'friedman_mse', max_depth = 6, random_state = 0)

In [358]:
resultado = cross_val_score(modelo, all_features, target, cv = k_fold)

In [359]:
print(f"Coeficiente R² médio: {resultado.mean()}")
print(f"Coeficiente R² inferior: {resultado.mean() - resultado.std()}")
print(f"Coeficiente R² superior: {resultado.mean() + resultado.std()}")

Coeficiente R² médio: 0.8502056088233102
Coeficiente R² inferior: 0.807442464238317
Coeficiente R² superior: 0.8929687534083034


#### Conclusão

Random Forest ficou muito parecido com o Decision Tree e ambos foram melhores que o SVR.
Devido à simplicidade, acredito que escolheria o Decision Tree até o momento.
Nas próximas seções, vamos ver 3 dos algoritmos mais poderosos, XGBoost, LightGBM e CatBoost.


In [360]:
model_performance = pd.DataFrame({
  'model': ['random_forest'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [resultado.mean()],
  'R^2_std_dev': [resultado.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, model_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552
0,support_vector_regressor,0.837493,0.87526,4460.061407,0.835645,0.047572
0,decision_tree_regressor,0.853165,0.885478,4273.490975,0.851014,0.036852
0,random_forest,0.904702,0.879819,4377.798554,0.850206,0.042763


### XGBoost

In [361]:
from xgboost import XGBRegressor

In [362]:
xgboost = XGBRegressor(n_estimators = 75, max_depth = 4, learning_rate = 0.05, objective='reg:squarederror')

In [363]:
xgboost.fit(features_treino, target_treino)

In [364]:
score_treino = xgboost.score(features_treino, target_treino)
score_treino

0.8813457947769354

In [365]:
score_teste = xgboost.score(features_teste, target_teste)
score_teste

0.8901612137439361

#### Resultados

all_features == scaled_all_features

Score do XGBoost:

- treino: 0.88

- teste: 0.89

#### Métricas de Desempenho

In [366]:
from sklearn.metrics import root_mean_squared_error

In [367]:
predicted_target_testes = xgboost.predict(features_teste)

In [368]:
rmse = root_mean_squared_error(target_teste, predicted_target_testes)
rmse

4185.203996526568

#### Validação Cruzada

In [369]:
from sklearn.model_selection import KFold, cross_val_score

In [370]:
k_fold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [371]:
modelo = XGBRegressor(n_estimators = 75, max_depth = 4, learning_rate = 0.05, objective='reg:squarederror')

In [372]:
cv_score = cross_val_score(modelo, all_features, target, cv = k_fold)

In [373]:
print(f"Coeficiente R² médio: {cv_score.mean()}")
print(f"Coeficiente R² inferior: {cv_score.mean() - cv_score.std()}")
print(f"Coeficiente R² superior: {cv_score.mean() + cv_score.std()}")

Coeficiente R² médio: 0.8577180490554157
Coeficiente R² inferior: 0.8180795485226553
Coeficiente R² superior: 0.8973565495881761


#### Conclusão

O XGBoost apresentou resultados levemente superiores ao outros que estavam melhor no coeficiente de determinação R².
Com relação aos erros, também. O XGBoost apresenta até agora o menor erro, 4185.20.

In [374]:
model_performance = pd.DataFrame({
  'model': ['xgboost'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [cv_score.mean()],
  'R^2_std_dev': [cv_score.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, model_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552
0,support_vector_regressor,0.837493,0.87526,4460.061407,0.835645,0.047572
0,decision_tree_regressor,0.853165,0.885478,4273.490975,0.851014,0.036852
0,random_forest,0.904702,0.879819,4377.798554,0.850206,0.042763
0,xgboost,0.881346,0.890161,4185.203997,0.857718,0.039639


### LightGBM

In [375]:
import lightgbm as lgb

In [376]:
lgbm = lgb.LGBMRegressor(num_leaves = 15, max_depth = 3, learning_rate = 0.1, n_estimators = 60)

In [377]:
lgbm.fit(features_treino, target_treino)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000161 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 325
[LightGBM] [Info] Number of data points in the train set: 936, number of used features: 11
[LightGBM] [Info] Start training from score 13232.916456




In [378]:
score_treino = lgbm.score(features_treino, target_treino)
score_treino



0.8736670088324157

In [379]:
score_teste = lgbm.score(features_teste, target_teste)
score_teste



0.8937278307747821

#### Resultados

O LightGBM ficou melhor usando scaled_all_features em vez de all_features.
O resultado foi:

- score treino: 0.873 ~ 0.87

- score teste: 0.897 ~ 0.90

#### Métricas de Desempenho

In [380]:
from sklearn.metrics import root_mean_squared_error

In [381]:
predicted_target_teste = lgbm.predict(features_teste)



In [382]:
rmse = root_mean_squared_error(predicted_target_teste, target_teste)
rmse

4116.693574020556

#### Validação Cruzada

In [383]:
from sklearn.model_selection import KFold, cross_val_score

In [384]:
kfold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [385]:
model = lgb.LGBMRegressor(num_leaves = 15, max_depth = 3, learning_rate = 0.1, n_estimators = 60)

In [386]:
cross_val_result = cross_val_score(model, scaled_all_features, target, cv = kfold);



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000225 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 334
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 11
[LightGBM] [Info] Start training from score 13010.342228
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000180 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 334
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 11
[LightGBM] [Info] Start training from score 13468.968214
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000209 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000195 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 334
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 11
[LightGBM] [Info] Start training from score 13305.774773
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000173 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 334
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 11
[LightGBM] [Info] Start training from score 13379.246585
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000182 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e



In [387]:
print(f"Coeficiente R² médio: {cross_val_result.mean()}")
print(f"Coeficiente R² inferior: {cross_val_result.mean() - cross_val_result.std()}")
print(f"Coeficiente R² superior: {cross_val_result.mean() + cross_val_result.std()}")

Coeficiente R² médio: 0.8574484096097168
Coeficiente R² inferior: 0.8180650507145069
Coeficiente R² superior: 0.8968317685049267


#### Conclusão

Ocorreu praticamente um empate técnico entre o LightGBM e o XGBoost.
Apesar disso, na métrica de desempenho, o LightGBM foi ligeiramente melhor usando scaled_all_features.

In [388]:
model_performance = pd.DataFrame({
  'model': ['lightgbm'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [cross_val_result.mean()],
  'R^2_std_dev': [cross_val_result.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, model_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552
0,support_vector_regressor,0.837493,0.87526,4460.061407,0.835645,0.047572
0,decision_tree_regressor,0.853165,0.885478,4273.490975,0.851014,0.036852
0,random_forest,0.904702,0.879819,4377.798554,0.850206,0.042763
0,xgboost,0.881346,0.890161,4185.203997,0.857718,0.039639
0,lightgbm,0.873667,0.893728,4116.693574,0.857448,0.039383


### CatBoost

In [389]:
!pip install catboost



In [390]:
from catboost.core import CatBoostRegressor

In [391]:
catboost = CatBoostRegressor(iterations = 75, learning_rate = 0.10, depth = 7, random_state = 0)

In [392]:
catboost.fit(features_treino, target_treino);

0:	learn: 11096.7213541	total: 1.61ms	remaining: 119ms
1:	learn: 10370.8239612	total: 2.32ms	remaining: 84.7ms
2:	learn: 9755.7383836	total: 3.47ms	remaining: 83.2ms
3:	learn: 9187.2664872	total: 4.73ms	remaining: 84ms
4:	learn: 8656.3861320	total: 5.8ms	remaining: 81.1ms
5:	learn: 8171.8658700	total: 6.13ms	remaining: 70.5ms
6:	learn: 7735.7643300	total: 7.21ms	remaining: 70ms
7:	learn: 7327.5305135	total: 8.41ms	remaining: 70.4ms
8:	learn: 6971.1983763	total: 9.89ms	remaining: 72.5ms
9:	learn: 6669.5714673	total: 10.2ms	remaining: 66.2ms
10:	learn: 6413.4874399	total: 10.4ms	remaining: 60.8ms
11:	learn: 6158.3320446	total: 11.2ms	remaining: 58.6ms
12:	learn: 5952.3027175	total: 12.5ms	remaining: 59.6ms
13:	learn: 5783.4733560	total: 13.8ms	remaining: 60ms
14:	learn: 5637.5084818	total: 14.2ms	remaining: 56.7ms
15:	learn: 5477.2082258	total: 15.4ms	remaining: 56.7ms
16:	learn: 5360.5466676	total: 16.6ms	remaining: 56.8ms
17:	learn: 5235.5888753	total: 17.8ms	remaining: 56.5ms
18:	lear

In [393]:
score_treino = catboost.score(features_treino, target_treino)
score_treino

np.float64(0.8959742114451366)

In [394]:
score_teste = catboost.score(features_teste, target_teste)
score_teste

np.float64(0.8871886423839948)

#### Resultados

all_features = scaled_all_features

CatBoost:

- Score em Treino: 0.895 ~ 0.90

- Score em Teste: 0.887 ~ 0.89

#### Métricas de Desempenho

In [395]:
predicted_target_teste = catboost.predict(features_teste)

In [396]:
from sklearn.metrics import root_mean_squared_error

In [397]:
rmse = root_mean_squared_error(target_teste, predicted_target_teste)
rmse

4241.458105099225

#### Validação Cruzada

In [398]:
from sklearn.model_selection import KFold, cross_val_score

In [399]:
kfold = KFold(n_splits = 10, shuffle = True, random_state = 0)

In [400]:
modelo = CatBoostRegressor(iterations = 75, learning_rate = 0.10, depth = 7, random_state = 0)

In [401]:
resultado = cross_val_score(modelo, all_features, target, cv = kfold)

0:	learn: 11054.0644801	total: 2.47ms	remaining: 183ms
1:	learn: 10318.3546299	total: 3.33ms	remaining: 121ms
2:	learn: 9675.8384761	total: 4.53ms	remaining: 109ms
3:	learn: 9073.3132547	total: 5.82ms	remaining: 103ms
4:	learn: 8488.3645577	total: 7.19ms	remaining: 101ms
5:	learn: 8006.9822973	total: 7.57ms	remaining: 87.1ms
6:	learn: 7550.5914270	total: 8.98ms	remaining: 87.3ms
7:	learn: 7145.9845984	total: 10.2ms	remaining: 85.6ms
8:	learn: 6785.2262303	total: 11.4ms	remaining: 83.4ms
9:	learn: 6505.4242602	total: 12.7ms	remaining: 82.8ms
10:	learn: 6219.9556953	total: 14ms	remaining: 81.4ms
11:	learn: 5970.7969237	total: 14.9ms	remaining: 78.3ms
12:	learn: 5785.6949346	total: 16.2ms	remaining: 77.3ms
13:	learn: 5616.8586596	total: 16.6ms	remaining: 72.2ms
14:	learn: 5464.2324896	total: 17.8ms	remaining: 71.1ms
15:	learn: 5315.9453845	total: 18.9ms	remaining: 69.7ms
16:	learn: 5183.9543530	total: 19.7ms	remaining: 67.1ms
17:	learn: 5069.5659354	total: 20.4ms	remaining: 64.6ms
18:	lea

In [402]:
print(f"Coeficiente R² médio: {resultado.mean()}")
print(f"Coeficiente R² inferior: {resultado.mean() - resultado.std()}")
print(f"Coeficiente R² superior: {resultado.mean() + resultado.std()}")

Coeficiente R² médio: 0.8483613861522021
Coeficiente R² inferior: 0.8063578667102363
Coeficiente R² superior: 0.8903649055941679


#### Conclusão

O XGBoost e o LightGBM foram superiores ao CatBoost.

In [403]:
model_performance = pd.DataFrame({
  'model': ['catboost'],
  'train_score': [score_treino],
  'test_score': [score_teste],
  'rmse': [rmse],
  'mean_R^2': [resultado.mean()],
  'R^2_std_dev': [resultado.std()]
})
resultados_finais_df = pd.concat([resultados_finais_df, model_performance], ignore_index = False)
resultados_finais_df

Unnamed: 0,model,train_score,test_score,rmse,mean_R^2,R^2_std_dev
0,multiple_linear_regression,0.730957,0.790916,5774.296306,0.735637,0.05552
0,support_vector_regressor,0.837493,0.87526,4460.061407,0.835645,0.047572
0,decision_tree_regressor,0.853165,0.885478,4273.490975,0.851014,0.036852
0,random_forest,0.904702,0.879819,4377.798554,0.850206,0.042763
0,xgboost,0.881346,0.890161,4185.203997,0.857718,0.039639
0,lightgbm,0.873667,0.893728,4116.693574,0.857448,0.039383
0,catboost,0.895974,0.887189,4241.458105,0.848361,0.042004


### Conclusão Final

De todos os modelos que foram treinados, os que se saíram melhor foram XGBoost e LightGBM. Ambos conseguiram boas pontuações, mas em termos produtivos, é provável que o LightGBM seria melhor por consumir menos recursos computacionais devido às podas que faz internamente. Com isso, seria o modelo que eu escolheria para produção.