### 415. Modelos Não Lineares
<h1>Árvores de Decisão</h1><h2>Conceito</h2><ul><li>Modelos não lineares para classificação e regressão</li><li>Conjunto de regras fáceis de entender e implementar</li><li>Objetivo é criar um modelo que faz previsões por meio de regras de decisão simples</li></ul><h2>Estrutura</h2><ul><li>Começa com decisões (sim ou não) que levam a outras decisões</li><li>Representa número de possíveis caminhos de decisão</li><li>Para regressão, a variável dependente (y) é contínua</li></ul><h2>Terminologia</h2><ul><li>Nós: ponto com alternativas entre decisões</li><li>Folhas: decisão final</li></ul><h2>Exemplo</h2><ul><li>Pai de família<ul><li>Se salário &lt; R$7.500:<ul><li>Ganha R$1.000 ou R$3.000</li></ul></li><li>Se empregado:<ul><li>Ganha R$5.000 ou R$800</li></ul></li></ul></li></ul><h2>Vantagens</h2><ul><li>Fácil de entender</li><li>Menor necessidade de limpeza de dados</li><li>Não restrito por tipo de dados</li></ul><h2>Desvantagens</h2><ul><li>Risco de overfitting (sobreajuste)</li><li>Não tão adequado para variáveis contínuas</li></ul>


### 416.Ensamble
<h1>Ensemble Learning</h1><h2>Tema central</h2><p>Técnica para agregar previsões de vários modelos a fim de melhorar a acurácia das previsões</p><h2>Tipos</h2><h3>Bagging</h3><ul><li>Vários modelos treinados em paralelo</li><li>Cada modelo é treinado com conjunto aleatório de amostras</li></ul><h3>Boosting</h3><ul><li>Modelos treinados sequencialmente</li><li>Cada modelo aprende com os erros do modelo anterior</li></ul><h2>Árvores de decisão</h2><h3>Random Forest Regressor</h3><ul><li>Usa bagging</li><li>Constrói muitas árvores de decisão</li><li>Produz média das previsões de todas as árvores</li></ul><h3>Adaboost</h3><ul><li>Usa boosting</li><li>Identifica erros e melhora performance a cada iteração</li><li>Ajusta pesos das variáveis</li></ul><h3>Gradient Boosting</h3><ul><li>Também usa boosting</li><li>Usa o resíduo (erro) para aprender e melhorar a cada iteração</li></ul><h2>Objetivo</h2><ul><li>Rodar vários modelos para melhorar performance da regressão</li><li>Bagging: modelos em paralelo, tira média</li><li>Boosting: modelos sequenciais, aprende com erros</li></ul>



### 417. Regressão Não Linear - Prática
<h1>Tema central: Regressão não-linear no Colab com Python</h1><h2>Importação de bibliotecas</h2><ul><li>Pandas</li><li>NumPy</li></ul><h2>Importação do dataset</h2><ul><li>Dados de temperatura da região de Seattle (EUA)</li><li>Variáveis: ano, mês, dia, dia da semana, temperatura de 2 dias antes, temperatura do dia anterior, média de temperatura, temperatura atual (a prever)</li></ul><h2>Pré-processamento dos dados</h2><ul><li>Transformação da variável categórica &quot;dia da semana&quot; em variáveis dummy através do one-hot encoding</li><li>Separação das features (variáveis preditoras) e labels (variável target) em arrays NumPy</li><li>Divisão da base de dados em conjuntos de treino (75%) e teste (25%)</li></ul><h2>Comparação com a média como baseline</h2><ul><li>Erro médio absoluto da previsão pela média: 5,6 graus</li></ul><h2>Modelos testados</h2><ul><li>Random Forest Regressor</li><li>ADA Boost Regressor</li><li>Gradient Boosting Regressor</li></ul><h2>Avaliação dos modelos</h2><ul><li>Métricas: R2, erro absoluto médio, erro quadrático médio</li></ul><h2>Visualização da árvore de decisão</h2><ul><li>Exportação da árvore de decisão em formato PNG</li></ul><h2>Importância das features</h2><ul><li>Identificação das features mais importantes para cada modelo</li></ul><p>Esse é um resumo abrangente do vídeo em formato de mapa mental, cobrindo o tema central e os principais tópicos. Fiz uso de markdown para destacar os títulos e facilitar a visualização.</p>

In [401]:
import pandas as pd
import numpy as np


In [402]:
features=pd.read_excel('temps.xlsx')
features.head()

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual
0,2016,1,1,Fri,45,45,45.6,45
1,2016,1,2,Sat,44,45,45.7,44
2,2016,1,3,Sun,45,44,45.8,41
3,2016,1,4,Mon,44,41,45.9,40
4,2016,1,5,Tues,41,40,46.0,44


In [403]:
features.describe()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual
count,348.0,348.0,348.0,348.0,348.0,348.0,348.0
mean,2016.0,6.477011,15.514368,62.652299,62.701149,59.760632,62.543103
std,0.0,3.49838,8.772982,12.165398,12.120542,10.527306,11.794146
min,2016.0,1.0,1.0,35.0,35.0,45.1,35.0
25%,2016.0,3.0,8.0,54.0,54.0,49.975,54.0
50%,2016.0,6.0,15.0,62.5,62.5,58.2,62.5
75%,2016.0,10.0,23.0,71.0,71.0,69.025,71.0
max,2016.0,12.0,31.0,117.0,117.0,77.4,92.0


In [404]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   year     348 non-null    int64  
 1   month    348 non-null    int64  
 2   day      348 non-null    int64  
 3   week     348 non-null    object 
 4   temp_2   348 non-null    int64  
 5   temp_1   348 non-null    int64  
 6   average  348 non-null    float64
 7   actual   348 non-null    int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 21.9+ KB


In [405]:
features= pd.get_dummies(features)
features.head()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,True,False,False,False,False,False,False
1,2016,1,2,44,45,45.7,44,False,False,True,False,False,False,False
2,2016,1,3,45,44,45.8,41,False,False,False,True,False,False,False
3,2016,1,4,44,41,45.9,40,False,True,False,False,False,False,False
4,2016,1,5,41,40,46.0,44,False,False,False,False,False,True,False


In [406]:
labels = np.array(features['actual'])
# Axis excluir uma linha ao invês da coluna = 1
features= features.drop('actual', axis = 1)

feature_list = list(features.columns)

features = np.array(features)

In [407]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

In [408]:
# Treino com média quanto meu modelo erraria
# Grau de comparatibilidade
# round para trazer duas casas decimais apos a virgula
baseline_preds = test_features[:, feature_list.index('average')]

baseline_error = abs(baseline_preds - test_labels)
print("Baseline error average:", round(np.mean(baseline_error),2))

Baseline error average: 5.06


In [409]:
from sklearn.datasets import make_moons
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

In [410]:
# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(train_features, train_labels);

In [411]:
prediction_rf = rf.predict(test_features)

erro_rf = abs(prediction_rf - test_labels)

r_sq = rf.score(features,labels)
print('R^2',r_sq)
print('MAE', metrics.mean_absolute_error(test_labels, prediction_rf))
print('MAE', metrics.mean_squared_error(test_labels, prediction_rf))

R^2 0.9282342331459054
MAE 3.9450574712643682
MAE 27.700197701149417


In [412]:
# AdaBoost
ada= AdaBoostRegressor(n_estimators=100)
ada.fit(train_features, train_labels)

ada_pred = ada.predict(test_features)

In [413]:
error_ada = abs(ada_pred - test_labels)

r_sq = ada.score(features,labels)
print('R^2',r_sq)
print('MAE', metrics.mean_absolute_error(test_labels, ada_pred))
print('MAE', metrics.mean_squared_error(test_labels, ada_pred))

R^2 0.8786941976018995
MAE 3.584137209798339
MAE 21.78251992779561


In [414]:
# GradientBoostingRegressor
gbr= GradientBoostingRegressor(n_estimators=100)
gbr.fit(train_features,train_labels)
gbr_pred = gbr.predict(test_features)

In [415]:
error_gbr = abs(gbr_pred - test_labels)

print('MAE', metrics.mean_absolute_error(test_labels, gbr_pred))
print('MAE', metrics.mean_squared_error(test_labels, gbr_pred))

MAE 4.075669041934963
MAE 28.499002687339804


In [416]:
# Representação da árvore de decisão
# Árvore simplificada
# max_depth=3 ter 3 camadas do nós 
rf = RandomForestRegressor(max_depth=3)
rf.fit(train_features,train_labels)

tree = rf.estimators_[5]
tree


In [417]:
from sklearn.tree import export_graphviz
import pydot
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')

In [418]:
# RandomForestRegressor
importances = list(rf.feature_importances_)

feature_importance=[(feature,round(importance,2)) for feature, importance in zip(feature_list, importances)]
feature_importance= sorted(feature_importance,key = lambda x:x[1], reverse = True)
[print("Feature: {:20}Importance{}".format(*pair))for pair in feature_importance]

Feature: temp_1              Importance0.77
Feature: average             Importance0.21
Feature: temp_2              Importance0.01
Feature: year                Importance0.0
Feature: month               Importance0.0
Feature: day                 Importance0.0
Feature: week_Fri            Importance0.0
Feature: week_Mon            Importance0.0
Feature: week_Sat            Importance0.0
Feature: week_Sun            Importance0.0
Feature: week_Thurs          Importance0.0
Feature: week_Tues           Importance0.0
Feature: week_Wed            Importance0.0


[None, None, None, None, None, None, None, None, None, None, None, None, None]

In [419]:
# AdaBoostRegressor
importances = list(ada.feature_importances_)

feature_importance=[(feature,round(importance,2)) for feature, importance in zip(feature_list, importances)]
feature_importance= sorted(feature_importance,key = lambda x:x[1], reverse = True)
[print("Feature: {:20}Importance{}".format(*pair))for pair in feature_importance]

Feature: temp_1              Importance0.5
Feature: average             Importance0.27
Feature: temp_2              Importance0.08
Feature: month               Importance0.06
Feature: week_Mon            Importance0.04
Feature: day                 Importance0.03
Feature: week_Fri            Importance0.01
Feature: week_Sun            Importance0.01
Feature: year                Importance0.0
Feature: week_Sat            Importance0.0
Feature: week_Thurs          Importance0.0
Feature: week_Tues           Importance0.0
Feature: week_Wed            Importance0.0


[None, None, None, None, None, None, None, None, None, None, None, None, None]

In [420]:
# GradientBoostingRegressor
importances = list(gbr.feature_importances_)

feature_importance=[(feature,round(importance,2)) for feature, importance in zip(feature_list, importances)]
feature_importance= sorted(feature_importance,key = lambda x:x[1], reverse = True)
[print("Feature: {:20}Importance{}".format(*pair))for pair in feature_importance]

Feature: temp_1              Importance0.63
Feature: average             Importance0.3
Feature: day                 Importance0.02
Feature: month               Importance0.01
Feature: temp_2              Importance0.01
Feature: week_Fri            Importance0.01
Feature: year                Importance0.0
Feature: week_Mon            Importance0.0
Feature: week_Sat            Importance0.0
Feature: week_Sun            Importance0.0
Feature: week_Thurs          Importance0.0
Feature: week_Tues           Importance0.0
Feature: week_Wed            Importance0.0


[None, None, None, None, None, None, None, None, None, None, None, None, None]

### 418. Regressão Não Linear - Exercício
<p>TEMA CENTRAL: Previsão de preços de imóveis</p><h1>Introdução</h1><ul><li>Chegou a hora de colocar a mão na massa no módulo de regressão não linear</li><li>Nova missão: prever o preço de casas</li><li>Vocês receberão um dataset</li></ul><h1>Etapas</h1><ul><li>Fazer todo o tratamento de dados visto até agora neste módulo</li><li>Treinar modelos<ul><li>Random Forest</li><li>Ada Boost</li><li>Gradient Boost</li></ul></li></ul><h1>Resultados</h1><ul><li>Tirar a feature importance para cada algoritmo</li><li>Identificar quais são as features mais importantes em cada modelo</li></ul><h1>Conclusão</h1><ul><li>Bom desafio!</li></ul>


### 419. Regressão Não Linear - Gabarito
<h1>Treinando modelos de Machine Learning em Python</h1><h2>Importando bibliotecas e dados</h2><ul><li>Importamos pandas, numpy e a base de dados Houson</li><li>Verificamos a base de dados<ul><li>Possui variáveis categóricas que precisam de transformação</li><li>Possui valores nulos que precisam ser tratados</li></ul></li></ul><h2>Pré-processamento dos dados</h2><ul><li>Removemos valores nulos com <code>.dropna()</code></li><li>Aplicamos One Hot Encoding para converter variáveis categóricas em dummies</li><li>Separamos os dados em features (X) e labels (y)</li></ul><h2>Divisão dos dados</h2><ul><li>Utilizamos <code>train_test_split</code> para dividir os dados em treino e teste</li></ul><h2>Modelos testados</h2><ul><li>Random Forest Regressor</li><li>AdaBoost Regressor</li><li>Gradient Boosting Regressor</li></ul><p>Para cada modelo:</p><ul><li>Treinamos o modelo</li><li>Fizemos predições no conjunto de teste</li><li>Calculamos métricas:<ul><li>R2 Score</li><li>Mean Absolute Error</li><li>Mean Squared Error</li></ul></li></ul><h2>Resultados</h2><ul><li>Random Forest<ul><li>R2 Score: 0.94</li><li>MAE: 19 mil</li></ul></li><li>AdaBoost<ul><li>Métricas piores que Random Forest</li></ul></li><li><strong>Gradient Boosting</strong><ul><li><strong>R2 Score: 0.99</strong></li><li><strong>MSE menor que Random Forest</strong></li><li>Foi o <strong>melhor modelo</strong> segundo as métricas</li></ul></li></ul><h2>Conclusão</h2><ul><li>O <strong>Gradient Boosting</strong> teve o melhor desempenho</li><li>Recomendação para praticar com outras bases de dados e comparar os modelos</li></ul>

In [421]:
import pandas as pd
import numpy as np

In [422]:
casa = pd.read_excel('house.xlsx')
casa.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,196.0,Gd,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,80.0,9600,6,8,1976,1976,0.0,TA,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,68.0,11250,7,5,2001,2002,162.0,Gd,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,60.0,9550,7,5,1915,1970,0.0,TA,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,84.0,14260,8,5,2000,2000,350.0,Gd,...,192,84,0,0,0,0,0,12,2008,250000


In [423]:
casa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 40 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   OverallQual    1460 non-null   int64  
 5   OverallCond    1460 non-null   int64  
 6   YearBuilt      1460 non-null   int64  
 7   YearRemodAdd   1460 non-null   int64  
 8   MasVnrArea     1452 non-null   float64
 9   ExterQual      1460 non-null   object 
 10  ExterCond      1460 non-null   object 
 11  BsmtFinSF1     1460 non-null   int64  
 12  BsmtFinSF2     1460 non-null   int64  
 13  BsmtUnfSF      1460 non-null   int64  
 14  TotalBsmtSF    1460 non-null   int64  
 15  1stFlrSF       1460 non-null   int64  
 16  2ndFlrSF       1460 non-null   int64  
 17  LowQualFinSF   1460 non-null   int64  
 18  GrLivAre

In [424]:
# Dropando dados nulos
casa = casa.dropna()

In [425]:
casa = pd.get_dummies(casa)
casa.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,YrSold,SalePrice,ExterQual_Ex,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Ex,ExterCond_Fa,ExterCond_Gd,ExterCond_TA
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,2008,208500,False,False,True,False,False,False,False,True
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,2007,181500,False,False,False,True,False,False,False,True
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,2008,223500,False,False,True,False,False,False,False,True
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,2006,140000,False,False,False,True,False,False,False,True
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,2008,250000,False,False,True,False,False,False,False,True


In [426]:
casa= casa.astype(int)

In [427]:
casa.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1121 entries, 0 to 1459
Data columns (total 46 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1121 non-null   int64
 1   MSSubClass     1121 non-null   int64
 2   LotFrontage    1121 non-null   int64
 3   LotArea        1121 non-null   int64
 4   OverallQual    1121 non-null   int64
 5   OverallCond    1121 non-null   int64
 6   YearBuilt      1121 non-null   int64
 7   YearRemodAdd   1121 non-null   int64
 8   MasVnrArea     1121 non-null   int64
 9   BsmtFinSF1     1121 non-null   int64
 10  BsmtFinSF2     1121 non-null   int64
 11  BsmtUnfSF      1121 non-null   int64
 12  TotalBsmtSF    1121 non-null   int64
 13  1stFlrSF       1121 non-null   int64
 14  2ndFlrSF       1121 non-null   int64
 15  LowQualFinSF   1121 non-null   int64
 16  GrLivArea      1121 non-null   int64
 17  BsmtFullBath   1121 non-null   int64
 18  BsmtHalfBath   1121 non-null   int64
 19  FullBath   

In [428]:
casa.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,YrSold,SalePrice,ExterQual_Ex,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Ex,ExterCond_Fa,ExterCond_Gd,ExterCond_TA
0,1,60,65,8450,7,5,2003,2003,196,706,...,2008,208500,0,0,1,0,0,0,0,1
1,2,20,80,9600,6,8,1976,1976,0,978,...,2007,181500,0,0,0,1,0,0,0,1
2,3,60,68,11250,7,5,2001,2002,162,486,...,2008,223500,0,0,1,0,0,0,0,1
3,4,70,60,9550,7,5,1915,1970,0,216,...,2006,140000,0,0,0,1,0,0,0,1
4,5,60,84,14260,8,5,2000,2000,350,655,...,2008,250000,0,0,1,0,0,0,0,1


In [429]:
labels = np.array(casa['SalePrice'])
features = casa.drop('SalePrice', axis = 1)
feature_list = list(casa.columns)
casa=np.array(features)


In [430]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

In [431]:
from sklearn.datasets import make_moons
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor

In [432]:
rf= RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(train_features, train_labels);

In [434]:
prediction_rf = rf.predict(test_features)
errors_rf = abs(prediction_rf - test_labels)

rf_sq = rf.score(features, labels)

print('R^2', r_sq)
print('MAE',metrics.mean_absolute_error(test_labels,prediction_rf))
print('MSE', metrics.mean_squared_error(test_labels,prediction_rf))


R^2 0.8786941976018995
MAE 19122.88696441281
MSE 937331048.1581936


In [435]:
ada= AdaBoostRegressor(n_estimators=1000, random_state=42)
ada.fit(train_features, train_labels);

In [436]:
prediction_ada = ada.predict(test_features)
errors_ada = abs(prediction_rf - test_labels)

ada_sq = ada.score(features, labels)

print('R^2', ada_sq)
print('MAE',metrics.mean_absolute_error(test_labels,prediction_ada))
print('MSE', metrics.mean_squared_error(test_labels,prediction_ada))

R^2 0.868834224815642
MAE 24702.400413692794
MSE 1409573730.4590282


In [437]:
gbr= GradientBoostingRegressor(n_estimators=1000, random_state=42)
gbr.fit(train_features, train_labels);

In [438]:
prediction_gbr = gbr.predict(test_features)
errors_gbr = abs(prediction_gbr - test_labels)

gbr_sq = gbr.score(features, labels)

print('R^2', gbr_sq)
print('MAE',metrics.mean_absolute_error(test_labels,prediction_gbr))
print('MSE', metrics.mean_squared_error(test_labels,prediction_gbr))

R^2 0.9716797837295239
MAE 17891.3484186379
MSE 768539758.2485199
