<a href="https://colab.research.google.com/github/dutrajunior/python_estudos/blob/main/cross_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importando as bibliotecas básicas

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importando o dataset

In [2]:
dados = pd.read_csv('https://raw.githubusercontent.com/dutrajunior/python_estudos/main/Data_Train.csv')
dados

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648


Este dataset tem diversos dados de voos com seu preço associado

O objetivo do problema é conseguir prever o preço de um voo através dos dados disponíveis

# Olhando para os dados

In [3]:
dados.describe()

Unnamed: 0,Price
count,10683.0
mean,9087.064121
std,4611.359167
min,1759.0
25%,5277.0
50%,8372.0
75%,12373.0
max,79512.0


In [4]:
dados.dtypes

Airline            object
Date_of_Journey    object
Source             object
Destination        object
Route              object
Dep_Time           object
Arrival_Time       object
Duration           object
Total_Stops        object
Additional_Info    object
Price               int64
dtype: object

Temos muitas colunas categóricas neste dataset, então vamos tentar quebrar algumas, de forma a construir algumas variáveis numéricas

# Tratamento dos Dados

## Quebrando a Variável de Data

In [5]:
from datetime import datetime

def string_date_to_day_part(date_str):
    return pd.to_numeric(datetime.strptime(date_str, '%d/%m/%Y').day, errors ='coerce')

def string_date_to_month_part(date_str):
    return pd.to_numeric(datetime.strptime(date_str, '%d/%m/%Y').month, errors ='coerce')

def string_date_to_year_part(date_str):
    return pd.to_numeric(datetime.strptime(date_str, '%d/%m/%Y').year, errors ='coerce')


dados['day'] = pd.Series(dtype=int)
dados['month'] = pd.Series(dtype=int)
dados['year'] = pd.Series(dtype=int)

dados['day'] = dados['Date_of_Journey'].apply(string_date_to_day_part)
dados['month'] = dados['Date_of_Journey'].apply(string_date_to_month_part)
dados['year'] = dados['Date_of_Journey'].apply(string_date_to_year_part)

dados = dados.drop('Date_of_Journey', axis = 1)

dados

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,day,month,year
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,9,4,2019
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,27,4,2019
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,27,4,2019
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,1,3,2019


## Quebrando a variável de Hora (Partida e Chegada)

In [6]:
def str_to_hour(time_str):
    return pd.to_numeric(time_str[0:2], errors ='coerce')

def str_to_min(time_str):
    return pd.to_numeric(time_str[3:6], errors ='coerce')

dados['departure_hour'] = pd.Series(dtype=int)
dados['departure_min'] = pd.Series(dtype=int)
dados['arrival_hour'] = pd.Series(dtype=int)
dados['arrival_min'] = pd.Series(dtype=int)

dados['departure_hour'] =  dados['Dep_Time'].apply(str_to_hour)
dados['departure_min'] = dados['Dep_Time'].apply(str_to_min)
dados['arrival_hour'] = dados['Arrival_Time'].apply(str_to_hour)
dados['arrival_min']= dados['Arrival_Time'].apply(str_to_hour)

dados = dados.drop('Dep_Time', axis = 1)
dados = dados.drop('Arrival_Time', axis = 1)

dados

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,day,month,year,departure_hour,departure_min,arrival_hour,arrival_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,24,3,2019,22,20,1,1
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,1,5,2019,5,50,13,13
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,9,6,2019,9,25,4,4
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,12,5,2019,18,5,23,23
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,1,3,2019,16,50,21,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,9,4,2019,19,55,22,22
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,27,4,2019,20,45,23,23
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,27,4,2019,8,20,11,11
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,1,3,2019,11,30,14,14


## Totalizando em minutos a variável de Duração

In [7]:
#converter string de Duration para total de minutos de duração
def to_min(time_str):
    if 'h' in time_str and 'm' in time_str: #verifica se existe tanto a letra h quanto m na string
      return (int(time_str.split('h ')[0])*60) + int((time_str.split('h ')[1]).split('m')[0]) #separa as horas e multiplica por 60 minutos e soma com os minutos
    elif 'h' in time_str and 'm' not in time_str: #verifica se existe somente a letra h na string
      return int(time_str.split('h')[0])*60 #separa as horas e multiplica por 60 minutos
    elif 'h' not in time_str and 'm' in time_str: #verifica se existe somente a letra m na string
      return int(time_str.split('m')[0]) #separa os minutos
    else :
      return int('nan') #retorna nulo para demais possibilidades se existir


dados['Duration_min'] = pd.Series(dtype=int)

dados['Duration_min'] =  dados['Duration'].apply(to_min)

dados = dados.drop('Duration', axis = 1)

dados

Unnamed: 0,Airline,Source,Destination,Route,Total_Stops,Additional_Info,Price,day,month,year,departure_hour,departure_min,arrival_hour,arrival_min,Duration_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,non-stop,No info,3897,24,3,2019,22,20,1,1,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,2 stops,No info,7662,1,5,2019,5,50,13,13,445
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,2 stops,No info,13882,9,6,2019,9,25,4,4,1140
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,1 stop,No info,6218,12,5,2019,18,5,23,23,325
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,1 stop,No info,13302,1,3,2019,16,50,21,21,285
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,non-stop,No info,4107,9,4,2019,19,55,22,22,150
10679,Air India,Kolkata,Banglore,CCU → BLR,non-stop,No info,4145,27,4,2019,20,45,23,23,155
10680,Jet Airways,Banglore,Delhi,BLR → DEL,non-stop,No info,7229,27,4,2019,8,20,11,11,180
10681,Vistara,Banglore,New Delhi,BLR → DEL,non-stop,No info,12648,1,3,2019,11,30,14,14,160


## Tratando a variável de quantidade de paradas

In [8]:
dados.Total_Stops.unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [9]:
dados.Total_Stops = dados.Total_Stops.fillna('non-stop') #Definir nan como non-stop

#converter string de Total_Stops para inteiros
def stops_to_int(stops_str):
    if stops_str == 'non-stop':
      return 0
    else :
      return int(stops_str[0:2])

dados['stops'] = pd.Series(dtype=int)

dados['stops'] = dados['Total_Stops'].apply(stops_to_int)

dados = dados.drop('Total_Stops', axis = 1)

dados

Unnamed: 0,Airline,Source,Destination,Route,Additional_Info,Price,day,month,year,departure_hour,departure_min,arrival_hour,arrival_min,Duration_min,stops
0,IndiGo,Banglore,New Delhi,BLR → DEL,No info,3897,24,3,2019,22,20,1,1,170,0
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,No info,7662,1,5,2019,5,50,13,13,445,2
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,No info,13882,9,6,2019,9,25,4,4,1140,2
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,No info,6218,12,5,2019,18,5,23,23,325,1
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,No info,13302,1,3,2019,16,50,21,21,285,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,No info,4107,9,4,2019,19,55,22,22,150,0
10679,Air India,Kolkata,Banglore,CCU → BLR,No info,4145,27,4,2019,20,45,23,23,155,0
10680,Jet Airways,Banglore,Delhi,BLR → DEL,No info,7229,27,4,2019,8,20,11,11,180,0
10681,Vistara,Banglore,New Delhi,BLR → DEL,No info,12648,1,3,2019,11,30,14,14,160,0


# Aplicando o Label Encoder nas variáveis categóricas

In [10]:
dados.dtypes

Airline            object
Source             object
Destination        object
Route              object
Additional_Info    object
Price               int64
day                 int64
month               int64
year                int64
departure_hour      int64
departure_min       int64
arrival_hour        int64
arrival_min         int64
Duration_min        int64
stops               int64
dtype: object

In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

colunas = dados.dtypes.reset_index()

categ_cols = colunas[colunas[0] == 'object']['index'].to_list()
categ_cols

['Airline', 'Source', 'Destination', 'Route', 'Additional_Info']

In [12]:
# Criando os Labels Encoders, criando um código para cada valor das colunas de categorias e deletando as colunas anteriores de formatos não númericos

for i in categ_cols:
    dados[str(i) +'_encoded'] = le.fit_transform(dados[i])
    dados = dados.drop(i,axis = 1)

dados

Unnamed: 0,Price,day,month,year,departure_hour,departure_min,arrival_hour,arrival_min,Duration_min,stops,Airline_encoded,Source_encoded,Destination_encoded,Route_encoded,Additional_Info_encoded
0,3897,24,3,2019,22,20,1,1,170,0,3,0,5,18,8
1,7662,1,5,2019,5,50,13,13,445,2,1,3,0,84,8
2,13882,9,6,2019,9,25,4,4,1140,2,4,2,1,118,8
3,6218,12,5,2019,18,5,23,23,325,1,3,3,0,91,8
4,13302,1,3,2019,16,50,21,21,285,1,3,0,5,29,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,4107,9,4,2019,19,55,22,22,150,0,0,3,0,64,8
10679,4145,27,4,2019,20,45,23,23,155,0,1,3,0,64,8
10680,7229,27,4,2019,8,20,11,11,180,0,4,0,2,18,8
10681,12648,1,3,2019,11,30,14,14,160,0,10,0,5,18,8


# Quebrando o dataset para modelagem

In [13]:
from sklearn.model_selection import train_test_split

x = dados.drop('Price', axis = 1)
y = dados['Price']

x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.3, random_state=42)

# Carregando o modelo

Iremos fazer o mesmo processo para os 3 modelos que falamos para verificar as diferenças

In [14]:
!pip install xgboost



In [15]:
!pip install lightgbm



In [16]:
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score

import time

modelo_gr = GradientBoostingRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)
modelo_xb = XGBRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)
modelo_lg = LGBMRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)

print('=========== Gradient Boosting ==============')

start = time.time()
modelo_gr.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_gr.predict(x_train)
y_pred_test = modelo_gr.predict(x_test)

print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))

print('=========== XG Boost ==============')

start = time.time()
modelo_xb.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_xb.predict(x_train)
y_pred_test = modelo_xb.predict(x_test)


print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))


print('=========== LGBM ==============')

start = time.time()
modelo_lg.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_lg.predict(x_train)
y_pred_test = modelo_lg.predict(x_test)

print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))


O resultado na base de treino é:  0.9356143441682732
O resultado na base de teste é:  0.8647145637712967
O tempo que o modelo demorou para treinar foi:  2.2936851978302


Parameters: { "max_leaf_nodes" } are not used.



O resultado na base de treino é:  0.9960060876315318
O resultado na base de teste é:  0.836261763335961
O tempo que o modelo demorou para treinar foi:  14.42462944984436
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002464 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 471
[LightGBM] [Info] Number of data points in the train set: 7478, number of used features: 13
[LightGBM] [Info] Start training from score 9121.404654
O resultado na base de treino é:  0.9190365032065053
O resultado na base de teste é:  0.877102395716472
O tempo que o modelo demorou para treinar foi:  0.46788692474365234


É possível ver as principais diferenças entre os modelos

- O GradientBoosting é um bom parâmetro para início

- O XGBoost traz resultados mais precisos

- O LGBM é treinado com uma velocidade muito maior

# Fazendo a Feature Selection

## Select KBest

In [17]:
from sklearn.feature_selection import SelectKBest, f_regression

In [18]:
selector = SelectKBest(score_func=f_regression, k = 7)

selector.fit(x,y)

dados_selected = selector.transform(x)

In [19]:
dados_selected

array([[  24,    3,  170, ...,    5,   18,    8],
       [   1,    5,  445, ...,    0,   84,    8],
       [   9,    6, 1140, ...,    1,  118,    8],
       ...,
       [  27,    4,  180, ...,    2,   18,    8],
       [   1,    3,  160, ...,    5,   18,    8],
       [   9,    5,  500, ...,    1,  108,    8]])

In [20]:
cols = selector.get_support(indices=True)
dados_new_best = x.iloc[:,cols]

Estas são as 7 colunas (features) que mais influenciam no preço das passagens

## Select Percentile

In [22]:
from sklearn.feature_selection import SelectPercentile, f_regression


selector = SelectPercentile(score_func=f_regression,percentile=50)
selector.fit(x,y)
dados_selected = selector.transform(x)


cols = selector.get_support(indices=True)
dados_new_percentile = x.iloc[:,cols]

dados_new_percentile

Unnamed: 0,day,month,Duration_min,stops,Destination_encoded,Route_encoded,Additional_Info_encoded
0,24,3,170,0,5,18,8
1,1,5,445,2,0,84,8
2,9,6,1140,2,1,118,8
3,12,5,325,1,0,91,8
4,1,3,285,1,5,29,8
...,...,...,...,...,...,...,...
10678,9,4,150,0,0,64,8
10679,27,4,155,0,0,64,8
10680,27,4,180,0,2,18,8
10681,1,3,160,0,5,18,8


## Rodando Novamente os Modelos com a Feature Selection Realizada

### Select KBest

In [23]:
x_train,x_test,y_train,y_test = train_test_split(dados_new_best,y, test_size=0.3, random_state=42)

In [24]:
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score

import time

modelo_gr = GradientBoostingRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)
modelo_xb = XGBRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)
modelo_lg = LGBMRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)

print('=========== Gradient Boosting ==============')

start = time.time()
modelo_gr.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_gr.predict(x_train)
y_pred_test = modelo_gr.predict(x_test)

print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))

print('=========== XG Boost ==============')

start = time.time()
modelo_xb.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_xb.predict(x_train)
y_pred_test = modelo_xb.predict(x_test)


print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))


print('=========== LGBM ==============')

start = time.time()
modelo_lg.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_lg.predict(x_train)
y_pred_test = modelo_lg.predict(x_test)

print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))


O resultado na base de treino é:  0.8517798555709819
O resultado na base de teste é:  0.7591368090488312
O tempo que o modelo demorou para treinar foi:  0.7077374458312988


Parameters: { "max_leaf_nodes" } are not used.



O resultado na base de treino é:  0.9456746863584202
O resultado na base de teste é:  0.7132698192114942
O tempo que o modelo demorou para treinar foi:  0.32862329483032227
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000192 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 371
[LightGBM] [Info] Number of data points in the train set: 7478, number of used features: 7
[LightGBM] [Info] Start training from score 9121.404654
O resultado na base de treino é:  0.8332246361501204
O resultado na base de teste é:  0.7801819500668457
O tempo que o modelo demorou para treinar foi:  0.0897974967956543


### Select Percentile

In [25]:
x_train,x_test,y_train,y_test = train_test_split(dados_new_percentile,y, test_size=0.3, random_state=42)


from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score

import time

modelo_gr = GradientBoostingRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)
modelo_xb = XGBRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)
modelo_lg = LGBMRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)

print('=========== Gradient Boosting ==============')

start = time.time()
modelo_gr.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_gr.predict(x_train)
y_pred_test = modelo_gr.predict(x_test)

print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))

print('=========== XG Boost ==============')

start = time.time()
modelo_xb.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_xb.predict(x_train)
y_pred_test = modelo_xb.predict(x_test)


print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))


print('=========== LGBM ==============')

start = time.time()
modelo_lg.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_lg.predict(x_train)
y_pred_test = modelo_lg.predict(x_test)

print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))


O resultado na base de treino é:  0.8517798555709819
O resultado na base de teste é:  0.7591368090488312
O tempo que o modelo demorou para treinar foi:  0.6934552192687988


Parameters: { "max_leaf_nodes" } are not used.



O resultado na base de treino é:  0.9456746863584202
O resultado na base de teste é:  0.7132698192114942
O tempo que o modelo demorou para treinar foi:  0.3282310962677002
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000196 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 371
[LightGBM] [Info] Number of data points in the train set: 7478, number of used features: 7
[LightGBM] [Info] Start training from score 9121.404654
O resultado na base de treino é:  0.8332246361501204
O resultado na base de teste é:  0.7801819500668457
O tempo que o modelo demorou para treinar foi:  0.08864355087280273


# Cross Validation

Pegando o modelo inicial que performou melhor, vamos testar fazer o cross validation

In [26]:
from sklearn.model_selection import train_test_split

x = dados.drop('Price', axis = 1)
y = dados['Price']

x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.3, random_state=42)

modelo_gb = GradientBoostingRegressor(max_depth=10, max_leaf_nodes=20, random_state=42)

print('=========== Gradient Boosting ==============')

start = time.time()
modelo_gb.fit(x_train, y_train)
end = time.time()
y_pred_train = modelo_gb.predict(x_train)
y_pred_test = modelo_gb.predict(x_test)


print('O resultado na base de treino é: ',r2_score(y_train,y_pred_train))
print('O resultado na base de teste é: ',r2_score(y_test,y_pred_test))
print('O tempo que o modelo demorou para treinar foi: ', str(end-start))

O resultado na base de treino é:  0.9356143441682732
O resultado na base de teste é:  0.8647145637712967
O tempo que o modelo demorou para treinar foi:  1.2313003540039062


In [27]:
from sklearn.model_selection import GridSearchCV


parametros = {'n_estimators':[10, 50, 100],
              'learning_rate':[0.1 ,1 ,10],
              'max_depth': [10,50 ,500],
              'max_leaf_nodes':[10,50, 100]}

grid = GridSearchCV(estimator = modelo_gb,
                      param_grid = parametros,
                      cv = 5,
                      scoring = 'r2')

grid.fit(x,y)

  (array - array_means[:, np.newaxis]) ** 2, axis=1, weights=weights


In [28]:
grid.best_params_

{'learning_rate': 0.1,
 'max_depth': 50,
 'max_leaf_nodes': 100,
 'n_estimators': 100}

In [29]:
grid.best_score_

0.8932208083229172

In [31]:
melhor_modelo = grid.best_estimator_
melhor_modelo