<a href="https://colab.research.google.com/github/chavamoon/MachineLearningExamples/blob/main/Python/Regression/DecisionTree_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Laboratorio de Regresión

Contamos con los datos de temperatura y humedad de varios aparatos eléctricos y electrónicos -appliances- de varios hogares de una región de Bélgica recolectados por cuatro meses y medio cada 10 minutos. 

Queremos predecir el consumo de energía en Wh que un aparato eléctrico/electrónico tendrá en cierto momento, esta variable corresponde a la variable -appliances-.

El conjunto de datos tiene las siguientes variables:

* date: Año, mes, día, hora, minuto y segundo del evento
* appliances: Energía utilizada en Wh
* lights: Energía utilizada en luz por la casa en Wh
* T1: Temperatura en la cocina en grados celsius 
* RH_1: Humedad en la cocina en porcentaje 
* T2: Temperatura en la sala en grados celsius
* RH_2: Humedad en la sala en porcentaje
* T3: Temperatura en el cuarto de lavado en grados celsius
* RH_3: Humedad en el cuarto de lavado en porcentaje
* T4: Temperatura en el cuarto de oficina en grados celsius
* RH_4: Humedad en el cuarto de oficina en porcentaje
* T5: Temperatura en el baño en grados celsius
* RH_5: Humedad en el baño en porcentaje 
* T6: Temperatura afuera del edificio (lado norte) en grados celsius
* RH_6: Humedad afuera del edificio (lado norte) en porcentaje
* T7: Temperatura en el cuarto de planchado en grados celsius
* RH_7: Humedad en el cuarto de planchado en el porcentaje
* T8: Temperatura en el cuarto de hijo 2 en grados celsius
* RH_8: Humedad en el cuarto de hijo 2 en porcentaje
* T9: Temperatura en el cuarto de los padres en grados celsius
* RH_9: Humedad en el cuarto de los padres en porcentaje
* To: Temperatura externa (de la estación de clima Chiervres) en grados celsius
* Pressure: Presión en mm Hg desde la estación de clima Chievres
* RH_out: Humedad externa desde la estación de clima Chievres en porcentaje
* Wind speed: Velocidad del viento en m/s desde la estación de clima Chievres
* Visibility: Visibilidad en km desde la estación de Chievres
* Tdewpoint: Temperatura de punto de rocío desde la estación de Chievres en grados celsius ( °C)
* rv1: Variable aleatoria 1, no dimensional
* rv2: Variable aleatoria 2, no dimensional 

In [49]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.model_selection as ms
import sklearn.tree as tr
from datetime import datetime

In [2]:
np.random.seed(200831)

## EDA

In [4]:
appliances = pd.read_csv('https://raw.githubusercontent.com/chavamoon/MachineLearningExamples/main/Python/energydata_complete.csv')

In [14]:
appliances.rename(columns= { col: col.lower() for col in appliances.columns }, inplace=True)

TypeError: ignored

In [15]:
appliances.dtypes

date            object
appliances       int64
lights           int64
t1             float64
rh_1           float64
t2             float64
rh_2           float64
t3             float64
rh_3           float64
t4             float64
rh_4           float64
t5             float64
rh_5           float64
t6             float64
rh_6           float64
t7             float64
rh_7           float64
t8             float64
rh_8           float64
t9             float64
rh_9           float64
t_out          float64
press_mm_hg    float64
rh_out         float64
windspeed      float64
visibility     float64
tdewpoint      float64
rv1            float64
rv2            float64
dtype: object

In [28]:
appliances.shape

(19735, 32)

In [16]:
appliances.head()

Unnamed: 0,date,appliances,lights,t1,rh_1,t2,rh_2,t3,rh_3,t4,rh_4,t5,rh_5,t6,rh_6,t7,rh_7,t8,rh_8,t9,rh_9,t_out,press_mm_hg,rh_out,windspeed,visibility,tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [27]:
appliances.describe()

Unnamed: 0,appliances,lights,t1,rh_1,t2,rh_2,t3,rh_3,t4,rh_4,t5,rh_5,t6,rh_6,t7,rh_7,t8,rh_8,t9,rh_9,t_out,press_mm_hg,rh_out,windspeed,visibility,tdewpoint,rv1,rv2,year,month,day
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,19.592106,50.949283,7.910939,54.609083,20.267106,35.3882,22.029107,42.936165,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033,2016.0,3.101647,16.057411
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,1.844623,9.022034,6.090347,31.149806,2.109993,5.114208,1.956162,5.224361,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634,0.0,1.3392,8.450998
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,15.33,29.815,-6.065,1.0,15.39,23.2,16.306667,29.6,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322,2016.0,1.0,1.0
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,18.2775,45.4,3.626667,30.025,18.7,31.5,20.79,39.066667,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889,2016.0,2.0,9.0
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,19.39,49.09,7.3,55.29,20.033333,34.863333,22.1,42.375,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653,2016.0,3.0,16.0
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,20.619643,53.663333,11.256,83.226667,21.6,39.0,23.39,46.536,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769,2016.0,4.0,23.0
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,25.795,96.321667,28.29,99.9,26.0,51.4,27.23,58.78,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653,2016.0,5.0,31.0


In [42]:
appliances['year'] = pd.to_datetime(appliances['date']).dt.year
appliances['month'] = pd.to_datetime(appliances['date']).dt.month
appliances['day'] =  pd.to_datetime(appliances['date']).dt.day
appliances['hour'] =  pd.to_datetime(appliances['date']).dt.hour
appliances.head()

Unnamed: 0,date,appliances,lights,t1,rh_1,t2,rh_2,t3,rh_3,t4,rh_4,t5,rh_5,t6,rh_6,t7,rh_7,t8,rh_8,t9,rh_9,t_out,press_mm_hg,rh_out,windspeed,visibility,tdewpoint,rv1,rv2,year,month,day,hour
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433,2016,1,11,17
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195,2016,1,11,17
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668,2016,1,11,17
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389,2016,1,11,17
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097,2016,1,11,17


In [43]:
appliances.dtypes

date            object
appliances       int64
lights           int64
t1             float64
rh_1           float64
t2             float64
rh_2           float64
t3             float64
rh_3           float64
t4             float64
rh_4           float64
t5             float64
rh_5           float64
t6             float64
rh_6           float64
t7             float64
rh_7           float64
t8             float64
rh_8           float64
t9             float64
rh_9           float64
t_out          float64
press_mm_hg    float64
rh_out         float64
windspeed      float64
visibility     float64
tdewpoint      float64
rv1            float64
rv2            float64
year             int64
month            int64
day              int64
hour             int64
dtype: object

##Training

In [44]:
training_data = appliances[(appliances.date >= '2016-01-11 00:00:00') & (appliances.date <= '2016-04-27 23:55:00')]
test_data = appliances[(appliances.date >= '2016-04-28 23:55:00') & (appliances.date <= '2016-05-27 23:55:00')]
print('Training shape: ', training_data.shape, 'Test data shape: ', test_data.shape)

Training shape:  (15450, 33) Test data shape:  (4141, 33)


In [53]:
X_train = training_data.drop(['date', 'appliances', 'rv1', 'rv2', 'year'], axis=1)
X_test = test_data.drop(['date', 'appliances', 'rv1', 'rv2', 'year'], axis=1)
y_train = training_data.appliances
y_test = test_data.appliances

In [55]:
tree = tr.DecisionTreeRegressor()
cv = ms.TimeSeriesSplit(n_splits = 7)
gs = ms.GridSearchCV(estimator=tree,
                     param_grid={
                        'max_depth': [5, 10, 15, 20],
                        'criterion': ['mse', 'mae'],
                        'min_samples_leaf': [5, 7, 9, 11]
                      },
                     cv=cv,
                     scoring= 'neg_root_mean_squared_error'
                     )

In [56]:
regm = gs.fit(X_train, y_train)

In [60]:
#Mejor modelo
best_model = regm.best_estimator_
best_model

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mae', max_depth=5,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=11, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [85]:
regm.best_params_

{'criterion': 'mae', 'max_depth': 5, 'min_samples_leaf': 11}

In [66]:
#Variables que aportan más al modelo
pd.DataFrame.from_dict({'features': X_train.columns.values , 'importances': best_model.feature_importances_}).sort_values(by='importances', ascending=False).head()

Unnamed: 0,features,importances
27,hour,0.688641
5,t3,0.09789
6,rh_3,0.059032
15,t8,0.056759
20,press_mm_hg,0.043475


##Predicciones

In [74]:
y_prediction = best_model.predict(X_test)
y_prediction[-5:]


array([110., 110., 110., 110., 110.])

##Métricas de desempeño

In [76]:
#RMSE
import sklearn.metrics as metr

metr.mean_squared_error(y_test, y_pred=y_prediction, squared=False)

113.96956002976589

In [77]:
#MAE
metr.mean_absolute_error(y_test, y_prediction)

65.9188601787008

In [81]:
#Residuales
residuals = pd.DataFrame.from_dict({'real': y_test, 'prediction': y_prediction, 'residual': y_test-y_prediction})

#peores residuales
residuals.sort_values(by='residual', ascending=False).head()


Unnamed: 0,real,prediction,residual
15798,840,60.0,780.0
19582,850,100.0,750.0
15664,780,80.0,700.0
15799,700,60.0,640.0
16647,670,50.0,620.0


In [84]:
#a que variable corresponde el peor residual
test_data[test_data.appliances == 840]

Unnamed: 0,date,appliances,lights,t1,rh_1,t2,rh_2,t3,rh_3,t4,rh_4,t5,rh_5,t6,rh_6,t7,rh_7,t8,rh_8,t9,rh_9,t_out,press_mm_hg,rh_out,windspeed,visibility,tdewpoint,rv1,rv2,year,month,day,hour
15798,2016-04-30 10:00:00,840,0,21.39,37.7,20.133333,38.466667,22.7,36.59,19.823333,39.326667,19.2,42.59,12.1,20.596667,19.2,33.7,21.426667,42.863333,19.5,41.5,7.1,758.8,83.0,1.0,40.0,4.3,47.772522,47.772522,2016,4,30,10
