## Embedded methode (Recursive feature elimination (RFE))
Een manier om een embedded methode uit te voeren is door gebruik te maken van Recursive feature elimination (RFE). De Recursive feature elimination is een techniek om backwards elimination uit te voeren. Er wordt voor iedere feature een rangschikscriterium berekend die het belang van de feature bepaald (Cheng Fan, 2014). 

In sklearn zijn de Recursive feature elimination (RFE) en Recursive feature elimination met cross-validation (RFECV) gedefinieerd. Cross validation zorgt ervoor dat de test en training dataset altijd verschillend zijn, zodat er geen over fitting plaats vind. Om het risico op over fitting te voorkomen is er gebruik gemaakt van een RFECV.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_selection import RFECV

In [2]:
#ophalen van de dataset.
features = pd.read_csv('energie_features_inflatie.csv')
features

Unnamed: 0,Datum,energie,wind_richting,zon_perc,zon_straling,dag_vocht,max_vocht,min_vocht,vector_wind,wind,...,min_temp_6m,max_temp_6m,zon_uren_6m,duur_neerslag_6m,dag_neerslag_6m,max_neerslag_6m,dag_luchtdruk_6m,max_luchtdruk_6m,min_luchtdruk_6m,CPI_energie_6m
0,2001-01,9267,141.483871,27.967742,241.870968,90.161290,96.580645,79.483871,3.403226,3.693548,...,11.641935,19.548387,3.954839,1.854839,3.151613,1.567742,1012.616129,1015.319355,1009.887097,54.71
1,2001-02,8266,181.642857,32.250000,483.571429,88.000000,97.464286,73.964286,3.050000,3.639286,...,11.593548,22.809677,6.825806,0.709677,1.380645,0.903226,1017.519355,1019.232258,1015.712903,54.71
2,2001-03,8962,142.580645,17.419355,570.774194,85.548387,95.451613,72.354839,3.448387,3.829032,...,12.110000,19.946667,3.846667,1.680000,2.280000,1.186667,1013.006667,1015.930000,1010.070000,54.68
3,2001-04,8156,226.400000,34.766667,1220.166667,78.933333,96.333333,56.133333,3.390000,3.926667,...,7.700000,14.954839,3.161290,3.006452,3.393548,1.358065,1010.248387,1014.483871,1005.770968,55.02
4,2001-05,8304,134.645161,56.967742,2028.741935,71.645161,93.548387,50.322581,3.619355,3.835484,...,5.393333,10.120000,2.030000,3.100000,3.916667,1.366667,1002.100000,1006.016667,997.723333,55.08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,2021-08,8981,225.741935,36.258065,1391.612903,81.419355,96.645161,62.548387,2.677419,2.958065,...,0.750000,7.860714,4.539286,2.035714,1.517857,0.450000,1017.864286,1021.328571,1014.757143,106.13
248,2021-09,9030,164.500000,46.433333,1165.900000,81.300000,97.066667,60.200000,2.160000,2.510000,...,1.754839,10.609677,4.545161,1.509677,1.100000,0.358065,1021.083871,1023.983871,1017.835484,107.18
249,2021-10,9410,200.580645,36.774194,619.709677,85.774194,96.516129,69.225806,3.045161,3.335484,...,1.606667,11.236667,6.826667,1.553333,1.476667,0.533333,1020.933333,1023.460000,1018.123333,107.97
250,2021-11,9678,206.033333,22.033333,276.566667,89.233333,96.766667,77.200000,2.383333,2.723333,...,6.216129,15.583871,6.096774,2.393548,3.358065,1.580645,1010.845161,1013.941935,1007.358065,108.22


Het toepassen van het RFECV functie (sklearn.feature_selection.RFECV, z.d.) in een lineaire regressie machine learning model. Hieruit komen de features die in het lineair regressie model meegenomen worden.

In [3]:
alle_features = features.drop(['Datum', 'energie'], axis=1)
X = alle_features.values 
y = features['energie'].values 

In [4]:
LR = LinearRegression()
selector = RFECV(LR, step=1, cv=5)
selector = selector.fit(X, y)

In [5]:
features_selector = list(zip(features.columns, selector.support_ ))
features_selector

[('Datum', False),
 ('energie', False),
 ('wind_richting', False),
 ('zon_perc', False),
 ('zon_straling', False),
 ('dag_vocht', False),
 ('max_vocht', False),
 ('min_vocht', True),
 ('vector_wind', True),
 ('wind', True),
 ('max_wind', False),
 ('min_wind', True),
 ('max_windstoot', False),
 ('gem_temp', True),
 ('min_temp', False),
 ('max_temp', True),
 ('zon_uren', False),
 ('duur_neerslag', False),
 ('dag_neerslag', False),
 ('max_neerslag', False),
 ('dag_luchtdruk', False),
 ('max_luchtdruk', False),
 ('min_luchtdruk', False),
 ('CPI_energie', False),
 ('wind_richting_1j', False),
 ('zon_perc_1j', False),
 ('zon_straling_1j', False),
 ('dag_vocht_1j', False),
 ('max_vocht_1j', False),
 ('min_vocht_1j', False),
 ('vector_wind_1j', False),
 ('wind_1j', False),
 ('max_wind_1j', False),
 ('min_wind_1j', False),
 ('max_windstoot_1j', False),
 ('gem_temp_1j', False),
 ('min_temp_1j', False),
 ('max_temp_1j', False),
 ('zon_uren_1j', False),
 ('duur_neerslag_1j', False),
 ('dag_neersla

In [6]:
features_ranking = list(zip(features.columns, selector.ranking_ ))
features_ranking

[('Datum', 87),
 ('energie', 69),
 ('wind_richting', 98),
 ('zon_perc', 47),
 ('zon_straling', 60),
 ('dag_vocht', 46),
 ('max_vocht', 44),
 ('min_vocht', 1),
 ('vector_wind', 1),
 ('wind', 1),
 ('max_wind', 21),
 ('min_wind', 1),
 ('max_windstoot', 8),
 ('gem_temp', 1),
 ('min_temp', 71),
 ('max_temp', 1),
 ('zon_uren', 37),
 ('duur_neerslag', 97),
 ('dag_neerslag', 15),
 ('max_neerslag', 19),
 ('dag_luchtdruk', 14),
 ('max_luchtdruk', 85),
 ('min_luchtdruk', 78),
 ('CPI_energie', 77),
 ('wind_richting_1j', 84),
 ('zon_perc_1j', 92),
 ('zon_straling_1j', 45),
 ('dag_vocht_1j', 72),
 ('max_vocht_1j', 80),
 ('min_vocht_1j', 16),
 ('vector_wind_1j', 35),
 ('wind_1j', 25),
 ('max_wind_1j', 20),
 ('min_wind_1j', 3),
 ('max_windstoot_1j', 4),
 ('gem_temp_1j', 2),
 ('min_temp_1j', 82),
 ('max_temp_1j', 73),
 ('zon_uren_1j', 59),
 ('duur_neerslag_1j', 27),
 ('dag_neerslag_1j', 38),
 ('max_neerslag_1j', 40),
 ('dag_luchtdruk_1j', 39),
 ('max_luchtdruk_1j', 75),
 ('min_luchtdruk_1j', 88),
 ('CP

De gevonden features worden toegepast in een lineair regressie machine learning model. 

In [7]:
X = features[['min_vocht', 'vector_wind', 'wind', 'min_wind', 'gem_temp', 'max_temp', 'min_vocht_1m', 'wind_1m', 'dag_neerslag_3m', 
              'max_neerslag_3m', 'dag_luchtdruk_3m']].values 
y = features['energie'].values 

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [9]:
LR = LinearRegression()
LR.fit(x_train,y_train)
y_prediction =  LR.predict(x_test)

In [10]:
score=r2_score(y_test,y_prediction)
print('r2 score = ',score)
print('RMSE = ',np.sqrt(mean_squared_error(y_test,y_prediction)))

r2 score =  0.5577902693045538
RMSE =  400.64835021507366


## Bronnenlijst
Cheng Fan, F. X. (2014, augustus 15). Development of prediction models for next-day building energy consumption and peak power demand using data mining techniques. Opgehaald van Elsevier: https://www.sciencedirect.com/science/article/pii/S0306261914003596 <br>
sklearn.feature_selection.RFECV. (sd). Opgehaald van Scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html