# __Training an xgboost model__

xgboost does not belong to classical time series models, however it is used frequently in the data science community for time series forecasts. The model uses base learners which are commonly decision trees. The training is based on gradient descent.

## __Data preparation__

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

import xgboost as xgb

import joblib

import warnings
warnings.simplefilter('ignore')

In [2]:
file_path = '../data/train_time_features.pkl'
df = pd.read_pickle(file_path)

In [3]:
df.isna().any()

t CO2-e / MWh    False
year             False
minute_sin       False
minute_cos       False
hour_sin         False
hour_cos         False
weekday_sin      False
month_sin        False
month_cos        False
lag1              True
lag2              True
lag3              True
lag4              True
lag5              True
lag6              True
lag7              True
lag8              True
lag9              True
lag10             True
lag11             True
lag12             True
horizon0          True
dtype: bool

In [4]:
#our lagging operation caused the introduction of NaN values into our dataset which need to be removed before the xgboost
df.dropna(inplace=True)

In [5]:
df.isna().any()

t CO2-e / MWh    False
year             False
minute_sin       False
minute_cos       False
hour_sin         False
hour_cos         False
weekday_sin      False
month_sin        False
month_cos        False
lag1             False
lag2             False
lag3             False
lag4             False
lag5             False
lag6             False
lag7             False
lag8             False
lag9             False
lag10            False
lag11            False
lag12            False
horizon0         False
dtype: bool

### __train / validation split__

In [6]:
def train_test_ts(df, relative_train, maximal_lag, horizon):
    '''
    Time series (ts) split function creates a train/test set under consideration of potential overlap between the two due to lag processing
    X_train, y_train, X_test, y_test = ...
    df=must contain target column as "target"; all other columns must be used as features
    percentage_train=how much of the total dataset shall be used for training; must be added between 0 - 1
    maximal_lag=out of all lag feature engineering, enter the maximal lag number
    '''
    k = int(df.shape[0] * relative_train)
    data_train = df.iloc[:k,:]
    #to avoid overlapping of train and test data, a gap of the maximal lag - 1 must be included between the two sets
    data_test = df.iloc[k+maximal_lag:,:]
    
    assert data_train.index.max() < data_test.index.min()
    
    #returns in the sequence X_train, y_train, X_test, y_test
    return (data_train.drop(columns=[f"horizon{horizon}","t CO2-e / MWh"], axis=1), data_train[f"horizon{horizon}"],
            data_test.drop(columns=[f"horizon{horizon}","t CO2-e / MWh"], axis=1), data_test[f"horizon{horizon}"])

### __Model training__

Initially, we will do the model training without the lag features together. In an exerice, you will do it yourself with the entire feature set, i.e. including the lag features.

In [31]:
X_train, y_train, X_validation, y_validation = train_test_ts(
    df=df.drop(columns=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag7', 'lag8', 'lag9', 'lag10', 'lag11', 'lag12']),
    relative_train=0.8,
    maximal_lag=12,
    horizon=0)

print(df.columns)

print(X_train.index.max())
print(X_validation.index.min())

assert X_train.index.max() < X_validation.index.min()

model = xgb.XGBRegressor(max_depth=5,
                         learning_rate=0.1,
                         num_estimators=100,
                         n_jobs=3,
                         reg_alpha=0.05,
                         reg_lambda=0,
                        )

model.fit(X_train, y_train)

Index(['t CO2-e / MWh', 'year', 'minute_sin', 'minute_cos', 'hour_sin',
       'hour_cos', 'weekday_sin', 'month_sin', 'month_cos', 'lag1', 'lag2',
       'lag3', 'lag4', 'lag5', 'lag6', 'lag7', 'lag8', 'lag9', 'lag10',
       'lag11', 'lag12', 'horizon0'],
      dtype='object')
2016-08-18 15:20:00
2016-08-18 16:25:00


  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=5, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=3, nthread=None, num_estimators=100, objective='reg:linear',
             random_state=0, reg_alpha=0.05, reg_lambda=0, scale_pos_weight=1,
             seed=None, silent=None, subsample=1, verbosity=1)

In [7]:
def errors(model, X_train, y_train, X_test, y_test):

    train_mae = (sum(abs(y_train - model.predict(X_train)))/len(y_train))
    train_mape = (sum(abs((y_train - model.predict(X_train))/y_train)))*(100/len(y_train))
    train_smape = sum(abs(y_train - model.predict(X_train)))/sum(y_train + model.predict(X_train))

    test_mae = (sum(abs(y_test - model.predict(X_test)))/len(y_test))
    test_mape = (sum(abs((y_test - model.predict(X_test))/y_test)))*(100/len(y_test))
    test_smape = sum(abs(y_test - model.predict(X_test)))/sum(y_test + model.predict(X_test))

    print(f'train_MAE: {train_mae}')
    print(f'test_MAE: {test_mae}')
    
    print(f'train_MAPE: {train_mape}')
    print(f'test_MAPE: {test_mape}')
    
    print(f'train_SMAPE: {train_smape}')
    print(f'test_SMAPE: {test_smape}')

__Now we have successfully trained the model. However, we have not evaluated the model yet. Let's do that with our last notebook in mind.__

### __Exercise 1:__

Write a function which takes our train data (X_train, y_train), our validation data (X_test, y_test), and our trained model as input and which returns the MAE, MAPE, and SMAPE of the train and test data. Use your function to asses the errors of the train and of the validation set. What is it that the MAPE is showing and why? How do you interpret the outcomes of the train and validation errors?

### __Your solution 1:__

In [9]:
def errors(model, X_train, y_train, X_test, y_test):
    
    #your code here
    
    print(f'train_MAE: {train_mae}')
    print(f'test_MAE: {test_mae}')
    print(f'train_SMAPE: {train_SMAPE}')
    print(f'test_SMAPE: {test_SMAPE}')

In [33]:
errors(model, X_train, y_train, X_validation, y_validation)

train_MAE: 0.24546015365607068
test_MAE: 0.2581517514093108
train_MAPE: inf
test_MAPE: inf
train_SMAPE: 0.1742231695142011
test_SMAPE: 0.18861056659382508


### __Exercise 2:__

Illustrate a comparison of the validation set (y_validation) and the forecasted values. Illustrate a period of i) 48 h and of ii) 4 h.

### __Your solution 2:__

### __Exercise 3:__

Perform the xgboost training again using the entire dataframe including the lag features. Save the resulting model. Check the error metrics and visualise the results as above. What do you see? How do you interpret it?

### __Your solution 3:__

### __Exercise 4:__

Take our test set and perform all data processing (cleaning, feature engineering) as we did with our training set. Use the saved model to make predicitons on the test set.

### __Your solution 4:__

### __Feature importances__