# __Training an XGBoost model__

XGBoost does not belong to classical time series models, however it is used frequently in the data science community for time series forecasts.

## __Data preparation__

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

import xgboost as xgb

import joblib

import warnings
warnings.simplefilter('ignore')

In [None]:
file_path = '../data/train_time_features.pkl'
df = pd.read_pickle(file_path)

In [None]:
df.isna().any()

In [None]:
#our lagging operation caused the introduction of NaN values into our dataset which need to be removed before the xgboost
df.dropna(inplace=True)

In [None]:
df.isna().any()

### __train / validation split__

In [None]:
def train_test_ts(df, relative_train, maximal_lag, horizon):
    '''
    Time series (ts) split function creates a train/test set under consideration of potential overlap between the two due to lag processing
    X_train, y_train, X_test, y_test = ...
    df=must contain target column as "target"; all other columns must be used as features
    percentage_train=how much of the total dataset shall be used for training; must be added between 0 - 1
    maximal_lag=out of all lag feature engineering, enter the maximal lag number
    '''
    k = int(df.shape[0] * relative_train)
    data_train = df.iloc[:k,:]
    #to avoid overlapping of train and test data, a gap of the maximal lag - 1 must be included between the two sets
    data_test = df.iloc[k+maximal_lag:,:]
    
    assert data_train.index.max() < data_test.index.min()
    
    #returns in the sequence X_train, y_train, X_test, y_test
    return (data_train.drop(columns=[f'horizon{horizon}','t CO2-e / MWh'], axis=1), data_train[f'horizon{horizon}'],
            data_test.drop(columns=[f'horizon{horizon}','t CO2-e / MWh'], axis=1), data_test[f'horizon{horizon}'])

### __Model training__

Initially, we will do the model training without the lag features together. In an exerice, you will do it yourself with the entire feature set, i.e. including the lag features.

In [None]:
df1 = df.drop(columns=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag7', 'lag8', 'lag9', 'lag10', 'lag11', 'lag12'])

In [None]:
X_train, y_train, X_validation, y_validation = train_test_ts(
    df=df1,
    relative_train=0.8,
    maximal_lag=12,
    horizon=0)

print(df1.columns)

print(X_train.index.max())
print(X_validation.index.min())

assert X_train.index.max() < X_validation.index.min()

model = xgb.XGBRegressor(max_depth=5,
                         learning_rate=0.1,
                         num_estimators=100,
                         n_jobs=3,
                         reg_alpha=0.05,
                         reg_lambda=0,
                        )

model.fit(X_train, y_train)
#joblib.dump(model, '../model_all_features.pkl')

__Now we have successfully trained the model. However, we have not evaluated the model yet. Let's do that with our last notebook in mind.__

### __Exercise 1:__

Write a function which takes our train data (X_train, y_train), our validation data (X_test, y_test), and our trained model as input and which returns the MAE, MAPE, and SMAPE of the train and test data. Use your function to asses the errors of the train and of the validation set. What is it that the MAPE is showing and why? How do you interpret the outcomes of the train and validation errors?

### __Your solution 1:__

In [None]:
def errors(model, X_train, y_train, X_test, y_test):
    
    #your code here
    
    print(f'train_MAE: {train_mae}')
    print(f'test_MAE: {test_mae}')
    print(f'train_SMAPE: {train_SMAPE}')
    print(f'test_SMAPE: {test_SMAPE}')

In [None]:
errors(model, X_train, y_train, X_validation, y_validation)

### __Exercise 2:__

Illustrate a comparison of the validation set (y_validation) and the forecasted values. Illustrate a period of i) 48 h and of ii) 4 h.

### __Your solution 2:__

### __Exercise 3:__

Perform the xgboost training again using the entire dataframe including the lag features. Save the resulting model. Check the error metrics and visualise the results as above. What do you see? How do you interpret it?

### __Your solution 3:__

### __Exercise 4:__

Take our test set and perform all data processing (cleaning, feature engineering) as we did with our training set. Use the saved model to make predicitons on the test set.

### __Your solution 4:__

### __Feature importances__

In [None]:
import os

def plot_feature_importances(rf, cols, model_dir):
    importances = pd.DataFrame()
    importances.loc[:, 'importances'] = rf.feature_importances_
    importances.loc[:, 'features'] = cols
    importances.sort_values('importances', inplace=True)
    f, a = plt.subplots()
    importances.plot(ax=a, kind='bar', x='features', y='importances')
    plt.gcf().subplots_adjust(bottom=0.3)
    f.savefig(os.path.join(model_dir, 'importances.png'))

In [None]:
#plt.style.use('ggplot')
#plot_feature_importances(model, X_train.columns.to_list(), '../data')