<img src="header.png" align="left"/>

# Exercise Import of excel data and timeseries prediction (10 points)

The goal of this example is to show the work with time series from Excel files and to develop a prediction model for time series. The data set describes the development of passenger numbers of an airline in the distant past. However, the data is typical for data as found in SME's (e.g. sales figures). 


The code for this example was adapted from [1]. Further information on classical time series prediction is available here [2]. 

- [1] [https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/)
- [2] [https://www.kdnuggets.com/2020/01/predict-electricity-consumption-time-series-analysis.html](https://www.kdnuggets.com/2020/01/predict-electricity-consumption-time-series-analysis.html)


Citation dataset:
```
Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1976) Time Series Analysis, Forecasting and Control. Third Edition. Holden-Day. Series G.
```

**NOTE**

Document your results by simply adding a markdown cell or a python cell (as comment) and writing your statements into this cell. For some tasks the result cell is already available.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ditomax/mlexercises/blob/master/05%20Exercise%20Import%20of%20data%20from%20excel%20and%20time%20series%20prediction.ipynb)

# Import of modules

In [None]:
#
# Prepare colab
#
import os

COLAB=False
try:
    %tensorflow_version 2.x
    print("running on google colab")
    COLAB=True
    os.makedirs('data/caltech101',exist_ok=True)    
    os.makedirs('results',exist_ok=True)    
except:
    print("not running on google colab")


#
# Turn off errors and warnings (does not work sometimes)
#
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=Warning)
simplefilter(action='ignore', category=RuntimeWarning)



import os
import logging
import openpyxl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

import sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

#
# GPU Support
#
tflogger = tf.get_logger()
tflogger.setLevel(logging.ERROR)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR )
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    print('using GPU support')


#
# Sizes of plots
#
plt.rcParams['figure.figsize'] = [16, 9]


#
# Versions
#
print('working on keras version {} on tensorflow {} using sklearn {} numpy {}'.format ( tf.keras.__version__, tf.version.VERSION, sklearn.__version__, np.__version__ ) )

# Constants

In [None]:
excelData = 'data/airline_passengers.xlsx'

# Read the data from Excel file

In [None]:
if COLAB:
    base_url = 'https://raw.githubusercontent.com/ditomax/mlexercises/master/'
    df = pd.read_excel(base_url + excelData, engine = 'openpyxl')
else:
    df = pd.read_excel(excelData, engine = 'openpyxl')

In [None]:
#
# read one table
#
df.info
df.head(10)

In [None]:
df.tail()

# Cut out data from pandas dataframes

Tutorial: [https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)

<img src="info.png" align="left"/> 

<div class="alert alert-block alert-info">

## Task
    
Cut out the data from the dataframe using the iloc function described above to have 
only relevant data left (timestamp and passenger number, no NaNs) (2 points)

</div>

In [None]:
df_cut = ...

In [None]:
df_cut.head()

In [None]:
# rename columns to fixed names
df_cut.columns = ['month','passengers']

In [None]:
df_cut.head()


<div class="alert alert-block alert-info">

## Task
    
Plot the passenger data from the pandas dataframe in a 2D plot (1 point)

</div>

...

# Conventional methods

There are many older methods to work with time series:

- https://machinelearningmastery.com/time-series-trends-in-python/
- https://www.kdnuggets.com/2020/01/predict-electricity-consumption-time-series-analysis.html


The main purpose of these methods is the calculation and use of parameters such as **trend** and **seasonality**. We hope that our models can handle them without us having to explicitly deal with them.

<img src="info.png" align="left"/> 

# Prepare data

In [None]:
x_data = df_cut['passengers'].values
x_data = x_data.astype('float32')
x_data = np.reshape(x_data,(-1,1))
print(x_data)

<div class="alert alert-block alert-info">

## Task
    
Scale the dataset. Create a minmax scaler from sklearn to scale the data 
between 0 and 1 and store the trained scaler in a python variable called scaler (2 points)

</div>

In [None]:
scaler = ...
scaler.fit(x_data)

In [None]:
print(scaler.data_max_)

In [None]:
#
# actually scale the data
#
x_data = scaler.transform(x_data)
print(x_data)

In [None]:
#
# split into train and test sets with 80% training data
#
train_size = int(len(x_data) * 0.80)
test_size = len(x_data) - train_size
train, test = x_data[0:train_size,:], x_data[train_size:len(x_data),:]
print(len(train), len(test))

In [None]:
#
# plot scaled training data for check
#
plt.plot(train)

# Create a training data set with sliding windows

Sliding windows are created from a time series by dragging a window over the entire time series and copying the data in the window. More details here [https://towardsdatascience.com/ml-approaches-for-time-series-4d44722e48fe](https://towardsdatascience.com/ml-approaches-for-time-series-4d44722e48fe).

<img src="info.png" align="left"/> 

In [None]:
# some hyperparameters
epochs = 100
batch_size = 2
window_length = 1

<div class="alert alert-block alert-info">

## Task
    
Create sliding windows with one label with the function `createSlidingWindowsWithLabel`.
Add comments for the details of this function (1 point)

</div>

In [None]:
#
# Create sliding window data sets.
#
def createSlidingWindowsWithLabel(dataset, window_length=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-window_length-1):
        a = dataset[ i:(i+window_length), 0]
        dataX.append(a)
        dataY.append(dataset[i + window_length, 0])
    return np.array(dataX), np.array(dataY)

In [None]:
#
# apply sliding window function with window size (window_length)
#
trainX, trainY = createSlidingWindowsWithLabel(train, window_length)
testX, testY = createSlidingWindowsWithLabel(test, window_length)

In [None]:
# 
# reshape input to be [samples, time steps, features]
#
trainX = np.reshape(trainX, (trainX.shape[0], window_length, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], window_length, testX.shape[1]))

# Create an LSTM model

In [None]:
# some hyperparameters
epochs = 100
batch_size = 4

In [None]:
#
# create and fit the LSTM network
#
def createLSTMModel():
    model = Sequential()
    model.add(LSTM(6, input_shape=(window_length,1)))
    model.add(Dense(1,activation='linear'))
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [None]:
lstm_model = createLSTMModel()

In [None]:
lstm_model.summary()

In [None]:
history = lstm_model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=(testX, testY), shuffle=False)

In [None]:
# plot history
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

In [None]:
#
# Estimate values for train and test data
#
trainPredict = lstm_model.predict(trainX)
testPredict = lstm_model.predict(testX)

In [None]:
#
# Inverse transformation of estimations (scaler)
#
trainPredicti = scaler.inverse_transform(trainPredict)
trainPredicti = np.reshape(trainPredicti, (-1,))

testPredicti = scaler.inverse_transform(testPredict)
testPredicti = np.reshape(testPredicti, (-1,))


trainYi = scaler.inverse_transform([trainY])
trainYi = np.reshape(trainYi, (-1,))

testYi = scaler.inverse_transform([testY])
testYi = np.reshape(testYi, (-1,))

In [None]:
#
# Plot estimations
#
plt.plot(testYi[0:])
plt.plot(testPredicti[0:], linestyle='dashed')
plt.show()

<div class="alert alert-block alert-info">

## Task
    
Calculate the root mean squared error between the test labels and the prediction (1 point)

Hint: labels are in testYi[0,0:-1] and predictions are in testPredicti[1:,0]

</div>

In [None]:
#
# calculate root mean squared error
#

trainScore = ...
print('train loss: {:.3f} RMSE (passengers)'.format(trainScore))

testScore = ...
print('test loss: {:.3f} RMSE (passengers)'.format(testScore))


# Plot complete timeline

In [None]:
# shift train predictions for plotting
trainPredictPlot = np.empty_like(x_data_scaled)
trainPredictPlot[:] = np.nan
trainPredictPlot[window_length:len(trainPredicti)+window_length] = trainPredicti

# shift test predictions for plotting
testPredictPlot = np.empty_like(x_data_scaled)
testPredictPlot[:] = np.nan
testPredictPlot[len(trainPredicti)+(window_length*2):len(x_data)] = testPredicti

# plot baseline and predictions
plt.plot( x_data, color='grey', linestyle='dashed')
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.axvline(x=window_length, linestyle='dotted')
plt.axvline(x=len(trainPredicti)+window_length, linestyle='dotted')
plt.axvline(x=len(trainPredicti)+2*window_length, linestyle='dotted')
plt.show()

# Conclusion

It seems that the model only learns to use the last feature value as the prediction value.

# Increase the prediction quality

<div class="alert alert-block alert-info">

## Task
    
We want to have a better prediction quality. Conduct several experiments with different values for **window_length**, the **capacity of the model** and the **batch size** of the training. Compare the resulting RMSE values. 
Hopefully, this increases the quality of the prediction. (3 Points)

**Hint**: maybe you pack the code below into a function ...
</div>

## Results

Please document your results here:

| Test score | window_length | batch size | model size |
|----------|:---------------:|-----------:|-----------:|
| 54.432   |  1            |  12         | 20          |


In [None]:
#
# Hyperparameters
#
window_length = 1
batch_size = 12
model_size = 20
epochs = 300
dropout = 0.001
learning_rate = 0.0005

In [None]:
#
# apply sliding window function with window size (window_length)
#
trainX, trainY = createSlidingWindowsWithLabel(train, window_length)
testX, testY = createSlidingWindowsWithLabel(test, window_length)
fullX, fullY = createSlidingWindowsWithLabel(full, window_length)

In [None]:
# 
# reshape input to be [samples, time steps, features]
#
trainX = np.reshape(trainX, (-1, window_length, 1))
testX = np.reshape(testX, (-1, window_length, 1))
fullX = np.reshape(fullX, (-1, window_length, 1))

In [None]:
#
# create and fit the LSTM network
#
def createLSTMModel2():
    model = Sequential()
    model.add(LSTM(model_size, input_shape= (window_length, 1), stateful=False, dropout=dropout ))
    model.add(Dense(model_size, activation='relu'))
    model.add(Dense(1, activation='linear'))
    opt = Adam( learning_rate=learning_rate )
    model.compile(loss='mean_squared_error', optimizer=opt )
    return model

In [None]:
lstm_model2 = createLSTMModel2()

In [None]:
history = lstm_model2.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, verbose=0, validation_data=(testX, testY), shuffle=False)

In [None]:
#
# Estimate values for train and test data
#
trainPredict = lstm_model2.predict(trainX)
testPredict = lstm_model2.predict(testX)

In [None]:
#
# Inverse transformation of estimations (scaler)
#
trainPredicti = scaler.inverse_transform(trainPredict)
testPredicti = scaler.inverse_transform(testPredict)

trainYi = scaler.inverse_transform([trainY])
testYi = scaler.inverse_transform([testY])

In [None]:
#
# calculate root mean squared error
#
trainScore = math.sqrt(mean_squared_error(trainYi[0,0:-1], trainPredicti[1:,0]))
print('train loss: %.3f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testYi[0,0:-1], testPredicti[1:,0]))
print('test loss: %.3f RMSE' % (testScore))