# Time Series Forecasting

inspired by:
- Jason Brownlee's Deep Learning for Time Series Forecasting Book, 
- Udacity Course Intro to TensorFlow for Deep Learning Lesson 8: Time Series Forecasting
- Siraj Raval


(written by Nicolaj C. Stache and Andreas F. Schneider)

## Where do we find time series data in the real world?

There are many examples, just a few listed here:

Weather Forecasts           |  Stock Prices
:-------------------------:|:-------------------------:
<img src="Bilder/weather.JPG" alt="Stock" style="height: 250px;"/>  |  <img src="Bilder/stock.JPG" alt="Stock" style="height: 250px;"/>

Historical Trends           |  Electrical Demand
:-------------------------:|:-------------------------:
<img src="Bilder/moore.JPG" alt="Stock" style="height: 250px;"/>  |  <img src="Bilder/elec_demand.JPG" alt="Stock" style="height: 250px;"/>

CO2 Concentration vs. Temperature         |
:-------------------------:|
<img src="Bilder/co2.JPG" alt="Stock" style="height: 400px;"/> 




[Image Source](https://www.udacity.com/course/intro-to-tensorflow-for-deep-learning--ud187)

## What is a time series exactly?
-  It's an ordered sequence of values usually equally spaced over time like "every year", "every month", "every second", and so on.

- If you observe just one value than it is a univariate time series 
(e.g. Temperature)
<img src="Bilder/weather.JPG" alt="Stock" style="height: 250px;"/>

- If you observe more than one it is a multivariate time series (e.g. Brainwaves, GPS coordinates or a Stereo Audio signal)
<img src="Bilder/GPS.jpg" alt="GPS" width="500"/>
<img src="Bilder/stereo.JPG" alt="Stereo" width="300"/>


### Possible Application Examples
- You could **forecast** stock prices to become rich, or the temperature, or how many products you should produce to meet the demand in the future.
- You could **understand the underlying process** that generated the time series e.g. study brainwaves to better understand sleep cycles.
- Time series analysis can also be used to **detect anomalies** e.g. the traffic of an e-mail server in order to find out abnormal activity such as a server attack.

##### IN THIS NOTEBOOK WE WILL FOCUS ON FORECASTING

# Common Patterns

### Trend
Values that gradually drift up or down, also known as upward or downward trend.

> **For Example:**
- CO2 concentration in the last decade -> **Upward trend**
- drinking water reserves -> **Downward trend**
- meat consumption worldwide-> **Upward trend**

### Seasonality
Patterns that repeat at predictable intervals and particular peaks and troughs. 

> **For Example:**
- Temperature drops every winter in comparision to the summer
- The same happens every night in comparison to the day.
- Rush Hour
- Sales before Christmas
- ...


### Noise
This is completly unpredictable. Almost all data coming from sensors do have noise. Often times it is a noise with a Gaussian distribution. We usually want to forecast the mean of the noise.


Trend         |  Seasonality        |  Noise
:-------------------------:|:-------------------------:|:-------------------------:
<img src="Bilder/trend.JPG" alt="Stock" style="height: 200px;"/>  |  <img src="Bilder/seasonality.JPG" alt="Stock" style="height: 200px;"/> | <img src="Bilder/noise.JPG" alt="Stock" style="height: 200px;"/> 

Trend + Seasonality + Noise |
:-------------------------: |
In many scenarios we will find all three of them together
<img src="Bilder/trend-seasonality-noise.JPG" alt="Stock" style="height: 350px;"/>

### Data partitioning for time series data

<h1>Fixed Partioning</h1>       |  <h1>Roll-Forward Partioning</h1> 
:-------------------------:|:-------------------------:
<img src="Bilder/fixed-part.JPG" alt="Stock" style="height: 250px;" align="left"/>|  <img src="Bilder/roll-forward-part.JPG" alt="Stock" style="height: 250px;"/> 

##### Fixed Partioning

As shown above, we have a training, validation and test set. Since the data has obvioulsy a seasonality, we need to make sure to have always the same amount of each season. E.g. 1 year instead of 1.5 years, because otherwise some months would be more represented than others. We train the model on the validation period, if we are happy with the performance, we train the model with the training and validation period together. After that we evaluate it on the test period in order to check the performance as close to reality as possible. Lastly, against normal machine learning principles, we use the test set to train one last time before using the model in production. This is important, because the last data points of a data set is normally most important and relevant for the future.

##### Roll-Forward Partioning

As shown above, we move the green window to the right for each training iteration. With each iteration, we look further in the future with less data from the past. We treat the validation period as the "future" and use the training period to fit the model on this "future". Nevertheless, it is also common to keep all training data and just extend the green window within the validation data set. This means here to just extend the window to the right.

**Advantage 1:** We get the forecasts and errors of multiple examles such as one-step-ahead, two-step-ahead, and so on. At the end, we know how good and how far we can predict into the future. 

 **Advantage 2:** We receive more detailed information about when the data works well for predicting the future. E.g. always in winter the prediction could be worse than in summer. This can be useful information.
 
**Advantage 3:** When we deploy a model, the inference is most of the time roll-forward anyway. As soon as we get a new datapoint, we take a few old points and the newly arrived datapoint and try to predict the new future.

[Source1: Galit Shmueli](https://youtu.be/3k7qk__5tDY)

[Source2: Udacity](https://www.udacity.com/course/intro-to-tensorflow-for-deep-learning--ud187)

# Classcial Forecasting Methods

### Naive Forcasting - The simplest approach.

Naive forecasts are the most cost-effective forecasting model, and provide a benchmark against which more sophisticated models can be compared. This forecasting method is only suitable for time series data. Using the naive approach, forecasts are produced that are equal to the last observed value. This method works quite well for economic and financial time series, which often have patterns that are difficult to reliably and accurately predict. If the time series is believed to have seasonality, the seasonal naïve approach may be more appropriate where the forecasts are equal to the value from last season. In time series notation: 

$\hat{y}_{t+1} = y_t$ 

[Source: Wikipedia](https://en.wikipedia.org/wiki/Forecasting#Na%C3%AFve_approach)

## Other classic approaches:
- Simple Average (SA)
- Autoregression (AR)
- Moving Average (MA)
- Autoregressive Moving Average (ARMA)
- Autoregressive Integrated Moving Average (ARIMA)
- Seasonal Autoregressive Integrated Moving-Average (SARIMA)
- Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
- Vector Autoregression (VAR)
- Vector Autoregression Moving-Average (VARMA)
- Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
- Simple Exponential Smoothing (SES)
- Holt Winter’s Exponential Smoothing (HWES)

[Source: Siraj Raval](https://youtu.be/d4Sn6ny_5LI)

[Source: MC.ai](https://mc.ai/11-classical-time-series-forecasting-methods-in-python-cheat-sheet/)

# Metrics for evaluation of the model

`errors = forecast - actual`


**Mean Squared Error (MSE)**

Use this, if you want to penalize large errors.
> `mse = np.square(errors).mean()`

**Mean Absolute Error (MAE)** (also called Mean Absolute Deviation (MAD)) 

Use this if the error should be proportional to the value
> `mae = np.abs(errors).mean()`


**Mean Absolute Percentage Error (MAPE)**

Mean ratio between the absolute error and the absolute value --> this gives an idea of the size of the errors compared to the values.
> `mape = np.abs(errors / x_valid).mean()`


## Example

The next code cell generates a typical time series with trend, seasonality and noise with a fixed time interval.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

keras = tf.keras

def plot_series(time, series, format="-", start=0, end=None, label=None):
    plt.plot(time[start:end], series[start:end], format, label=label)
    plt.xlabel("Time")
    plt.ylabel("Value")
    if label:
        plt.legend(fontsize=14)
    plt.grid(True)
    
def trend(time, slope=0):
    return slope * time

def seasonal_pattern(season_time):
    """Just an arbitrary pattern, you can change it if you wish"""
    return np.where(season_time < 0.4,
                    np.cos(season_time * 2 * np.pi),
                    1 / np.exp(3 * season_time))

def seasonality(time, period, amplitude=1, phase=0):
    """Repeats the same pattern at each period"""
    season_time = ((time + phase) % period) / period
    return amplitude * seasonal_pattern(season_time)

def white_noise(time, noise_level=1, seed=None):
    rnd = np.random.RandomState(seed)
    return rnd.randn(len(time)) * noise_level

time = np.arange(4 * 365 + 1)

slope = 0.05
baseline = 10
amplitude = 40
series = baseline + trend(time, slope) + seasonality(time, period=365, amplitude=amplitude)

noise_level = 5
noise = white_noise(time, noise_level, seed=42)

series += noise

plt.figure(figsize=(10, 6))
plot_series(time, series)
plt.show()

### Naive Forecasting
Now we try the simplest forcasting method to get a baseline.

In [None]:
split_time = 1000
time_train = time[:split_time]
x_train = series[:split_time]
time_valid = time[split_time:]
x_valid = series[split_time:]

naive_forecast = series[split_time - 1:-1]

plt.figure(figsize=(10, 6))
plot_series(time_valid, x_valid, start=0, end=150, label="Series")
plot_series(time_valid, naive_forecast, start=1, end=151, label="Forecast")

Let's see how it performs with the MAE metric:

In [None]:
keras.metrics.mean_absolute_error(x_valid, naive_forecast).numpy()

### Moving Average
A little bit more sophisticated, but still classical without any deep learning.

In [None]:
def moving_average_forecast(series, window_size):
  """Forecasts the mean of the last few values.
     If window_size=1, then this is equivalent to naive forecast
     This implementation is *much* faster than the previous one"""
  mov = np.cumsum(series)
  mov[window_size:] = mov[window_size:] - mov[:-window_size]
  return mov[window_size - 1:-1] / window_size

window_size = 30
moving_avg = moving_average_forecast(series, window_size)[split_time - window_size:]

plt.figure(figsize=(10, 6))
plot_series(time_valid, x_valid, label="Series")
plot_series(time_valid, moving_avg, label="Moving average {} days".format(window_size))

In [None]:
keras.metrics.mean_absolute_error(x_valid, moving_avg).numpy()

As you can see it is much worse. Through a method calld **Differencing** we can remove trend and seasonality. We just substract the value at time t and the value one season (year) earlier at time t-365. 

In [None]:
diff_series = (series[365:] - series[:-365])
diff_time = time[365:]

plt.figure(figsize=(10, 6))
plot_series(diff_time, diff_series, label="Series(t) – Series(t–365)")
plt.show()

In [None]:
window_size = 30
diff_moving_avg = moving_average_forecast(diff_series, window_size)[split_time - 365 - window_size:]

plt.figure(figsize=(10, 6))
plot_series(time_valid, diff_series[split_time - 365:], label="Series(t) – Series(t–365)")
plot_series(time_valid, diff_moving_avg, label="Moving Average of Diff")
plt.show()

Now we need to add back the value at time t-365 in order to get back the original series.

In [None]:
diff_moving_avg_plus_past = series[split_time - 365:-365] + diff_moving_avg

plt.figure(figsize=(10, 6))
plot_series(time_valid, x_valid, label="Series")
plot_series(time_valid, diff_moving_avg_plus_past, label="Forecasts")
plt.show()

In [None]:
keras.metrics.mean_absolute_error(x_valid, diff_moving_avg_plus_past).numpy()

Not bad. Little bit better than naive forcasting. The problem is that after adding the old values the noise comes back. To remove the noise again, we just apply the moving average again.

In [None]:
diff_moving_avg_plus_smooth_past = moving_average_forecast(series[split_time - 370:-359], 11) + diff_moving_avg

plt.figure(figsize=(10, 6))
plot_series(time_valid, x_valid, label="Series")
plot_series(time_valid, diff_moving_avg_plus_smooth_past, label="Forecasts")
plt.show()

In [None]:
keras.metrics.mean_absolute_error(x_valid, diff_moving_avg_plus_smooth_past).numpy()

Great improvement!!

# Switching to Machine Learning Models
So far so good. From now on we will focus on more recent technologies like MLPs, RNNs and CNNs.

## Difference between MLPs, RNNs and CNNs

### MLP
MLP stands for Multilayer Perceptron and can consist of multiple layers. Within the context of this notebook we only use fully connected layers (dense layers). The easiest possible MLP model is one hidden Dense layer and one output Dense layer.

More information about MLP: https://en.wikipedia.org/wiki/Multilayer_perceptron

### RNN
<h1>RNN Architecture</h1>       |  <h1>RNN Layer</h1> 
:-------------------------:|:-------------------------:
<img src="Bilder/rnn-architecture.JPG" alt="Stock" style="height: 250px;" align="left"/>|  <img src="Bilder/rnn.JPG" alt="Stock" style="height: 250px;"/> 

A RNN is a neural network which contains recurrent layers. A recurrent layer is a layer that can sequentially process a sequence of inputs. RNNs can take all sorts of sequences as inputs such as time series data or sentences. 
In the left picture you can see a simple RNN architecture with two recurrent layers and one dense layer, which serves as the output layer. You feed the network with a batch of sequences such as the windows from a time series and it outputs a batch of forecasts. The input shape is 3-dimensional: the first dimension represents the batch size, the second dimension represents the time steps and the third dimenion represents the dimensionality of the inputs at each time step (e.g. for univariate time series it would be 1).

A recurrent layer is composed of a single memory cell which is used repeatedly to compute the outputs. A memory cell is basically a small neural network. It can be a simple dense vanilla RNN cell or a complex memory cell such as LSTM or GRU (more information https://towardsdatascience.com/animated-rnn-lstm-and-gru-ef124d06cf45)

It's important to understand that a recurrent layer only contains a single cell. In the right image you can see multiple cells but it's actually the same cell that is reused multiple times by the layer. At each time step the memory cell takes the value of the input sequcence at that time step, starting with X0, X1... and then it produces the output for the current time step, starting with Y0, Y1...
The memory cell itself also produces another output at each time step called a state vector, or sometimes a context vector, starting with H0, H1...
As you can see this state vector is fed as an additional input to the memory cell at the next time step. This is actually why this is called a recurrent layer. Part of the output of the memory cell at one time step is fed back to itself at the next time step. The beauty of this architecture is that it can handle sequences of any length using a constant number of parameters.


### CNN
Convolutional Neural Networks are very successful in image processing tasks such as image classification. Recently they get more and more used in handling sequences.
In image processing we use 2D-Convolutional-Layers whereas in handling sequences we use 1D-Convolutional-layers. They work exactly the same except the windows slides only over one dimension, typically the time axis instead of two dimensions such as height and width of an image.
In the image below you can see a typical 1D-Convolutional-Layer with a filter (kernel) of size three. You can see how it is computed and activated by a ReLU function in the depicted equation.

A RNN Layer has some memory as explained above whereas a CNN has no memory at all. Each output takes only a small window into consideration (here 3 time steps).
This seems like a serious limitation, but once you stack multiple 1D-Convolutional-layers, the top layers end up indirectly seeing a large part of the input sequence. This means Convnets are able to capture also very long-term patterns.
<img src="Bilder/convnet.JPG" alt="Stock" style="height: 350px;" align="left"/>

# Imports
We are using Tensorflow 2.x

In [None]:
import numpy as np
from numpy import array
from numpy import hstack
import matplotlib.pyplot as plt


import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import Sequential

from tensorflow.keras.models import Model


from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten

from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import MaxPooling1D

from tensorflow.keras.layers import concatenate
print("Tensorflow Version {}".format(tf.__version__))

### Convenience Functions
In this notebook we will create different kinds of models. The method *plot_graph* will produce a nice looking graph of the model.

Additionally we use the *plot_loss* function to reduce boilerplate code. We only need to hand over the history from the training process and receive a nice looking plot.

In [None]:
def plot_graph(model, pngName):    
    return tf.keras.utils.plot_model(
                model,
                to_file=pngName,
                show_shapes=True,
                show_layer_names=True,
                rankdir='LR',
                expand_nested=True,
                dpi=96
    )
def plot_loss(history):
    plt.plot(history.history['loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Loss'], loc='upper left')
    plt.show()

# Univariate Data Example 

A major part in working with Time-Series-Data is the data preperation. 

Time series data requires preparation before it can be used to train a supervised learning model,
such as an LSTM neural network. For example, a univariate time series is represented as a
vector of observations:

##### [10, 20, 30, 40, 50, 60, 70, 80, 90]

A supervised learning algorithm requires that data is provided as a collection of samples,
where each sample has an input component (X) and an output component (y).

For a univariate time series problem where we are interested in one-step
predictions, the observations at prior time steps, so-called lag observations, are used as input
and the output is the observation at the current time step. For example, the above 9-step
univariate series can be expressed as a supervised learning problem with three time steps for
input and one step as output, as follows:

| X     | y  |
|------ |------|
|[10,20,30]|[40]|
|[20,30,40]|[50]|
|[30,40,50]|[60]|
|[...]|[...]|

### The 3 dimensions of the input
* **Samples**. One sequence is one sample. A batch is comprised of one or more samples.
* **Time Steps**. One time step is one point of observation in the sample. One sample is comprised of multiple time steps.
* **Features**. One feature is one observation at a time step. One time step is comprised of one or more features.

### Data Preparation
The following function will split a given univariate sequence into multiple samples where each sample has a specified number of time steps and the output is a single time step.

In [None]:
def split_sequence(sequence, n_steps):
    X, y = list(), list()
    for i in range(len(sequence)):
        end_ix = i + n_steps
        if end_ix > len(sequence)-1:
            break
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
        X.append(seq_x)
        y.append(seq_y)
    return array(X), array(y)

Let's try out what we will get when we use this *split_sequence* function with our time series data.

In [None]:
# define univariate time series
series = array([10, 20, 30, 40, 50, 60, 70, 80, 90])
print("------Before----------")
print("Orginal Data:\n",series, "\nShape:", series.shape)
print()
print("------After----------")
# Define the number of time steps
n_steps = 3
# Generate  data
X, y = split_sequence(series, n_steps)
print("Transformed Data:")
for i in range(len(X)):
    print(X[i], y[i])
print("Shape:",X.shape, y.shape)

### Model - MLP

In [None]:
model = Sequential()
model.add(Dense(200, activation='relu', input_dim=n_steps))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.summary()
plot_graph(model, 'single-headed-MLP-model.png')

##### Start Training

In [None]:
history = model.fit(X, y, epochs=500, verbose=0)
plot_loss(history)

###### Prediction of the future

In [None]:
x_input = array([70.0, 80.0, 90.0])
x_input = x_input.reshape((1,n_steps))
yhat = model.predict(x_input, verbose=0)
print("Desired:\t", 100.0, "\nPredicted:\t", yhat)

You can add another *dense layer* to improve the result. You can try to see how much this affects the result

# Multivariate Data Example

Now we have two variables as an Input and still one output variable (Single-Step Forecasting). 

### Data Preparation

**Input X1** - [10, 20, 30, 40,  50,    60,   70,   80,   90]

**Input X2** - [15, 25, 35, 45,  55,    65,   75,   85,   95]

**Output Y** - [25, 45, 65, 85, 105, 125, 145, 165, 185]

**-------------------**

After data preparation, the data should look like this:

| X     | y  |
|------ |------|
|[ [10, 15], [20, 25], [30, 35] ]|[65]|
|[ [20, 25], [30, 35], [40, 45] ]|[85]|
|[ [30, 35], [40, 45], [50, 55] ]|[105]|
|[...]|[...]|

As before, the updated split_sequences function as follows:

In [None]:
def split_sequences(sequences, n_steps):
    X, y = list(), list()
    for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + n_steps
        #check if we are beyond the dataset
        if end_ix > len(sequences):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequences[i:end_ix,0:2], sequences[end_ix-1, -1]
        X.append(seq_x)
        y.append(seq_y)
    return array(X), array(y)

Let's generate the data and convert them into training samples. 
The output shows *Stacked Data*, which is the input of the *split_sequences* function. *Training Data* is the output of that function.

In [None]:
# Input
in_seq1 = array([10, 20, 30, 40, 50, 60, 70, 80, 90])
in_seq2 = array([15, 25, 35, 45, 55, 65, 75, 85, 95])
# Output
out_seq = array([in_seq1[i]+in_seq2[i] for i in range(len(in_seq1))])
# convert to [rows, columns] structure
in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
out_seq = out_seq.reshape((len(out_seq), 1))
# horizontally stack columns
dataset = hstack((in_seq1, in_seq2, out_seq))
print("Stacked Data:\n\n", dataset)
print("Shape: ",  dataset.shape)
print("-------------")
print("\nTraining Data:\n")
# Number of time steps
n_steps = 3
# convert into input/output
X, y = split_sequences(dataset, n_steps)
# the dataset knows the number of features, e.g. 2
n_features = X.shape[2]
# summarize the data
for i in range(len(X)):
    print(X[i], y[i])
    print("-------------")

### Model - Single-Headed CNN
This time we replace the MLP model with a simple CNN model. 

A one-dimensional CNN is a CNN model that has a convolutional hidden layer that operates over a 1D sequence. This is followed by a pooling layer whose job it is to distill the output of the convolutional layer to the most important elements. This is then followed by a dense layer that interprets the features extracted by the convolutional part of the model. A flatten layer is used between the convolutional layers and the dense layer to reduce the feature maps to a single one-dimensional vector. This network is implemented as follows:

In [None]:
model = Sequential()
model.add(Conv1D(64, 2, activation='relu', input_shape=(n_steps, n_features)))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.summary()
plot_graph(model, 'single-headed-model.png')

##### Start Training and plot history

In [None]:
history=model.fit(X, y, epochs=500, verbose=0)
plot_loss(history)

###### Prediction of the future

In [None]:
x_input = array([[80.0, 85.0], [90.0, 95.0], [100.0, 105.0]])
x_input = x_input.reshape((1, n_steps, n_features))
yhat = model.predict(x_input, verbose=0)
print("Desired:\t", 205.0, "\nPredicted:\t", yhat)

### Multi-headed CNN Model
Each input series can be handled by
a separate CNN and the output of each of these submodels can be combined before a prediction
is made for the output sequence.

It may offer more flexibility or better performance depending on the specifics of the problem that is being modeled. For example, it allows you to configure each submodel differently for each input series, such as the number of filter maps and the kernel size. This type of model can be defined using the Tensorflow/Keras Functional API.
Now, n_features is 1 instead of 2 as in the previous example.

In [None]:
n_steps = 3
n_features = 1
X1 = X[:, :, 0].reshape(X.shape[0], X.shape[1], n_features)
X2 = X[:, :, 1].reshape(X.shape[0], X.shape[1], n_features)
# first input model
visible1 = Input(shape=(n_steps, n_features))
cnn1 = Conv1D(64, 2, activation='relu')(visible1)
cnn1 = MaxPooling1D()(cnn1)
cnn1 = Flatten()(cnn1)
# second input model
visible2 = Input(shape=(n_steps, n_features))
cnn2 = Conv1D(64, 2, activation='relu')(visible2)
cnn2 = MaxPooling1D()(cnn2)
cnn2 = Flatten()(cnn2)

# merge input models
merge = concatenate([cnn1, cnn2])
dense = Dense(50, activation='relu')(merge)
output = Dense(1)(dense)

model = Model(inputs=[visible1, visible2], outputs=output)
model.compile(optimizer='adam', loss='mse')

tf.keras.utils.plot_model(model)

##### Start Training

In [None]:
# fit model
model.fit([X1, X2], y, epochs=500, verbose=0)
# demonstrate prediction
x_input = array([[80.0, 85.0], [90.0, 95.0], [100.0, 105.0]])
x1 = x_input[:, 0].reshape((1, n_steps, n_features))
x2 = x_input[:, 1].reshape((1, n_steps, n_features))
yhat = model.predict([x1, x2], verbose=0)
print("Desired:\t", 205.0, "\nPredicted:\t", yhat[0][0])

### Let's try to predict the sinus function
We start by creating training data. In between $x=-7$ and $x=7$ we populate $71$ points in order to have an interval of $0.2$ (just for convenience). The resulting red dots is a sinus and the training data available for auf NN.

In [None]:
np.set_printoptions(2,suppress=True)
in_array = np.linspace(-7, 7, 71) 
out_array = np.sin(in_array) 
print(in_array)
# red for numpy.sin() 
plt.plot(in_array, out_array, color = 'red', marker = "o") 
plt.title("numpy.sin()") 
plt.xlabel("X") 
plt.ylabel("Y") 
plt.show() 

The function for splitting the data as in the previous examples:

In [None]:
def split_sequence(sequence, n_steps_in, n_steps_out):
    X, y = list(), list()
    for i in range(len(sequence)):
        # find the end of this pattern
        end_ix = i + n_steps_in
        out_end_ix = end_ix + n_steps_out
        # check if we are beyond the sequence
        if out_end_ix > len(sequence):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_end_ix]
        X.append(seq_x)
        y.append(seq_y)
    return array(X), array(y)

### Hypterparameter and Model
We use a stacked LSTM with 200 neurons each. A single LSTM network would do the job as well. 
In this example we use the last 10 input values and try to predict 20 future sinus samples.

In [None]:
# choose a number of time steps
n_steps_in, n_steps_out = 10,20
# split into samples
X, y = split_sequence(out_array, n_steps_in, n_steps_out)
n_features = 1
X = X.reshape((X.shape[0], X.shape[1], n_features))
print("There are {} training samples available".format(X.shape[0]))
# define model
model = Sequential()
model.add(LSTM(200, activation='relu', return_sequences=True, input_shape=(n_steps_in, n_features)))
model.add(LSTM(200, activation='relu'))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')

###### Start Training

With live preview in each epoch how good the system is predicting.

In [None]:
%matplotlib notebook

fig = plt.figure()
ax = fig.add_subplot(111)
plt.ion()

fig.show()
fig.canvas.draw()


for i in range(0,75):
    model.fit(X, y, epochs=1, verbose=0)
    # demonstrate prediction
    x_input = array(out_array[-10:])
    x_input = x_input.reshape((1, n_steps_in, n_features))
    yhat = model.predict(x_input, verbose=0)

    ax.clear()
    ax.set(xlim=(-7.5, 12), ylim=(-1.2, 1.2))
    ax.plot(in_array, out_array, color = 'red', marker = "o")
    ax.plot(np.linspace(7.2, 7.2+0.2*n_steps_out, num=n_steps_out, endpoint=False),yhat[0], color = 'blue', marker = "o")
    fig.canvas.draw()
%matplotlib inline

# Fibonacci Sequence Example
In this example we will try to learn the fibonacci sequence and predict new fibonacci samples.

In [None]:
fib = lambda n: n if n < 2 else fib(n-1) + fib(n-2)
np.set_printoptions(precision=0, suppress=True)
out_array = []
print("Fibonacci Numbers")
for i in range(1, 31):
    nr=fib(i)
    out_array = np.append(out_array,nr)
    print(i,nr, end="\n")

In [None]:
def split_sequence(sequence, n_steps_in, n_steps_out):
    X, y = list(), list()
    for i in range(len(sequence)):
        # find the end of this pattern
        end_ix = i + n_steps_in
        out_end_ix = end_ix + n_steps_out
        # check if we are beyond the sequence
        if out_end_ix > len(sequence):
            break
        # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_end_ix]
        X.append(seq_x)
        y.append(seq_y)
        print("No. ", i,"- Input: ",X[i],"Output: ",y[i])
    return array(X), array(y)


## Hyperparameters
**n_steps_in** -> How many data points the NN will get in order to do a prediction.

**n_steps_out** -> How many data points the NN should predict.

**n_train_amount** -> How many data points are available? We produced 30 data points as you can see the first cell in this section.

**n_features** -> How many features do we have for the prediciton? In this case it is just one: The Fibonacci Numbers.

In [None]:
n_steps_in, n_steps_out, n_train_amount, n_features = 2, 1, 30, 1

##### Data preperation - Training data
In order get always a full training sample with 2 Input values, we need to cut away some incomplete data at the front and at the end. We will end up with a training set with 27 samples.

In [None]:
X, y = split_sequence(out_array[:n_train_amount], n_steps_in, n_steps_out)
X = X.reshape((X.shape[0], X.shape[1], n_features))

# Let's test 4 different Models and find out which one is best.

Choose 1, 2, 3 or 4, execute the cell and scroll then down and execute the "Start Training" Block. Compare the results and decide which one performs best.

#### 1. Model - LSTM - Plain and simple

In [None]:
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps_in, n_features)))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')
model.summary()

#### 2. Model - Stacked LSTM  - A little bit more sophisticated

We can stack multiple LSTM Layers on top of each other.
An LSTM Layer produces a 2-dim Output (this is the interpretation) but needs a 3-dim Input.
This is addressed by the *return_sequences* flag, which is a Boolean and determines whether to return the last output in the output sequence, or the full sequence.

Do you want to learn more about the Input and Output Shapes? Read here:

https://medium.com/@shivajbd/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e

In [None]:
model = Sequential()
model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=(n_steps_in, n_features)))
model.add(LSTM(50, activation='relu'))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')

#### 3. Model - Bidirectional LSTM
This LSTM network is not only able to look into the past but also into the future. This helps often times understanding the context. The process is to process the inputs both from past to future and from future to past. Implemenation is as easy as wrapping your LSTM Layer with **Bidirectional()**

In [None]:
model = Sequential()
model.add(Bidirectional(LSTM(50, activation='relu', input_shape=(n_steps_in, n_features))))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')

#### 4. Model - CNN
It is also possible to use a Convolutional Neural Network to predict the future.

In [None]:
model = Sequential()
model.add(Conv1D(64, 1, activation='relu', input_shape=(n_steps_in, n_features)))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(200, activation='relu'))
model.add(Dense(n_steps_out))
model.compile(optimizer='adam', loss='mse')

##### Start Training
Choose one of the models above and start training here:

In [None]:
history = model.fit(X, y, epochs=200, verbose=0, batch_size=16)
plot_loss(history)

In [None]:
np.set_printoptions(suppress=True, precision=0)
random_nr=np.random.randint(1,31-n_steps_out-n_steps_in-1)
#random_nr=6
print("Random ID ", random_nr)
print("Input:\t\t", out_array[random_nr:random_nr+n_steps_in])
print("Ground Truth:\t", out_array[random_nr+n_steps_in:random_nr+n_steps_in+n_steps_out])
x_input = array(out_array[random_nr:random_nr+n_steps_in])
x_input = x_input.reshape((1, n_steps_in, n_features))
yhat = model.predict(x_input, verbose=0)
print("Prediction:\t", yhat[0])
x=tf.keras.losses.MSE(out_array[random_nr+n_steps_in:random_nr+n_steps_in+n_steps_out],yhat[0])
print("MSE: {0:.2f}".format(x.numpy()))

## Further/deeper information about the concepts of Time Series Forecasting

In this notebook we discuss data preparation based on simple time series data. Before we start we need to talk about different concepts regarding time series data. 

**What are the inputs and outputs for a forecast?**
> **Inputs** - Historical data provided to the model in order to make a single forecast.<br/>
  **Outputs** - Prediction or forecast for a future time step beyond the data provided as input.
  
> **Description** - Defining the inputs and outputs of the model forces you to think about what exactly is or may be required to make a forecast. You may not be able to be specific when it comes to input data. For example, you may not know whether one or multiple prior time steps are required to make a forecast.

**What are the endogenous and exogenous variables?**

> **Endogenous** - Input variables that are influenced by other variables in the system and on which the output variable depends.<br/>
  **Exogenous** - Input variables that are not influenced by other variables in the system and on which the output variable depends.
  
> **Description** - Typically, a time series forecasting problem has endogenous variables (e.g. the output is a
function of some number of prior time steps) and may or may not have exogenous variables. Often, exogenous variables are ignored given the strong focus on the time series. Explicitly thinking about both variable types may help to identify easily overlooked exogenous data or even engineered features that may improve the model.

**Are you working on a regression or classification predictive modeling problem?**

> **Regression** - Forecast a numerical quantity. E.g. a price, a count, a volume<br/>
  **Classification** - Classify as one of two or more labels. E.g. car, bike, up, down, hot, cold
  
> **Description** - There is some flexibility between these types. For example, a regression problem can be reframed as classification and a classification problem can be reframed as regression. Some problems, like predicting an ordinal value, can be framed as either classification and regression. It is possible that a reframing of your time series  forecasting problem may simplify it.


**Are the time series variables unstructured or structured?**

> **Unstructured** - No obvious systematic time-dependent pattern in a time series variable.<br/>
  **Structered** - Systematic time-dependent patterns in a time series variable (e.g. trend and/or seasonality).
  
> **Description** - It is useful to plot each variable in a time series and inspect the plot looking for possible
patterns. A time series for a single variable may not have any obvious pattern. We can think of a series with no pattern as unstructured, as in there is no discernible time-dependent structure. Alternately, a time series may have obvious patterns, such as a trend or seasonal cycles as structured. We can often simplify the modeling process by identifying and removing the obvious structures from the data, such as an increasing trend or repeating cycle. Some classical methods even allow you to specify parameters to handle these systematic structures directly.


**Are you working on a univariate or multivariate time series problem?**

> **Univariate** - One variable measured over time.<br/>
  **Multivariate** - Multiple variables measured over time.<br/>
  
> **Description** - Considering this question with regard to inputs and outputs may add a further distinction.
The number of variables may differ between the inputs and outputs, e.g. the data may not be
symmetrical. For example, you may have multiple variables as input to the model and only be
interested in predicting one of the variables as output. In this case, there is an assumption in
the model that the multiple input variables aid and are required in predicting the single output
variable.

> **Univariate and Multivariate Inputs** - One or multiple input variables measured over time.<br/>
  **Univariate and Multivariate Outputs** - One or multiple output variables to be predicted.


**Do you require a single-step or a multi-step forecast?**

> **single-step** - Forecast the next time step.<br/>
  **multi-step** - Forecast more than one future time steps.
  
> **Description** - A forecast problem that requires a prediction of the next time step is called a one-step forecast
model. Whereas a forecast problem that requires a prediction of more than one time step is called a multi-step forecast model. The more time steps to be projected into the future, the more challenging the problem given the compounding nature of the uncertainty on each forecasted time step.


**Do you require a static or a dynamically updated model?**

> **Static** - A forecast model is fit once and used to make predictions.<br/>
  **Dynamic** - A forecast model is fit on newly available data prior to each prediction.

**Are your observations contiguous or discontiguous?**

> **Contiguous** - Observations are made uniform over time. E.g. each hour, day, month.<br/>
  **Discontiguous** - Observations are not uniform over time. E.g. missing or corrupt values
  
> **Description** - In the case of non-uniform observations, specific data formatting may be required when fitting some models to make the observations uniform over time. E.g. interpolation.