# Using LSTM to detect anomalies in time series data

We will train a neural network based on LSTM cells to predict future values in time series data (temperature and pressures on a cryostat). The network should be pretty good in its predictions while the context is conform with the training set, much less so when unexpected behaviours occure. By monitoring the errors done in the predictions, we can deduce that we are in a "normal" context or an "abnormal" one.

## Importing modules

We will use [pandas](https://pandas.pydata.org/) to load the CSV data files, [numpy](https://numpy.org/) to deal with the data arrays and [matplotlib](https://matplotlib.org) for plotting data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

[SciKit-learn](https://scikit-learn.org/stable/) will help us re-scaling the data.

In [None]:
import sklearn.preprocessing as preproc
import math

TensorFlow and Keras for Machine Learning.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import load_model

## Loading the data

The training and test data are available in CSV files. First we download the files :

In [None]:
!rm *.csv
!wget ftp://lyoftp.in2p3.fr/baulieu/nn_data/2020-02-23.csv
!wget ftp://lyoftp.in2p3.fr/baulieu/nn_data/event.csv
!wget ftp://lyoftp.in2p3.fr/baulieu/nn_data/event2.csv

Pandas has a function to load a CSV file into a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) object :

In [None]:
data = pd.read_csv("2020-02-23.csv")
print(data)

For each minute we have 4 values :
  * cernox_4k : temperature in Kelvin
  * K3 : pressure in millibar
  * K4 : pressure in millibar
  * K5 : pressure in millibar
  
The pressures are measured at different points on the injection system.
  
We don't need the timestamps, we can drop the column :

In [None]:
data = data.drop(columns=['Date (UTC)'])
print(data)

Let's have a look at the data! We can plot the different columns using the index number in abcisse.

In [None]:
data.plot(use_index=True,y="cernox_4k",figsize=(20,3))
data.plot(use_index=True,y="K3",figsize=(20,3))
data.plot(use_index=True,y="K4",figsize=(20,3))
data.plot(use_index=True,y="K5",figsize=(20,3))

We can see our data are not in the same ranges : values for cernox_4k are in [4.04,4.09], while K5 is in [910,945]. This is a problem for the training as the goal is to minimize errors : the errors on K5 will overshadow those on cernox_4k. 

We need to rescale our data in the same range [0,1].

## Rescaling the data

First, we can extract the values as Numpy arrays :

In [None]:
values = data.values
columns = data.columns[:] #Names of the columns
print(columns)
print(values)

Now, let's rescale our data! We will use the [MinMaxScaler]("https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler") object of the scikit module to bring all informations in the [0,1] range.

In [None]:
scaler = preproc.MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
print(scaled)

We can plot the modified data :

In [None]:
#plot each column
plt.figure(figsize=(20,10))
for group in range(len(columns)):
    plt.subplot(len(columns), 1, group+1)
    plt.plot(scaled[:, group])
    print('STDEV of ',columns[group],' : ',np.std(scaled[:, group]))
    print('MEAN of ',columns[group],' : ',np.mean(scaled[:, group]))
    plt.title(columns[group], y=0.8, loc='right')
plt.show()


## Formating the training data

The idea of the neural network is to guess the values of the 16th minute from the values of the 15 previous minutes. Currently, the fomat of our data is:

```
cernox_4k(T0) K3(T0) K4(T0) K5(T0)
cernox_4k(T1) K3(T1) K4(T1) K5(T1)
cernox_4k(T2) K3(T2) K4(T2) K5(T2)
cernox_4k(T3) K3(T3) K4(T3) K5(T3)
...
```

First, we want to have 16 minutes of data on one line :
```
cernox_4k(T0) K3(T0) K4(T0) K5(T0) cernox_4k(T1) K3(T1) K4(T1) K5(T1) ... cernox_4k(T15) K3(T15) K4(T15) K5(T15)
cernox_4k(T1) K3(T1) K4(T1) K5(T1) cernox_4k(T2) K3(T2) K4(T2) K5(T2) ... cernox_4k(T16) K3(T16) K4(T16) K5(T16)
...
```

In [None]:
nb_features = 4 # cernox_4k, K3, K4, K5
nb_timesteps = 15 # number of minutes in input
# Allocation of the array for training data
training_data = np.zeros((len(scaled)-nb_timesteps,(nb_timesteps+1)*nb_features))
# Take 16 consecutive lines and reshape them in one, increment starting point by one minute, restart
for i in range(len(scaled)-nb_timesteps):
    training_data[i] = np.reshape(scaled[i:i+nb_timesteps+1],((nb_timesteps+1)*nb_features))
print(training_data.shape)

Each line is now composed of 16 minutes of consecutive data. We will separate them in a training set and a testing set. To avoid any bias, we will first shuffle all lines. We will also separate input from output : 15x4 first columns will be input, 1x4 last columns will be output (guess the 16th minute from the previous 15 minutes).

In [None]:
#Shuffle all lines
rng = np.random.default_rng()
rng.shuffle(training_data)
# split into input and outputs (15 minutes of data in input, the 16th minute in output)
# keeping last 200 lines as test data
train_x, train_y = training_data[:-200, :-nb_features], training_data[:-200, -nb_features:]
test_x, test_y = training_data[-200:, :-nb_features], training_data[-200:, -nb_features:]
print(train_x.shape)
print(train_y.shape)

An LSTM layer needs a 3D input : 

  * Nb records
  * Nb timesteps
  * Nb features

So we need to reshape the inputs to give the number of timesteps.

In [None]:
# reshape input to be 3D [samples, timesteps, features] -> [nb_samples, nb_timesteps minutes, nb_features features]
train_x = train_x.reshape((train_x.shape[0], nb_timesteps, nb_features))
test_x = test_x.reshape((test_x.shape[0], nb_timesteps, nb_features))
print(train_x.shape)

In input, we have N records composed of 15 timesteps of 4 values.

## Building and training the neural network

We will keep it simple with only 3 layers: 
  * Rounding the values (we don't need that precision)
  * LSTM layer
  * Dense layer

Of course, you can test different values for the different parameters...

In [None]:
lstm = keras.Sequential()
lstm.add(keras.layers.Lambda(lambda x : keras.backend.round(x*100), input_shape=(train_x.shape[1], train_x.shape[2])))
lstm.add(keras.layers.LSTM(400))
lstm.add(keras.layers.Dense(nb_features, activation='sigmoid'))
lstm.compile(loss='mse', optimizer='adam')

lstm.summary()

Now let's train the network with our traning set. We use a callback to stop the training if the val_loss shows no progress.

For each entry (15 minutes of data), we try to create the data of the 16th minute.

In [None]:
# fit network
history = lstm.fit(train_x, train_y, epochs=500, batch_size=240, validation_split=0.1, callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=20, verbose=1, mode='min'),keras.callbacks.TensorBoard("logs/lstm_pred"),keras.callbacks.ModelCheckpoint('lstm.hdf5', monitor='val_loss', save_best_only=True, mode='min')])

Check the evolution of the loss on the training and validation data :

In [None]:
plt.plot(history.history['loss'][:], label='training')
plt.plot(history.history['val_loss'][:], label='validation')
plt.legend()
plt.show()

We have now our trained network, are we able to guess the correct values for our test data-set? We give 15 minutes of data in input and compare the output with the 16th minute data :

In [None]:
lstm = load_model("lstm.hdf5")
evaluation = lstm.evaluate(test_x,test_y)
print("Average error on test data :",round(evaluation,6))
print("->",round(math.sqrt(evaluation)*100,1),"% error on average")

## Using the network to detect anomalies

Our network is able to guess the correct future values with an accuracy of ~3% in the context of the training data. What happens if we change that context? We will check the evolution of the error using the data from the event.csv file.

### Loading the file

In [None]:
# Load the CSV file
strange_event = pd.read_csv("event.csv")
#Drop the timestamp column
strange_event = strange_event.drop(columns=['Date (UTC)'])
strange_event = strange_event.values
#Rescale the data, just like the training set
strange_event_scaled = scaler.transform(strange_event)
#Plotting the data
plt.plot(strange_event_scaled[:,0],label="cernox_4k")
plt.plot(strange_event_scaled[:,1],label="K3")
plt.plot(strange_event_scaled[:,2],label="K4")
plt.plot(strange_event_scaled[:,3],label="K5")
plt.legend()

In this example, all values are suddently increasing after around 140 minutes, which is a behaviour that is not present in the training data. We will format the data as we did for the training set :

In [None]:
event_data = np.zeros((len(strange_event_scaled)-nb_timesteps,(nb_timesteps+1)*nb_features))
for i in range(len(strange_event_scaled)-nb_timesteps):
    event_data[i] = np.reshape(strange_event_scaled[i:i+nb_timesteps+1],((nb_timesteps+1)*nb_features))
# split into input and outputs (15 minutes of data in input, the 16th minute in output)
event_x, event_y = event_data[:, :-nb_features], event_data[:, -nb_features:]
# reshape input to be 3D [samples, timesteps, features] -> [nb_samples, nb_timesteps minutes, nb_features features]
event_x = event_x.reshape((event_x.shape[0], nb_timesteps, nb_features))

### Using the neural network

We will now send in the network the data from T0 to T14 and try to guess T15. We keep the error done in the prediction and restart with data from T1 to T15, then T2 to T16 and so on...

In [None]:
errors=np.zeros(len(event_x))
for i in range(len(event_x)):
    errors[i]=lstm.evaluate(event_x[i:i+1,:,:],event_y[i:i+1],verbose=0)
    
fig = plt.figure(figsize=(15,9))
plt.title("Evolution of the prediction's errors")
plt.plot(errors)
plt.show()

#Apply a sliding average on the output
N = 10
errors = np.convolve(errors, np.ones((N,))/N, mode='valid')
    
fig = plt.figure(figsize=(15,9))
plt.title("Evolution of the prediction's errors (average on "+str(N)+" minutes)")
plt.plot(errors)
plt.show()

We can see that the prediction error is relatively stable except for a huge spike: it should be easy to set a threshold to raise an alarm!

Now this example is fairly obvious, you can try again with the data from file event2.csv.