# Timeseries anomaly detection using an Autoencoder

**Author:** [pavithrasv](https://github.com/pavithrasv)<br>
**Date created:** 2020/05/31<br>
**Last modified:** 2020/05/31<br>
**Description:** Detect anomalies in a timeseries using an Autoencoder.

https://github.com/keras-team/keras-io/blob/master/examples/timeseries/ipynb/timeseries_anomaly_detection.ipynb

## Introduction

This script demonstrates how you can use a reconstruction convolutional
autoencoder model to detect anomalies in timeseries data.

## Setup

In [1]:
import numpy as np
import pandas as pd
import keras
from keras import layers
from matplotlib import pyplot as plt

2024-08-09 08:36:58.690621: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-09 08:37:00.924725: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
from joblib import delayed

In [3]:
import pandas as pd

In [4]:
import random

In [5]:
import tensorflow as tf

In [6]:
import dask.dataframe as dd
import os

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



## Load the data

We will use the [Numenta Anomaly Benchmark(NAB)](
https://www.kaggle.com/boltzmannbrain/nab) dataset. It provides artificial
timeseries data containing labeled anomalous periods of behavior. Data are
ordered, timestamped, single-valued metrics.

We will use the `art_daily_small_noise.csv` file for training and the
`art_daily_jumpsup.csv` file for testing. The simplicity of this dataset
allows us to demonstrate anomaly detection effectively.

### Create sequences
Create sequences combining `TIME_STEPS` contiguous data values from the
training data.

In [7]:
%%time
TIME_STEPS = 750
Dims = 3

Folder = '/scratch/750inputs/'

CPU times: user 4 μs, sys: 4 μs, total: 8 μs
Wall time: 12.4 μs


In [19]:
%%time 
file_list = [
    os.path.join(Folder,file)
    for file in os.listdir(Folder) if file.endswith('.csv') and file.startswith('2')
]

CPU times: user 1min 38s, sys: 1min 48s, total: 3min 26s
Wall time: 55min 16s


In [20]:
len(file_list)

112032781

In [21]:
%%time
random.shuffle(file_list)

CPU times: user 50.5 s, sys: 4.67 s, total: 55.2 s
Wall time: 55.3 s


In [22]:
for i in range(30):
    print(file_list[i])

/scratch/750inputs/230524 recording200095214.csv
/scratch/750inputs/230430 recording200363446.csv
/scratch/750inputs/230305 recording300081899.csv
/scratch/750inputs/230105 recording300303895.csv
/scratch/750inputs/221116 recording100041219.csv
/scratch/750inputs/230523 recording400219998.csv
/scratch/750inputs/230517 recording200139617.csv
/scratch/750inputs/221116 recording200387258.csv
/scratch/750inputs/230717 recording200268674.csv
/scratch/750inputs/230113 recording200299053.csv
/scratch/750inputs/221123 recording300047599.csv
/scratch/750inputs/230115 recording200360639.csv
/scratch/750inputs/230704 recording400048620.csv
/scratch/750inputs/230313 recording300162248.csv
/scratch/750inputs/230103 recording300145506.csv
/scratch/750inputs/230504 recording400281842.csv
/scratch/750inputs/230506 recording300485285.csv
/scratch/750inputs/230227 recording300245342.csv
/scratch/750inputs/230317 recording300147906.csv
/scratch/750inputs/221122 recording300407890.csv
/scratch/750inputs/2

In [11]:
%%time
#ddf = tf.data.experimental.make_csv_dataset(file_list[:1000], batch_size=32)
ddf = tf.data.experimental.make_csv_dataset(Folder+"2*.csv", batch_size=32, column_names=['rx','ry','rz','sx','sy','sz'],header=False)

ValueError: Problem inferring types: CSV row 1 has 750 number of fields. Expected: 6.

In [None]:
%%time
#ddf = dd.read_csv(Folder+'*.csv')

In [None]:
%%time
# other options are skipped for convenience
ddf = dd.read_csv(file_list[:1000], names=['rx','ry','rz','sx','sy','sz'], 
                  dtype={'rx':np.float64,'ry':np.float64,'rz':np.float64,'sx':np.float64,'sy':np.float64,'sz':np.float64})

In [None]:
%%time
dfs = [delayed(pd.read_fwf)(file) for file in file_list[:1000]] 
ddf = dd.from_delayed(dfs)

In [None]:
ddf.head

In [None]:
files = 100

x_train = []
for i in range(files):
    xtrain.append(np.loadtxt(file_list[i]))

In [None]:
file = np.loadtxt(file_list[3],delimiter=",")

In [None]:
file.shape

In [None]:
print("Training input shape: ", x_train.shape)

## Build a model

We will build a convolutional reconstruction autoencoder model. The model will
take input of shape `(batch_size, sequence_length, num_features)` and return
output of the same shape. In this case, `sequence_length` is 288 and
`num_features` is 1.

In [9]:
model = keras.Sequential(
    [
        layers.Input(shape=(6, 750)),
        layers.Conv1D(
            filters=64,
            kernel_size=7,
            padding="same",
            strides=2,
            activation="relu",
        ),
        layers.Dropout(rate=0.2),
        layers.Conv1D(
            filters=32,
            kernel_size=7,
            padding="same",
            strides=2,
            activation="relu",
        ),
        layers.Conv1DTranspose(
            filters=32,
            kernel_size=7,
            padding="same",
            strides=2,
            activation="relu",
        ),
        layers.Dropout(rate=0.2),
        layers.Conv1DTranspose(
            filters=64,
            kernel_size=7,
            padding="same",
            strides=2,
            activation="relu",
        ),
        layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),

    ]
)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")
model.summary()

## Train the model

Please note that we are using `x_train` as both the input and the target
since this is a reconstruction model.

In [10]:
history = model.fit(
    ddf,
    ddf,
    epochs=50,
    batch_size=5,
    #validation_split=0.1,
    callbacks=[
        keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")
    ],
)

NameError: name 'ddf' is not defined

Let's plot training and validation loss to see how the training went.

In [None]:
plt.plot(history.history["loss"], label="Training Loss")
plt.plot(history.history["val_loss"], label="Validation Loss")
plt.legend()
plt.show()

## Detecting anomalies

We will detect anomalies by determining how well our model can reconstruct
the input data.


1.   Find MAE loss on training samples.
2.   Find max MAE loss value. This is the worst our model has performed trying
to reconstruct a sample. We will make this the `threshold` for anomaly
detection.
3.   If the reconstruction loss for a sample is greater than this `threshold`
value then we can infer that the model is seeing a pattern that it isn't
familiar with. We will label this sample as an `anomaly`.


In [None]:
# Get train MAE loss.
x_train_pred = model.predict(x_train)
train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)

plt.hist(train_mae_loss, bins=50)
plt.xlabel("Train MAE loss")
plt.ylabel("No of samples")
plt.show()

# Get reconstruction loss threshold.
threshold = np.max(train_mae_loss)
print("Reconstruction error threshold: ", threshold)

### Compare recontruction

Just for fun, let's see how our model has recontructed the first sample.
This is the 288 timesteps from day 1 of our training dataset.

In [None]:
# Checking how the first sequence is learnt
plt.plot(x_train[0])
plt.plot(x_train_pred[0])
plt.show()

### Prepare test data

In [None]:

df_test_value = (df_daily_jumpsup - training_mean) / training_std
fig, ax = plt.subplots()
df_test_value.plot(legend=False, ax=ax)
plt.show()

# Create sequences from test values.
x_test = create_sequences(df_test_value.values)
print("Test input shape: ", x_test.shape)

# Get test MAE loss.
x_test_pred = model.predict(x_test)
test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)
test_mae_loss = test_mae_loss.reshape((-1))

plt.hist(test_mae_loss, bins=50)
plt.xlabel("test MAE loss")
plt.ylabel("No of samples")
plt.show()

# Detect all the samples which are anomalies.
anomalies = test_mae_loss > threshold
print("Number of anomaly samples: ", np.sum(anomalies))
print("Indices of anomaly samples: ", np.where(anomalies))

## Plot anomalies

We now know the samples of the data which are anomalies. With this, we will
find the corresponding `timestamps` from the original test data. We will be
using the following method to do that:

Let's say time_steps = 3 and we have 10 training values. Our `x_train` will
look like this:

- 0, 1, 2
- 1, 2, 3
- 2, 3, 4
- 3, 4, 5
- 4, 5, 6
- 5, 6, 7
- 6, 7, 8
- 7, 8, 9

All except the initial and the final time_steps-1 data values, will appear in
`time_steps` number of samples. So, if we know that the samples
[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
5 is an anomaly.

In [None]:
# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies
anomalous_data_indices = []
for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):
    if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):
        anomalous_data_indices.append(data_idx)

Let's overlay the anomalies on the original test data plot.

In [None]:
df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]
fig, ax = plt.subplots()
df_daily_jumpsup.plot(legend=False, ax=ax)
df_subset.plot(legend=False, ax=ax, color="r")
plt.show()