## Foundations of Deep Learning 24/25 // Data Science MSc - unimib

### Temporal Convolutional Networks and the Tennessee Eastman Process dataset 

### TCNs for classification of time series anomalies 

 E. Mosca - 925279

#### *Introduction*

In this work, I have implemented TCN-based models for time series classification in Tf/Keras, basing myself mainly off of the paper introducing them by Bai et al.(https://arxiv.org/abs/1803.01271)

TCNs are a declination of the usual convolution operation for time series data, with kernels of a single dimension, characterised by dilated causal convolutions. A dilated 'causal' convolution is a convolution with a 1D kernel of length k, where each pass of the kernel does not concern values beyond the current time step. To be able to do this with kernel sizes greater than 1, we add causal zero padding at the beginning of the sequence; as a consequence the output sequence will have the same length. 

Since we want to cover the complete history of the input sequence, TCNs need to have an appropriate receptive field RF, which depends on kernel size, dilation values and number of layers.
- A single "1D Dilated Causal convolutional layer" consists of multiple convolutions, each subsequent one with a higher dilation parameter in order to cover a wider range of values without having to increase kernel size, which would bring in a lot of weights.
- Each convolution has its own receptive field rf = 1 + d(k-1), meaning each time point in its output will depend on the previous rf points
- By stacking multiple convolutions, and increasing the dilation factor exponentially, we can increase the overall RF by a lot(exponentially), with only a linear increase in the weights.

Knowing our sequence length we can determine the required number of convolutions to stack in a single layer, by computing:

n = ceiling(log2(  (len-1)/(k-1) +1 ))

After experimenting TCNs on our data with dilation factors being powers of 2, we choose a kernel size of 3. Even though we know our train and test sequence lengths are different, we will train using a receptive field that is appropriate for the train set, so we will be using 8 convolutions for each dilated causal conv layer as sequence length in the train set is 500

#### *Implementation*

Code for creating a TF model based on TCN for time series classification follows

In [1]:
import pandas as pd
from tensorflow import keras
import tensorflow as tf
import numpy as np

In [2]:
from tensorflow.keras.layers import Input, Conv1D, BatchNormalization, ReLU, SpatialDropout1D, Add, Dense, GlobalMaxPooling1D
from tensorflow.keras.models import Model
from keras.regularizers import L2

Our TCN Architecture: Stack of residual blocks made of TCNs

- final residual block is followed by global max pooling, which is conncted to 20way softmax for classification
- 2 residual blocks, where output is added to input to avoid vanishing gradients(if dims dont match we use a 1x1 convolution on input to match)
- 2 dilated convolutional layers per residual block
- dilated convolutional layers are followed by (batch) norm, (relu) activation, and spatial dropout
- each dilated convolutional layer has 8 convolutions, with dilation rate from 1 to 128
- before batch norm, we add the intermediate convolution results obtained with different dilations, to combine different time resolution features
- every convolution has kernel size = 3, causal zero padding, uses 128 kernels, has Xavier initialization and L2 regularization 

In [3]:
def tcn_block(inputs, n_filters, kernel_size, dilations, dropout_rate,kernel_initializer="glorot_uniform"):

    conv_outputs = []
    x = inputs 

    for d in dilations:
        x = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=d,
                   padding='causal', activation=None,kernel_initializer=kernel_initializer,kernel_regularizer=L2())(x)
        conv_outputs.append(x)

    x = Add()(conv_outputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    x = SpatialDropout1D(rate=dropout_rate)(x)

    conv_outputs = []

    for d in dilations:
        x = Conv1D(filters=n_filters, kernel_size=kernel_size, dilation_rate=d,
                   padding='causal', activation=None,kernel_initializer=kernel_initializer,kernel_regularizer=L2())(x)
        conv_outputs.append(x)

    x = Add()(conv_outputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    x = SpatialDropout1D(rate=dropout_rate)(x)

    # Match dimensions for residual if necessary
    if inputs.shape[-1] != x.shape[-1]:
        residual = Conv1D(filters=x.shape[-1], kernel_size=1, padding='same')(inputs)
    else:
        residual = inputs

    return Add()([x, residual])


def build_tcn_model(num_features, 
                    num_classes,
                    num_blocks=2,
                    n_filters=64,
                    kernel_size=3,
                    dilations=[1, 2, 4,8,16,32,64,128], 
                    dropout_rate=0.1,
                    pooling=GlobalMaxPooling1D):
    inputs = Input(shape=(None,num_features)) #none seq length so as to take in any length(since test set is different)
    x = inputs
    for _ in range(num_blocks):
        x = tcn_block(x, n_filters=n_filters, kernel_size=kernel_size,
                      dilations=dilations, dropout_rate=dropout_rate)


    x = pooling()(x)
    outputs = Dense(num_classes, activation='softmax')(x)

    return Model(inputs, outputs)

The final training(done in colab) of the TCN-based model using this architecture resulting from many experiments is available in the TCNtraining.ipynb file, and the model file is available as TCN00.keras

In [4]:
model = build_tcn_model(num_features=52,num_classes=20)
model.summary()

Development notes:

- Different TCN architectures were implemented and tested, mainly the one above and a "long" TCN variation, where we had more residual blocks with one convolution per dilated convolutional layer, where dilation would increase block-by-block
- Comparing them to the keras-tcn library, they tended to obtain similar performance for classification during training (in terms of categorical cross entropy and accuracy)
- Surprisingly Xavier init worked better than He init, even though relu activations are used
- For some time, training was stuck at a plateau, likely because of the lack of weight decay. In fact performance was much higher after including the latter(might have encountered exploding gradients)
- After some time training different models, it became evident that they didn't need to be big and that trainable params under 500K were enough. This coupled with a high learning rate(0.01 on Adam) greatly reduced the convergence of the model and the number of epochs we actually needed to train it. Even though there were some suspicions, the resulting model performed better than the old one on the test set