# Time Series Forecasting. LSTM approach.

## Introduction

Welcome to this notebook, where we embark on an exploration of a Deep Learning approach to predict the number of cyberattacks a country may face in the following month. LSTM networks, a subtype of Recurrent Neural Networks (RNNs), excel in predicting future trends by efficiently capturing complex temporal patterns. Their distinctive architecture allows for the retention of critical information over time, enabling proactive identification of emerging patterns in sequential data. In this context, we leverage LSTM's capabilities to enhance our understanding and prediction of temporal dependencies, contributing to more effective forecasting measures


## Table of Contents

1. Time Series Visualization

2. Dataset Construction

3. Sequencing for LSTM

4. Train models

5. Evaluation

6. Bonus Section


In [None]:
# Required imports
import pandas as pd
from utils import *

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

## Time Series Visualization

Let us read the data and visualize it as a time series.

In [None]:
# Read data
df1 = pd.read_csv('../Data/21_november_to_april.csv')
df2 = pd.read_csv('../Data/22_april_to_november.csv')
df3 = pd.read_csv('../Data/22_november_to_april.csv')
df4 = pd.read_csv('../Data/23_april_to_november.csv')

# Concatenate dataframes
df = pd.concat([df1, df2, df3, df4], axis=0, ignore_index=True)

# Delete dataframes
del  df1, df2, df3, df4

In [None]:
# Select some countries to analyze
df_Spain = select_country(df, 'Spain')
df_USA = select_country(df, 'United States')
df_Singapore = select_country(df, 'Singapore')
df_Germany = select_country(df, 'Germany')
df_Japan = select_country(df, 'Japan')

In [None]:
daily_count_Spain = visualize_ts(df_Spain)

In [None]:
daily_count_USA = visualize_ts(df_USA)

In [None]:
daily_count_Singapore = visualize_ts(df_Singapore)

In [None]:
daily_count_Germany = visualize_ts(df_Germany)

In [None]:
daily_count_Japan = visualize_ts(df_Japan)

## Dataset Construction

Let us generate features to forecast the number of cyberattacks a country might encounter in the future. We will begin by incorporating temporal elements such as the month, year, day, and so on. This will establish a baseline dataset for our analysis.

In [None]:
# Create baseline dataset. Just temporal information

df_Spain = create_baseline_dataset(daily_count_Spain)
df_USA = create_baseline_dataset(daily_count_USA)
df_Singapore = create_baseline_dataset(daily_count_Singapore)
df_Germany = create_baseline_dataset(daily_count_Germany)
df_Japan = create_baseline_dataset(daily_count_Japan)

# Visualize df_USA
df_USA.head() # The first column is the target variable

Just like in the ML_approach notebook, you can enhance the model by incorporating lagged features and rolling statistics features. We have a function in utils.py that divides the data into training, validation, and test sets while including the specified lagged and rolling statistics features.

In [None]:
# Create train, validation and test sets with lagged and rolling features

train_Spain, valid_Spain, test_Spain, scaler_Spain = create_features_lstm(df_Spain, 5,[2,3] )
train_USA, valid_USA, test_USA, scaler_USA = create_features_lstm(df_USA, 1, [2,3,4,5])
train_Singapore, valid_Singapore, test_Singapore, scaler_Singapore = create_features_lstm(df_Singapore, 4,[2,3,4] )
train_Germany, valid_Germany, test_Germany, scaler_Germany = create_features_lstm(df_Germany, 5,[2,3,4,5] )
train_Japan, valid_Japan, test_Japan, scaler_Japan = create_features_lstm(df_Japan,3 ,[2,3] )

# Numbers of features
num_feat_Spain = train_Spain.shape[1]
num_feat_USA = train_USA.shape[1]
num_feat_Singapore = train_Singapore.shape[1]
num_feat_Germany = train_Germany.shape[1]
num_feat_Japan = train_Japan.shape[1]

## Sequencing for LSTM 

Typically, the input data for an LSTM should be in the form of three-dimensional arrays, commonly known as tensors. The dimensions of these tensors represent the number of samples, the sequence length, and the number of features, respectively ([samples, timesteps, features]). For time series data, each sample corresponds to a different time point, and the sequence length determines how many previous time steps the model considers when making predictions This tensor is fed into the LSTM layer, enabling the model to learn temporal dependencies and patterns within the sequential data.

In summary, organizing data for an LSTM involves structuring it into sequences, creating 3D tensors that encapsulate the temporal aspects of the data, and appropriately splitting the dataset for training and testing. This format facilitates the LSTM's ability to capture and learn from the sequential patterns within the input data.

To achieve this, we have developed a class called WindowGenerator. This class is responsible for organizing the data into the specified tensor format, making it ready for training with LSTM.

In [None]:
# Generate windows
# Spain
window_Spain = WindowGenerator(
    input_width=3,
    label_width=1,
    shift=1,
    train_df=train_Spain,
    val_df=valid_Spain,
    test_df=test_Spain,
    batch_size=1,
    label_columns=['count'])

# USA
window_USA = WindowGenerator(
    input_width=3,
    label_width=1,
    shift=1,
    train_df=train_USA,
    val_df=valid_USA,
    test_df=test_USA,
    batch_size=4,
    label_columns=['count'])

# Singapore
window_Singapore = WindowGenerator(
    input_width=3,
    label_width=1,
    shift=1,
    train_df=train_Singapore,
    val_df=valid_Singapore,
    test_df=test_Singapore,
    batch_size=1,
    label_columns=['count'])

# Germany
window_Germany = WindowGenerator(
    input_width=3,
    label_width=1,
    shift=1,
    train_df=train_Germany,
    val_df=valid_Germany,
    test_df=test_Germany,
    batch_size=1,
    label_columns=['count'])

# Japan
window_Japan = WindowGenerator(
    input_width=3,
    label_width=1,
    shift=1,
    train_df=train_Japan,
    val_df=valid_Japan,
    test_df=test_Japan,
    batch_size=1,
    label_columns=['count'])

## Train models

Let us create LSTM models for each individual country and proceed to train them.

In [None]:
# Create LSTM model
lstm_Spain = create_lstm(num_feat_Spain)
lstm_USA = create_lstm(num_feat_USA)
lstm_Singapore = create_lstm(num_feat_Singapore)
lstm_Germany = create_lstm(num_feat_Germany)
lstm_Japan = create_lstm(num_feat_Japan)

In [None]:
# Fit models
print('\n\n============= Spain =============\n\n')
history_Spain = compile_and_fit_lstm(lstm_Spain, window_Spain, patience=5, MAX_EPOCHS=50)
print('\n\n============= USA =============\n\n')
history_USA = compile_and_fit_lstm(lstm_USA, window_USA, patience=5, MAX_EPOCHS=50)
print('\n\n============= Singapore =============\n\n')
history_Singapore = compile_and_fit_lstm(lstm_Singapore, window_Singapore, patience=5, MAX_EPOCHS=50)
print('\n\n============= Germany =============\n\n')
history_Germany = compile_and_fit_lstm(lstm_Germany, window_Germany, patience=5, MAX_EPOCHS=50)
print('\n\n============= Japan =============\n\n')
history_Japan = compile_and_fit_lstm(lstm_Japan, window_Japan, patience=5, MAX_EPOCHS=50)

## Evaluation

Let us see the performance of our models.

In [None]:
# Get predictions
actual_train_Spain, pred_train_Spain, actual_test_Spain, pred_test_Spain = get_predictions_lstm(lstm_Spain,scaler_Spain, train_Spain,
                                                                                                 test_Spain, window_Spain)
actual_train_USA, pred_train_USA, actual_test_USA, pred_test_USA = get_predictions_lstm(lstm_USA,scaler_USA, train_USA,
                                                                                                    test_USA, window_USA)
actual_train_Singapore, pred_train_Singapore, actual_test_Singapore, pred_test_Singapore = get_predictions_lstm(lstm_Singapore,scaler_Singapore, train_Singapore,
                                                                                                    test_Singapore, window_Singapore)
actual_train_Germany, pred_train_Germany, actual_test_Germany, pred_test_Germany = get_predictions_lstm(lstm_Germany,scaler_Germany, train_Germany,
                                                                                                    test_Germany, window_Germany)
actual_train_Japan, pred_train_Japan, actual_test_Japan, pred_test_Japan = get_predictions_lstm(lstm_Japan,scaler_Japan, train_Japan,
                                                                                                    test_Japan, window_Japan)

In [None]:
# Plot predictions for Spain
plot_results_LSTM(actual_train_Spain, pred_train_Spain, actual_test_Spain, pred_test_Spain, test_Spain, train_Spain, scaler_Spain)

In [None]:
# Plot predictions for USA
plot_results_LSTM(actual_train_USA, pred_train_USA, actual_test_USA, pred_test_USA, test_USA, train_USA, scaler_USA)

In [None]:
# Plot predictions for Singapore
plot_results_LSTM(actual_train_Singapore, pred_train_Singapore, actual_test_Singapore, pred_test_Singapore, test_Singapore, train_Singapore, scaler_Singapore)

In [None]:
# Plot predictions for Germany
plot_results_LSTM(actual_train_Germany, pred_train_Germany, actual_test_Germany, pred_test_Germany, test_Germany, train_Germany, scaler_Germany)

In [None]:
# Plot predictions for Japan
plot_results_LSTM(actual_train_Japan, pred_train_Japan, actual_test_Japan, pred_test_Japan, test_Japan, train_Japan, scaler_Japan)

Let us plot the metrics results.

In [None]:
# Append predictions and actual values
predictions = [pred_test_Spain.values, pred_test_USA.values, pred_test_Singapore.values, pred_test_Germany.values, pred_test_Japan.values]
actual = [actual_test_Spain.values, actual_test_USA.values, actual_test_Singapore.values, actual_test_Germany.values, actual_test_Japan.values]
countries = ['Spain', 'USA', 'Singapore', 'Germany', 'Japan']

# Display results
display_metrics_table(predictions, actual, countries)

## Bonus Section

Let us attempt to provide a European perspective. Is it possible to forecast the quantity of cyberattacks expected in Europe for the upcoming month?

In [None]:
# Let us read European data and visualize it as a time series

df_EU = select_continent(df, 'EU')

daily_count_EU = visualize_ts(df_EU)

In [None]:
# Create the baseline dataset
df_EU = create_baseline_dataset(daily_count_EU)

In [None]:
# Create train, validation and test sets with lagged and rolling features
train_EU, valid_EU, test_EU, scaler_EU = create_features_lstm(df_EU, 3, [2,3,4])

# Numbers of features
num_feat_EU = train_EU.shape[1]

In [None]:
# Generate windows

window_EU = WindowGenerator(
    input_width=3,
    label_width=1,
    shift=1,
    train_df=train_EU,
    val_df=valid_EU,
    test_df=test_EU,
    batch_size=1,
    label_columns=['count'])

In [None]:
# Create LSTM model
lstm_EU = create_lstm(num_feat_EU)

# Fit model
print('\n\n============= EU =============\n\n')
history_EU = compile_and_fit_lstm(lstm_EU, window_EU, patience=5, MAX_EPOCHS=50)

In [None]:
# Get predictions
actual_train_EU, pred_train_EU, actual_test_EU, pred_test_EU = get_predictions_lstm(lstm_EU,scaler_EU, train_EU,
                                                                                                 test_EU, window_EU)

# Plot predictions for EU
plot_results_LSTM(actual_train_EU, pred_train_EU, actual_test_EU, pred_test_EU, test_EU, train_EU, scaler_EU)