# Time Series Forecasting. ML approach.

## Introduction

Welcome to this notebook, where we embark on an exploration of a Machine Learning approach to predict the number of cyberattacks a country may face in the following month. Supervised models can be used for time series, as long as we have a way to extract seasonality and put it into a variable. Examples include creating a variable for a year, a month, or a day of the week, etc. These are then used as the X variables in your supervised model and the ‘y’ is the actual value of the time series. You can also include lagged versions of y (the past value of y) into the X data, in order to add autocorrelation effects.


## Table of Contents

1. Time Series Visualization

2. Dataset Construction
    - 2.1 Baseline
    - 2.2 Lagged features
    - 2.3 Rolling statistics feature

3. Train models

4. Evaluation

5. Bonus Section

In [None]:
# Required imports
import pandas as pd
from utils import *

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

## Time Series Visualization

Let us read the data and visualize it as a time series.

In [None]:
# Read data
df1 = pd.read_csv('../Data/21_november_to_april.csv')
df2 = pd.read_csv('../Data/22_april_to_november.csv')
df3 = pd.read_csv('../Data/22_november_to_april.csv')
df4 = pd.read_csv('../Data/23_april_to_november.csv')

# Concatenate dataframes
df = pd.concat([df1, df2, df3, df4], axis=0, ignore_index=True)

# Delete dataframes
del  df1, df2, df3, df4

In [None]:
# Select some countries to analyze
df_Spain = select_country(df, 'Spain')
df_USA = select_country(df, 'United States')
df_Singapore = select_country(df, 'Singapore')
df_Germany = select_country(df, 'Germany')
df_Japan = select_country(df, 'Japan')

In [None]:
daily_count_Spain = visualize_ts(df_Spain)

In [None]:
daily_count_USA = visualize_ts(df_USA)

In [None]:
daily_count_Singapore = visualize_ts(df_Singapore)

In [None]:
daily_count_Germany = visualize_ts(df_Germany)

In [None]:
daily_count_Japan = visualize_ts(df_Japan)

## Dataset Construction

Let us generate features to forecast the number of cyberattacks a country might encounter in the future. We will begin by incorporating temporal elements such as the month, year, day, and so on. This will establish a baseline dataset for our analysis.

In [None]:
# Create baseline dataset. Just temporal information

df_Spain = create_baseline_dataset(daily_count_Spain)
df_USA = create_baseline_dataset(daily_count_USA)
df_Singapore = create_baseline_dataset(daily_count_Singapore)
df_Germany = create_baseline_dataset(daily_count_Germany)
df_Japan = create_baseline_dataset(daily_count_Japan)

# Visualize df_USA
df_USA.head() # The first column is the target variable

### Lagged feature

A valuable feature for anticipating the number of attacks a country might experience in the future is the historical count of attacks. To forecast the number of attacks at a given time, say $t$, we can use information on the number of cyberattacks at an earlier time $t-i$, where $i\geq 1$.

In [None]:
# Example of lagged dataset
add_lags(df_USA, 3, 'count').head()

### Rolling statistics features

Additional valuable features that we can derive from the lagged variables include various statistics like the mean, maximum, minimum, and so forth.

In [None]:
# Example of rolling dataset
create_rolling_features(df_USA, 'count', windows=[2,3]).head()

## Train models

Let us establish a routine for training various Machine Learning regression algorithms. The objective is to use historical data at time $t$ to make predictions for time $t+1$. With our dataset spanning two years, we plan to allocate one month for model validation and another month for testing.

The training pipeline is as follows:

 - Choose a time series from a specific country.

 - Determine the number of lagged values and windows to generate the dataset from the chosen time series. This will serve as a hyperparameter that requires tuning, with various combinations tested to identify the optimal configuration.

 - Train multiple regression models while conducting hyperparameter tuning on each using the validation set. The hyperparameter tuning is made using [Optuna](https://github.com/optuna/optuna), an automatic hyperparameter optimization software framework, particularly designed for machine learning.

 - Assess the performance of the best-performing models on the test set.



In [None]:
# Spain
param_Spain = parameters_search(df_Spain, 5, 5)
param_Spain

In [None]:
# USA
param_USA = parameters_search(df_USA, 5, 5)
param_USA

In [None]:
# Singapore
param_Singapore = parameters_search(df_Singapore, 5, 5)
param_Singapore


In [None]:
# Germany
param_Germany = parameters_search(df_Germany, 5, 5)
param_Germany

In [None]:
# Japan
param_Japan = parameters_search(df_Japan, 5, 5)
param_Japan

## Evaluation

Let us use the optimal set of parameters to assess how well our models perform.

In [None]:
# Generate the best combination of features for each country
train_Spain, valid_Spain, test_Spain = create_best_combination_dataset(df_Spain, 5,[2,3] )
train_USA, valid_USA, test_USA = create_best_combination_dataset(df_USA, 1, [2,3,4,5])
train_Singapore, valid_Singapore, test_Singapore = create_best_combination_dataset(df_Singapore, 2,[2,3,4] )
train_Germany, valid_Germany, test_Germany = create_best_combination_dataset(df_Germany, 5,[2,3,4,5] )
train_Japan, valid_Japan, test_Japan = create_best_combination_dataset(df_Japan,3 ,[2,3] )

In [None]:
_ = evaluate_models(train_Spain, valid_Spain, test_Spain, plot_figures= True, plot_feature_importance= True, use_PCA= False)

In [None]:
_ = evaluate_models(train_USA, valid_USA, test_USA, plot_figures= True, plot_feature_importance= True, use_PCA= False)

In [None]:
_ = evaluate_models(train_Singapore, valid_Singapore, test_Singapore, plot_figures= True, plot_feature_importance= True, use_PCA= False)

In [None]:
_ = evaluate_models(train_Germany, valid_Germany, test_Germany, plot_figures= True, plot_feature_importance= True, use_PCA= False)

In [None]:
_ = evaluate_models(train_Japan, valid_Japan, test_Japan, plot_figures= True, plot_feature_importance= True, use_PCA= False)

Let us consider incorporating additional features. At time $t$, we have information about both the attacks on the country and the status of all 255 sensors. By combining this data, we aim to enhance the predictive capabilities of our model.

In [None]:
add_all_sensors_data = True

if add_all_sensors_data:

    all_sensors_df = all_sensors_data(df)
    # For Spain
    train_Spain = merge_all_sensors(all_sensors_df, train_Spain, 3)
    valid_Spain = merge_all_sensors(all_sensors_df, valid_Spain, 3)
    test_Spain = merge_all_sensors(all_sensors_df, test_Spain, 3)

    # For USA
    train_USA = merge_all_sensors(all_sensors_df, train_USA, 3)
    valid_USA = merge_all_sensors(all_sensors_df, valid_USA, 3)
    test_USA = merge_all_sensors(all_sensors_df, test_USA, 3)

    # For Singapore
    train_Singapore = merge_all_sensors(all_sensors_df, train_Singapore, 3)
    valid_Singapore = merge_all_sensors(all_sensors_df, valid_Singapore, 3)
    test_Singapore = merge_all_sensors(all_sensors_df, test_Singapore, 3)

    # For Germany
    train_Germany = merge_all_sensors(all_sensors_df, train_Germany, 3)
    valid_Germany = merge_all_sensors(all_sensors_df, valid_Germany, 3)
    test_Germany = merge_all_sensors(all_sensors_df, test_Germany, 3)

    # For Japan
    train_Japan = merge_all_sensors(all_sensors_df, train_Japan, 3)
    valid_Japan = merge_all_sensors(all_sensors_df, valid_Japan, 3)
    test_Japan = merge_all_sensors(all_sensors_df, test_Japan, 3)

In [None]:
_ = evaluate_models(train_Spain, valid_Spain, test_Spain, plot_figures= True, plot_feature_importance= False, use_PCA= True)

In [None]:
_ = evaluate_models(train_USA, valid_USA, test_USA, plot_figures= True, plot_feature_importance= False, use_PCA= True)

In [None]:
_ = evaluate_models(train_Singapore, valid_Singapore, test_Singapore, plot_figures= True, plot_feature_importance= False, use_PCA= True)

In [None]:
_ = evaluate_models(train_Germany, valid_Germany, test_Germany, plot_figures= True, plot_feature_importance= False, use_PCA= True)

In [None]:
_ = evaluate_models(train_Japan, valid_Japan, test_Japan, plot_figures= True, plot_feature_importance= False, use_PCA= True)

## Bonus Section

Let us attempt to provide a European perspective. Is it possible to forecast the quantity of cyberattacks expected in Europe for the upcoming month?

In [None]:
# Let us read European data and visualize it as a time series

df_EU = select_continent(df, 'EU')

daily_count_EU = visualize_ts(df_EU)

In [None]:
# Create the baseline dataset
df_EU = create_baseline_dataset(daily_count_EU)

In [None]:
# Find the best combination of parameters
param_EU = parameters_search(df_EU, 5, 5)

In [None]:
# Evaluate the best combination of features
train_EU, valid_EU, test_EU = create_best_combination_dataset(df_EU, 5, [2,3,4,5])
_ = evaluate_models(train_EU, valid_EU, test_EU, plot_figures= True, plot_feature_importance= True, use_PCA= False)