# Time Series Forecasting. Classical approach.

## Introduction

Welcome to this notebook, where we embark on an exploration of a Machine Learning approach to predict the number of cyberattacks a country may face in the following month. Supervised models can be used for time series, as long as we have a way to extract seasonality and put it into a variable. Examples include creating a variable for a year, a month, or a day of the week, etc. These are then used as the X variables in your supervised model and the ‘y’ is the actual value of the time series. You can also include lagged versions of y (the past value of y) into the X data, in order to add autocorrelation effects.


## Table of Contents

1. Time Series Visualization

2. Dataset Construction
    - 2.1 Baseline
    - 2.2 Lagged features
    - 2.3 Rolling statistics feature

3. Train models

4. Evaluation

In [None]:
# Required imports
import pandas as pd
from utils import *

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

## Time Series Visualization

Let us read the data and visualize it as a time series.

In [None]:
# Read data
df1 = pd.read_csv('../Data/21_november_to_april.csv')
df2 = pd.read_csv('../Data/22_april_to_november.csv')
df3 = pd.read_csv('../Data/22_november_to_april.csv')
df4 = pd.read_csv('../Data/23_april_to_november.csv')

# Concatenate dataframes
df = pd.concat([df1, df2, df3, df4], axis=0, ignore_index=True)

# Delete dataframes
del  df1, df2, df3, df4

In [None]:
# Select some countries to analyze
df_Spain = select_country(df, 'Spain')
df_USA = select_country(df, 'United States')
df_Singapore = select_country(df, 'Singapore')
df_Germany = select_country(df, 'Germany')
df_Japan = select_country(df, 'Japan')

In [None]:
daily_count_Spain = visualize_ts(df_Spain)

In [None]:
daily_count_USA = visualize_ts(df_USA)

In [None]:
daily_count_Singapore = visualize_ts(df_Singapore)

In [None]:
daily_count_Germany = visualize_ts(df_Germany)

In [None]:
daily_count_Japan = visualize_ts(df_Japan)

## Dataset Construction

Let us generate features to forecast the number of cyberattacks a country might encounter in the future. We will begin by incorporating temporal elements such as the month, year, day, and so on. This will establish a baseline dataset for our analysis.

In [None]:
# Create baseline dataset. Just temporal information

df_Spain = create_baseline_dataset(daily_count_Spain)
df_USA = create_baseline_dataset(daily_count_USA)
df_Singapore = create_baseline_dataset(daily_count_Singapore)
df_Germany = create_baseline_dataset(daily_count_Germany)
df_Japan = create_baseline_dataset(daily_count_Japan)

# Visualize df_USA
df_USA.head() # The first column is the target variable

### Lagged feature

A valuable feature for anticipating the number of attacks a country might experience in the future is the historical count of attacks. To forecast the number of attacks at a given time, say $t$, we can use information on the number of cyberattacks at an earlier time $t-i$, where $i\geq 1$.

In [None]:
# Example of lagged dataset
add_lags(df_USA, 3, 'count').head()

### Rolling statistics features

Additional valuable features that we can derive from the lagged variables include various statistics like the mean, maximum, minimum, and so forth.

In [None]:
# Example of rolling dataset
create_rolling_features(df_USA, 'count', windows=[2,3]).head()

## Train models