# Chapter 12: Time Series Models

## 12.1: Introduction

- so far only static data, the measurements do not change over time
- possible to transform non-static data into static data (to some extent) 
  $\Rightarrow$ seldom optimal


## 12.2 Stationarity

- no change over time
- unconditional koint probability ditribtuion of its parameters does not change over time
  $\Rightarrow$ **strict stationarity**
- more practical definition: mean and variance does not change over time
- weak stationarity: only mean is constant over time
  $\Rightarrow$ weak stationarity, wide sense stationarity
- most processes encountered in real life are non-stationarity, but are modelled as stationary processes


## 12.3 Autoregressive Moving Average Models

- short: ARMA analysis
- simple technique of univariate time series analysis
- based on two separate concepts: *autoregression* and *moving average*
- Define a system: 
  * discrete time system 
  * white noise inputs denoted as $\epsilon_i , i = 1,...,n$
  * $i$ denotes the instance of time
  * output of the system be denoted as $x_i, i = 1,...,n$
  * assume all these variables as univariate and numerical


## 12.3.1 Autoregressive process


An autoregressive or AR process is a process in which the current output of the system is a function of the weighted sum of a certain number of previous inputs.
The AR process is not necessarily stationary.

AR process of order p, $AR(p)$

\begin{equation}
x_i = \sum_{j = i - p}^{i - 1} \alpha_j x_j + \epsilon_i
\end{equation}

with

- $\epsilon_i$: error/residual term at instance $i$
- $x_i$: output of system
- $\alpha_i$ coefficients/parameters of the AR process

Time lag operator L:

$L x_i = x_{i-1} \forall i$

For the *k*th order lag

$L^k x_i = x_{i-k} \forall i$


\begin{equation}
(1 - \sum_{j=1}^{p} \alpha_j L^j)\,\,x_i = \epsilon_i
\end{equation}

### 12.3.2 Moving Average Process

- always stationary
- current output is moving average of a certain number of past states of the dault white noise process

Moving avergae process of order *q*:

\begin{equation}
x_i = \epsilon_i + \beta_1 \epsilon_{i-1} + ... + \beta_q \epsilon_{i-q}
\end{equation}

- $\beta_i$ coefficients/parameters of the MA process


### 12.3.3 Autoregressive oving Average Process

\begin{equation}
x_i = \alpha_1 x_{i-1} + ... + \alpha_p x_{i-p} \epsilon_i + \beta_1 \epsilon_{i-1} + ... + \beta_q \epsilon_{i-q}
\end{equation}



## 12.4 Autoregressive Integrated Moving Average (ARIMA) Models

Although ARMA(p, q) process in general can be non-stationary, it cannot explicitly model a non-stationary process well. s.
$\Rightarrow$ ARIMA, adds differencing terms to the equation
THe differencing process explicity tires to remove the trends or seasonalities from the data to make the residual process stationary. 

Differencing operationas the name suggests computes the deltas between consective values of the output as

\begin{equation}
x_d(i) = x(i) - x(i-1)
\end{equation}

Here the first-order differences is shown.

Here generalized into any arbitrary order $r$ as

\begin{equation}
{x_d(i)}^r = {(1 - L)}^r x_i
\end{equation}

Full equation of a ARIMA process ARIMA(p, q, r):

\begin{equation}
\left(1 - \sum_{j=1}^p \alpha_j L^j \right) {(1-L)}^r x_i = \left(1 + \sum_{j=1}^1 \beta_j L^j \right) \epsilon_i
\end{equation}

When the value of $r$ is 0, the ARIMA process reduces to ARMA process.
Similarly, when $r$ and $q$ are 0, the ARIMA process reduces to AR process, and when $r$ and $p$ are 0,it reduces to MA process.


## 12.5 Implementing AR, MA, ARMA and ARIMA in Python

- sklearn does not offer tools directly
- other libraries: Darts, Prophet


## 12.6 Hidden Markov Models (HMMs)

## 12.7 Conditional Random Fields (CRFs)

CRFs directly try to model the conditional probabilities of the observations based on the assumptions of similar hidden states. The fundamental function for CRF can be stated as

\begin{equation}
\hat{y} = arg max\,y P(y| X)
\end{equation}

In order to model the sequential input and states, CRF introduces feature functions. The feature function is defined based on four entities.

1. Input vectors **X**
1. Instane *i* of the data pont being predicted
1. Label for data point at (i - 1)th instance, $l_{i-1}$
1. Label for data point at (*i*)th instance $l_i$

The function is then given as 

\begin{equation}
f(X, i, l_{i-1}, l_i)
\end{equation}

Using this feature function, the conditional probability is written as 

\begin{equation}
P(y|X, \lambda) = \frac{1}{Z(X)} exp\left(\sum_{i=1}^n \sum_j \lambda_j f_i(X, i, y_{i-1}, y_i) \right)
\end{equation}

where the normalization constant Z(X) is defined as

\begin{equation}
Z(X) = \sum_{\hat{y} \in y} \sum_{i=1}^n \sum_j \lambda_j f_i(X, i, \hat{y_{i-1}}, \hat{y_i}) 
\end{equation}