<a href="https://colab.research.google.com/github/cagBRT/timeSeries/blob/main/3b_TestingForStationarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

By the end of this notebook the student will be able to: <br>
1. test a time series for stationarity
2. identify white noise

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/timeSeries.git cloned-repo
%cd cloned-repo

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# **Methods for testing for stationarity**<br>
The stationarity of a series can be established by looking at the plot of the series<br>
Split the series into 2 or more contiguous parts and compute the summary statistics such as: <br>
- the mean 
- the variance 
- autocorrelation<br>

**If the stats are quite different, then the series is not likely to be stationary.**<br>


To quantitatively determine if a given series is stationary or not, use statistical tests called ‘Unit Root Tests’. <br>
There are multiple variations of this, where the tests check if a time series is non-stationary and possesses a unit root.<br>
The presence of a unit root means the time series is non-stationary. <br>
**The number of unit roots contained in the series corresponds to the number of differencing operations required to make the series stationary**

There are a number of implementations of Unit Root tests. <br>
We will look at two :<br>

- Augmented Dickey Fuller test (ADF Test)
- Kwiatkowski-Phillips-Schmidt-Shin – KPSS test (trend stationary)


The most commonly used is the ADF test, where the null hypothesis is:<br>
>the time series possesses a unit root and is non-stationary. <br>

If the P-Value in ADH test is less than the significance level (0.05), you reject the null hypothesis.<br>
**When the test statistic is greater than the critical value shown, you reject the null hypothesis and infer that the time series is stationary.**



**Get, prepare, and plot the data**

In [None]:
df = pd.read_csv('timeSeriesExample.csv',  header=0,  index_col=0, squeeze=True, parse_dates=['date'])
df.head()

In [None]:
from matplotlib import pyplot
df.plot()
pyplot.show()

**The ADF test**

In [None]:
from statsmodels.tsa.stattools import adfuller, kpss
df = pd.read_csv('timeSeriesExample.csv', parse_dates=['date'])
# ADF Test
#autolag='AIC'. By doing so, the adfuller will choose a the number
# of lags that yields the lowest AIC. 
#This is usually a good option to follow.
result = adfuller(df.value.values, autolag='AIC')
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
for key, value in result[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')

#When the p-value is greater than significance level of 0.05 
#and the ADF statistic is higher than any of the critical values, 
#the series is stationary


**Assignment 1**<br>
Test the following series for stationarity using the ADF test<br>
Plot the series to visually confirm your answer

In [None]:
#Assignment 1
#Use the MARUTI.csv dataset, column VWAP

In [None]:
#@title 
df = pd.read_csv('/content/cloned-repo/MARUTI.csv')
df_vwap = df[['Date','VWAP']] # df is the original dataframe
df_vwap['Date'] = df_vwap['Date'].apply(pd.to_datetime)
df_vwap.set_index("Date", inplace = True)

result = adfuller(df_vwap['VWAP'], autolag='AIC')
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
for key, value in result[4].items():
    print('Critial Values:')
    print(f'   {key}, {value}')


In [None]:
#Plot assignmentSeries


In [None]:
#@title 
fig, axes = plt.subplots(figsize=(10,7))
plt.plot(df_vwap);
plt.title('Maruti');

**The KPSS test**

The KPSS test, on the other hand, is used to test for trend stationarity. The null hypothesis and the P-Value interpretation is just the opposite of ADH test. <br>
Like ADF test, the KPSS test is also commonly used to analyse the stationarity of a series. However, it has couple of key differences compared to the ADF test in function and in practical usage. <br>
**Therefore, is not safe to just use them interchangeably.**

A key difference from ADF test is the null hypothesis of the KPSS test is that the series is stationary.

So the interpretaion of p-value is just the opposite to each other.

**For the KPSS test if p-value is < signif level (say 0.05), then the series is non-stationary**

In [None]:
# KPSS Test
df = pd.read_csv('timeSeriesExample.csv', parse_dates=['date'])
result = kpss(df.value.values, regression='c')
print('\nKPSS Statistic: %f' % result[0])
print('p-value: %f' % result[1])
for key, value in result[3].items():
    print('Critial Values:')
    print(f'   {key}, {value}')

# **Assignment**
Use the KPSS method to determine if the Maruti data series is stationary or not stationary

In [None]:
#@title 
result = kpss(df_vwap['VWAP'], regression='c')
print('\nKPSS Statistic: %f' % result[0])
print('p-value: %f' % result[1])
for key, value in result[3].items():
    print('Critial Values:')
    print(f'   {key}, {value}')

There is more to these tests and interpeting their results. <br>
If you intend to use them, check out these links:<br>

[ADF](https://www.machinelearningplus.com/time-series/augmented-dickey-fuller-test/)<br>

[KPSS](https://www.machinelearningplus.com/time-series/kpss-test-for-stationarity/)

# **White Noise**

White noise is an important concept in time series analysis and forecasting.

- **Predictability**: If your time series is white noise, then, by definition, it is random. You cannot reasonably model it and make predictions.
- **Model Diagnostics**: The series of errors from a time series forecast model should ideally be white noise.

White noise is not a function of time, its mean and variance does not change over time. The difference is, **the white noise is completely random with a mean of 0**.

In white noise there is no pattern whatsoever. If you consider the sound signals in an FM radio as a time series, the blank sound you hear between the channels is white noise.

Mathematically, a sequence of completely random numbers with mean zero is a white noise.



In [None]:
from random import gauss
from random import seed
from pandas import Series
from pandas.plotting import autocorrelation_plot
# seed random number generator
seed(1)
# create white noise series
series = [gauss(0.0, 1.0) for i in range(1000)]
series = Series(series)

The mean is nearly 0.0 and the standard deviation is nearly 1.0. <br>
Some variance is expected given the small size of the sample.

In [None]:
print(series.describe())

A line plot of the series shows it appears to be random noise

In [None]:
# line plot
series.plot()
pyplot.show()

Create a histogram of the series<br>
The bell shaped curve shows it is white noise. 

In [None]:
series.hist()
pyplot.show()

The correlogram does not show any obvious autocorrelation pattern.<br>

In [None]:
autocorrelation_plot(series)
pyplot.show()

Autocorrelation represents the degree of similarity between a given time series and a lagged version of itself over successive time intervals.<br>
 Autocorrelation measures the relationship between a variable's current value and its past values. <br>
It finds patterns in the series