# Handling and Processing Time Series Data

## What is a "Time Series"

Timeseries data are samples observed in a chronological manner (across time). Some samples in different fields are:

    1) Economics - This field usually looks at timeseries indicators to know if a country is improving. Examples are :GDP,  Unemployment, Inflation, CPI, Poverty       
    2) Finance - Daytraders that focus on technical analysis are actually looking at timeseries data on stock prices
    3) Physics - Sin and cosine waves which are heavily used in understanding sound and spread of soundwaves across time.
    
    
    
## Why is it important to look at Time series data?

    1) Understand the process behind a specific indicator
    2) Create simulations to understand what might happen in the future & eventually put controls in order to avoid/surpass forecasts
    3) To know what factors affect the series being analyzed / is there another series that affect the fluctuations of another? (e.g. Demand for Ice cream vs Temperature)
    
    
## Some Samples

`DJIA 30 Stock Time Series` - Average stock price of specific companies included in the Dow Jones Index (used to assess market health).

![Stocks](https://miro.medium.com/max/1400/0*bCS3EWiVfLIZqwIW.gif)

`Voice sampled every .2 seconds` - Average frequency and amplitude of voice sampled at specific timeframes. Time series representation of voices are actually used to model voice recognition algorithms.


![Voice](https://miro.medium.com/max/700/1*80fnKgU_07EzqXWkJXZrQQ.png)

`Inflation Across Time Among the ASEAN Countries` - This shows the trend of inflation of different countries across time. Kinda shows how bad it is here in the PH as compared to other countries
![Inflation](https://assets.rappler.co/4A0970D12C774D96850FBCDB5D2707AA/img/86CDAB18C8304B6CB5DACBF8CA471CBC/Fig_1_aseaninf.png)




## Difference in handling Time Series Data

Given the difference in the behavior of time series data with panel data (samples differ per row), there are different preprocessing techniques applied to be able to clean your dataset.

Lets discuss the usual preprocessing techniques applied to time series data and how it differs with panel data. But before that, we will first discuss the dataset that we'll be using today. We will use stocks data from yfinance. Luckily theres a free python library that will allow us to pull data easily.

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller

In [None]:
stock_name = 'MSFT'
start_date="2019-01-01"
end_date="2022-01-01"

# Valid Intervals : 1m, 2m, 5m, 15m, 30m, 60m, 90m, 1h, 1d, 5d, 1wk, 1mo, 3mo
interval_val = '1D'

data = yf.download(stock_name, start=start_date, end=end_date, interval = interval_val)
data.head()

data = data.reset_index()

In [None]:
# data.to_excel('Data For Students.xlsx', index=False)

In [None]:
data.head()

### Data Dictionary



| Variable  | Definition                               | Key                                            |
| --------- | ---------------------------------------- | ---------------------------------------------- |
| Date  | Timeperiod Covered (Depends on Interval Selected)                                 | Can be days, minutes, hours                                |
| Open    | Opening Stock Price (SP)                             | Stock Price                      |
| High       | Highest SP within the period selected                                      |      Stock Price                                          |
| Low       | Lowest SP within the period selected                             |       Stock Price                                         |
| Close     |Closing SP within the period selected |   Stock Price                                             |
| Adj Close     | Closing SP with adjustments |                                          Stock Price      |
| Volume    | Total Stocks issued                          |                     Quantity                           |


### Exploratory Data Analysis

#### How Big is our Data?

We have a total of 7 columns and 506 day sample

In [None]:
data.shape

#### Data Type and Nulls

Make sure that the columns have the proper datatypes, specially the date columns. We can see from the results below that each columns have the correct datatype.

In [None]:
data.info()

In case the datatype of the datecolumn is not correct, we can convert it to datetime by using the code below. Note that format can be explicitly stated but if that parameter is set to None, pandas will automatically detect what the format of the column is. Check out the documentation for more info : 

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

In [None]:
data['Date'] = pd.to_datetime(data['Date'])

### Feature Engineering

From the previous discussions before, we know that we can extract multiple features from dates. Lets try to get the following from our data:

    Year
    Day
    Weekday Index
    Weekday Name
    Is Weekend / Weekday

In [None]:
data['Year'] = data['Date'].dt.year
data['Day'] = data['Date'].dt.day
data['Weekday Index'] = data['Date'].dt.weekday
data['Weekday Name'] = data['Date'].dt.day_name()
data['IsWeekday'] = ['Weekend' if i > 4 else 'Weekday' for i in data['Weekday Index']]

### Is there anything weird with the data?

Looking at maximum weekday index, it seems that we dont have stock prices data with weekday index 5 and 6, this implies we dont have values for Saturdays and Sundays. Lets try to confirm this by getting the frequency of samples across weekday index


In [None]:
data.describe()

It seems that we dont have stock prices reported during the weekend. And this is expected because stock markets are **closed during the weekends!**



In [None]:
data['Weekday Index'].value_counts().reset_index().sort_values(by='index', ascending=True)

In [None]:
data['Weekday Name'].value_counts().sort_values()

In [None]:
data['IsWeekday'].value_counts().sort_values()

### Imputation

One big difference between timeseries data and panel data is timeseries are frequently analyzed continuously as time progress (hence the series in the name). There are times that we'd need to show all values, EVEN IF IT's zero. In this case, lets say that we'd want to show weekend values as 0 to indicate that there is no trading activity during that day. We can do it by using the code below : 

In [None]:
sample_imputation = data.copy()

In [None]:
sample_imputation.index = sample_imputation['Date']

In [None]:
## Code Provided by Nidhi Sharma
## https://stackoverflow.com/questions/47231496/pandas-fill-missing-dates-in-time-series

def fill_in_missing_dates(df,
                          date_col_name = 'date',
                          fill_val =np.nan,
                          date_format='%Y-%m-%d'):
    
    df.set_index(date_col_name,drop=True,inplace=True)
    
    df.index = pd.to_datetime(df.index, format = date_format)
    idx = pd.date_range(df.index.min(), df.index.max())
    print('missing_dates are',idx.difference(df.index))
    df=df.reindex(idx,fill_value=fill_val)
    
    return df


In [None]:
sample_imputation = fill_in_missing_dates(sample_imputation,
                                          date_col_name = 'Date',
                                          fill_val = 0,
                                          date_format='%Y-%m-%d')

### Getting the features for the imputed values

Here we try to extract features from the date again (since new datapoints were added)

In [None]:
sample_imputation['Date'] = sample_imputation.index

In [None]:
sample_imputation['Year'] = sample_imputation['Date'].dt.year
sample_imputation['Day'] = sample_imputation['Date'].dt.day
sample_imputation['Weekday Index'] = sample_imputation['Date'].dt.weekday
sample_imputation['Weekday Name'] = sample_imputation['Date'].dt.day_name()
sample_imputation['IsWeekday'] = ['Weekend' if i > 4 else 'Weekday' for i in sample_imputation['Weekday Index']]

In [None]:
sample_imputation['Weekday Name'].value_counts().sort_values()

In [None]:
plt.figure(figsize=(30,8))

# Plotting the Imputed Data (blue)
plt.plot(sample_imputation['Date'], sample_imputation['Close'], marker='o', label='Imputed')

#Plotting the original data (orange)
plt.plot(data['Date'], data['Close'], marker='o', label='Original')

plt.legend(fontsize=25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

In [None]:
# For Students, What if we want to impute using the mean value of the closing stock price?

## Aggregation and Granularity

Granularity in panel data talks about the level of grouping of the samples. Usually we group them to be able to analyze samples a larger scale. Similarly, we can also group time series data depending on the level of **temporality** that you would want to check. The usual level of grouping for time is : 

`Seconds` -> `Minutes` -> `Hours` -> `Days` -> `Weeks` -> `Months` -> `Years` -> `Decades` -> `Centuries`

Example : 


![Stocks](http://kourentzes.com/forecasting/wp-content/uploads/2014/05/fors1.fig1_.png)



Implementing this using pandas is quite straightforward, similar to panel data, we will use the **groupby** function to group the timeseries data together and we will have to specify the following:

    Groups Involved - what are the columns that we'll include to the aggregation (i.e. per stock, per week)
    Granularity - what level of grouping do we need? How high or how low? 
    Type of Aggregation - how do we group the values together (mean / sum)

In [None]:
weekly_aggregate = data.groupby(pd.Grouper(key='Date', freq='W'))['Close'].mean().reset_index()
month_aggregate = data.groupby(pd.Grouper(key='Date', freq='MS'))['Close'].mean().reset_index()
month_aggregate_ = data.groupby(pd.Grouper(key='Date', freq='M'))['Close'].mean().reset_index()

In [None]:
plt.figure(figsize=(30,8))

#Plotting the original data (orange)
plt.plot(data['Date'], data['Close'], marker='o', label='Original')

#Plotting the Grouped Data with different Granularity
plt.plot(weekly_aggregate['Date'], weekly_aggregate['Close'], label='Weekly Agg')
plt.plot(month_aggregate['Date'], month_aggregate['Close'], label='Month end Agg')
plt.plot(month_aggregate_['Date'], month_aggregate_['Close'], label = 'Month start Agg')

plt.legend(fontsize=25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

## Aggregation via a Moving Window

Notice that as we increase the granularity of the grouping, we were able to "smoothen" the graph. This is because time series aggregation will account for the the variation observed in the granular data (similar to when we group samples to specific categories to get descriptive statistics). 

One flaw in the usual aggregation method is it only accounts for the data at the specific window that we are aggregating (i.e. if we aggregate by month, the calculation will not factor in the previous months). It is in the nature of timeseries that datapoints are **autocorrelated** meaning the past is somehow related with the present values. This means that whatever the past values are in the series, it may have meaningful relationship with the present values that we **SHOULD** account for. Moving or Rolling averages were created to be able to account for this "flaw" in the usual timeseries aggregation method. Mathematically, moving averages are calculated using this formula :

![Simple moving average](attachment:image.png)

Procedurally, this is what happens under the hood : 

![Moving average info](https://media.geeksforgeeks.org/wp-content/uploads/20211116101813/Capture.PNG)

We can implement it in python using the code block below : 

In [None]:
data['7day MA'] = data['Close'].rolling(window =7).mean()

### Pros and Cons of using MA

Pros :
    
    Information from the Past - this method accounts for the possible relationship of past values with the present

    Smoothness - allows for the series to be smoothened making trend more visible (higher window, the smoother the series will be)

    Can be used for forecast - one of the most intuitive and classical model
    
Cons :
    
    Loss of data

In [None]:
# Smoothness
plt.figure(figsize=(30,8))

#Plotting the original data (orange)
# plt.plot(data['Date'], data['Close'], marker='o', label='Original')

plt.plot(weekly_aggregate['Date'], weekly_aggregate['Close'], label='Weekly Agg')

#Plotting the 7Day MA
plt.plot(data['Date'], data['7day MA'], label='7Day MA')

plt.legend(fontsize=25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

In [None]:
# Loss of Data
data[['Date','Close', '7day MA']]

## Detection of Outliers


`Zscore` - similar to panel data, you can use zscores to spot outliers within the dataset. In fact, this is the most basic anomaly detection algorithm used for time series. You can easily spot weird behavior across time by simply getting the zscore values and plotting it against the series.

In [None]:
# Subsampling the datapoints only to include 2020 stock prices of microsoft
# Increasing granularity to weeks

outlier_sample = data[(data['Date'].dt.year == 2020) & (data['Date'].dt.month <= 5)]

outlier_sample = outlier_sample.groupby(pd.Grouper(key='Date', freq='D'))['Close'].mean().reset_index()
outlier_sample = outlier_sample.dropna()

In [None]:
# Subsampled data
plt.figure(figsize=(30,8))

plt.plot(outlier_sample['Date'], outlier_sample['Close'], label='Daily Agg', marker='o')

plt.legend(fontsize=25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

In [None]:
outlier_sample['zscore'] = stats.zscore(outlier_sample['Close'])

Here we will establish a certain threshold before we flag a datapoint as an outlier. Similar with panel data, we need to know how many standard deviations away from the mean will we consider as normal vs as anomalous. As an example, lets say we will deem datapoints that is +/- 1 standard deviation away from the mean as anomalous

In [None]:
mean_value = outlier_sample['zscore'].mean()
standard_dev_val = 1

Plotting the threshold band will allow us to easily spot the trends present in our dataset. Looking at the firs quarter of 2020, there seems to be an event that decreased the price of microsoft for `multiple weeks`. In time series analysis, these trends are usually points of reflection and we try to make sense of the trend by looking at factors that might have caused such behavior.

In [None]:
# Plotting Zscores across time
plt.figure(figsize=(30,8))

plt.plot(outlier_sample['Date'], outlier_sample['zscore'], label='Zscores', marker='o')

plt.axhline(outlier_sample['zscore'].mean(), label='zscore Ave', linestyle='--')


plt.axhline(standard_dev_val, label='Upper Bound', linestyle='--', color='red')

plt.axhline(-standard_dev_val, label='Lower Bound', linestyle='--', color='red')

plt.legend(fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

## Major Components of Observed Time Series

Time series have components that we can look at to further understand the behavior of our data.

`Trend` - long term slope and direction of the timeseries. Question : How long is long? and what if there are shifts in trend?

`Cycle` - pattern in data that is observed but is not chunked within a specific timeframe (i.e. repeat during x number of weeks/days, but boom and bust is observable)

`Seasonality` - calendar based effects (seasons, definite scheduled holidays, tax periods)


`Irregularity` - term to capture stochasticity and randomness.


### Extracting the Major Components of a Series

Decomposition is a technique that can help us make sense of a time series. The result of decomposition will allow us to know what kind of "preprocessing" technique to apply to the series before we can move forward with more complex analysis and modelling.

The two most common ways to decompose a series are : 

    Additive Model - states that the series can be created by aggregating the estimated componantes together

    Multiplicative Model - states that the series can be created by multiplying the estimated componantes together

The code below will allow us to extract the important components of a timeseries

In [None]:
decompose_sample = data.copy()

decompose_sample = decompose_sample.groupby(pd.Grouper(key='Date', freq='MS'))['Close'].mean().reset_index()

decompose_sample = decompose_sample.set_index(pd.to_datetime(decompose_sample['Date'])).dropna()

decompose_sample.sort_index(inplace=True)

In [None]:
decompose_sample['Year'] = decompose_sample['Date'].dt.year
decompose_sample['Month'] = decompose_sample['Date'].dt.month

In [None]:
# Analyze first the original series 
# What is the trend? What months, on average is high?
# What to expect when we extract the components?
plt.figure(figsize=(15, 5))
for each in decompose_sample['Year'].unique():
    plt.plot(decompose_sample[decompose_sample['Year'] == each]['Month'],
             decompose_sample[decompose_sample['Year'] == each]['Close'], marker='o', label=each)
    
    plt.legend()
    plt.xlabel('Months')
    plt.ylabel('Monthly Closing Ave')
    plt.title('Monthly Closing Ave Across Months')

In [None]:
result = seasonal_decompose(decompose_sample['Close'], model='additive')

In [None]:
plt.figure(figsize=(10,3))
plt.plot(result.observed)
plt.title("Observed")
plt.figure(figsize=(10,3))
plt.plot(result.trend)
plt.title("Trend")
plt.figure(figsize=(10,3))
plt.plot(result.seasonal)
plt.title("Seasonal")
plt.figure(figsize=(10,3))
plt.plot(result.resid)
plt.title("Residuals")

### Monthly expectation

By analyzing the seasonality index extracted, we can somehow see the effect of certain months on the closing stock price. The graph below will give us an idea if a certain month has a positive / negative effect on the average closing price for that specific month.

In [None]:
seasonality_value = result.seasonal.reset_index()

seasonality_value['Month'] = seasonality_value['Date'].dt.month
seasonality_value = seasonality_value[['seasonal', 'Month']].drop_duplicates()

Here we can see that Microsoft's stock prices drop (on average) during January, March, April, May, October, November and December.

In [None]:
plt.figure(figsize=(20,3))

plt.bar(seasonality_value[seasonality_value['seasonal'] > 0]['Month'],
        seasonality_value[seasonality_value['seasonal'] > 0]['seasonal'])


plt.bar(seasonality_value[seasonality_value['seasonal'] <= 0]['Month'],
        seasonality_value[seasonality_value['seasonal'] <= 0]['seasonal'])

## Preprocessing Techniques Depend on Seasonality and Trend

The result that we get from decomposition will dictate what kind of preprocessing technique should be applied to the data. If you find that the data has seasonality / trend, we can use the table below to know how to impute for missing values. 

| Season      | Trend |Trendless |
| ----------- | ----------- |----------- |
| Seasonal      |Use __MEAN__ of the __adjacent__ years (same period)|Use __MEAN__ of __all__ available data |
| Non-Seasonal   |Use __MEAN__ of the __adjacent__ datapoints|Use __MEAN__ of __all__ available data  |


By following this table, we can more or less generate more accurate models.


## Stationarity and Autocorrelation


Some algorithms require stationarity of a series before it can be modelled. Non-stationary data, when used to train classical models (i.e. ARIMA) will result to autocorrelated residuals implying that the errors are severly increasing across time. This just means that the past values are dependent or **CORRELATED** with present and future values.  Residuals (Prediction - Actuals) that are not serially correlated is an assumption that must be passed before we can interpret ARIMA and other classical models. There are different preprocessing techniques that you can use to remove this relationship from the series.

A series that is stationary is defined to have the following properties :

    1) Trendless
    2) No predifined seasonality
    
    
![Stationarity](https://0megap0int.files.wordpress.com/2021/07/stationary.png)
    
    

### Differencing

You can difference the input series to be able to remedy for non-stationarity. Differencing is the direct subtraction of adjacent datapoints of timeperiod selected.





In [None]:
data['Differenced First Order'] = data['Close'].diff()

In [None]:
# Plotting Data
plt.figure(figsize=(30,8))

plt.plot(data['Date'], data['Close'], label='Actual Series', marker='o')
plt.plot(data['Date'], data['Differenced First Order'], label='Differenced Series', marker='o')


plt.legend(fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

The formal way to check for stationarity is by using the Augmented Dickey-Fuller test where the null hypothesis is :

Null = Series is NON-STATIONARY

Alternative = Series is STATIONARY

In [None]:
result = adfuller(data['Close'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

In [None]:
result = adfuller(data['Differenced First Order'].dropna())
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

### Log transform

Another technique to tame non-stationarity is by getting the logarithmic transform of the series

In [None]:
data['Log Close'] = np.log(data['Close'])

In [None]:
# Plotting Data
plt.figure(figsize=(30,8))

# plt.plot(data['Date'], data['Close'], label='Actual Series', marker='o')
plt.plot(data['Date'], data['Log Close'], label='Log Series', marker='o')


plt.legend(fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Date')
plt.ylabel('Closing Price');

In [None]:
result = adfuller(data['Log Close'].dropna())
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))