<a href="https://www.coursera.org/learn/applying-data-analytics-business-in-finance"> <img src="./resources/illinois_banner.png" alt="applying-data-analytics-business-in-finance"/> </a>

# Stationarity Data and ARIMA Model
*This lab was developed by: <br> Jose Luis Rodriguez, Director of Margolis Market Information Lab, R.C. Evans Innovation Fellow at Gies College of Business
<br> Meiou Wen, MSF Teaching Assistant, Gies College of Business*

On this lab we will begin with stationarity, the first and necessary step in analyzing time series data. We will learn how to identify if a time series is stationary or not and know how to make nonstationary data become stationary:

* Develop an understanding of stationarity, identifying two forms of it and their importance in time series analysis
* Understand stationarity testing procedure: how to conduct a stationarity test, how to read the ADF test result, how to make a nonstationary series become stationary, and how to determine the order of integration
* Understand the various components of ARIMA parameters, and understand the procedure of modelling ARIMA forecasting
* Identify opportunities for utilizing ARIMA forecasting model in reality

### Packages and Configurations

* tidyverse: https://www.tidyverse.org/
* lubridate: https://lubridate.tidyverse.org/
* tseries: https://cran.r-project.org/web/packages/tseries
* forecast: https://cran.r-project.org/web/packages/forecast/
* xts: https://cran.r-project.org/web/packages/xts

In [None]:
# SUPPRESS PACKAGE WARNINGS
quietly <- suppressPackageStartupMessages

# DISABLE SCIENTIFIC NOTATION
options(scipen = 9999)

# CHANGE CHARTS DIMENSIONS
options(repr.plot.width=12, repr.plot.height=7)

# LOAD PACKAGES SUPRESS WARNINGS
quietly(library(xts))
quietly(library(tidyverse))
quietly(library(lubridate))
quietly(library(forecast))
quietly(library(urca))

### Data Import and Exploration

#### SPY US - SPDR S&P 500 ETF Trust

SPDR S&P 500 ETF Trust is an exchange-traded fund incorporated in the USA. The ETF tracks the S&P 500 Index. The Trust consists of a portfolio representing all 500 stocks in the S&P 500 Index. It holds predominantly large-cap U.S. stocks. This ETF is structured as a Unit Investment Trust and pays dividends on a quarterly basis. The holdings are weighted by market capitalization. 

In [None]:
# READ THE .CSV FILE AS DATA FRAME
spy_df = read_csv("data/SPY.csv")

In [None]:
head(spy_df)

In [None]:
# CREATE AN EMPTY COLUMN volume
volume = c()
# CREATE A LOOP FUNCTION TO FILL IN volume
for(entry in spy_df$volume){
    if(str_detect(string = entry, "M")){
        v = gsub("M","", entry)
        v = as.numeric(v) * 1000000
        volume = c(volume, v)
    }
    else if(str_detect(string = entry, "k")){
        v = gsub("k","", entry)
        v = as.numeric(v) * 1000
        volume = c(volume, v)
    }
    else{
        v = as.numeric(entry)
        volume = c(volume, v)
    }
}

In [None]:
# APPEND volume TO spy_df
spy_df$volume = volume

In [None]:
# CHECK THE FIRST FEW ROWS OF spy_df
head(spy_df)

In [None]:
# CREATE A NEW OBJECT, spy_xts, CONVERT date IN spy_df TO POSIXct TIME FORMAT AND ORDER BY TIME
spy_xts = xts(select(spy_df,-c("date")), 
             order.by = as.POSIXct(strptime(spy_df$date,"%m/%d/%y")))

In [None]:
# ASSIGN THE FIRST ITEM IN spy_xts TO spy_start
spy_start = index(spy_xts[1])
spy_start = c(year(spy_start),month(spy_start))

# ASSIGN THE LAST ITEM IN spy_xts TO spy_end
spy_end = index(spy_xts[length(spy_xts[,'close'])])
spy_end = c(year(spy_end), month(spy_end))

# CREATE A TIME-SERIES OBJECT AND ASSIGN TO spy_ts
spy_ts = ts(spy_xts, start = spy_start, end = spy_end, frequency = 365)

In [None]:
# PLOT THE DAILY CLOSING PRICE
autoplot(spy_ts[,"close"], xlab = "Year", ylab = "Close Price",
         main = "Daily Prices - SPDR S&P 500 ETF Trust (SPY)")

#### US Census Bureau - Total Construction Spending (TTLCON/TCS)

The Value of Construction Put in Place Survey (VIP) provides monthly estimates of the total dollar value of construction work done in the U.S. The survey covers construction work done each month on new structures or improvements to existing structures for private and public sectors. Data estimates include the cost of labor and materials, cost of architectural and engineering work, overhead costs, interest and taxes paid during construction, and contractor's profits.  

* https://fred.stlouisfed.org/series/TTLCON
* https://www.census.gov/construction/c30/c30index.html

In [None]:
# READ THE .CSV FILE AS DATA FRAME
tcs_df = read_csv("data/TTLCON.csv")

In [None]:
# CREATE A NEW OBJECT, tcs_xts, CONVERT date IN tcs_df TO POSIXct TIME FORMAT AND ORDER BY TIME
tcs_xts = xts(select(tcs_df,-c("DATE")), 
             order.by = as.POSIXct(strptime(tcs_df$DATE,"%Y-%d-%m")))

In [None]:
# ASSIGN THE FIRST ITEM IN tcs_xts TO tcs_start
tcs_start = index(tcs_xts[1])
tcs_start = c(year(tcs_start),month(tcs_start))

# ASSIGN THE LAST ITEM IN tcs_xts TO tcs_end
tcs_end = index(tcs_xts[length(tcs_xts)])
tcs_end = c(year(tcs_end), month(tcs_end))

# CREATE A TIME-SERIES OBJECT AND ASSIGN TO tcs_ts
tcs_ts = ts(tcs_xts[,"TTLCON"], start = tcs_start, end = tcs_end, frequency = 12)

In [None]:
# PLOT THE YEARLY TTLCON
autoplot(tcs_ts, xlab = "Year", ylab = "Millions of Dollars",
         main = "US Census Bureau - Total Construction Spending (TTLCON)")

## Stationary and Non-Stationarity Time Series Data

### Testing for Stationary on SPDR S&P 500 ETF (SPY)

In [None]:
# SPY Closing Price
spy_close = spy_ts[,"close"]

# PERFORM A TEST REGRESSION 
spy_adf = ur.df(spy_close, type = "none", selectlags = "AIC")
summary(spy_adf)

In [None]:
# Difference SPY Closing Price
spy_close_diff = diff(spy_close, lag = 1, differences = 1)

# PERFORM A TEST REGRESSION 
spy_adf = ur.df(spy_close_diff, type = "none", selectlags = "AIC")
summary(spy_adf)

In [None]:
# PLOT SPY Closing Price AND Difference SPY Closing Price
autoplot(spy_close)
autoplot(spy_close_diff)

### Testing for Stationarity on US Total Construction Spending Data

In [None]:
# Total Construction Spending
tscl_spending = tcs_ts[,"TTLCON"]

# PERFORM A TEST REGRESSION 
tscl_spending_adf = ur.df(tscl_spending, type = "none", selectlags = "AIC")
summary(tscl_spending_adf)

In [None]:
# Difference Total Construction Spending
tscl_spending_diff = diff(tscl_spending, lag = 1, differences = 1)

# PERFORM A TEST REGRESSION 
tscl_spending_adf = ur.df(tscl_spending_diff, type = "none", selectlags = "AIC")
summary(tscl_spending_adf)

In [None]:
# PLOT Total Construction Spending AND Difference Total Construction Spending
autoplot(tscl_spending)
autoplot(tscl_spending_diff)

## ARIMA

### SPY 4-years Closing Daily Prices to Monthly Prices

In [None]:
# DAILY CLOSING PRICE FOR 10 YEARS
spy_10yrs = spy_xts["2010-01-01/2020-06-26"][,'close']
# ASSIGN THE FIRST MONTH IN spy_10yrs TO spy_start
spy_start = index(spy_10yrs[1])
spy_start = c(year(spy_start),month(spy_start))
# CREATE A TIME-SERIES OBJECT AND ASSIGN TO spy_10yrs_ts
spy_10yrs_ts = ts(spy_10yrs, start = spy_start, frequency = 365)

In [None]:
# PLOT spy_10yrs_ts
autoplot(spy_10yrs_ts)

In [None]:
# CLEAN spy_10yrs_ts BY OMITTING THE NA VALUES 
spy_diff_ts = na.omit(diff(spy_10yrs_ts, lag = 1, differences = 1))

In [None]:
# PLOT SPY CLOSING PRICE DIFFERENCE BY TIME
autoplot(spy_diff_ts,
         xlab = "Years",
         ylab = "SPY Closing Prices Difference",
         main = "SPY 10-Years Closing Prices Difference")

In [None]:
# PERFORM A TEST REGRESSION
spy_adf = ur.df(spy_diff_ts, type = "none", selectlags = "AIC")
summary(spy_adf)

### Autocorrelation and Cross-Correlation Function

In [None]:
# PLOT spy_diff_ts WITH ESTIMATES OF THE AUTOCORRELATION FUNCTION
ggAcf(x = spy_diff_ts, lag.max = 10) + theme_bw()

### Partial Autocorrelation and Cross-Correlation Function

In [None]:
# PLOT spy_diff_ts WITH ESTIMATES OF THE PARTIAL AUTOCORRELATION FUNCTION
ggPacf(x = spy_diff_ts, lag.max = 10) + theme_bw()

## Fitting the ARIMA Model to SPY Monthly Closing Difference Data

In [None]:
# FIT AN ARIMA MODEL TO spy_diff_ts 
spy_model_01 = Arima(spy_diff_ts, order = c(1, 0, 1))
summary(spy_model_01)

In [None]:
# FIT AN ARIMA MODEL TO spy_diff_ts 
spy_model_02 = Arima(spy_diff_ts, order = c(1, 0, 2))
summary(spy_model_02)

In [None]:
# FIT AN ARIMA MODEL TO spy_diff_ts 
spy_model_03 = Arima(spy_diff_ts, order = c(2, 0, 1))
summary(spy_model_03)

In [None]:
# FIT AN ARIMA MODEL TO spy_diff_ts 
spy_model_04 = Arima(spy_diff_ts, order = c(2, 0, 2))
summary(spy_model_04)

## ARIMA Forecasting Plot and Auto ARIMA Function

In [None]:
# DAILY CLOSING PRICE FOR 10 YEARS
spy_10yrs = spy_xts["2010-01-01/2020-06-26"][,'close']
# CONVERT TO A MONTHLY SERIES
spy_10yrs = to.monthly(spy_10yrs)[,'spy_10yrs.Close']

# ASSIGN THE FIRST MONTH IN spy_10yrs TO spy_start
spy_start = index(spy_10yrs[1])
spy_start = c(year(spy_start),month(spy_start))
# ASSIGN THE LAST MONTH IN spy_10yrs TO spy_end
spy_end = index(spy_10yrs[length(spy_10yrs)])
spy_end = c(year(spy_end),month(spy_end))

# CREATE A TIME-SERIES OBJECT AND ASSIGN TO spy_10yrs_ts
spy_10yrs_ts = ts(spy_10yrs, start = spy_start, end = spy_end, frequency = 12)

In [None]:
# PLOT spy_10yrs_ts
autoplot(spy_10yrs_ts)

In [None]:
# FIT BEST ARIMA MODEL TO spy_10yrs_ts
spy_model = auto.arima(spy_10yrs_ts)

In [None]:
# PERFORM AUTOMATIC TIME SERIES FORECASTING
spy_predict = forecast(spy_model, h = 7)

In [None]:
# SUMMARY OF THE FORECASTING RESULT
summary(spy_predict)

In [None]:
# PLOT THE FORECASTING RESULT
autoplot(spy_predict, includes = 20)

### US Total Construction Spending Data 

### Autocorrelation and Cross-Correlation Function

In [None]:
# PLOT tscl_spending_diff WITH ESTIMATES OF THE AUTOCORRELATION FUNCTION
ggAcf(x = tscl_spending_diff, lag.max = 10) + theme_bw()

### Partial Autocorrelation and Cross-Correlation Function

In [None]:
# PLOT tscl_spending_diff WITH ESTIMATES OF THE PARTIAL CORRELATION FUNCTION
ggPacf(x = tscl_spending_diff, lag.max = 10) + theme_bw()

## Fitting the ARIMA Model to SPY Monthly Closing Difference Data

In [None]:
# FIT AN ARIMA MODEL TO tscl_spending_diff
tscl_model_01 = Arima(tscl_spending_diff, order = c(1, 0, 1))
summary(tscl_model_01)

In [None]:
# FIT AN ARIMA MODEL TO tscl_spending_diff
tscl_model_02 = Arima(tscl_spending_diff, order = c(1, 0, 2))
summary(tscl_model_02)

In [None]:
# FIT AN ARIMA MODEL TO tscl_spending_diff
tscl_model_03 = Arima(tscl_spending_diff, order = c(2, 0, 1))
summary(tscl_model_01)

In [None]:
# FIT AN ARIMA MODEL TO tscl_spending_diff
tscl_model_04 = Arima(tscl_spending_diff, order = c(2, 0, 2))
summary(tscl_model_04)

## ARIMA Forecasting Plot and Auto ARIMA Function

In [None]:
# DAILY CLOSING PRICE FOR 10 YEARS
tcs_10yrs = tcs_xts["2010-01-01/2020-01-04"][,'TTLCON']

# ASSIGN THE FIRST MONTH IN tcs_10yrs TO tcs_start
tcs_start = index(tcs_10yrs[1])
tcs_start = c(year(tcs_start),month(tcs_start))
# ASSIGN THE LAST MONTH IN tcs_10yrs TO tcs_end
tcs_end = index(tcs_10yrs[length(tcs_10yrs)])
tcs_end = c(year(tcs_end),month(tcs_end))

# CREATE A TIME-SERIES OBJECT AND ASSIGN TO tcs_10yrs_ts
tcs_10yrs_ts = ts(tcs_10yrs, start = tcs_start, end = tcs_end, frequency = 12)

In [None]:
# PLOT tcs_10yrs_ts
autoplot(tcs_10yrs_ts)

In [None]:
# FIT BEST ARIMA MODEL TO tcs_10yrs_ts
tcs_model = auto.arima(tcs_10yrs_ts)

In [None]:
# PERFORM AUTOMATIC TIME SERIES FORECASTING
tcs_predict = forecast(tcs_model, h = 12)

In [None]:
# SUMMARY OF THE FORECASTING RESULTS
summary(tcs_predict)

In [None]:
# PLOT THE FORECASTING RESULTS
autoplot(tcs_predict, includes = 50)

## Summary

On this lab we learned how to use analytical methods to analyze time series data and build forecasting models. We analyzed financial data in different forms and learned how to identify if a time series is stationary or not and how to transform nonstationary data to stationary to perform analysis. Some of the topics covered on this lab include testing for stationary, autocorrelation and cross-correlation function, and ARIMA forecasting model.

<a href="https://www.coursera.org/learn/applying-data-analytics-business-in-finance"> <img src="./resources/illinois_banner.png" alt="applying-data-analytics-business-in-finance"/> </a>