## VOO Stock Prices Prediction with ARIMA

In this notebook, we will explore the use of the Autoregressive Integrated Moving Average (ARIMA) model to forecast the stock prices of Vanguard S&P 500 ETF (VOO).

We will begin by importing the necessary libraries and loading the VOO stock price dataset. Next, we will preprocess the data and perform exploratory data analysis (EDA) to understand the characteristics of the dataset. We will then proceed to train an ARIMA model on the historical stock prices, and evaluate the performance of the model based on the same.

<br/>

---


### Introduction: What is ARIMA?

<br/>

ARIMA (**A**uto**R**egressive **I**ntegrated **M**oving **A**verage) is a time series forecasting model that combines autoregressive and moving average components with differencing to handle non-stationary data. The general form of an ARIMA model is `ARIMA(p, d, q)`, where `p`, `d`, and `q` are the parameters of the model.

`p`: Specifies the number of lagged observations to include in the model. It models the dependence of the current value on its previous values. 

`d`: Specifies the number of times differencing is performed to achieve stationarity, so that statistical properties of the time series (such as mean, variance) remain constant over time.

`q`: Specifies the number of lagged forecast errors to include in the model. It captures the short-term fluctuations in the time series that are not accounted for by the AR component. 

<br/>

---

### I. Importing Libraries

Let us begin by importing the basic modules.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
sb.set()

---

### II. Data Preparation

Next, we load the dataset and do some simple data cleaning.

In [2]:
df = pd.read_csv('../datasets/VOO_full.csv')
display(df)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2021-03-01,354.549988,359.390015,354.500000,358.119995,340.781616,3721100
1,2021-03-02,358.380005,358.630005,355.160004,355.350006,338.145752,5462600
2,2021-03-03,354.700012,355.640015,350.559998,350.660004,333.682800,6317600
3,2021-03-04,350.489990,353.019989,341.920013,346.339996,329.571930,6604500
4,2021-03-05,349.769989,353.730011,342.589996,352.690002,335.614471,8721300
...,...,...,...,...,...,...,...
771,2024-03-22,479.869995,480.320007,478.820007,479.179993,479.179993,5876800
772,2024-03-25,477.730011,478.790009,477.549988,477.940002,477.940002,6081300
773,2024-03-26,479.059998,479.369995,476.429993,476.600006,476.600006,8073500
774,2024-03-27,479.510010,480.869995,477.450012,480.760010,480.760010,4951400


In [3]:
# Check number of rows and columns
print("Data dims: ", df.shape)

Data dims:  (776, 7)


In [4]:
# Check details of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 776 entries, 0 to 775
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       776 non-null    object 
 1   Open       776 non-null    float64
 2   High       776 non-null    float64
 3   Low        776 non-null    float64
 4   Close      776 non-null    float64
 5   Adj Close  776 non-null    float64
 6   Volume     776 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 42.6+ KB


We can see that there are 7 columns, with 6 numeric variables and `Date` being the only categorical variable in the dataset. 

The important variables are `Open`, `High`, `Low` and `Close`, which are the four key data points used in financial markets, particularly in the context of stock market trading.

`Open`: The price at which a security first trades upon the opening of the trading day.  

`High`: The highest price at which a security trades during the trading day. 

`Low`: The lowest price at which a security trades during the trading day. 

`Close`: The final price at which a security trades at the end of the trading day. 

<br/>

In this analysis, we will be only focusing on using `Date` as a predictor for `Close`, which is the response variable.

In [5]:
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Set the 'Date' column as the index
df.set_index('Date', inplace=True)

# Check the cleaned dataset
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-03-01,354.549988,359.390015,354.5,358.119995,340.781616,3721100
2021-03-02,358.380005,358.630005,355.160004,355.350006,338.145752,5462600
2021-03-03,354.700012,355.640015,350.559998,350.660004,333.6828,6317600
2021-03-04,350.48999,353.019989,341.920013,346.339996,329.57193,6604500
2021-03-05,349.769989,353.730011,342.589996,352.690002,335.614471,8721300


In [8]:
# Extract 'Close' columns as a separate DataFrame
df_close = df[['Close']]
df_close

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2021-03-01,358.119995
2021-03-02,355.350006
2021-03-03,350.660004
2021-03-04,346.339996
2021-03-05,352.690002
...,...
2024-03-22,479.179993
2024-03-25,477.940002
2024-03-26,476.600006
2024-03-27,480.760010
