# (2) Random Walk with Drift (RW_Drift) Model

A Random Walk with Drift (RW_Drift) model is identical to an AR(0) on first differenced data with an added constant (c) term to account for the series mean. Let $Y_{t}$ represent a level series and $\Delta Y_{t}$ represent the growth rate transformation of that series. An AR(0) plus constant (c) on growth rate data can be rewritten in level form using $Y_{t} = c + Y_{t-1} + e_{t}$. Therefore, the AR(0) plus constant (c) model on growth rate data is a RW_Drift model on levels data. A RW_Drift model on growth rate data produces forecasts consistent with a “historical mean change.”       

The following code reestimates the RW_Drift model each period using walk foreword cross-validation over the validation set. Model validation is carried out using an 80-20 split. The initial training model is estimated on the first 80% of the training data, which is updated with an expanding window of one-period. Walk foreword cross-validation is carried out on the remaining 20% of the in-sample set. Each $h$-step ahead forecast is produced using linear model iteration. In the codes below, the phrase "test" actually references the “validation” set AND NOT an out-of-sample test set.

The first block of code defines a function (MODEL) that takes in five arguments. The univariate series to be forecasted is defined using the data argument. The argument p defines the number of autoregressive (AR) lags (AR_Lags) to set in an ARIMA model. The q defines the number of moving average (MA) lags (MA_Lags) to set in an ARIMA model. The trend argument determines whether the model is estimated with a constant (c) or not (n) via the Const command. The number of multistep ahead forecasts are set using the step_size argument through horizons. The output of MODEL allows the researcher to analyze the number of observations in the training set (train_size), the training set predictions (train_pred), the test set predictions (test_pred), the training root mean squared error value (train_RMSE), the test set root mean squared error value (test_RMSE), the AIC, and BIC values. 

In [None]:
# Load Library:
from pandas import read_csv
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
from matplotlib import pyplot
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Function to Fit Model using Walk Foreward Cross-Validation:
def MODEL(data, p = 0, q = 0, step_size = 1, trend = 'n'):
    # Extracting Data:
    index_values = data.index.values
    data = data.values
    # Inital Training & Test Set Sizes:
    train_size = int(len(data)*0.8)
    test_size = len(data)-train_size
    # Storage for Growth Forecasts:
    test_pred = []
    for t in range(test_size-step_size+1):
        # Walk Foreward Training Set:
        train_set = data[:train_size+t]
        # Walk Foreward Test Set:
        test_set = data[train_size+t:]
        # Tracking Convergence:
        print('Test Set Walk Foreward: Iteration '+str(t+1))
        # Fit Training Model:
        model = SARIMAX(train_set, order = (p,0,q), trend = trend, enforce_stationarity = False, enforce_invertibility = False)
        model_fit = model.fit(method = 'bfgs', maxiter = 10000)
        # Original Training Model: No Data Leakage
        if t == 0:
            train_pred = model_fit.predict().reshape((train_size,1))
            AIC = model_fit.aic
            BIC = model_fit.bic
        # N-Step Ahead Forecast:
        test_yhat = model_fit.forecast(steps = step_size)
        test_pred = np.append(test_pred, test_yhat[step_size-1])
    # Model Evaluation:
    train_RMSE = np.sqrt(mean_squared_error(data[:train_size], train_pred))
    test_RMSE = np.sqrt(mean_squared_error(data[train_size+step_size-1:], test_pred))
    # Convert to DataFrame:
    train_set = pd.DataFrame(data[:train_size], index = index_values[:train_size], columns = ['train_set'])
    train_pred = pd.DataFrame(train_pred, index = index_values[:train_size], columns = ['train_pred'])
    test_set = pd.DataFrame(data[train_size+step_size-1:], index = index_values[train_size+step_size-1:], columns = ['test_set'])
    test_pred = pd.DataFrame(test_pred, index = index_values[train_size+step_size-1:], columns = ['test_pred'])
    return train_size, train_pred, test_pred, train_RMSE, test_RMSE, AIC, BIC
# Setting Seed:
np.random.seed(12345)
# Load in Data & Set Index Frequency: DataFrame 
univariate_data = read_csv('Univariate_Data.csv', header = 0, index_col = 0, parse_dates = True)
univariate_data.index = pd.DatetimeIndex(univariate_data.index.values, freq = "MS")
growth_data = 100.0*np.log(univariate_data[['RHP']]).diff().dropna()
AR_Lags = 0
MA_Lags = 0
Const = 'c'
horizons = 1
# Evaluate Model: Growth Rate
train_size, train_pred, test_pred, train_RMSE, test_RMSE, AIC, BIC = MODEL(growth_data, p = AR_Lags, q = MA_Lags, step_size = horizons, trend = Const)

The second block presents and graphs the stored output from the MODEL function. The MODEL above is fit to housing price data in order to forecast real housing price growth rates at the U.S. national level.

In [None]:
# Evaluate Random Walk Model: Growth Rate
print('-----------------------------')
print('National Housing Price Series')
print('-----------------------------')
print('Data Type: Growth Rates')
print('Model Type: Random Walk with Drift')
print('Train RMSE: %.3f' % (train_RMSE))
print('Test RMSE: %.3f' % (test_RMSE))
print('AIC: %.3f' % (AIC))
print('BIC: %.3f' % (BIC))
# Plot Forecast: Growth Rate
sns.set_theme(style = 'whitegrid')
pyplot.figure(figsize = (12,6))
pyplot.plot(growth_data, label = 'Observed')
pyplot.plot(train_pred['train_pred'], label = 'RW_Drift: Train')
pyplot.plot(test_pred['test_pred'], label = 'RW_Drift: Test')
pyplot.xlabel('Date')
pyplot.ylabel('Growth Rate')
pyplot.title('Real Housing Price Series (National)')
pyplot.legend()
pyplot.show()

The third block of code is used to analyze the forecast errors for stationarity. The forecast errors are computed, plotted, and distributed. Lastly, the autocorrelation function (ACF) is plotted and the Augmented Dickey-Fuller (ADF) unit root test is carried out.

In [None]:
# Load Library:
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
# Define Residuals:
resids = growth_data[:train_size] - train_pred.values
# Plot Residuals:
sns.set_theme(style = 'whitegrid')
pyplot.figure(figsize = (16,4))
pyplot.subplot(1,2,1)
pyplot.plot(resids)
pyplot.xlabel('Date')
pyplot.title('Residual Series')
pyplot.subplot(1,2,2)
pyplot.hist(resids, bins = 20)
pyplot.title('Residual Distribution')
pyplot.tight_layout()
pyplot.show()
# Plot Autocorelation Function (ACF):
sns.set_theme(style = 'whitegrid')
fig, ax = pyplot.subplots(figsize=(8,4))
plot_acf(resids, title = 'Residual ACF', lags = 36, ax = ax)
pyplot.show()
# ADF Test: Non-Stationary v. Stationary
ADF_Test = adfuller(resids)
print('----------------------')
print('  ADF Unit-Root Test  ')
print('----------------------')
print('Test Statistic: %.3f' % (ADF_Test[0]))
print('P-Value: %.3f' % (ADF_Test[1]))
print('Critical Values:')
for key, value in ADF_Test[4].items():
    print('%s: %.3f' % (key, value))

The last block of code loads in the previous .csv files "National_Train_Growth_One" and "National_Test_Growth_One" that contain the stored forecasted values. The storage files are then augmented to include the predicted values from the current algorithm in order to estimate the forecast combinations, produce the final "top performing" model plots, and carry out the final comparison tests for predictive accuracy.

In [None]:
# Load Forecast Tables: 
train_forecasts = read_csv('National_Train_Growth_One.csv', header = 0, index_col = 0, parse_dates = True)
train_forecasts.index = pd.DatetimeIndex(train_forecasts.index.values, freq = "MS")
test_forecasts = read_csv('National_Test_Growth_One.csv', header = 0, index_col = 0, parse_dates = True)
test_forecasts.index = pd.DatetimeIndex(test_forecasts.index.values, freq = "MS")
# Add New Forecast Model:
train_forecasts['RW_Drift'] = train_pred
test_forecasts['RW_Drift'] = test_pred
# Save Forecast:
pd.DataFrame(train_forecasts).to_csv('National_Train_Growth_One.csv')
pd.DataFrame(test_forecasts).to_csv('National_Test_Growth_One.csv')