## Differencing 

Many (perhaps all) of our predictors are non-stationary (they have a non-constant mean over time). This can lead to unreliable and unstable forecasts. In this script, we'll evaluate whether our data is non-stationary, and apply differencing to the variables until they are stationary. 

NOTE: much of the code for this script are based on/taken from this Medium article: 

First, load in our packages + data:

In [None]:
# PACKAGES
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller

# data
data = pd.read_csv("../data/data_wpk.csv")
data = data.drop("gdp", axis = 1)
data.set_index('date', inplace=True)

# filter out NaNs 
data_filt = data.loc["2011-03-31" : "2025-06-30"]

In [19]:
def adfuller_test(series, signif=0.05, name='', verbose=False):
    """Perform ADFuller to test for Stationarity of given series and print report"""
    r = adfuller(series, autolag='AIC')
    output = {'test_statistic':round(r[0], 4), 'pvalue':round(r[1], 4), 'n_lags':round(r[2], 4), 'n_obs':r[3]}
    p_value = output['pvalue'] 
    def adjust(val, length= 6): return str(val).ljust(length)

    # Print Summary
    print(f'    Augmented Dickey-Fuller Test on "{name}"', "\n   ", '-'*47)
    print(f' Null Hypothesis: Data has unit root. Non-Stationary.')
    print(f' Significance Level    = {signif}')
    print(f' Test Statistic        = {output["test_statistic"]}')
    print(f' No. Lags Chosen       = {output["n_lags"]}')

    for key,val in r[4].items():
        print(f' Critical value {adjust(key)} = {round(val, 3)}')

    if p_value <= signif:
        print(f" => P-Value = {p_value}. Rejecting Null Hypothesis.")
        print(f" => Series is Stationary.")
    else:
        print(f" => P-Value = {p_value}. Weak evidence to reject the Null Hypothesis.")
        print(f" => Series is Non-Stationary.")  


# ADF Test on each column
for name, column in data_filt.items():
    adfuller_test(column, name=column.name)
    print('\n')

    Augmented Dickey-Fuller Test on "gdp_yoy" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Stationary.
 Significance Level    = 0.05
 Test Statistic        = -3.5399
 No. Lags Chosen       = 4
 Critical value 1%     = -3.56
 Critical value 5%     = -2.918
 Critical value 10%    = -2.597
 => P-Value = 0.007. Rejecting Null Hypothesis.
 => Series is Stationary.


    Augmented Dickey-Fuller Test on "orders" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Stationary.
 Significance Level    = 0.05
 Test Statistic        = -2.4818
 No. Lags Chosen       = 1
 Critical value 1%     = -3.553
 Critical value 5%     = -2.915
 Critical value 10%    = -2.595
 => P-Value = 0.12. Weak evidence to reject the Null Hypothesis.
 => Series is Non-Stationary.


    Augmented Dickey-Fuller Test on "employment" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Station

Looks like we have some differencing to do.

In [20]:
# stationary vars: gdp_yoy, unemploy_claims, bus_outlook, auto_sales
# non-stationary vars: orders, employment, consumer_sentiment, construcution, itrade, wtrade
data_differenced = data_filt
columns_to_diff = ['orders', 'employment', 'consumer_sentiment', 'construction', 'itrade', 'wtrade']
for col in columns_to_diff:
    data_differenced[col] = data_differenced[col].diff()

data_diff_filt = data_differenced.dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_differenced[col] = data_differenced[col].diff()


Now that we've differenced, let's check the Dickey-Fuller test again.

In [21]:
# ADF Test on each column
for name, column in data_diff_filt.items():
    adfuller_test(column, name=column.name)
    print('\n')

    Augmented Dickey-Fuller Test on "gdp_yoy" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Stationary.
 Significance Level    = 0.05
 Test Statistic        = -3.5092
 No. Lags Chosen       = 4
 Critical value 1%     = -3.563
 Critical value 5%     = -2.919
 Critical value 10%    = -2.597
 => P-Value = 0.0078. Rejecting Null Hypothesis.
 => Series is Stationary.




    Augmented Dickey-Fuller Test on "orders" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Stationary.
 Significance Level    = 0.05
 Test Statistic        = -10.8061
 No. Lags Chosen       = 0
 Critical value 1%     = -3.553
 Critical value 5%     = -2.915
 Critical value 10%    = -2.595
 => P-Value = 0.0. Rejecting Null Hypothesis.
 => Series is Stationary.


    Augmented Dickey-Fuller Test on "employment" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Stationary.
 Significance Level    = 0.05
 Test Statistic        = -6.9226
 No. Lags Chosen       = 1
 Critical value 1%     = -3.555
 Critical value 5%     = -2.916
 Critical value 10%    = -2.596
 => P-Value = 0.0. Rejecting Null Hypothesis.
 => Series is Stationary.


    Augmented Dickey-Fuller Test on "consumer_sentiment" 
    -----------------------------------------------
 Null Hypothesis: Data has unit root. Non-Stationary.
 Signif

Business outlook (bus_outlook) is still a technically over thr 5% level, but barely. We're good here. Time to save the data.

In [24]:
# save data
data_diff_filt.reset_index(inplace = True)
data_diff_filt.to_csv("../data/data_differenced_wpk.csv", index = False)