In [1]:
# Import statements (standard)
import math
import time
import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Import statements (custom)
import helper_functions as hf
import granger_causality as gc

In [2]:
#df = pd.read_csv('~/Desktop/Springboard/Cryptocurrency/cleaned_crypto_closing_prices.csv', index_col='time')
df = pd.read_csv('../data/cleaned_crypto_closing_prices.csv', index_col='time')

In [3]:
df.head()

Unnamed: 0_level_0,BTC_USD,DASH_USD,ETH_USD,LTC_USD,XMR_USD
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9/17/17 7:00,3573.556,294.153333,246.1225,49.59,92.036667
9/17/17 8:00,3595.146,296.706667,246.485,49.892,92.756667
9/17/17 9:00,3670.806,302.993333,253.555,51.37,94.123333
9/17/17 10:00,3669.62,301.62,253.1875,50.94,95.44
9/17/17 11:00,3688.812,304.31,255.56,51.226,95.6


## Test for Stationarity before/after differencing:

Prior to looking into Granger causality, we need to check to ensure that our time series are indeed stationary. To do so, I'll use the augmented Dickey-Fuller test (`adfuller` from `statsmodels`: http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html). 

In [4]:
# BEFORE
for i in range(len(df.columns)):
    crypto_str = df.columns[i]
    crypto_df = df[crypto_str]
    print('%s:' % crypto_str)

    adf_test_results = hf.adf_stationarity_test(crypto_df, al='AIC')
    print(adf_test_results)
    print('\n')

BTC_USD:
Number of Data Points    1973.000000
Number of Lags             26.000000
Test Statistic              0.597288
p-value                     0.987549
dtype: float64


DASH_USD:
Number of Data Points    1974.000000
Number of Lags             25.000000
Test Statistic              0.442954
p-value                     0.983032
dtype: float64


ETH_USD:
Number of Data Points    1973.000000
Number of Lags             26.000000
Test Statistic             -0.348741
p-value                     0.918312
dtype: float64


LTC_USD:
Number of Data Points    1973.000000
Number of Lags             26.000000
Test Statistic              1.667423
p-value                     0.998046
dtype: float64


XMR_USD:
Number of Data Points    1973.000000
Number of Lags             26.000000
Test Statistic              1.402269
p-value                     0.997124
dtype: float64




In [5]:
diff_df = hf.difference_prices(df)

In [6]:
# AFTER
for i in range(len(diff_df.columns)):
    crypto_str = diff_df.columns[i]
    crypto_df = diff_df[crypto_str]
    print('%s:' % crypto_str)

    adf_test_results = hf.adf_stationarity_test(crypto_df, al='AIC')
    print(adf_test_results)
    print('\n')

BTC_USD:
Number of Data Points    1.972000e+03
Number of Lags           2.600000e+01
Test Statistic          -7.394542e+00
p-value                  7.840737e-11
dtype: float64


DASH_USD:
Number of Data Points    1.972000e+03
Number of Lags           2.600000e+01
Test Statistic          -8.092700e+00
p-value                  1.364732e-12
dtype: float64


ETH_USD:
Number of Data Points    1.972000e+03
Number of Lags           2.600000e+01
Test Statistic          -9.243282e+00
p-value                  1.561619e-15
dtype: float64


LTC_USD:
Number of Data Points    1.972000e+03
Number of Lags           2.600000e+01
Test Statistic          -1.096753e+01
p-value                  8.032989e-20
dtype: float64


XMR_USD:
Number of Data Points    1.973000e+03
Number of Lags           2.500000e+01
Test Statistic          -8.688091e+00
p-value                  4.110910e-14
dtype: float64




It's reasonable that we've achieved stationarity for each of our time series by taking the first difference.

## Look for Granger Causality

With the first difference of each time series shown to be stationary, we can now look for Granger causality. We'll do so using the package `grangercausalitytests` (http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.grangercausalitytests.html), available as part of `statsmodels`. 

Based on my interpretation of the underlying source code, the parameter `max_lags` (which I've set to 48 for now) sets the number of maximum lags *included* in the test, meaning that we are performing the test for each value up to/including `max_lags`, and at each intermediate value, we include all of the lags up to/including that value in the test.

In [7]:
n_cols = 5
for i in range(n_cols):
    for j in range(n_cols):
        if i == j:
            continue
        else:
            print('Series 1: ', diff_df.columns[i])
            print('Series 2: ', diff_df.columns[j])
            gc_results = gc.granger_causality(diff_df, i, j)
            print('*******************************')
            print('\n')

Series 1:  BTC_USD
Series 2:  DASH_USD

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.0164  , p=0.8982  , df_denom=1995, df_num=1
ssr based chi2 test:   chi2=0.0164  , p=0.8981  , df=1
likelihood ratio test: chi2=0.0164  , p=0.8981  , df=1
parameter F test:         F=0.0164  , p=0.8982  , df_denom=1995, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=5.2566  , p=0.0053  , df_denom=1992, df_num=2
ssr based chi2 test:   chi2=10.5397 , p=0.0051  , df=2
likelihood ratio test: chi2=10.5119 , p=0.0052  , df=2
parameter F test:         F=5.2566  , p=0.0053  , df_denom=1992, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=7.4514  , p=0.0001  , df_denom=1989, df_num=3
ssr based chi2 test:   chi2=22.4328 , p=0.0001  , df=3
likelihood ratio test: chi2=22.3076 , p=0.0001  , df=3
parameter F test:         F=7.4514  , p=0.0001  , df_denom=1989, df_num=3

Granger Causality
number of lags (no zero) 4
ssr

Here, the null hypothesis is that we do *not* have causality, meaning that the quickly vanishing p-value for any given combination implies that we do, in fact, have causality if we include the `number` `of` `lags` displayed above the test results. This is likely spurious (infinitely many predictors will all have nonzero weights, yielding a perfect fit), but where do we toe the line?

If we instead want to determine if we have causality for each *single* lag, would we just construct the lags ourselves and feed them into `granger_causality` with `max_lags` set to 1? If so, are we allowed to discretize in this fashion (i.e., include values of the time series at a set of lags instead of up to/including a given lag)?