# EXERCISE 3:

***Neural Networks and Gaussian process:***
Predict the SP500 with the  nancial indicators assigned to your team in the google spreadsheet (ep, dp, de, dy, dfy, bm, svar, ntis, in, tbl , see RLab3 2 GWcausalSP500.R), some lagged series of these indicators and lags of the target using a Neural Network and a GP regression with your desired kernel. Predict return, or price, or direction (up or down). For which target works best? Do some feature selection to disregard some variables, select appropriate lags: causality, (distance) correlation, VAR-test, Lasso ... (The script RLab5 GausProc.R can be of help. The dataset is goyal-welch2022Monthly.csv and work within the period 1927/2021.)

In our case, we have been assigned variables dp, de and ep.

# INDEX:

0. [DATA AND LIBRARY IMPORTS](#0.-DATA-AND-LIBRARY-IMPORTS)

1. [PREPROCESSING AND FEATURE ENGINEERING](#1.-PREPROCESSING-AND-FEATURE-ENGINEERING)

2. [LAG CREATION AND SELECTION](#2.-LAG-CREATION-AND-SELECTION)

# 0. DATA AND LIBRARY IMPORTS

[Back to Index](#INDEX)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.stattools import grangercausalitytests, adfuller, acf, pacf
from sklearn.preprocessing import StandardScaler, RobustScaler
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

After some further steps on the way down, we decided to keep the last period from 1926, as it will be useful for havig a value for the differenced variables straight from the beginning.

In [None]:
snp = pd.read_csv('goyal-welch2022Monthly.csv')

snp['yyyymm'] = snp['yyyymm'].astype(str)
snp['yyyymm'] = pd.to_datetime(snp['yyyymm'], format='%Y%m')
snp = snp.loc[(snp['yyyymm'] >= '1926-12-01') & (snp['yyyymm'] < '2022-01-01')].reset_index(drop=True)
snp['Index'] = snp['Index'].str.replace(',', '').astype(float)

display(snp)

[Back to Index](#INDEX)

# 1. PREPROCESSING AND FEATURE ENGINEERING

[Back to Index](#INDEX)

In [None]:
snp.info()

We can observe that THERE ARE NO MISSING VALUES FOR OUR VARIABLES OF INTEREST (Index, D12 and E12).

We compute the new features as we need to compute:

- Dividend Price Ratio (DP)
- Dividend Earnings Ratio (DE)
- Earnings Price Ratio (EP)

In [None]:
snp = snp[['yyyymm', 'Index', 'D12', 'E12']]

snp['LogReturns'] = np.log(snp['Index']).diff()
snp['PriceDiv'] = snp['Index'] + snp['D12']
snp['LogReturnsDiv'] = np.log(snp['PriceDiv']).diff()

# We need to fill the NaN values with 0 because the ADF test doesn't tolerate NaN values and we might still need to further differentiate the series.
snp.fillna({'LogReturns': 0}, inplace=True)
snp.fillna({'LogReturnsDiv': 0}, inplace=True)

In [None]:
snp['DP'] = np.log(snp['D12']) - np.log(snp['Index'])
snp['DE'] = np.log(snp['D12']) - np.log(snp['E12'])
snp['EP'] = np.log(snp['E12']) - np.log(snp['Index'])

display(snp.head())

In [None]:
fig, axes = plt.subplots(nrows=9, ncols=1, figsize=(10, 18))

columns_to_plot = ['Index', 'PriceDiv', 'D12', 'E12', 'LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']
for i, col in enumerate(columns_to_plot):
    axes[i].plot(snp['yyyymm'], snp[col], marker='', linestyle='-')
    axes[i].set_title(col)
    axes[i].set_xlabel('Date')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

In [None]:
# We test for stationarity in the data

for col in columns_to_plot:
    adf_result = adfuller(snp[col])
    print(f'ADF Statistic for {col}: {adf_result[0]}')
    print(f'p-value for {col}: {adf_result[1]}\n')

The variables we are interested in are mostly stationary but there are a couple that should be further differenced in order to make them stationary.

That being, said, log returns and log returns + dividends are already quite surely stationary so we're not going to bother with their stationarity anymore.

Additionally, in further boxplots we have been able to see that variables DE and EP are quite skewed so, even if they are already decently stationary, we will also difference them in order to center them a bit more.

In [None]:
snp['DP'] = snp['DP'].diff()
snp['DE'] = snp['DE'].diff()
snp['EP'] = snp['EP'].diff()

display(snp.head())

# We can safely remove the first row now instead of filling it with 0
snp.dropna(axis = 0, inplace = True)

display(snp.head())

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 6))

columns_to_plot = ['DP', 'DE', 'EP']
for i, col in enumerate(columns_to_plot):
    axes[i].plot(snp['yyyymm'], snp[col], marker='', linestyle='-')
    axes[i].set_title(col)
    axes[i].set_xlabel('Date')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

# We test again for stationarity in the data

for col in columns_to_plot:
    adf_result = adfuller(snp[col])
    print(f'ADF Statistic for {col}: {adf_result[0]}')
    print(f'p-value for {col}: {adf_result[1]}\n')

We can see how now our data is completely stationary and we may proceed. Careful attention to the outliers will be needed, though, as they are very specific to the 2008 crisis. We will first standardize the data and afterwards, if there are still outliers, we will treat them or maybe consider scaling with robust scaling (i.e. taking away the median instead of the mean).

In [None]:
columns_to_plot = ['LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(15, 10))
for i, col in enumerate(columns_to_plot):
    sns.boxplot(y=col, data=snp, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

In [None]:
std_scaler = StandardScaler()
robust_scaler = RobustScaler()

snp_std = snp.copy()

for col in columns_to_plot:
    snp_std[col] = robust_scaler.fit_transform(snp_std[[col]])

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(15, 10))
for i, col in enumerate(columns_to_plot):
    sns.boxplot(y=col, data=snp, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

As we can observe, the scaled version, even with the robust scaler, returns ranges which are still bigger than the range from the previous version. Even with the outliers we had, as we had already applied a logarithmic transformation to the data, the outliers were not excessively far away from the center, with the exception of the DE and EP ratios.

What we will do is "manually" transform the data which is bigger than 0.3 in the five columns we are considering. We will first try to simply perform a sort of winsorization on those values to the nearest value that we establish as "the maximum" we allow. As such, we simply set the values exceeding ±0.3, to ±0.3.

In [None]:
for col in columns_to_plot:
    snp[col] = snp[col].clip(upper=0.3, lower=-0.3)

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(15, 10))
for i, col in enumerate(columns_to_plot):
    sns.boxplot(y=col, data=snp, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].tick_params(axis='y', labelsize=10)

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=1, figsize=(10, 10))

columns_to_plot = ['LogReturns', 'LogReturnsDiv', 'DP', 'DE', 'EP']
for i, col in enumerate(columns_to_plot):
    axes[i].plot(snp['yyyymm'], snp[col], marker='', linestyle='-')
    axes[i].set_title(col)
    axes[i].set_xlabel('Date')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

[Back to Index](#INDEX)

# 2. LAG CREATION AND SELECTION

[Back to Index](#INDEX)

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10))

plot_acf(snp['LogReturns'], lags=100, alpha=0.05, title='ACF for Series', ax=ax1)
ax1.grid(True)

plot_pacf(snp['LogReturns'], lags=100, alpha=0.05, title='PACF for Series', ax=ax2)
ax2.grid(True)

plt.tight_layout()
plt.show()

We can observe how there is no clear patter in the ACF and PACF plots, so we will just start using an arbitrary number of lags and try different options from there.

[Back to Index](#INDEX)

# 3. NEURAL NETWORK

[Back to Index](#INDEX)

[Back to Index](#INDEX)

# 4. GAUSSIAN PROCESS

[Back to Index](#INDEX)

[Back to Index](#INDEX)