## Section II: Data Analysis of SP500

According to Wikipeida, "the SP500 is a stock market index that tracks the stocks of 500 large-cap U.S. companies. It represents the stock market's performance by reporting the risks and returns of the biggest companies. Investors use it as the benchmark of the overall market, to which all other investments are compared."

This notebook evaluates various aspects of the price history of the SP500; looking for trends, randomness and statistics.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline  

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
from pandas.plotting import lag_plot
from pandas.plotting import autocorrelation_plot
from pandas.plotting import bootstrap_plot
import matplotlib.pyplot as plt

plt.close('all')



### SP500 Preprocessing

The initial pickle file is a multi indexed dataframe. By reducing the index to 1D, dropping columns and limiting the data range, this is what the models will be trained on.

In [None]:
import pandas as pd
df = pd.read_pickle("sp500.pickle")

print("Df shape is : {}".format(df.shape))
for name in df.columns:
    print(name)

In [None]:
df[['Open', 'High', 'Low', 'Close', 'Volume']].tail(5)

### SP500 Closing Price 

A simple graph showing the Close price over time. Clearly the price tends to migrate upwards for most years.

In [None]:
df[['Close']].plot()

### SP500 Closing Price Percentage Change

Even through the price tends upwards, day in day out the SP500 closing price only changes by a couple of percent.

In [None]:
print("Prior to Close % change mean: {}".format(df['Target+'].mean()))
print("Prior to Close % change stddev: {}".format(df['Target+'].std()))
df[['Target+']].plot()


## Absolute Close Price - Non Stationary Test

### Adfuller Statistical Test
The Augmented Dickey-Fuller test can be used to test for a unit root in a univariate process in the presence of serial correlation. Source: Statsmodels documentation

If the p-value is above 0.05 and the AFD statistic is below the 1% cutoff, that is a very strong indication that the series is non-stationary; effectively changing over time.


In [None]:
# statistical test
result = adfuller(df['Target'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))


### Lag Plot
Lag plots are used to check if a data set or time series is random. Random data should not exhibit any structure in the lag plot. Non-random structure implies that the underlying data are not random. Source: Pandas Documentation

Even a lag of 10 time steps or days, shows a clear structure.

In [None]:
lag_plot(df['Target'], lag=10)

### Autocorrelation Plot
his is done by computing autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly non-zero. Source: Pandas Documentation

Again, there is a clear non-randomness property to the closing price.

In [None]:
autocorrelation_plot(df['Target'])

## Percent Change in Close Price - Non Stationary Test

### Density Plot of Prior to Close Percent Change

Determining the breakdown of percent price changes the close, it is possible to gain insight into what biases may exist. From the two plots below, the price change is slightly biased towards the positive as per the mean and median. However, the midrage value is below zero, indicating that when the market turns negative, it tends to drop more. The premise to draw upon here is that markets fall faster than they rise.

In [None]:
df['Target+'].plot.kde()


In [None]:
bootstrap_plot(df['Target+'], size=50, samples=500, color='green')

The ADF score is below the 10% cutoff and the p score zero, clearly a stationary series.

In [None]:
# statistical test
result = adfuller(df['Target+'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
lag_plot(df['Target+'], lag=10)

In [None]:
autocorrelation_plot(df['Target+'])

## Close Price Net Change - Stationary Test

In [None]:
# statistical test
result = adfuller(df['Prior_Close'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

In [None]:
lag_plot(df['Prior_Close'], lag=10)

In [None]:
autocorrelation_plot(df['Prior_Close'])