# Vulcan Analytics Presentation - DataHacks 2021
## Predicting the S&P 500 using various economic indexes
##### Arunav Gupta, Brian Huang, Kyle Nero

In [3]:
import plotly as plot
import numpy as np
import pandas as pd
from prophet import Prophet

# Abstract
<p>
    lorem ipsum
    </p>

# Data Cleaning and Pre-Processing

- **Pivoting**: In order to format the data into a more usable manner, we pivotted the table to set the time series as columns and the observations as row values. This allowed us to visualize and access individual time series entries more effectively

![Alt](pivot.png "Pivot")



In [8]:
messy_obs = pd.read_csv("data/observations_train.csv")
# read the csv file

messy_obs["date"] = pd.to_datetime(messy_obs["date"])
# convert to date time

- **Normalization**: The sixty-eight time series entries were given in a variety of units (Percentages, Exchange Rates, Unemployment Rates, Billions of Dollars), which meant normalization was neccesary to most effectively compare the different entries. 

In [11]:
obs_pivot = messy_obs.pivot(values="value", index="date", columns="series_id")
# pivot the table

normalize = lambda col: (col - col.mean()) / col.std()
normed_obs = obs_pivot.apply(normalize, axis=0)
# normalize data

- **Dealing with Null Values**: While working with the data, there were an abundance of null values that had to be dealt with. In order to deal with these values, we back-filled all the data first and then forward filled the rest of the null values. We back filled generally due to the fact that most data in our set is collected at the end of the time period and reflects the previous time period, not the next time period. Forward filling dealt with what little discrepancies were left.

In [10]:
normed_obs.head()

series_id,AAA10Y,ASEANTOT,BAA10Y,BUSAPPWNSAUS,BUSAPPWNSAUSYY,CBUSAPPWNSAUS,CBUSAPPWNSAUSYY,CUUR0000SA0R,DEXCHUS,DEXUSEU,...,PCUOMINOMIN,SFTPAGRM158SFRBSF,SP500,T10YIE,TEDRATE,TLAACBW027NBOG,TLBACBW027NBOG,TSIFRGHT,UNRATE,WLEMUINDXD
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-02,,,,,,,,,,,...,,,,,,,,,,-0.066997
2000-01-03,-1.034676,,-1.29042,,,,,,1.218044,-1.15307,...,,,,,,,,,,-0.101209
2000-01-04,-0.965665,,-1.251178,,,,,,1.218162,-1.06459,...,,,,,0.783933,,,,,1.874472
2000-01-05,-1.05768,,-1.316581,,,,,,1.218044,-1.049651,...,,,,,0.736376,-1.5885,-1.577472,,,-0.383635
2000-01-06,-1.080684,,-1.316581,,,,,,1.217926,-1.055971,...,,,,,0.807712,,,,,-0.409119


In [14]:
normed_obs.fillna(method='bfill', inplace = True)
normed_obs.fillna(method='ffill', inplace = True)

- **Filtering out data**: We found that the S&P500 data did not start until 02-14-2011, so filtering out all the data that came before that date aided in getting rid of unneccesary noise. 

In [16]:
observations = normed_obs[normed_obs.index >= '02-14-2011']
# filter out any dates that are before designated date
observations.head()

series_id,AAA10Y,ASEANTOT,BAA10Y,BUSAPPWNSAUS,BUSAPPWNSAUSYY,CBUSAPPWNSAUS,CBUSAPPWNSAUSYY,CUUR0000SA0R,DEXCHUS,DEXUSEU,...,PCUOMINOMIN,SFTPAGRM158SFRBSF,SP500,T10YIE,TEDRATE,TLAACBW027NBOG,TLBACBW027NBOG,TSIFRGHT,UNRATE,WLEMUINDXD
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-02-14,-0.068511,1.424496,-0.191647,0.614001,0.263285,-0.14148,0.058255,-0.590029,-0.772034,0.753857,...,1.047525,-0.088386,-1.283833,0.475112,-0.619011,0.263166,0.235511,0.016147,1.637112,-0.360459
2011-02-15,0.069513,1.424496,-0.191647,0.614001,0.263285,-0.14148,0.058255,-0.590029,-0.78161,0.765348,...,1.047525,-0.088386,-1.294698,0.475112,-0.619011,0.263166,0.235511,0.016147,1.637112,-0.433198
2011-02-16,0.069513,1.424496,-0.191647,0.614001,0.263285,-0.14148,0.058255,-0.590029,-0.78161,0.795799,...,1.047525,-0.088386,-1.27375,0.40198,-0.595232,0.263166,0.235511,0.016147,1.637112,-0.206956
2011-02-17,0.138524,1.424496,-0.152405,0.614001,0.263285,-0.14148,0.058255,-0.590029,-0.78563,0.833145,...,1.047525,-0.088386,-1.26339,0.450735,-0.523896,0.295239,0.268779,0.016147,1.637112,-0.516571
2011-02-18,0.184532,1.424496,-0.126244,0.614001,0.263285,-0.14148,0.058255,-0.590029,-0.799699,0.868193,...,1.047525,-0.088386,-1.256886,0.621376,-0.547675,0.295239,0.268779,0.016147,1.637112,-0.54065


# Prophet

# What is Prophet?
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

It was developed by Facebook's Core Data Science Team in 2018

# Model
lorem ipsum