### Pre-process Your Data

In a data_analysis.ipynb notebook create a sklearn pipeline (similar to the one shown in class)
that preprocesses your data as described in the paper. If needed add any imputers or scalers in
your pipeline. Split the data in Training set (until Jan 2015) and test set (anything after that).
Don’t look at the test set until you have trained all your models on the training set and you are
ready to give predictions for today’s market.
Re-create Table 1, Table 2 and Table 3 with your data. Are your results the same as in the
paper? Where do they differ? Comment in your notebook.

In [60]:
from __future__ import division, print_function, unicode_literals

import numpy as np
import pandas as pd
import datetime
import os


In [68]:
# construct DataFrame for all raw data

path = r'E:\NYU MathFin\courses\data science in quant finance\short project\data\rawdata\\'
otherVarNames = ['BY','DEF','VRP','IC','BDI','PCR','MA','PCAtech','OIL','SI'] # daily data
priceVarNames = ['BM','PE','CAPE','DP','PCAprice','CPI','SPX'] # monthly data - need to fill this set to daily data at later time, SPX as date benchmark
varNames = [priceVarNames, otherVarNames]

df_all = pd.DataFrame()
for varName in otherVarNames:

    df = pd.read_csv(path + varName+'.csv', index_col=0, parse_dates=[0], usecols=['Date',varName]) # montly at end of month
    df_all = pd.concat([df_all,df],axis=1)
    #print(varName)
                                                                                   
                                                                                   
df_price = pd.DataFrame()    
for varName in priceVarNames:

    df = pd.read_csv(path + varName+'.csv', index_col=0, parse_dates=[0], usecols=['Date',varName]) # montly at end of month
    df_price = pd.concat([df_price,df],axis=1)
    #print(varName)
                          

In [69]:
# pre-processing period
preStartDate = '1990-01-01'
preEndDate = '2017-06-30'
df_all = df_all[(df_all.index >= datetime.datetime.strptime(preStartDate, '%Y-%m-%d')) 
                           & (df_all.index <= datetime.datetime.strptime(preEndDate, '%Y-%m-%d'))]
df_price = df_price[(df_price.index >= datetime.datetime.strptime(preStartDate, '%Y-%m-%d')) 
                           & (df_price.index <= datetime.datetime.strptime(preEndDate, '%Y-%m-%d'))]


In [70]:
# fill the monthly data to daily data
# method: use the previous available data to fill the following missing ones
df_price = df_price.sort_index() # make sure the data is sorted ascendingly by dates
df_price.iloc[:,0:-1] = df_price.iloc[:,0:-1].fillna(method='pad') 
df_price = df_price.dropna(axis=0,how='any')
#df_price = df_price.drop(['SPX'],axis=1)

In [71]:
df = pd.concat([df_price,df_all],axis=1,join='inner')
df = df.drop(['SPX'],axis=1)

In [72]:
df

Unnamed: 0_level_0,BM,PE,CAPE,DP,PCAprice,CPI,BY,DEF,VRP,IC,BDI,PCR,MA,PCAtech,OIL,SI
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1990-01-31,0.506073,15.13,17.05,0.0328,-2.778152,127.500,0.999436,0.43,,,1644.0,-0.757631854,1.0,1.853827,0.070086,
1990-02-01,0.506073,14.97,16.51,0.0328,-2.778152,128.000,0.999895,0.40,,,1642.0,-0.757424016,1.0,2.134645,0.074084,
1990-02-02,0.506073,14.97,16.51,0.0328,-2.778152,128.000,1.000559,0.43,,,1638.0,-0.759989545,0.0,2.949392,0.076998,
1990-02-05,0.506073,14.97,16.51,0.0328,-2.778152,128.000,1.000244,0.45,,,1619.0,-0.753331516,0.0,3.513543,0.061805,
1990-02-06,0.506073,14.97,16.51,0.0328,-2.778152,128.000,1.000295,0.48,,,1606.0,-0.756084665,0.0,3.257571,0.065022,
1990-02-07,0.506073,14.97,16.51,0.0328,-2.778152,128.000,0.999666,0.44,,,1588.0,-0.750112406,0.0,3.735134,0.058214,
1990-02-08,0.506073,14.97,16.51,0.0328,-2.778152,128.000,0.999768,0.41,,,1583.0,-0.749725825,0.0,3.735134,0.055053,
1990-02-09,0.506073,14.97,16.51,0.0328,-2.778152,128.000,0.998688,0.29,,,1583.0,-0.746386933,0.0,3.735134,0.048117,
1990-02-12,0.506073,14.97,16.51,0.0328,-2.778152,128.000,1.000565,0.41,,,1579.0,-0.755471641,0.0,3.465168,0.055437,
1990-02-13,0.506073,14.97,16.51,0.0328,-2.778152,128.000,0.999675,0.51,,,1579.0,-0.75458997,0.0,3.184350,0.048853,
