<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Train-/-Dev-/-Test-Split" data-toc-modified-id="Train-/-Dev-/-Test-Split-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Train / Dev / Test Split</a></span><ul class="toc-item"><li><span><a href="#SPY" data-toc-modified-id="SPY-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>SPY</a></span></li><li><span><a href="#TNX" data-toc-modified-id="TNX-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>TNX</a></span></li><li><span><a href="#VIX" data-toc-modified-id="VIX-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>VIX</a></span></li></ul></li></ul></div>

# Fit the model

We select three indexes to skecth the effect of the FOMC's documents on the market: S&P 500 Index(SPY), Treasury Yield 10 Years (TNX), CBOE Volatility Index (VIX). The data is from [Yahoo Finance](https://finance.yahoo.com/)

In [102]:
import pandas as pd
import datetime
from sklearn.model_selection import train_test_split
import numpy as np
from textblob import TextBlob
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

In [6]:
spy_df = pd.read_csv('./data/SPY.csv')
spy_df.fillna(method='pad')
spy_df["SPY"] = spy_df["Adj Close"].pct_change() * 100
spy_df.drop(columns=["Open", 'High', 'Low', 'Close', "Adj Close", 'Volume'], inplace=True)
spy_df.drop(0, inplace=True)
vix_df = pd.read_csv('./data/^VIX.csv')
vix_df.fillna(method='pad')
vix_df["VIX"] = vix_df["Adj Close"].pct_change() * 100
vix_df.drop(columns=["Open", 'High', 'Low', 'Close', "Adj Close", 'Volume'], inplace=True)
vix_df.drop(0, inplace=True)
tnx_df = pd.read_csv('./data/^TNX.csv')
tnx_df.fillna(method='pad')
tnx_df["TNX"] = tnx_df["Adj Close"].pct_change() * 100
tnx_df.drop(columns=["Open", 'High', 'Low', 'Close', "Adj Close", 'Volume'], inplace=True)
tnx_df.drop(0, inplace=True)

In [81]:
mkt_df = pd.merge(spy_df, vix_df, how='left', on='Date')
mkt_df = pd.merge(mkt_df, tnx_df, how='left', on='Date')
mkt_df.rename(columns={'Date':'date'},inplace=True)
mkt_df['date'] = pd.to_datetime(mkt_df['date'])

In [82]:
data_1 = pd.read_feather("./data_1.feather")
data_2 = pd.read_feather("./data_2.feather")
data_2 = pd.concat([data_2['minutes'],data_1['date']],axis=1)

In [83]:
df_statements = pd.merge(data_1, mkt_df, how='left', on='date')
mkt_df['date'] = mkt_df['date'] - datetime.timedelta(days=21)
df_minutes = pd.merge(data_2, mkt_df, how='left', on='date')

In [84]:
for var in df_statements.columns[-3:]:
    df_statements[var] =  df_statements[var].apply(lambda x: 1 if x > 0 else 0)
    df_minutes[var] =  df_minutes[var].apply(lambda x: 1 if x > 0 else 0)

## Train / Dev / Test Split

Before we can train any model, we first consider how to split the data. Here I chose to split the data into three chunks: train, development, test. I referenced Andrew Ng's "deeplearning.ai" course on how to split the data.

- Train set: The sample of data used for learning
- Development set (Hold-out cross-validation set): The sample of data used to tune the parameters of a classifier, and provide an unbiased evaluation of a model.
- Test set: The sample of data used only to assess the performance of a final model.

The ratio I decided to split my data is 98/1/1, 98% of data as the training set, and 1% for the dev set, and the final 1% for the test set. The rationale behind this ratio comes from the size of my whole data set. The dataset has more than 1.5 million entries. In this case, only 1% of the whole data gives me more than 15,000 entries. This is more than enough to evaluate the model and refine the parameters.

Another approach is splitting the data into only train and test set, and run k-fold cross-validation on the training set, so that you can have an unbiased evaluation of a model. But considering the size of the data, I have decided to use the train set only to train a model, and evaluate on the dev set, so that I can quickly test different algorithms and run this process iteratively.

### SPY

In [86]:
x = df_minutes['minutes']
y = df_minutes['SPY']

In [89]:
SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2,random_state=SEED)

In [90]:
print("Train set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(
    len(x_train),(len(x_train[y_train == 0]) / (len(x_train) * 1.)) * 100,(
        len(x_train[y_train == 1]) / (len(x_train) * 1.)) * 100))
print("Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(
    len(x_test),(len(x_test[y_test == 0]) / (len(x_test) * 1.)) * 100,(
        len(x_test[y_test == 1]) / (len(x_test) * 1.)) * 100))

Train set has total 153 entries with 49.02% negative, 50.98% positive
Test set has total 39 entries with 38.46% negative, 61.54% positive


In [93]:
%%time
tbresult = [TextBlob(i).sentiment.polarity for i in x_test]
tbpred = [0 if n < 0 else 1 for n in tbresult]

Wall time: 680 ms


In [96]:
conmat = np.array(confusion_matrix(y_test, tbpred, labels=[1, 0]))

confusion = pd.DataFrame(conmat, index=['positive', 'negative'],
                         columns=['predicted_positive', 'predicted_negative'])
print("Accuracy Score: {0:.2f}%".format(accuracy_score(y_test, tbpred) * 100))
print("-" * 80)
print("Confusion Matrix\n")
print(confusion)
print("-" * 80)
print("Classification Report\n")
print(classification_report(y_test, tbpred))

Accuracy Score: 61.54%
--------------------------------------------------------------------------------
Confusion Matrix

          predicted_positive  predicted_negative
positive                  24                   0
negative                  15                   0
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        15
           1       0.62      1.00      0.76        24

    accuracy                           0.62        39
   macro avg       0.31      0.50      0.38        39
weighted avg       0.38      0.62      0.47        39



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### TNX

In [97]:
x = df_minutes['minutes']
y = df_minutes['TNX']

In [100]:
SEED = 2000
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2,random_state=SEED)

In [101]:
print("Train set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(
    len(x_train),(len(x_train[y_train == 0]) / (len(x_train) * 1.)) * 100,(
        len(x_train[y_train == 1]) / (len(x_train) * 1.)) * 100))
print("Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(
    len(x_test),(len(x_test[y_test == 0]) / (len(x_test) * 1.)) * 100,(
        len(x_test[y_test == 1]) / (len(x_test) * 1.)) * 100))

Train set has total 153 entries with 60.13% negative, 39.87% positive
Test set has total 39 entries with 58.97% negative, 41.03% positive


In [103]:
%%time
tbresult = [TextBlob(i).sentiment.polarity for i in x_test]
tbpred = [0 if n < 0 else 1 for n in tbresult]

Wall time: 342 ms


In [104]:
conmat = np.array(confusion_matrix(y_test, tbpred, labels=[1, 0]))

confusion = pd.DataFrame(conmat, index=['positive', 'negative'],
                         columns=['predicted_positive', 'predicted_negative'])
print("Accuracy Score: {0:.2f}%".format(accuracy_score(y_test, tbpred) * 100))
print("-" * 80)
print("Confusion Matrix\n")
print(confusion)
print("-" * 80)
print("Classification Report\n")
print(classification_report(y_test, tbpred))

Accuracy Score: 41.03%
--------------------------------------------------------------------------------
Confusion Matrix

          predicted_positive  predicted_negative
positive                  16                   0
negative                  23                   0
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        23
           1       0.41      1.00      0.58        16

    accuracy                           0.41        39
   macro avg       0.21      0.50      0.29        39
weighted avg       0.17      0.41      0.24        39



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### VIX

In [99]:
x = df_minutes['minutes']
y = df_minutes['VIX']

In [None]:
### 

## 