This notebook models stock data that has been pre-processed with news sentiment data, then predicts the Closing, High, and Low values for the current buisness day and following business day. Below is the process for preparing the data then modeling the data for the best predictions.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd

# Set Console formatting for panda prints
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 10)
pd.set_option('display.width', 1000)
pd.options.mode.chained_assignment = None

# **********************************************************************************************************************
# Modeling / Prepare Data
data = pd.read_csv('https://raw.githubusercontent.com/mwilchek/Stock-Modeling/master/DJ_NEWS_SENTIMENT_DATA.csv')
data['Cycle_Change'] = data.Max_Sentiment.eq(data.Max_Sentiment.shift())
dummies = pd.get_dummies(data.Cycle_Change)
data.join(dummies)
data_tomorrow = data

# Move certain columns up by one row for data_tomorrow
data_tomorrow.Anger = data_tomorrow.Anger.shift(+1)
data_tomorrow.Anticipation = data_tomorrow.Anticipation.shift(+1)
data_tomorrow.Disgust = data_tomorrow.Disgust.shift(+1)
data_tomorrow.Fear = data_tomorrow.Fear.shift(+1)
data_tomorrow.Joy = data_tomorrow.Joy.shift(+1)
data_tomorrow.Sadness = data_tomorrow.Sadness.shift(+1)
data_tomorrow.Surprise = data_tomorrow.Surprise.shift(+1)
data_tomorrow.Trust = data_tomorrow.Trust.shift(+1)
data_tomorrow.Negative = data_tomorrow.Negative.shift(+1)
data_tomorrow.Positive = data_tomorrow.Positive.shift(+1)
data_tomorrow.Max_Sentiment = data_tomorrow.Max_Sentiment.shift(+1)
data_tomorrow.Sentiment_Proportion = data_tomorrow.Sentiment_Proportion.shift(+1)

# Delete the first row of data_tomorrow
data_tomorrow.drop(data_tomorrow.head(1).index, inplace=True)

train_data = data[:-1]  # train data
today_record = data.tail(1)  # test data (validate current day and predict from following day)
train_data_tomorrow = data_tomorrow[:-1]  # train data
tomorrow_record = data_tomorrow.tail(1)  # test data (validate current day and predict from following day)

data.head(n=5)

Adding Local Functions for Accuracy Printing

In [None]:
########################################################################################################################
# Local method to get Margin of Error
def get_change(current, previous):
    if current == previous:
        return 100.0
    try:
        return (abs(current - previous) / previous) * 100.0
    except ZeroDivisionError:
        return 0

In this section we created a pipeline of Regressor type models with a number of parameters that we tune to hopefully find an accurate model for predicting the closing value. We hope that with good results, we can mimic the process for predicting the High and Low values for the current and next business day.

In [None]:
########################################################################################################################
# MODELING EXPLORATION #################################################################################################
# Testing best model for f(x) = Close ~ Features
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

# Get Feature values
x = data[['Open', 'High', 'Low', 'False', 'True']].values

# Get Target values
y = data['Close'].values

regression_models = {'lr': LinearRegression(n_jobs=-1),
                     'mlp': MLPRegressor(random_state=0),
                     'dt': DecisionTreeRegressor(random_state=0),
                     'rf': RandomForestRegressor(random_state=0, n_jobs=-1),
                     'svr': SVR(max_iter=-1)}

pipe_regrs = {}

# Create list of pipeline models to test with that standardize the data
for name, regression_models in regression_models.items():
    pipe_regrs[name] = Pipeline([('StandardScaler', StandardScaler()), ('regr', regression_models)])

param_grids = {}

# Linear Regression Parameter Options:
param_grid = [{'regr__normalize': ['True']},
              {'regr__normalize': ['False']}]

# Add Linear Regression Parameters to dictionary grid
param_grids['lr'] = param_grid

# MLP Parameter Options:
alpha_range = [10 ** i for i in range(-4, 5)]

param_grid = [{'regr__hidden_layer_sizes': [10, 100, 200]}]

# Add Multi-layer Perceptron Parameters to dictionary grid
param_grids['mlp'] = param_grid

# Decision Tree Regression Parameter Options:
param_grid = [{'regr__criterion': ['mse', 'mae'],
               'regr__min_samples_split': [2, 6, 10],
               'regr__min_samples_leaf': [1, 6, 10],
               'regr__max_features': ['auto', 'sqrt', 'log2']}]

# Add Decision Tree Parameters to dictionary grid
param_grids['dt'] = param_grid

# Random Forest Regression Parameter Options:
param_grid = [{'regr__n_estimators': [10, 100],
               'regr__criterion': ['mse', 'mae'],
               'regr__min_samples_split': [2, 6, 10],
               'regr__min_samples_leaf': [1, 6, 10],
               'regr__max_features': ['auto', 'sqrt', 'log2']}]

# Add Random Forest Parameters to dictionary grid
param_grids['rf'] = param_grid

# Support Vector Machine (SVM) Parameter Options:
param_grid = [{'regr__C': [0.1, 1, 10],
               'regr__gamma': [0.1, 1, 10],
               'regr__kernel': ['linear', 'poly', 'rbf', 'sigmoid']}]

# Add SVM Parameters to dictionary grid
param_grids['svr'] = param_grid

# The list of [best_score_, best_params_, best_estimator_]
best_score_param_estimators = []

# Scoring Param: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
# For each regression
for name in pipe_regrs.keys():
    # GridSearchCV
    gs = GridSearchCV(estimator=pipe_regrs[name],
                      param_grid=param_grids[name],
                      scoring='neg_mean_squared_error',
                      n_jobs=1,
                      cv=None)
    print("Modeling: " + str(pipe_regrs[name]))
    # Fit the pipeline
    gs = gs.fit(x, y)

    # Update best_score_param_estimators
    best_score_param_estimators.append([gs.best_score_, gs.best_params_, gs.best_estimator_])
    print("Modeling Completed - Appending scores...")

Depending on the user's computer, the amount of time to complete the Pipeline can be between 5-10mins.

In [None]:
# Sort best_score_param_estimators in descending order of the best_score_
best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x: x[0], reverse=True)

# For each [best_score_, best_params_, best_estimator_]
for best_score_param_estimator in best_score_param_estimators:
    # Print out [best_score_, best_params_, best_estimator_], where best_estimator_ is a pipeline
    # Since we only print out the type of classifier of the pipeline
    print([best_score_param_estimator[0], best_score_param_estimator[1],
           type(best_score_param_estimator[2].named_steps['regr'])], end='\n\n')

It appears that the '{'regr__normalize': 'True'}, <class 'sklearn.linear_model.base.LinearRegression'>]' model from the Pipeline was the best scored. Let us practice predicting the cclosing value of the stock with a Linear Regression model with the best tuned parameters from our GridSearchCV. 

In [None]:
# Declare best model from GridSearchCV where normalize set to True is the default parameter
lr = LinearRegression(n_jobs=-1)

# Fit the model with our data
lr = lr.fit(x, y)

# Predict on Today Close
today_close = today_record[['Open', 'High', 'Low', 'False', 'True']].values
y_pred = lr.predict(today_close)

# Print Results
print("Actual Closing Value: " + str(today_record['Close'].values[0]))
print("Predicted Closing Value: " + str(y_pred[0]))

error = get_change(y_pred[0], today_record['Close'].values[0])
print("Accuracy error for prediction: " + str(round(error, 4)) + "%")

The prediction results were pretty close to the actual. However, based on academic research we also want to explore OLS regression to see if there is any noticeable change in predictions.

In [None]:
# **********************************************************************************************#
# OLS Regression Test
import statsmodels.formula.api as smf

# Define formula string for Stats-model API
formula = 'Close ~ Open + High + Low + False + True'

# Define Training Data
dta = train_data[['Close', 'Open', 'High', 'Low', 'Anger', 'Anticipation',
                  'Disgust', 'Fear', 'Joy', 'Sadness', 'Surprise',
                  'Trust', 'Negative', 'Positive', 'Cycle_Change', 'Sentiment_Proportion']].copy()

# Set the Model
ols_today_close_model = smf.ols(formula=formula, data=dta).fit()

# Print results
print(ols_today_close_model.summary())

As we can see the model is clearly overfit based on the Adj. R Squared equaling to 1. In order to prevent overfitting, we will revise our model with a regularized fit.

In [None]:
# Update Model with Regularized Fit to prevent over-fitting; alpha and weight values were set.
olsUpdate_today_close = smf.ols(formula=formula, data=dta).fit_regularized(alpha=10, L1_wt=.6)

Now we practice predicting on today's record with the revised model.

In [None]:
import matplotlib.pyplot as plt
import statsmodels.api as sm

print(today_record)

olsUpdate_today_close_prediction = olsUpdate_today_close.predict(today_record)

# Show Updated Model
fig = plt.figure(figsize=(12, 8))
fig = sm.graphics.plot_partregress_grid(olsUpdate_today_close_prediction, fig=fig)
fig

It would appear an OLS Regularized fit regression model is quite accurate as well to the actual value based on the above plots. If we can hyperpameter tune the sentiment values in our formula, and the alpha and weight values for the regularized parameters we may be able to create a strong model for predicting stock values with news sentiment data as a significant relationship.

In this section we will create custom local methods that will find the most significant sentiment for today's stock values in Close, High, and Low using a RandForestRegressor model.