# TECHNICAL NOTEBOOK

# Using Machine Learning and Natural Language Processing to predict stock price movements from article headlines

## Research question

Can we predict the <b>companies performance on the stockmarket based on the headlines</b> of the news articles?

## Dataset

For this project a retrieved headlines from WSJ online of 6 companies <b>between 2010/01/01 and 2019/11/22</b>, resulted altogether <b>7171 observations</b>.

For labeling the dataset I used daily stockprices which are available on Yahoo Finance.

The performance was <b>good</b>, if the closed stock price compared to the previous day was better the S&P500 average. 

The performance was <b>bad</b>, if that ration was less than the S&P500 average.

## Findings

*
*


In [108]:
import pandas as pd
import numpy as np

## Importing the dataset

In [109]:
X_train = pd.read_csv('data/train_x.csv')
y_train = pd.read_csv('data/train_y.csv', header=-1)
X_test = pd.read_csv('data/test_x.csv')
y_test = pd.read_csv('data/test_y.csv', header=-1)

In [110]:
# create our training data list - this is a list of text of headlines/summaries
train_data = X_train['tokens'].to_list()
test_data = X_test['tokens'].to_list()

In [111]:
print('Number of train data: ',len(train_data))
print('Number of test data:  ',len(test_data))

Number of train data:  5736
Number of test data:   1435


## Countvectorizer

Here I instantiate a CountVectorizer. This counts the number of appearances of all the words in our training data. During the preprocessing stopword were removed and text was tokenized.

In [113]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             max_features=5000,
                             ngram_range=(1, 3))

train_data_features = vectorizer.fit_transform(train_data).toarray()
test_data_features = vectorizer.transform(test_data).toarray()

# check shapes
print('Shape of train_data: ', train_data_features.shape,
      '\nShape of test_data:  ', test_data_features.shape)

Shape of train_data:  (5736, 5000) 
Shape of test_data:   (1435, 5000)


In [114]:
vocab = vectorizer.get_feature_names()

In [125]:
y_train[0].value_counts(normalize=True)

bad     0.517608
good    0.482392
Name: 0, dtype: float64

In [126]:
y_test[0].value_counts(normalize=True)

bad     0.51777
good    0.48223
Name: 0, dtype: float64

In [127]:
#restructure the target variable
y_train = y_train.iloc[:,0].ravel()
y_test = y_test.iloc[:,0].ravel()

#check the shapes of target variable
print('Shape of train target: ', y_train.shape,
      '\nShape of test target:  ', y_test.shape)

Shape of train target:  (5736,) 
Shape of test target:   (1435,)


## Logistic Regression

In [128]:
from sklearn.linear_model import LogisticRegression

# Perform baseline logistic regression
logreg_base = LogisticRegression(C = 1e9, 
                                 solver='lbfgs', 
                                 max_iter=1000, 
                                 penalty='l2',
                                 random_state=110)

logreg_base.fit(train_data_features, y_train)

print('Train accuracy score:',f'{logreg_base.score(train_data_features, y_train):.3f}')
print('Test accuracy score:',f'{logreg_base.score(test_data_features, y_test):.3f}')



Train accuracy score: 0.994
Test accuracy score: 0.524


In [129]:
from sklearn.model_selection import GridSearchCV

# gridsearch original parameters:
params = {'C': np.logspace(-2, 4, 7),
          'penalty': ['l1', 'l2'],
          'solver': ['lbfgs']
          }

# gridsearch best parameters:
params = {'C': [0.01],
          'penalty': ['l2'],
          'solver': ['lbfgs']
          }

logreg_grid = GridSearchCV(estimator=LogisticRegression(random_state=123),
                           param_grid=params,
                           scoring='accuracy',
                           refit='accuracy',
                           return_train_score = True,
                           cv=5, verbose=2, n_jobs=-1)

#fit the model
logreg_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {logreg_grid.best_params_}')
print(f'Best score: {logreg_grid.best_score_:.3f}')
print(f'Test score: {logreg_grid.best_estimator_.score(test_data_features, y_test):.3f}')

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   17.7s finished


Best parameters: {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'}
Best score: 0.532
Test score: 0.524


In [135]:
logreg_grid.best_estimator_.coef_

(1, 5000)

In [141]:
coef_df = pd.DataFrame({'features': vocab,
                        'coefs': logreg_grid.best_estimator_.coef_[0]})
coef_df = coef_df.sort_values(by = 'coefs')
coef_df.iloc[-100:-50]

Unnamed: 0,features,coefs
1574,entrepreneur,0.049372
125,777,0.049542
832,canada,0.049562
2309,hundreds,0.049853
1102,confidence,0.049872
751,brand,0.050129
675,boeing 787 dreamliners,0.05021
1530,electrolux,0.050747
2524,job,0.050835
1864,flights,0.050962


## Random Forest

In [131]:
from sklearn.ensemble import RandomForestClassifier

params = {'n_estimators': [10, 100, 200, 500],
          'criterion': ['gini', 'entropy'],
          'max_depth': [2, 4, 6, 8, 10],
          'max_features': [5, 10, 20, 50]}

forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=110),
                           param_grid=params,
                           scoring='accuracy',
                           refit='accuracy',
                           return_train_score=True,
                           cv=5, verbose=2, n_jobs=-1)

forest_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {forest_grid.best_params_}')
print(f'Best score: {forest_grid.best_score_:.3f}')
print(f'Test score: {forest_grid.best_estimator_.score(test_data_features, y_test):.3f}')

Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 16.9min
[Parallel(n_jobs=-1)]: Done 800 out of 800 | elapsed: 22.8min finished


Best parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': 50, 'n_estimators': 100}
Best score: 0.526
Test score: 0.530


In [133]:
feat_imp_forest = pd.DataFrame(zip(vocab, forest_grid.best_estimator_.feature_importances_), 
                        columns=['Feature', 'Importance'])
feat_imp_forest.sort_values(by='Importance', ascending=False, inplace=True)
feat_imp_forest.head(20)

Unnamed: 0,Feature,Importance
4867,warning,0.008855
2003,funds,0.007882
4291,start,0.006501
2881,may,0.005483
4739,unit general electric,0.005211
1754,federal appeals court,0.004926
3650,range,0.00466
473,auto,0.004453
1753,federal appeals,0.003743
3093,neil,0.003726
