# TECHNICAL NOTEBOOK

# Using Machine Learning and Natural Language Processing to predict stock price movements from article headlines

## Research question

Can we predict the <b>companies performance on the stockmarket based on the headlines</b> of the news articles?

## Dataset

For this project I retrieved headlines from WSJ online of 6 companies <b>between 2010/01/01 and 2019/11/22</b>, after preprocessing I got <b>6202 observations</b>.

For labeling the dataset I used daily stockprices which are available on Yahoo Finance.

The performance was <b>good</b>, if the closed stock price compared to the previous day was 0.5 %point better the S&P500 average. 

The performance was <b>bad</b>, if that stock performance was 0.5 %point worse than the S&P500 average.

## Modeling

The following Machine Learning algorithms was used to make predictions:

- Logistic Regression
- K-Nearest Neighbors
- Random Forest
- Gradient Boosting
- AdaBoost Classifier
- XG Boosting
- Support Vector Machine


## Findings

*
*


In [1]:
import pandas as pd
import numpy as np

## Importing the dataset

In [2]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')


In [3]:
# create our training data list - this is a list of text of headlines/summaries
train_data = train['tokens'].to_list()
test_data = test['tokens'].to_list()

In [4]:
print('Number of train data: ',len(train_data))
print('Number of test data:  ',len(test_data))

Number of train data:  4961
Number of test data:   1241


## Countvectorizer

Here I instantiate a CountVectorizer. This counts the number of appearances of all the words in our training data. During the preprocessing stopword were removed and text was tokenized.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             max_features=5000,
                             ngram_range=(1, 3))

train_data_features = vectorizer.fit_transform(train_data).toarray()
test_data_features = vectorizer.transform(test_data).toarray()

# check shapes
print('Shape of train_data: ', train_data_features.shape,
      '\nShape of test_data:  ', test_data_features.shape)

Shape of train_data:  (4961, 5000) 
Shape of test_data:   (1241, 5000)


In [6]:
vocab = vectorizer.get_feature_names()

In [7]:
#restructure the target variable
y_train = train.label
y_test = test.label

#check the shapes of target variable
print('Shape of train target: ', y_train.shape,
      '\nShape of test target:  ', y_test.shape)

Shape of train target:  (4961,) 
Shape of test target:   (1241,)


In [8]:
y_train.value_counts(normalize=True)

neutral    0.352550
bad        0.331385
good       0.316065
Name: label, dtype: float64

In [9]:
y_test.value_counts(normalize=True)

neutral    0.352941
bad        0.331185
good       0.315874
Name: label, dtype: float64

## Logistic Regression

### Baseline

In [10]:
from sklearn.linear_model import LogisticRegression

# Perform baseline logistic regression
logreg_base = LogisticRegression(C = 1e9, 
                                 solver='lbfgs', 
                                 max_iter=1000, 
                                 penalty='l2',
                                 multi_class = 'multinomial',
                                 random_state=110)

logreg_base.fit(train_data_features, y_train)

print('Train accuracy score:',f'{logreg_base.score(train_data_features, y_train):.3f}')
print('Test accuracy score:',f'{logreg_base.score(test_data_features, y_test):.3f}')



Train accuracy score: 0.992
Test accuracy score: 0.370


## Logistic Regression 
### Finetuning

In [44]:
from sklearn.model_selection import GridSearchCV

# gridsearch original parameters:
params = [
          {'C': np.logspace(-2, 4, 7),
           'penalty': ['l1', 'l2'],
           'solver': ['liblinear'],
           'multi_class': ['ovr']
           },
          {'C': np.logspace(-2, 4, 7),
           'penalty': ['l1', 'l2'],
           'solver': ['saga'],
           'multi_class': ['multinomial']
           }]

# gridsearch best parameters:
# params = {'C': [0.01],
#           'penalty': ['l2'],
#           }

logreg_grid = GridSearchCV(estimator=LogisticRegression(random_state=111),
                           param_grid=params,
                           scoring='accuracy',
                           refit='accuracy',
                           return_train_score = True,
                           cv=5, verbose=2, n_jobs=-1)

#fit the model
logreg_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {logreg_grid.best_params_}')
print(f'''Train accuracy score: 
          {logreg_grid.best_estimator_.score(train_data_features, y_train):.3f}''')
print(f'''Test accuracy score: 
          {logreg_grid.best_estimator_.score(test_data_features, y_test):.3f}''')

Fitting 5 folds for each of 14 candidates, totalling 70 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 14.1min
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed: 33.7min finished


Best parameters: {'C': 0.01, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'}
Best score: 0.414
Test score: 0.430


## Random Forest

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# params = {'n_estimators': [10, 100, 200, 500],
#           'criterion': ['gini', 'entropy'],
#           'max_depth': [2, 4, 6, 8, 10],
#           'max_features': [5, 10, 20, 50]}

params = {'n_estimators': [150, 200, 300, 400],
          'criterion': ['gini'],
          'max_depth': [9, 10, 11, 12],
          'max_features': [50, 100, 500]}

forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=110),
                           param_grid=params,
                           scoring='accuracy',
                           refit='accuracy',
                           return_train_score=True,
                           cv=5, verbose=2, n_jobs=-1)

forest_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {forest_grid.best_params_}')
print(f'''Train accuracy score: 
         {forest_grid.best_estimator_.score(train_data_features, y_train):.3f}''')
print(f'''Test accuracy score: 
         {forest_grid.best_estimator_.score(test_data_features, y_test):.3f}''')

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.


KeyboardInterrupt: 

In [None]:
# feat_imp_forest = pd.DataFrame(zip(vocab, forest_grid.best_estimator_.feature_importances_), 
#                         columns=['Feature', 'Importance'])
# feat_imp_forest.sort_values(by='Importance', ascending=False, inplace=True)
# feat_imp_forest.head(20)

## K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

params = {'n_neighbors': range(1, 30, 2)}

knn_grid = GridSearchCV(estimator=KNeighborsClassifier(random_state=110),
                           param_grid=params,
                           scoring='accuracy',
                           refit='accuracy',
                           return_train_score=True,
                           cv=5, verbose=2, n_jobs=5)

knn_grid.fit(train_data_features, y_train)

print(f'Best parameters: {forest_grid.best_params_}')
print(f'''Train accuracy score: 
          {forest_grid.best_estimator_.score(train_data_features, y_train):.3f}''')
print(f'''Test accuracy score: 
          {forest_grid.best_estimator_.score(test_data_features, y_test):.3f}''')

## Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

params = {"learning_rate": [0.1, 0.2, 0.4],
          'max_depth': [2, 3, 4, 5, 8, 10],
          'max_features': [5, 10, 50, 100, 200],
          'n_estimators': [10, 50, 100, 200],
          }

gboost_grid = GridSearchCV(estimator=GradientBoostingClassifier(random_state=111),
                       param_grid=param_grid_gboost,
                       scoring='accuracy',
                       refit='accuracy',
                       return_train_score=True,
                       cv=5, verbose=2, n_jobs=-1)

gboost_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {gboost_grid.best_params_}')
print(f'''Train accuracy score: 
          {gboost_grid.best_estimator_.score(train_data_features, y_train):.3f}''')
print(f'''Test accuracy score: 
          {gboost_grid.best_estimator_.score(test_data_features, y_test):.3f}''')

## AdaBoost Classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

params = {"learning_rate": [0.1, 0.2, 0.4],
          'max_depth': [2, 3, 4, 5, 8, 10],
          'max_features': [5, 10, 50, 100, 200],
          'n_estimators': [10, 50, 100, 200],
          }

adaboost_grid = GridSearchCV(estimator=AdaBoostClassifier(algorithm='SAMME.R', 
                                                          random_state=111),
                            param_grid=param_grid_gboost,
                            scoring='accuracy',
                            refit='accuracy',
                            return_train_score=True,
                            cv=5, verbose=2, n_jobs=-1)

adaboost_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {gboost_grid.best_params_}')
print(f'''Train accuracy score: 
          {gboost_grid.best_estimator_.score(train_data_features, y_test):.3f}''')
print(f'''Test accuracy score: 
          {gboost_grid.best_estimator_.score(test_data_features, y_test):.3f}''')

## XG Boosting

In [None]:
import xgboost as xgb

params = {"learning_rate": [0.1, 0.2, 0.4],
          'max_depth': [2, 3, 4, 5, 8, 10],
          'max_features': [5, 10, 50, 100, 200],
          'n_estimators': [10, 50, 100, 200],
          }

adaboost_grid = GridSearchCV(estimator=AdaBoostClassifier(algorithm='SAMME.R', random_state=111),
                       param_grid=param_grid_gboost,
                       scoring='accuracy',
                       refit='accuracy',
                       return_train_score=True,
                       cv=5, verbose=2, n_jobs=-1)

adaboost_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {gboost_grid.best_params_}')
print(f'Train accuracy score: {gboost_grid.best_estimator_.score(train_data_features, y_test):.3f}')
print(f'Test accuracy score: {gboost_grid.best_estimator_.score(test_data_features, y_test):.3f}')

## Support Vector Machine

In [None]:
from sklearn import svm

params = {'kernel': ['linear', 'poly', 'rbf'],
          'C': [0.1, 1, 10, 1000], #high C allows narrow mistakes
          'degree': [2, 3]}

svm_grid = GridSearchCV(estimator=svm_clf,
                       param_grid=param_grid_svm,
                       scoring='accuracy',
                       return_train_score=True,
                       cv=3, verbose=2,        n_jobs=5)


svm_grid.fit(train_data_features, y_train)

#priting out the result
print(f'Best parameters: {svm_grid.best_params_}')
print(f'Train accuracy score: {svm_grid.best_estimator_.score(train_data_features, y_test):.3f}')
print(f'Test accuracy score: {svm_grid.best_estimator_.score(test_data_features, y_test):.3f}')