# Determining the Sentiment of Financial News
The news database here will train the Naive Bayes
For deploying I'd recommend using the NewsAPI code shared and tag the sentiment via the trained NB.

## Constructing a Naive Bayes Classifier

* Load dataset
* Vectorize data
* Split data (80/20, train test, random_state=0 so as to allow reproducability)
* Initialize the NB classifer and fit
* Predict and measure accuracy

In [9]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import CountVectorizer

news_pd = pd.read_csv("./news_with_sentiment.csv")
news_pd.head()
news_pd = news_pd[:2000] # 28,000 rows will use more RAM than is available -> Truncation required

cv = CountVectorizer() # Convert text data to a vector as that is required for Naive Bayes
X = cv.fit_transform(news_pd['text']).toarray()
y = news_pd['sentiment'] # y = the variable we are trying to predict, in this case sentiment

In [10]:
# Split train and test data (80/20)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Initialize the Gaussian Naive Bayes Classifier, then fit the data
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [11]:
# Predict sentiment of our test data
y_pred = classifier.predict(X_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)

And now we can view the accuracy:

In [12]:
print(score)

0.685


Roughly 68% accuracy. Not exactly stellar, if you reduce the dataset further you end up with higher accuracy which is interesting.

In [14]:
news_pd = pd.read_csv("./news_with_sentiment.csv")
news_pd = news_pd[:1000] # 28,000 rows will use more RAM than is available. Truncation required.

cv = CountVectorizer()
X = cv.fit_transform(news_pd['text']).toarray()
y = news_pd['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

score = accuracy_score(y_test, y_pred)

print(score)

0.775


77.5% accuracy on a 1000 row dataset with an 80/20 split.

After research, Naive Bayes appears to be better with smaller datasets but perhaps we can improve:

## To improve on our Naive Bayes we can now try a Random Forest:

* Load dataset
* Remove stopwords, min_df=7 means the data is irrelevant if used in more than 7 documents, max_df of 0.8 means it also is irrelevant if used in more than 80% of documents
* Vectorize data (max_features is the max number of WORDS in Vector form that will influence the sentiment)
* Split data (80/20, train test, random_state=0 so as to allow reproducability)
* Initialize the Random Forest classifer and fit
* Predict and measure accuracy

In [15]:
# Read in 20,000 headlines
news_pd = pd.read_csv("./news_with_sentiment.csv")
news_pd = news_pd[:20000] # 28,000 rows will use more RAM than is available. Truncation required.
y = news_pd['sentiment']

In [16]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Remove stopwords and vectorize the dataset
#TfidVectorizer converts a collection of raw documents to a matrix of TF-IDF features.
vectorizer = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
processed_features = vectorizer.fit_transform(news_pd['text']).toarray()

In [None]:
# 80/20 data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(processed_features, y, test_size=0.2, random_state=0)

# Fit our model with split data, starting with 450 estimators (450 decision trees)
from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=450, random_state=0)
text_classifier.fit(X_train, y_train)

  from numpy.core.umath_tests import inner1d


In [None]:
# Predicting the sentiment of our test data
predictions = text_classifier.predict(X_test)


# Checking our accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

93.57% accuracy

## Hyperparameter Tuning:

* Choose a set of trees we want to test
* Train the model with n trees, store accuracy
* Loop above until complete
* Plot the resulting trees v accuracy

In [None]:
from sklearn.ensemble import RandomForestRegressor

estimators = [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 900, 1000, 1250, 1500, 2000]
accuracy = []

for estimator_num in estimators:
    # Fit and predict
    text_classifier = RandomForestClassifier(n_estimators=estimator_num, random_state=0)
    text_classifier.fit(X_train, y_train)
    predictions = text_classifier.predict(X_test)

    # Store accuracy
    from sklearn.metrics import accuracy_score
    accuracy.append(accuracy_score(y_test, predictions))


# Graph reported accuracy of various sets of estimators
import matplotlib.pyplot as plt

plt.plot(estimators, accuracy)
plt.ylabel('Accuracy')
plt.xlabel('Estimators')
plt.show()

print(estimators)
print(accuracy)

![image info](./download.png)

A strange curve? 

As per: https://en.wikipedia.org/wiki/Talk%3ARandom_forest

"Random Forests does not overfit. The testing performance of Random Forests does not decrease (due to overfitting) as the number of trees increases. Hence after certain number of trees the performance tend to stay in a certain value."


However, we can also see that ~250 estimators/trees is the ideal parameter.

Naive Bayes v Random Forest v SVM: https://www.researchgate.net/publication/336225950_Comparison_of_Naive_Bayes_Support_Vector_Machine_Decision_Trees_and_Random_Forest_on_Sentiment_Analysis

## Pull Fresh News: 

In [None]:
import requests
import time
import datetime


articleCount = 0

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}

stocks = ['TSLA', 'AMZN', 'MMM', 'INTC', 'GOOGL', 'FB', 'MSFT', 'AAPL']
list_of_headlines = []
for line in stocks:
    ticker = line

    try:

        #Query for the stock name, for refined news queries.
        resp = requests.get(
            url="https://www.alphavantage.co/query?function=SYMBOL_SEARCH&keywords={}&apikey=ERO5XRBZNWQ9E608".format(
                ticker), headers=headers)
        data = resp.json()
        companyName = data['bestMatches'][0]['2. name']
        print("Company Name: " + companyName)

        #Query for news
        resp = requests.get(
            url='https://newsapi.org/v2/everything?'
'q={}&'
'from=2020-01-05' # This is the OLDEST date an article can be from, free edition will let you have a month I believe
'sortBy=popularity&' #Filter by popularity (read the newsapi docs)
'apiKey=fe00115ceffe418988616191b03e1c74'.format(
                ticker + " " + companyName), headers=headers) #Add the company name in full after the ticker, for more accurate news queries
        data = resp.json()

        for article in data['articles']:
            articleCount = articleCount + 1
            newsTitle = article['title']
            print(newsTitle)
            list_of_headlines.append(newsTitle)
            
        time.sleep(1)

    except Exception as e:
        print("Error: " + str(e))
        time.sleep(10)
        
# Create the pandas DataFrame and save to csv
df = pd.DataFrame({'headlines':list_of_headlines}) 
df.to_csv('fresh_news_month_tsla.csv', encoding='utf-8', mode='w', index=False)

In [None]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Read in fresh news
fresh_news = pd.read_csv('./fresh_news_month_tsla.csv')
fresh_news['headlines'].head(5)

# Vectorize new text data with a max of 40 words being predictors
vectorizer_new_data = CountVectorizer(max_features=40, min_df=9)
processed_features_new_data = vectorizer_new_data.fit_transform(fresh_news['headlines']).toarray()

# Vectorize training text data with a max of 40 words being predictors
vectorizer = CountVectorizer(max_features=40, min_df=9)
processed_features = vectorizer.fit_transform(news_pd['text']).toarray()

X_train, X_test, y_train, y_test = train_test_split(processed_features, y, test_size=0.2, random_state=0)

# Predict on new/fresh news after fitting on training data
text_classifier = RandomForestClassifier(n_estimators=650, random_state=0)
text_classifier.fit(X_train, y_train)

predictions = text_classifier.predict(processed_features_new_data)

# Output our predictions
print(predictions)

for i in range(len(predictions)):
    if predictions[i] == 1:
        print("Positive: " + fresh_news['headlines'][i])
    if predictions[i] == -1:
        print("Negative: " + fresh_news['headlines'][i])
        

Save our model to disk for production deployment to Sparkbot

In [None]:
from joblib import dump
dump(text_classifier,'sentimentclassified.joblib')