<h1>Sentiment Analysis</h1>

<p>In this step, we'll actually flesh out the classifier to predict the behaviour of stock prices based on financial news. We'll compare the performance of forecasts made based only on headlines and those made based on headlines and first sentences of articles (as it is on Reuters' news feed) using a Naive Bayes model.</p>

<p>Implementing a machine learning model comprises 4 key steps</p>

<ul>
    <li>Define - Define the model that's being used</li>
    <li>Fit - Fit the model on your training dataset (X_train)</li>
    <li>Predict - Make predictions with the model on your test dataset (X_test)</li>
    <li>Evaluate - Measure your model's accuracy (compare predictions and y_test)</li>
</ul>

<p>To start off, let's read in the pickled data.</p>

In [1]:
import pandas as pd

df = pd.read_pickle("./pickles/corpus.pkl")
df_text = df.drop(["Headline", "Sentence"], axis=1)
df_headlines = df.drop(["Sentence", "Text"], axis=1)

<h2>Naive Bayes</h2>

<p>Let's define a few helper functions that will abstract away some of the modelling details.</p>

<li>Obtaining X and y (features and output)</li>

In [18]:
# Setting up the dataset for modelling
from sklearn.feature_extraction.text import CountVectorizer

def get_X_y(df, col):
    # Creating a DTM
    cv = CountVectorizer()
    X = cv.fit_transform(col).toarray()
    y = df.Inflection.values
    
    return X, y

<li>Define, Fit, and Predict</li>

In [19]:
from sklearn.naive_bayes import GaussianNB

def dfp_naive_bayes(X_train, y_train, X_test):
    # Model definition
    naive_bayes = GaussianNB(var_smoothing=1e-2)

    # Fitting the model
    naive_bayes.fit(X_train, y_train)

    # Predicting with the model
    y_pred = naive_bayes.predict(X_test)
    
    return y_pred

<li>Evaluate</li>

In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score

def evaluate_naive_bayes(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    confusion = confusion_matrix(y_test, y_pred)

    return confusion, accuracy

<h3>Headlines Forecast</h3>

In [21]:
# Train-test split
from sklearn.model_selection import train_test_split

X_headlines, y_headlines = get_X_y(df_headlines, df_headlines.Headline)
X_headlines_train, X_headlines_test, y_headlines_train, y_headlines_test = train_test_split(
    X_headlines, y_headlines, test_size = 0.20, random_state = 0)

In [22]:
# Obtaining predictions
y_headlines_pred = dfp_naive_bayes(X_headlines_train, y_headlines_train, X_headlines_test)

In [23]:
# Evaluating the model
confusion_headlines, accuracy_headlines = evaluate_naive_bayes(y_headlines_test, y_headlines_pred)

print("Confusion matrix\n")
print(confusion_headlines, "\n")
print("The model's accuracy is", accuracy_headlines)

Confusion matrix

[[12 11]
 [18  9]] 

The model's accuracy is 0.42


<h3>Display Text Forecast</h3>

In [24]:
# Train-test split
from sklearn.model_selection import train_test_split

X_text, y_text = get_X_y(df_text, df_text.Text)
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
    X_text, y_text, test_size = 0.20, random_state = 0)

In [25]:
# Obtaining predictions
y_text_pred = dfp_naive_bayes(X_text_train, y_text_train, X_text_test)

In [26]:
# Evaluating the model
confusion_text, accuracy_text = evaluate_naive_bayes(y_text_test, y_text_pred)

print("Confusion matrix\n")
print(confusion_text, "\n")
print("The model's accuracy is", accuracy_text)

Confusion matrix

[[ 7 16]
 [17 10]] 

The model's accuracy is 0.34


<h3>Preliminary Results</h3>

<p>As we can see, the accuracy of the default Naive Bayes model is quite poor</p>

<ul>
    <li>42% on headlines-based forecast</li>
    <li>34% on display text-based forecast (which is counter-intuitive)</li>
</ul>

<p>Let's see how we can tune our model to get better results.</p>