<h1>Sentiment Analysis</h1>

<p>In this step, we'll actually flesh out the classifier to predict the behaviour of stock prices based on financial news. We'll compare the performance of forecasts made based only on headlines and those made based on headlines and first sentences of articles (as it is on Reuters' news feed) using two models</p>

<ul>
    <li>Naive Bayes</li>
    <li>Support Vector Machine (SVM)</li>
</ul>

<p>Implementing a machine learning model comprises 4 key steps</p>

<ul>
    <li>Define - Define the model that's being used</li>
    <li>Fit - Fit the model on your training dataset (X_train)</li>
    <li>Predict - Make predictions with the model on your test dataset (X_test)</li>
    <li>Evaluate - Measure your model's accuracy (compare predictions and y_test)</li>
</ul>

<p>To start off, let's read in the pickled data.</p>

In [1]:
import pandas as pd

df = pd.read_pickle("./pickles/corpus.pkl")
df_text = df.drop(["Headline", "Sentence"], axis=1)
df_headlines = df.drop(["Sentence", "Text"], axis=1)

<h2>Naive Bayes</h2>

<p>Let's define a few helper functions that will abstract away some of the modelling details.</p>

<li>Obtaining X and y (features and output)</li>

In [2]:
# Setting up the dataset for modelling
from sklearn.feature_extraction.text import CountVectorizer

def get_X_y(df, col):
    # Creating a DTM
    cv = CountVectorizer()
    X = cv.fit_transform(col).toarray()
    y = df.Inflection.values
    
    return X, y

<li>Define, Fit, and Predict</li>

In [3]:
from sklearn.naive_bayes import GaussianNB

def dfp_naive_bayes(X_train, y_train, X_test):
    # Model definition
    naive_bayes = GaussianNB(var_smoothing=1e+2)

    # Fitting the model
    naive_bayes.fit(X_train, y_train)

    # Predicting with the model
    y_pred = naive_bayes.predict(X_test)
    
    return y_pred

<li>Evaluate</li>

In [4]:
from sklearn.metrics import confusion_matrix, accuracy_score

def evaluate_naive_bayes(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    confusion = confusion_matrix(y_test, y_pred)

    return confusion, accuracy

<h3>Headlines Forecast</h3>

In [5]:
# Train-test split
from sklearn.model_selection import train_test_split

X_headlines, y_headlines = get_X_y(df_headlines, df_headlines.Headline)
X_headlines_train, X_headlines_test, y_headlines_train, y_headlines_test = train_test_split(
    X_headlines, y_headlines, test_size = 0.10
    , random_state = 0)

In [6]:
# Obtaining predictions
y_headlines_pred = dfp_naive_bayes(X_headlines_train, y_headlines_train, X_headlines_test)

In [7]:
# Evaluating the model
confusion_headlines, accuracy_headlines = evaluate_naive_bayes(y_headlines_test, y_headlines_pred)

print("Confusion matrix\n")
print(confusion_headlines, "\n")
print("The model's accuracy is", accuracy_headlines)

Confusion matrix

[[ 0  9]
 [ 0 16]] 

The model's accuracy is 0.64


<h3>Display Text Forecast</h3>

In [8]:
# Train-test split
from sklearn.model_selection import train_test_split

X_text, y_text = get_X_y(df_text, df_text.Text)
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(
    X_text, y_text, test_size = 0.10, random_state = 0)

In [9]:
# Obtaining predictions
y_text_pred = dfp_naive_bayes(X_text_train, y_text_train, X_text_test)

In [10]:
# Evaluating the model
confusion_text, accuracy_text = evaluate_naive_bayes(y_text_test, y_text_pred)

print("Confusion matrix\n")
print(confusion_text, "\n")
print("The model's accuracy is", accuracy_text)

Confusion matrix

[[ 0  9]
 [ 0 16]] 

The model's accuracy is 0.64


<h3>Preliminary Results</h3>

<p>As we can see, the accuracy of the default Naive Bayes model is quite poor</p>

<ul>
    <li>42% on headlines-based forecast</li>
    <li>34% on display text-based forecast (which is counter-intuitive)</li>
</ul>

<p>Let's see how we can tune our model to get better results.</p>

<h3>Results</h3>

<p>After tuning the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html">variance smoothing</a> of the Naive Bayes model, it resulted in an accuracy of 64%. Let's now pit this against SVM classifiers.</p>

<h2>Support Vector Machine (SVM)</h2>

<p>An SVM model draws hyperplanes that distinctly identify all classes in an n-dimensional classification model. In this case, we only have two classes (-1, 1 or negative, positive). It determines the hyperplane dissection with mathematical functions called "kernels". We will be using a linear kernel since our sample space is merely {-1, 1}.</p>

In [42]:
train_text = df_headlines[:200]
test_text = df_headlines[200:]

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create feature vectors
vectorizer = TfidfVectorizer(min_df = 5,
                             max_df = 0.8,
                             sublinear_tf = True,
                             use_idf = True)

train_vectors = vectorizer.fit_transform(train_text["Headline"])
test_vectors = vectorizer.transform(test_text["Headline"])

In [52]:
import time
from sklearn import svm
from sklearn.metrics import classification_report

# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear', C=10, tol=1e-1)
t0 = time.time()
classifier_linear.fit(train_vectors, train_text["Inflection"])
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1
# results
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
report = classification_report(test_text["Inflection"], prediction_linear, output_dict=True)
print('positive: ', report["1"])
print('negative: ', report["-1"])

Training time: 0.005531s; Prediction time: 0.000000s
positive:  {'precision': 0.40540540540540543, 'recall': 0.6818181818181818, 'f1-score': 0.5084745762711864, 'support': 22}
negative:  {'precision': 0.46153846153846156, 'recall': 0.21428571428571427, 'f1-score': 0.2926829268292683, 'support': 28}
