# University of Aberdeen

## Applied AI (CS5079)

### Lecture (Day 5) - Investigating Sentiment Prediction

---

In the lecture, we cover tools for pre-processing text data, several supervised/unsupervised models for sentiment prediction and model causation.  This lecture is inspired by Chapter 7 of __Practical Machine Learning with Python__ (2018), Sarkar et al.

__In this particular notebook, we study supervised models based on SGDC Classifiers and Logistic Regressions with BOW or TF-IDF features__.

We will use the following packages:

In [1]:
# Usual data representation and manipulation libraries
import pandas as pd
import numpy as np

# ML Models
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import train_test_split

# Evaluation libraries
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Libraries for feature engineering
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Computing BOW and TF-IDF features

Since Codio only offers 500MB of RAM, we will restrict our dataset to the first 10,000 reviews (instead of the 50,000 reviews). The dataset will then be split into a training and test dataset containing 70% and 30% of the restricted dataset respectively.

Feel free to try with the full set of reviews on your personal computer.

In [2]:
normalized_movie_reviews = pd.read_csv("Datasets/normalized_movie_reviews.csv")
numberOfReviews=10000
reviews = np.array(normalized_movie_reviews['review'].iloc[:numberOfReviews])
sentiments = np.array(normalized_movie_reviews['sentiment'].iloc[:numberOfReviews])

# extract data for model evaluation
train_reviews, test_reviews, train_sentiments, test_sentiments = train_test_split(reviews, sentiments, test_size=0.3)

# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(train_reviews)

# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),sublinear_tf=True)
tv_train_features = tv.fit_transform(train_reviews)

In [3]:
# transform test reviews into features
cv_test_features = cv.transform(test_reviews)
tv_test_features = tv.transform(test_reviews)

In [4]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (7000, 581132)  Test features shape: (3000, 581132)
TFIDF model:> Train features shape: (7000, 581132)  Test features shape: (3000, 581132)


## Model Training, Prediction and Performance Evaluation



In [5]:
# We define our SVM and LR models
lr = LogisticRegression(penalty='l2', max_iter=1000, C=1)
svm = SGDClassifier(loss='hinge', max_iter=100)

In [6]:
# Logistic Regression model on BOW features
lr.fit(cv_train_features,train_sentiments)
y_predicted = lr.predict(cv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8806666666666667
The model precision score is: 0.8807741589652909
The model recall score is: 0.8806666666666667
The model F1-score is: 0.8806436387571166
              precision    recall  f1-score   support

    negative       0.89      0.87      0.88      1483
    positive       0.88      0.89      0.88      1517

    accuracy                           0.88      3000
   macro avg       0.88      0.88      0.88      3000
weighted avg       0.88      0.88      0.88      3000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,1290,193
Act. positive,165,1352


In [7]:
# Logistic Regression model on TF-IDF features
lr.fit(tv_train_features,train_sentiments)
y_predicted = lr.predict(tv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8743333333333333
The model precision score is: 0.874331981981982
The model recall score is: 0.8743333333333333
The model F1-score is: 0.8743317832086531
              precision    recall  f1-score   support

    negative       0.87      0.87      0.87      1483
    positive       0.88      0.88      0.88      1517

    accuracy                           0.87      3000
   macro avg       0.87      0.87      0.87      3000
weighted avg       0.87      0.87      0.87      3000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,1293,190
Act. positive,187,1330


In [8]:
# SVM model on BOW
svm.fit(cv_train_features,train_sentiments)
y_predicted = svm.predict(cv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8713333333333333
The model precision score is: 0.8718705534123772
The model recall score is: 0.8713333333333333
The model F1-score is: 0.8712526229825935
              precision    recall  f1-score   support

    negative       0.89      0.85      0.87      1483
    positive       0.86      0.89      0.88      1517

    accuracy                           0.87      3000
   macro avg       0.87      0.87      0.87      3000
weighted avg       0.87      0.87      0.87      3000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,1260,223
Act. positive,163,1354


In [9]:
#SVM model on TF-IDF
svm.fit(tv_train_features,train_sentiments)
y_predicted = svm.predict(tv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8943333333333333
The model precision score is: 0.8944576905807119
The model recall score is: 0.8943333333333333
The model F1-score is: 0.8943118735362295
              precision    recall  f1-score   support

    negative       0.90      0.88      0.89      1483
    positive       0.89      0.91      0.90      1517

    accuracy                           0.89      3000
   macro avg       0.89      0.89      0.89      3000
weighted avg       0.89      0.89      0.89      3000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,1310,173
Act. positive,144,1373
