# University of Aberdeen

## Applied AI (CS5079)

### Lecture (Day 5) - Investigating Sentiment Prediction

---

In the lecture, we cover tools for pre-processing text data, several supervised/unsupervised models for sentiment prediction and model causation.  This lecture is inspired by Chapter 7 of __Practical Machine Learning with Python__ (2018), Sarkar et al.

__In this particular notebook, we normalize the dataset and investigate multiple unsupervised lexion-based models__.

We will use the following packages:

In [1]:
# Usual data representation and manipulation libraries
import pandas as pd
import numpy as np
from collections import Counter

# NLTK is very useful for natural language applications
import nltk

# This will be used to tokenize sentences
from nltk.tokenize.toktok import ToktokTokenizer

# We use spacy for extracting useful information from English words
import spacy
nlp = spacy.load('en', parse = False, tag=False, entity=False)

# This dictionary will be used to expand contractions (e.g. we'll -> we will)
from contractions import contractions_dict
import re

# Unicodedata will be used to remove accented characters
import unicodedata

# BeautifulSoup will be used to remove html tags
from bs4 import BeautifulSoup

# Lexicon models
from afinn import Afinn
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Evaluation libraries
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Importing the dataset and Pre-Processing

We will predict the sentiment for movie reviews obtained from the Internet Movie Database (IMDb). The dataset contains 50,000 movie reviews that have been labeled with “positive” and “negative” labels based on the review content.

If you are not using Codio, the dataset can be obtained from http://ai.stanford.edu/~amaas/data/sentiment/, courtesy of Stanford University and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. This dataset was also used in their paper: "Learning Word Vectors for Sentiment Analysis proceedings of the 49th Annual Meeting of the Association" for Computational Linguistics (ACL 2011).

In [2]:
#We import the dataset
movie_reviews_raw = pd.read_csv("Datasets/movie_reviews.csv")
movie_reviews_raw.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In the previous sample, you can notice that the text reviews contain html tag. The following code will use `BeautifulSoup` to remove those tags.

In [3]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

It may also be important to remove accented and special characters.

In [4]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

The cell below will use `contractions` dictionary to find contractions (using regular expressions) and expand them.

In [5]:
def expand_contractions(text, contraction_mapping=contractions_dict):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(match) else contraction_mapping.get(match.lower())                               
        return first_char+expanded_contraction[1:] if expanded_contraction != None else match
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

Depending on the application, we may also need to lemmatize the words in a sentence, i.e. extract the canonical form of the words.

In [6]:
def lemmatize_text(text):
    text = nlp(text)
    return ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])

We download a list of stopwords (list of words that are usually filtered before processing natural languages). Because we are concerned with sentiment prediction, it is very important to keep the polarity of the sentence, that is why we need to keep words like `no` or `not`.

In [7]:
#nltk.download('stopwords')
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

def remove_stopwords(text, is_lower_case=False):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


We can now combine all the previous functions into a `normalize_corpus` function to apply the chosen pre-processing techniques.

In [8]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters    
        if special_char_removal:
            doc = remove_special_characters(doc)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
    return normalized_corpus

You can run the following code snippet to convert and save the processed dataset or use the `normalized_movie_reviews.csv` file directly. This step may take some time to be completed.

In [9]:
#movie_reviews_raw['review'] = normalize_corpus(movie_reviews_raw.review)
#movie_reviews_raw.to_csv("normalized_movie_reviews.csv", index = False)

In [10]:
normalized_movie_reviews = pd.read_csv("Datasets/normalized_movie_reviews.csv")
normalized_movie_reviews.head(5)

Unnamed: 0,review,sentiment
0,one reviewer mention watch 1 oz episode hook r...,positive
1,wonderful little production filming technique ...,positive
2,think wonderful way spend time hot summer week...,positive
3,basically family little boy jake think zombie ...,negative
4,petter matteis love time money visually stunni...,positive


## Unsupervised Lexicon-based Models

We split the dataset into a training (resp. test) dataset containing the first 35,000 (resp. last 15,000) instances.

In [11]:
reviews = np.array(normalized_movie_reviews['review'])
sentiments = np.array(normalized_movie_reviews['sentiment'])

# extract data for model evaluation
test_reviews = reviews[35000:]
train_reviews = reviews[:35000]
test_sentiments = sentiments[35000:]
train_sentiments = sentiments[:35000]

sample_review_ids = [7626, 3533, 13010]

### AFINN

In the next cell, we show how AFINN can easily score the polarity of a review.

In [12]:
afn = Afinn(emoticons=True)

for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', afn.score(review))
    print('-'*60)

REVIEW: no comment stupid movie act average bad screenplay no sense skip
Actual Sentiment: negative
Predicted Sentiment polarity: -7.0
------------------------------------------------------------
REVIEW: not care people vote movie bad want truth good movie every thing movie really get one
Actual Sentiment: positive
Predicted Sentiment polarity: 3.0
------------------------------------------------------------
REVIEW: bad horror film ever funniest film ever roll one get see film cheap unbeliaveble see really p watch carrot
Actual Sentiment: positive
Predicted Sentiment polarity: -3.0
------------------------------------------------------------


In the next two cells, we evaluate our sentiment prediction model with AFINN on the test dataset.

In [13]:
T = lambda x: "positive" if x else "negative"
y_predicted = [T(x) for x in [afn.score(review)>=0 for review in test_reviews]]

In [14]:
print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.7084666666666667
The model precision score is: 0.7305592613010629
The model recall score is: 0.7084666666666667
The model F1-score is: 0.7012126290280828
              precision    recall  f1-score   support

    negative       0.80      0.55      0.65      7490
    positive       0.66      0.86      0.75      7510

    accuracy                           0.71     15000
   macro avg       0.73      0.71      0.70     15000
weighted avg       0.73      0.71      0.70     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,4140,3350
Act. positive,1023,6487


## SentiWordNet

Wordnet groupes synonyms into synsets with short definitions and usage examples. In the example below, we print the synsets for the word `extravagant`. You can notice than each synset is associated with a positive, a negative and an objectivity score.

In [15]:
extravant = list(swn.senti_synsets('extravagant', 'a'))
pd.DataFrame.from_dict({ "Synset" : [ s.synset for s in extravant],
"Definition" : [s.synset.definition() for s in extravant], "Positive Polarity" : [s._pos_score for s in extravant], "Negative Polarity" : [s._neg_score for s in extravant], "Objectivity Score" : [s._obj_score for s in extravant]})

Unnamed: 0,Synset,Definition,Positive Polarity,Negative Polarity,Objectivity Score
0,Synset('excessive.s.02'),"unrestrained, especially with regard to feelings",0.125,0.375,0.5
1,Synset('extravagant.s.02'),recklessly wasteful,0.0,0.125,0.875


We can create a simple function to aggregate the scores of the most common synsets for the words in the reviews using the postive and negative scores returned by Wordnet.

In [16]:
def analyze_sentiment_sentiwordnet_lexicon(review):
    # tokenize and POS tag text tokens
    tagged_text = [(token.text, token.tag_) for token in nlp(review)]
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')): #NOUNS
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')): #VERBS
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')): #ADJECTIVES
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')): #ADVERBS
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    return norm_final_score

In [17]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity: '+ str(analyze_sentiment_sentiwordnet_lexicon(review)))
    print('-'*60)

REVIEW: no comment stupid movie act average bad screenplay no sense skip
Actual Sentiment: negative
Predicted Sentiment polarity: -0.1
------------------------------------------------------------
REVIEW: not care people vote movie bad want truth good movie every thing movie really get one
Actual Sentiment: positive
Predicted Sentiment polarity: 0.11
------------------------------------------------------------
REVIEW: bad horror film ever funniest film ever roll one get see film cheap unbeliaveble see really p watch carrot
Actual Sentiment: positive
Predicted Sentiment polarity: 0.03
------------------------------------------------------------


In [18]:
y_predicted = [T(x) for x in [analyze_sentiment_sentiwordnet_lexicon(review)>=0 for review in test_reviews]]

In [19]:
print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.6842666666666667
The model precision score is: 0.6869568255259471
The model recall score is: 0.6842666666666667
The model F1-score is: 0.683079703654126
              precision    recall  f1-score   support

    negative       0.71      0.62      0.66      7490
    positive       0.66      0.75      0.70      7510

    accuracy                           0.68     15000
   macro avg       0.69      0.68      0.68     15000
weighted avg       0.69      0.68      0.68     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,4668,2822
Act. positive,1914,5596


## VADER Lexicon

VADER (Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.

In the next cell, we can see that VADER returns four sentiment scores `compound`, `neg`, `neu` and `pos`. In the following model, we will only use the `compound` (i.e. the aggregated score).

In [20]:
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores('This movie was actually neither that funny, nor super witty.')

{'compound': -0.6759, 'neg': 0.41, 'neu': 0.59, 'pos': 0.0}

In [21]:
def analyze_sentiment_vader_lexicon(review, threshold=0.1):
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold else 'negative'
    return final_sentiment

In [22]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_vader_lexicon(review, threshold=0.4)    
    print('-'*60)

REVIEW: no comment stupid movie act average bad screenplay no sense skip
Actual Sentiment: negative
------------------------------------------------------------
REVIEW: not care people vote movie bad want truth good movie every thing movie really get one
Actual Sentiment: positive
------------------------------------------------------------
REVIEW: bad horror film ever funniest film ever roll one get see film cheap unbeliaveble see really p watch carrot
Actual Sentiment: positive
------------------------------------------------------------


In [23]:
y_predicted = [analyze_sentiment_vader_lexicon(review, threshold=0.4) for review in test_reviews]

print("The model accuracy score is: {}".format(accuracy_score(test_sentiments, y_predicted)))
print("The model precision score is: {}".format(precision_score(test_sentiments, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(test_sentiments, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(test_sentiments, y_predicted, average="weighted")))

print(classification_report(test_sentiments, y_predicted))

display(pd.DataFrame(confusion_matrix(test_sentiments, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.7043333333333334
The model precision score is: 0.7131634140041795
The model recall score is: 0.7043333333333334
The model F1-score is: 0.7011709547819932
              precision    recall  f1-score   support

    negative       0.76      0.60      0.67      7490
    positive       0.67      0.81      0.73      7510

    accuracy                           0.70     15000
   macro avg       0.71      0.70      0.70     15000
weighted avg       0.71      0.70      0.70     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,4506,2984
Act. positive,1451,6059
