#### Part 1: Using the TextBlob Sentiment Analyzer

In [1]:
# import modules

import pandas as pd
import re
from textblob import TextBlob

Import the movie review data as a data frame and ensure that the data is loaded properly.

In [2]:
# read the dataset as a dataframe
df = pd.read_csv("./datasets/labeledTrainData.tsv", sep="\t")

In [3]:
# view few rows to check dataset is loaded correcly
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


How many of each positive and negative reviews are there?

In [4]:
# using value counts to get counts of postive and negetive sentiments
df["sentiment"].value_counts()

sentiment
1    12500
0    12500
Name: count, dtype: int64

Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [5]:
# this function converts input text to lower case and then performs sentiment analysis using TextBlob sentiment analyzer
# this function returns 1 (positive) if polarity of TextBlob sentiment analysis is greater than 0 otherwise returns 0 (negative)
def perform_sentiment_analysis(text):
    testimonial = TextBlob(text.lower())
    return 0 if testimonial.sentiment.polarity < 0 else 1

In [6]:
# perform sentiment analysis for each review and store in a new column
df["textblob_sentiment"] = df["review"].apply(perform_sentiment_analysis)

In [7]:
# view few rows
df.head()

Unnamed: 0,id,sentiment,review,textblob_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0
3,3630_4,0,It must be assumed that those who praised this...,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0


Check the accuracy of this model. Is this model better than random guessing?

In [13]:
# importing accuracy score module for calculating accuracy
from sklearn.metrics import accuracy_score

In [14]:
# get accuracy score for sentiment generated by TextBlob against the labeled data
accuracy_score(df["sentiment"],df["textblob_sentiment"])

0.68552

The sentiment analyzer model using `TextBlob` has an accuracy of $68.5$%. The original labeleled dataset has has $50$% positive and $50$% negetive sentiment. Hence, with random guessing we can expect to get a maximum of $50$% accuracy. However, the `TextBlob` sentiment analyzer model has accuracy higher than the $50$%, therefore we can say that the `TextBlob` model is better than _random guessing_.

For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [15]:
# import VADER sentiment analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [17]:
# this function performs VADER sentiment analysis on the input text and returns 1 (poitive) 
# if the compound score from the analysis is greater than eqaul 0 otherwise 0 (negative)
def perform_sentiment_analysis_using_vader(text):
    analyzer = SentimentIntensityAnalyzer()
    sentiment_dict = analyzer.polarity_scores(text.lower())
    return 0 if sentiment_dict["compound"] < 0 else 1

In [19]:
# perform VADER sentiment analysis for each review using above function and store in a new column
df["sentiment_with_vader"] = df["review"].apply(perform_sentiment_analysis_using_vader)

In [24]:
# view few rows
df.head()

Unnamed: 0,id,sentiment,review,review_lowercase,textblob_sentiment,sentiment_with_vader
0,5814_8,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,1,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...","\the classic war of the worlds\"" by timothy hi...",1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...,0,0
3,3630_4,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,1,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,0,1


In [25]:
# get accuracy score for sentiment generated by VADER against the labeled data
accuracy_score(df["sentiment"],df["sentiment_with_vader"])

0.69236

We see that the accuracy of _Sentiment analysis_ using `VADER` sentiment analyzer did not change much from `TextBlob` sentiment analyzer, just $1$% increase in accuracy.

#### Part 2: Prepping Text for a Custom Model

In [20]:
# import modules

import pandas as pd
import unicodedata
import sys
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
# read the dataset into a dataframe
df = pd.read_csv("./datasets/labeledTrainData.tsv", sep="\t")

Convert all text to lowercase letters.

In [22]:
# convert reviews to lower case and store in a new column
df["cleaned_review"] = df["review"].str.lower()

Remove punctuation and special characters from the text.

In [23]:
# this function clean punctuation and symbols from input text
punctuation = dict.fromkeys(
[i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith("P") or unicodedata.category(chr(i)).startswith("S")],
    None
 )
def remove_punctuation_symbols(text):
    return text.translate(punctuation)

In [24]:
# remove punctuation and symbols from reviews using the above function 
df["cleaned_review"] = df["cleaned_review"].apply(remove_punctuation_symbols)

In [26]:
# this function removes stopwords from input text
stop_words = stopwords.words("english")
def remove_stopwords(text):
    tokenized_words = word_tokenize(text)
    return " ".join([word for word in tokenized_words if word not in stop_words])

Remove stop words

In [27]:
# remove stopwords from reviews using above function
df["cleaned_review"] = df["cleaned_review"].apply(remove_stopwords)

Apply NLTK’s PorterStemmer.

In [28]:
# this function perform stemming on the input text using NLTK PorterStemmer
def porter_stemmer(text):
    tokenized_words = word_tokenize(text)
    porter = PorterStemmer()
    return " ".join([porter.stem(words) for words in tokenized_words])

In [29]:
# apply stemming on reviews using above function
df["cleaned_review"] = df["cleaned_review"].apply(porter_stemmer)

In [30]:
# view few rows after the above text preparation
df.head()

Unnamed: 0,id,sentiment,review,cleaned_review
0,5814_8,1,With all this stuff going down at the moment w...,stuff go moment mj ive start listen music watc...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",classic war world timothi hine entertain film ...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,film start manag nichola bell give welcom inve...
3,3630_4,0,It must be assumed that those who praised this...,must assum prais film greatest film opera ever...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbl trashi wondrous unpretenti 80 exploit ...


Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [31]:
# Generate bag of words matrix using CountVectorizer
count_vec = CountVectorizer()
bag_of_words = count_vec.fit_transform(df["cleaned_review"])


In [32]:
# dimensions of the generated bag-of-words matrix
bag_of_words.shape

(25000, 92324)

In [33]:
# dimensions of the dataframe
df.shape

(25000, 4)

we see that the number of rows of the bag-of-words matrix is same as the number of rows in the original dataframe i.e $25000$

Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix

In [34]:
# generate term frequency-inverse document frequency (tf-idf) matrix using TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_feature_matrix = tfidf.fit_transform(df["cleaned_review"])

In [35]:
# dimensions of te tf-idf feature matrix
tfidf_feature_matrix.shape

(25000, 92324)

we see that the number of rows of the tf-idf matrix is same as the number of rows in the original dataframe i.e.$25000$

In [8]:
df.head()

Unnamed: 0,id,sentiment,review,textblob_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0
3,3630_4,0,It must be assumed that those who praised this...,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0


In [13]:
df[(df["sentiment"] == 1) & (df["textblob_sentiment"] == 1)].shape

(11825, 4)

In [None]:
df[(df["sentiment"] == 0) & (df["textblob_sentiment"] == 1)].shape