Allison Forte

December 13, 2022

DSC 550

Exercise 3.2

# Part 1: Using the TextBlob Sentiment Analyzer

1. Import the movie review data as a data frame and ensure that the data is loaded properly.
2. How many of each positive and negative reviews are there?
3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.
4. Check the accuracy of this model. Is this model better than random guessing?
5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [287]:
# 1. Import the movie review data as a data frame and ensure that the data is loaded properly.

import pandas as pd

training_data = pd.read_csv('/Users/allison.forte/Downloads/labeledTrainData.tsv',sep = '\t')

training_data.head(3)  # Print first 3 rows to ensure data is loaded properly.

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...


In [288]:
# 2. How many of each positive and negative reviews are there?

# count then display negative reviews

negs = training_data['review'][training_data['sentiment'] == 0].count()
print('Negative reviews:\n', negs)


# count then display positive reviews

pos = training_data['review'][training_data['sentiment'] == 1].count()
print('\nPositive reviews:\n', pos)

Negative reviews:
 12500

Positive reviews:
 12500


There are 12,500 negative and 12,500 positive reviews in the training data set. 

In [289]:
# pip install textblob

In [290]:
# 3. Use TextBlob to classify each movie review as positive or negative. 
# Assume a polarity score greater than or equal to zero is positive sentiment and less than 0 is negative.

from textblob import TextBlob


tb_pol = []  # Create a list of the actual polarity scores to be added to the dataframe
tb_round = []  # Create a list of rounded polarity scores to be added to the dataframe for easier analysis


for r in training_data['review']:  # Iterate through each review
    analysis = TextBlob(r)
    tb_pol.append(analysis.sentiment.polarity)  # Add the actual polarity score of each review to the list
    
    if analysis.sentiment.polarity >= 0:  # If the review if positive, add '1' to the rounded list
        tb_round.append(1)
    
    else:  # If the reveiw is negative, add '0' to the rounded list
        tb_round.append(0)
    
    
# Add both lists to the dataframe

training_data['tb_pol'] = tb_pol
training_data['tb_round'] = tb_round  # Note, 1 = positive review, 0 = negative review


training_data.head(3)  # Print first 3 rows of dataframe to confirm columns added

Unnamed: 0,id,sentiment,review,tb_pol,tb_round
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0


In [291]:
# 4. Check the accuracy of this model.
# Overall accuracy

correct = 0  # Begin a count of correct classifications


for index, row in training_data.iterrows():  # Iterate through the df
    if row['sentiment'] == row['tb_round']:  # Compare assigned and calculated results
        correct = correct + 1                # Add to correct count when assigned and calculated results match
    
    else:
        continue

        
pct_correct = (correct/(training_data['sentiment'].count()))*100  # Calculate percent correct

print('Overall, TextBlob calculated {}% of the movie reviews accurately.'.format(pct_correct))  # Print


# Accuracy of positive reviews

positive_df = training_data[training_data['sentiment'] == 1]

pos_correct = 0  # Begin a count of correct classifications

for index, row in positive_df.iterrows():    # Iterate through the df
    if row['sentiment'] == row['tb_round']:  # Compare assigned and calculated results
        pos_correct = pos_correct + 1        # Add to correct count when assigned and calculated results match
    
    else:
        continue
      
pct_pos_correct = (pos_correct/pos)*100  # Calculate percent correct

print('TextBlob calculated {:0.2f}% of the positive movie reviews accurately.'.format(pct_pos_correct))  # Print


# Accuracy of negative reviews

negative_df = training_data[training_data['sentiment'] == 0]

neg_correct = 0  # Begin a count of correct classifications

for index, row in negative_df.iterrows():    # Iterate through the df
    if row['sentiment'] == row['tb_round']:  # Compare assigned and calculated results
        neg_correct = neg_correct + 1        # Add to correct count when assigned and calculated results match
    
    else:
        continue
        
pct_neg_correct = (neg_correct/negs)*100  # Calculate percent correct

print('TextBlob calculated {:0.2f}% of the negative movie reviews accurately.'.format(pct_neg_correct))  # Print

Overall, TextBlob calculated 68.524% of the movie reviews accurately.
TextBlob calculated 94.59% of the positive movie reviews accurately.
TextBlob calculated 42.46% of the negative movie reviews accurately.


Is this model better than random guessing?

Overall, this model accurately classified over 68% of reviews. 
Random guessing would have likely been lower than 68% (likely close to 50%) but reading each review would likely have resulted in greater accuracy. 
Depending on the number and length of reviews, it may be feasible to read each one. If possible, individual assessment would be more accurate. Assuming there are too many reviews to assess individually, this model was better than random guessing. 

Looking more closely at the accuracy, we can see that Textblob is significantly more accurate when analyzing positive reviews than when analyzing negative reviews. The model was over 94% accurate for positive reviews but only 42% accurate on negative reviews. 

5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER.

In [292]:
# pip install vadersentiment

In [293]:
# 3. Use Vader to classify each movie review as positive or negative. 
# Assume a compound score greater than zero is positive sentiment and less than or equal to 0 is negative.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

v_pol = []  # Create a list of the actual compound scores to be added to the dataframe
v_round = []  # Create a list of rounded compound scores to be added to the dataframe for easier analysis


for r in training_data['review']:  # Iterate through each review
    vanalysis = SentimentIntensityAnalyzer()
    result = vanalysis.polarity_scores(r)
    v_pol.append(result['compound'])  # Add the actual polarity score of each review to the list
    
    if result['compound'] > 0:  # If the review if positive, add '1' to the rounded list
        v_round.append(1)
    
    else:  # If the reveiw is negative, add '0' to the rounded list
        v_round.append(0)
    
    
# Add both lists to the dataframe

training_data['v_pol'] = v_pol
training_data['v_round'] = v_round  # Note, 1 = positive review, 0 = negative review


training_data.head(3)  # Print first 3 rows of dataframe to confirm columns added

Unnamed: 0,id,sentiment,review,tb_pol,tb_round,v_pol,v_round
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,1,-0.8879,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,1,0.9736,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0,-0.9883,0


In [294]:
# 4. Check the accuracy of this model.
# Overall accuracy

vcorrect = 0  # Begin a count of correct classifications


for index, row in training_data.iterrows():  # Iterate through the df
    if row['sentiment'] == row['v_round']:   # Compare assigned and calculated results
        vcorrect = vcorrect + 1              # Add to correct count when assigned and calculated results match
    
    else:
        continue

        
vpct_correct = (vcorrect/(training_data['sentiment'].count()))*100  # Calculate percent correct

print('Overall, Vader calculated {}% of the movie reviews accurately.'.format(vpct_correct))  # Print


# Accuracy of positive reviews

positive_df = training_data[training_data['sentiment'] == 1]

vpos_correct = 0  # Begin a count of correct classifications

for index, row in positive_df.iterrows():    # Iterate through the df
    if row['sentiment'] == row['v_round']:  # Compare assigned and calculated results
        vpos_correct = vpos_correct + 1        # Add to correct count when assigned and calculated results match
    
    else:
        continue
      
vpct_pos_correct = (vpos_correct/pos)*100  # Calculate percent correct

print('Vader calculated {:0.2f}% of the positive movie reviews accurately.'.format(vpct_pos_correct))  # Print


# Accuracy of negative reviews

negative_df = training_data[training_data['sentiment'] == 0]

vneg_correct = 0  # Begin a count of correct classifications

for index, row in negative_df.iterrows():    # Iterate through the df
    if row['sentiment'] == row['v_round']:  # Compare assigned and calculated results
        vneg_correct = vneg_correct + 1        # Add to correct count when assigned and calculated results match
    
    else:
        continue
        
vpct_neg_correct = (vneg_correct/negs)*100  # Calculate percent correct

print('Vader calculated {:0.2f}% of the negative movie reviews accurately.'.format(vpct_neg_correct))  # Print

Overall, Vader calculated 69.42% of the movie reviews accurately.
Vader calculated 85.78% of the positive movie reviews accurately.
Vader calculated 53.06% of the negative movie reviews accurately.


Is this model better than random guessing?

Overall, this model accurately classified over 69% of reviews. 
Random guessing would have likely been lower, close to 50%. Assuming there are too many reviews to assess individually, this model was better than random guessing. 

Looking more closely at the accuracy, Vader, like Textblob, is significantly more accurate when analyzing positive reviews than when analyzing negative reviews but the gap is smaller than seen with Textblob. The model was over 85% accurate for positive reviews and over 53% accurate on negative reviews. Vader was more accurate than random guessing overall and with both negative and positive reviews when assessed individually.

# Part 2: Prepping Text for a Custom Model
If you want to run your own model to classify text, it needs to be in proper form to do so.

In [295]:
# 1. Convert to lowercase

lower_reviews = []

for r in training_data['review']:  # Iterate through each review
    text = r.lower()
    lower_reviews.append(text)

training_data['review'] = lower_reviews

training_data.head(3)

Unnamed: 0,id,sentiment,review,tb_pol,tb_round,v_pol,v_round
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,1,-0.8879,0
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.256349,1,0.9736,1
2,7759_3,0,the film starts with a manager (nicholas bell)...,-0.053941,0,-0.9883,0


In [296]:
# 2. Remove punctuation and special characters from the text

new_reviews = []

for r in training_data['review']:
    new_text = ''
    
    for character in r:
        if character.isalnum() or character == ' ':
            new_text += character
    
    new_reviews.append(new_text)
    
training_data['review'] = new_reviews

training_data.head(3)

Unnamed: 0,id,sentiment,review,tb_pol,tb_round,v_pol,v_round
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,1,-0.8879,0
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,1,0.9736,1
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,0,-0.9883,0


In [297]:
# 3. Remove stop words

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = stopwords.words('english')

training_data['review'] = training_data['review'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

training_data.head(3)

Unnamed: 0,id,sentiment,review,tb_pol,tb_round,v_pol,v_round
0,5814_8,1,stuff going moment mj ive started listening mu...,0.001277,1,-0.8879,0
1,2381_9,1,classic war worlds timothy hines entertaining ...,0.256349,1,0.9736,1
2,7759_3,0,film starts manager nicholas bell giving welco...,-0.053941,0,-0.9883,0


In [298]:
# 4. Apply NLTK’s PorterStemmer

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

stemmed_reviews = []

for r in training_data['review']:
    words = r.split()
    stems = [stemmer.stem(word) for word in words]
    stemmed_review = ' '.join(stems)
    stemmed_reviews.append(stemmed_review)
    
training_data['review'] = stemmed_reviews

training_data.head(3)

Unnamed: 0,id,sentiment,review,tb_pol,tb_round,v_pol,v_round
0,5814_8,1,stuff go moment mj ive start listen music watc...,0.001277,1,-0.8879,0
1,2381_9,1,classic war world timothi hine entertain film ...,0.256349,1,0.9736,1
2,7759_3,0,film start manag nichola bell give welcom inve...,-0.053941,0,-0.9883,0


In [299]:
# 5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count 
# vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). 
# Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same 
# as the number of rows in your original data frame.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 

# create bag of words matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(training_data['review'])


# create dataframe
BoW_matrix = pd.DataFrame(bag_of_words.toarray(), columns = count.get_feature_names())


# Display the dimensions of your bag-of-words matrix
print(BoW_matrix.shape)

(25000, 92508)


In [300]:
# 6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your 
# movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of 
# your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.


# Create the matrix
tf_idf = TfidfVectorizer()
feature_matrix = tf_idf.fit_transform(training_data['review'])


# Display the dimensions of your tf-idf matrix
print(feature_matrix.shape)

(25000, 92508)


# Additional Comments
The bag-of-words and tf-idf matrices are stored as sparse matrices because most entries are zero.
Each row in the bag-of-words/tf-idf matrices corresponds to a movie review.
The columns in the bag-of-words/tf-idf matrices correspond to unique words appearing in the movie reviews.
Entries in the bag-of-words matrix are the number of times a word appears in a review.
Entries in the tf-idf matrix are numbers representing the word importance in a review.
The bag-of-words/tf-idf matrices are possible feature (input) matrices for model building.
We will revisit this preprocessed text data to build a custom model in the future.