# DSC 550 

### Week 3:  Exercise 3.2

Author: Kimberly Cable<br>
Date: 6-25-2022

## Exercise 3.2: Sentiment Analysis and Preprocessing Text

In [1]:
# import libraries
import pandas as pd

### Part 1: Using the TextBlob Sentiment Analyzer

**1. Import the movie review data as a data frame and ensure that the data is loaded properly**

In [2]:
# read bag of words meets bags of popcorn training data
path_reviews_training = r'labeledTrainData.tsv.zip'
reviews_training = pd.read_csv(path_reviews_training, delimiter  = '\t')
reviews_training.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
reviews_training.shape

(25000, 3)

**2. How many of each positive and negative reviews are there?**

In [4]:
reviews_training['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

There are 12,500 positive and 12,500 negative reviews

**3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.**

In [5]:
# import library
from textblob import TextBlob

In [6]:
def sentiment_analysis_textblob(reviews):
    def getSubjectivity(text):
        return TextBlob(text).sentiment.subjectivity
    
    def getPolarity(text):
        return TextBlob(text).sentiment.polarity
    
    reviews['TextBlob_Subjectivity'] = reviews['review'].apply(getSubjectivity)
    reviews['TextBlob_Polarity'] = reviews['review'].apply(getPolarity)
    
    def getAnalysis(score):
        if score < 0:
            return 0
        else:
            return 1
        
    reviews['TextBlob_Analysis'] = reviews['TextBlob_Polarity'].apply(getAnalysis)

sentiment_analysis_textblob(reviews_training)
reviews_training.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis
0,5814_8,1,With all this stuff going down at the moment w...,0.606746,0.001277,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.531111,0.256349,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0.562933,-0.053941,0
3,3630_4,0,It must be assumed that those who praised this...,0.492901,0.134753,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0


**4. Check the accuracy of this model. Is this model better than random guessing?**

In [7]:
pos_correct = 0
pos_count = 0

for index, row in reviews_training.iterrows():
    if row['TextBlob_Analysis'] == row['sentiment']:
        pos_correct += 1
    pos_count += 1
    
accuracy = (pos_correct / pos_count) * 100.0

print(f"Accuracy is {accuracy}%")

Accuracy is 68.524%


In [10]:
# Another Approach
pos_correct, neg_correct = reviews_training['TextBlob_Analysis'].value_counts()

print(pos_correct)

19017


The accuracy of the model is only about 69%. With random guessing you'd be at about 50% right as there are only two choices (positive or negative) so the model is better than just random guessing.

**5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).**

In [11]:
# import libraries
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [12]:
def sentiment_scores(review):
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(review)
        
    if sentiment_dict['compound'] < 0:
        return 0
    else:
        return 1

reviews_training['Vader_Analysis'] = reviews_training['review'].apply(sentiment_scores)

reviews_training.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis,Vader_Analysis
0,5814_8,1,With all this stuff going down at the moment w...,0.606746,0.001277,1,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.531111,0.256349,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0.562933,-0.053941,0,0
3,3630_4,0,It must be assumed that those who praised this...,0.492901,0.134753,1,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0,1


In [13]:
# Check the accuracy of Vader
pos_correct = 0
pos_count = 0

for index, row in reviews_training.iterrows():
    if row['Vader_Analysis'] == row['sentiment']:
        pos_correct += 1
    pos_count += 1
    
accuracy = (pos_correct / pos_count) * 100.0

print(f"Accuracy is {accuracy}%")

Accuracy is 69.404%


It looks like the Vader analysis is slightly higher than the TextBlob analysis.

In [15]:
reviews_training.shape

(25000, 7)

### Part 2: Prepping Text for a Custom Model

If you want to run your own model to classify text, it needs to be in proper form to do so. The following steps will outline a procedure to do this on the movie reviews text.

**1. Convert all text to lowercase letters.**

In [16]:
reviews_training['review'] = reviews_training['review'].str.lower()
reviews_training.shape

(25000, 7)

**2. Remove punctuation and special characters from the text.**

In [17]:
# import libraries
import string
import re

In [18]:
# remove html from text
reviews_training['review_cleaned'] = reviews_training['review'].str.replace(r'<[^<>]*>', ' ', regex = True)

In [19]:
# remove all punctuation from review and create new column
reviews_training['review_cleaned'] = reviews_training['review_cleaned'].str.translate(str.maketrans('', '', string.punctuation))
reviews_training.shape

(25000, 8)

**3. Remove stop words.**

In [20]:
# import libraries
from nltk.tokenize import word_tokenize

In [21]:
# tokenize text into new column
reviews_training['tokenized_review'] = reviews_training['review_cleaned'].apply(word_tokenize)
reviews_training.shape

(25000, 9)

In [22]:
# import libraries
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Osgiliath\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
# load stop words
stop_words = stopwords.words('english')

In [25]:
# View first 5 rows
reviews_training.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis,Vader_Analysis,review_cleaned,tokenized_review
0,5814_8,1,with all this stuff going down at the moment w...,0.606746,0.001277,1,0,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,..."
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.531111,0.256349,1,1,the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot..."
2,7759_3,0,the film starts with a manager (nicholas bell)...,0.562933,-0.053941,0,0,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas..."
3,3630_4,0,it must be assumed that those who praised this...,0.492901,0.134753,1,0,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai..."
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0,1,superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, and, wondrously, unpretenti..."


In [27]:
# Remove stop words
reviews_training['review_no_stopwords'] = reviews_training['tokenized_review'].apply(
    lambda x: ' '.join([item for item in x if item not in stop_words]))

reviews_training.shape

(25000, 10)

In [28]:
# View first 5 rows
reviews_training.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis,Vader_Analysis,review_cleaned,tokenized_review,review_no_stopwords
0,5814_8,1,with all this stuff going down at the moment w...,0.606746,0.001277,1,0,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,...",stuff going moment mj ive started listening mu...
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.531111,0.256349,1,1,the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot...",classic war worlds timothy hines entertaining ...
2,7759_3,0,the film starts with a manager (nicholas bell)...,0.562933,-0.053941,0,0,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas...",film starts manager nicholas bell giving welco...
3,3630_4,0,it must be assumed that those who praised this...,0.492901,0.134753,1,0,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai...",must assumed praised film greatest filmed oper...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0,1,superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, and, wondrously, unpretenti...",superbly trashy wondrously unpretentious 80s e...


**4. Apply NLTK’s PorterStemmer.**

In [29]:
# import libraries
from nltk.stem.porter import PorterStemmer

In [30]:
# create stemmer
porter = PorterStemmer()

In [31]:
# Apply stemmer
def stem_review(words):
    tokens = words.split()
    stemmed_tokens = [porter.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)
    
reviews_training['review_stemmer'] = reviews_training['review_no_stopwords'].apply(stem_review)

In [34]:
print(reviews_training.shape)
reviews_training.head()

(25000, 11)


Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis,Vader_Analysis,review_cleaned,tokenized_review,review_no_stopwords,review_stemmer
0,5814_8,1,with all this stuff going down at the moment w...,0.606746,0.001277,1,0,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,...",stuff going moment mj ive started listening mu...,stuff go moment mj ive start listen music watc...
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.531111,0.256349,1,1,the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot...",classic war worlds timothy hines entertaining ...,classic war world timothi hine entertain film ...
2,7759_3,0,the film starts with a manager (nicholas bell)...,0.562933,-0.053941,0,0,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas...",film starts manager nicholas bell giving welco...,film start manag nichola bell give welcom inve...
3,3630_4,0,it must be assumed that those who praised this...,0.492901,0.134753,1,0,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai...",must assumed praised film greatest filmed oper...,must assum prais film greatest film opera ever...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0,1,superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, and, wondrously, unpretenti...",superbly trashy wondrously unpretentious 80s e...,superbl trashi wondrous unpretenti 80 exploit ...


**5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.**

In [35]:
# import libraries
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [38]:
# Create bag of words feature matrix
count = CountVectorizer()
    
bag_of_words = count.fit_transform(reviews_training['review_stemmer'])
bag_of_words

<25000x82861 sparse matrix of type '<class 'numpy.int64'>'
	with 2406125 stored elements in Compressed Sparse Row format>

In [39]:
bag_of_words.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.**

In [40]:
# import libraries
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
    
feature_matrix = tfidf.fit_transform(reviews_training['review_stemmer'])
feature_matrix

<25000x82861 sparse matrix of type '<class 'numpy.float64'>'
	with 2406125 stored elements in Compressed Sparse Row format>

In [42]:
feature_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [43]:
# show feature names
tfidf.vocabulary_

{'stuff': 70136,
 'go': 30203,
 'moment': 47596,
 'mj': 47392,
 'ive': 37800,
 'start': 69122,
 'listen': 42558,
 'music': 49071,
 'watch': 79305,
 'odd': 51904,
 'documentari': 20906,
 'wiz': 80901,
 'moonwalk': 47882,
 'mayb': 45395,
 'want': 79081,
 'get': 29613,
 'certain': 13406,
 'insight': 36853,
 'guy': 31741,
 'thought': 73308,
 'realli': 59775,
 'cool': 16526,
 'eighti': 22946,
 'make': 44244,
 'mind': 46896,
 'whether': 80211,
 'guilti': 31559,
 'innoc': 36788,
 'part': 54371,
 'biographi': 9183,
 'featur': 25972,
 'film': 26442,
 'rememb': 60469,
 'see': 64191,
 'cinema': 14602,
 'origin': 52991,
 'releas': 60372,
 'subtl': 70393,
 'messag': 46315,
 'feel': 26019,
 'toward': 74586,
 'press': 57499,
 'also': 4170,
 'obviou': 51812,
 'drug': 21904,
 'bad': 7095,
 'mkay': 47397,
 'visual': 78567,
 'impress': 36214,
 'cours': 16986,
 'michael': 46511,
 'jackson': 37881,
 'unless': 76878,
 'remot': 60501,
 'like': 42283,
 'anyway': 5306,
 'hate': 32802,
 'find': 26738,
 'bore': 

**Additional Comments**

* The bag-of-words and tf-idf matrices are stored as sparse matrices because most entries are zero.
* Each row in the bag-of-words/tf-idf matrices corresponds to a movie review.
* The columns in the bag-of-words/tf-idf matrices correspond to unique words appearing in the movie reviews.
* Entries in the bag-of-words matrix are the number of times a word appears in a review.
* Entries in the tf-idf matrix are numbers representing the word importance in a review.
* The bag-of-words/tf-idf matrices are possible feature (input) matrices for model building.
* We will revisit this preprocessed text data to build a custom model in the future.