# DSC 550 

### Week 3:  Exercise 3.2

Author: Kimberly Cable<br>
Date: 6-25-2022

## Exercise 3.2: Sentiment Analysis and Preprocessing Text

In [1]:
# import libraries
import pandas as pd

### Part 1: Using the TextBlob Sentiment Analyzer

**1. Import the movie review data as a data frame and ensure that the data is loaded properly**

In [2]:
# read bag of words meets bags of popcorn training data
path_reviews_training = r'labeledTrainData.tsv.zip'
reviews_training = pd.read_csv(path_reviews_training, delimiter  = '\t')
reviews_training.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
reviews_training.shape

(25000, 3)

**2. How many of each positive and negative reviews are there?**

In [4]:
reviews_training['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

There are 12,500 positive and 12,500 negative reviews

**3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.**

In [5]:
# import library
from textblob import TextBlob

In [6]:
def sentiment_analysis_textblob(reviews):
    def getSubjectivity(text):
        return TextBlob(text).sentiment.subjectivity
    
    def getPolarity(text):
        return TextBlob(text).sentiment.polarity
    
    reviews['TextBlob_Subjectivity'] = reviews['review'].apply(getSubjectivity)
    reviews['TextBlob_Polarity'] = reviews['review'].apply(getPolarity)
    
    def getAnalysis(score):
        if score < 0:
            return 0
        else:
            return 1
        
    reviews['TextBlob_Analysis'] = reviews['TextBlob_Polarity'].apply(getAnalysis)

sentiment_analysis_textblob(reviews_training)
reviews_training.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis
0,5814_8,1,With all this stuff going down at the moment w...,0.606746,0.001277,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.531111,0.256349,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0.562933,-0.053941,0
3,3630_4,0,It must be assumed that those who praised this...,0.492901,0.134753,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0


**4. Check the accuracy of this model. Is this model better than random guessing?**

In [7]:
pos_correct = 0
pos_count = 0

for index, row in reviews_training.iterrows():
    if row['TextBlob_Analysis'] == row['sentiment']:
        pos_correct += 1
    pos_count += 1
    
accuracy = (pos_correct / pos_count) * 100.0

print(f"Accuracy is {accuracy}%")

Accuracy is 68.524%


The accuracy of the model is only about 69%. With random guessing you'd be at about 50% right as there are only two choices (positive or negative) so the model is better than just random guessing.

**5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).**

In [8]:
# import libraries
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [9]:
def sentiment_scores(review):
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(review)
        
    if sentiment_dict['compound'] < 0:
        return 0
    else:
        return 1

reviews_training['Vader_Analysis'] = reviews_training['review'].apply(sentiment_scores)

reviews_training.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis,Vader_Analysis
0,5814_8,1,With all this stuff going down at the moment w...,0.606746,0.001277,1,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.531111,0.256349,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0.562933,-0.053941,0,0
3,3630_4,0,It must be assumed that those who praised this...,0.492901,0.134753,1,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0,1


In [10]:
# Check the accuracy of Vader
pos_correct = 0
pos_count = 0

for index, row in reviews_training.iterrows():
    if row['Vader_Analysis'] == row['sentiment']:
        pos_correct += 1
    pos_count += 1
    
accuracy = (pos_correct / pos_count) * 100.0

print(f"Accuracy is {accuracy}%")

Accuracy is 69.404%


It looks like the Vader analysis is slightly higher than the TextBlob analysis.

### Part 2: Prepping Text for a Custom Model

If you want to run your own model to classify text, it needs to be in proper form to do so. The following steps will outline a procedure to do this on the movie reviews text.

**1. Convert all text to lowercase letters.**

In [11]:
reviews_training['review'] = reviews_training['review'].str.lower()
reviews_training['review'][0]

"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />the actual feature film bit when it finally sta

**2. Remove punctuation and special characters from the text.**

In [12]:
# import libraries
import string
import re

In [13]:
# remove html from text
reviews_training['review_cleaned'] = reviews_training['review'].str.replace(r'<[^<>]*>', ' ', regex = True)
reviews_training['review_cleaned'][0]

"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj's feeling towards the press and also the obvious message of drugs are bad m'kay.  visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.  the actual feature film bit when it finally starts is only on for 2

In [14]:
# remove all punctuation from review and create new column
reviews_training['review_cleaned'] = reviews_training['review_cleaned'].str.translate(str.maketrans('', '', string.punctuation))
reviews_training['review_cleaned'][0]

'with all this stuff going down at the moment with mj ive started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay  visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him  the actual feature film bit when it finally starts is only on for 20 minutes or

**3. Remove stop words.**

In [15]:
# import libraries
from nltk.tokenize import word_tokenize

In [16]:
# tokenize text into new column
reviews_training['tokenized_review'] = reviews_training['review_cleaned'].apply(word_tokenize)
reviews_training['tokenized_review'][0]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'ive',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again',
 'maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent',
 'moonwalker',
 'is',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released',
 'some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 'mjs',
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 'obvious',
 'message',
 'of',
 'drugs',
 'are',
 

In [17]:
# import libraries
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Osgiliath\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
# load stop words
stop_words = stopwords.words('english')

In [19]:
# Remove stop words
reviews_training['review_no_stopwords'] = reviews_training['tokenized_review'].apply(
    lambda x: [item for item in x if item not in stop_words])

reviews_training['review_no_stopwords'][0]

['stuff',
 'going',
 'moment',
 'mj',
 'ive',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'really',
 'cool',
 'eighties',
 'maybe',
 'make',
 'mind',
 'whether',
 'guilty',
 'innocent',
 'moonwalker',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'remember',
 'going',
 'see',
 'cinema',
 'originally',
 'released',
 'subtle',
 'messages',
 'mjs',
 'feeling',
 'towards',
 'press',
 'also',
 'obvious',
 'message',
 'drugs',
 'bad',
 'mkay',
 'visually',
 'impressive',
 'course',
 'michael',
 'jackson',
 'unless',
 'remotely',
 'like',
 'mj',
 'anyway',
 'going',
 'hate',
 'find',
 'boring',
 'may',
 'call',
 'mj',
 'egotist',
 'consenting',
 'making',
 'movie',
 'mj',
 'fans',
 'would',
 'say',
 'made',
 'fans',
 'true',
 'really',
 'nice',
 'actual',
 'feature',
 'film',
 'bit',
 'finally',
 'starts',
 '20',
 'minutes',
 'excluding',


**4. Apply NLTK’s PorterStemmer.**

In [20]:
# import libraries
from nltk.stem.porter import PorterStemmer

In [21]:
# create stemmer
porter = PorterStemmer()

In [22]:
# Apply stemmer
def stem_review(words):
    return [porter.stem(word) for word in words]
    
reviews_training['review_stemmer'] = reviews_training['review_no_stopwords'].apply(stem_review)

In [23]:
len(reviews_training['review_stemmer'][0])

222

**5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.**

In [24]:
# import libraries
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
# Create bag of words feature matrix
def bag_review(words):
    return count.fit_transform(words)

count = CountVectorizer()
    
reviews_training['review_bags'] = reviews_training['review_stemmer'].apply(bag_review)

reviews_training['review_bags'][0].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [26]:
reviews_training['review_bags'][0].shape

(222, 158)

In [27]:
count.get_feature_names_out()

array(['age', 'alon', 'anim', 'beauti', 'broke', 'broken', 'cant',
       'catharsi', 'child', 'chirin', 'choic', 'cri', 'cruelti', 'degrad',
       'depth', 'doesnt', 'eat', 'encount', 'end', 'even', 'favourit',
       'feel', 'follow', 'glad', 'great', 'grew', 'hard', 'heart', 'ill',
       'im', 'intricaci', 'ive', 'kill', 'lamb', 'life', 'like', 'loneli',
       'made', 'mani', 'mayb', 'mean', 'mother', 'mountain', 'movi',
       'need', 'never', 'night', 'one', 'other', 'rare', 'repair', 'sad',
       'saw', 'say', 'seen', 'someday', 'stori', 'tape', 'though',
       'tomorrow', 'undon', 'unfair', 'unfinish', 'unusu', 'use', 'watch',
       'wolf'], dtype=object)

**6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.**

In [28]:
# import libraries
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
# Create the tf-idf feature matrix
def tfidf_review(words):
    return tfidf.fit_transform(words)

tfidf = TfidfVectorizer()
    
reviews_training['review_tfidf'] = reviews_training['review_stemmer'].apply(tfidf_review)

reviews_training['review_tfidf'][0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [30]:
reviews_training['review_tfidf'][0].shape

(222, 158)

In [31]:
# show feature names
tfidf.vocabulary_

{'saw': 52,
 'movi': 43,
 'child': 8,
 'broke': 4,
 'heart': 27,
 'stori': 56,
 'unfinish': 62,
 'end': 18,
 'grew': 25,
 'mani': 38,
 'great': 24,
 'anim': 2,
 'one': 47,
 'favourit': 20,
 'unusu': 63,
 'unfair': 61,
 'cruelti': 12,
 'loneli': 36,
 'life': 34,
 'choic': 10,
 'cant': 6,
 'undon': 60,
 'need': 44,
 'other': 48,
 'chirin': 9,
 'made': 37,
 'alon': 1,
 'wolf': 66,
 'kill': 32,
 'mother': 41,
 'follow': 22,
 'mountain': 42,
 'doesnt': 15,
 'lamb': 33,
 'even': 19,
 'though': 58,
 'night': 46,
 'say': 53,
 'mayb': 39,
 'ill': 28,
 'eat': 16,
 'tomorrow': 59,
 'tape': 57,
 'broken': 5,
 'degrad': 13,
 'age': 0,
 'use': 64,
 'repair': 50,
 'watch': 65,
 'someday': 55,
 'cri': 11,
 'hard': 26,
 'like': 35,
 'depth': 14,
 'feel': 21,
 'intricaci': 30,
 'mean': 40,
 'rare': 49,
 'sad': 51,
 'ive': 31,
 'never': 45,
 'encount': 17,
 'catharsi': 7,
 'beauti': 3,
 'glad': 23,
 'seen': 54,
 'im': 29}

**Additional Comments**

* The bag-of-words and tf-idf matrices are stored as sparse matrices because most entries are zero.
* Each row in the bag-of-words/tf-idf matrices corresponds to a movie review.
* The columns in the bag-of-words/tf-idf matrices correspond to unique words appearing in the movie reviews.
* Entries in the bag-of-words matrix are the number of times a word appears in a review.
* Entries in the tf-idf matrix are numbers representing the word importance in a review.
* The bag-of-words/tf-idf matrices are possible feature (input) matrices for model building.
* We will revisit this preprocessed text data to build a custom model in the future.