# DSC 550

### Week 3: Exercise 3.2

### Author: Era Ebhodaghe

## Part 1: Using the TextBlob Sentiment Analyzer

### Step 1: Import the movie review data as a data frame and ensure that the data is loaded properly

In [219]:
# importing libraries
import pandas as pd

In [220]:
# read bag of words meets bags of popcorn training data
training_data= r'/Users/valuedcustomer/Downloads/labeledTrainData.tsv'
reviews_training_data = pd.read_csv(training_data, delimiter  = '\t')
reviews_training_data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [221]:
reviews_training_data.shape

(25000, 3)

### Step 2: How many of each positive and negative reviews are there?

In [223]:
reviews_training_data['sentiment'].value_counts()

sentiment
1    12500
0    12500
Name: count, dtype: int64

#### There are 12,500 positive and 12,500 negative reviews

### Step 3: Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment

!pip install textblob

In [227]:
 # import library
from textblob import TextBlob

In [228]:
def sentiment_analysis_textblob(reviews):
    def getSubjectivity(text):
        return TextBlob(text).sentiment.subjectivity
    
    def getPolarity(text):
        return TextBlob(text).sentiment.polarity
    
    reviews['TextBlob_Subjectivity'] = reviews['review'].apply(getSubjectivity)
    reviews['TextBlob_Polarity'] = reviews['review'].apply(getPolarity)
    
    def getAnalysis(score):
        if score < 0:
            return 0
        else:
            return 1
        
    reviews['TextBlob_Analysis'] = reviews['TextBlob_Polarity'].apply(getAnalysis)

sentiment_analysis_textblob(reviews_training_data)
reviews_training_data.head()

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis
0,5814_8,1,With all this stuff going down at the moment w...,0.606746,0.001277,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.531111,0.256349,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0.562933,-0.053941,0
3,3630_4,0,It must be assumed that those who praised this...,0.492901,0.134753,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,0.459818,-0.024842,0


### Step 4: Check the accuracy of this model. Is this model better than random guessing?

In [230]:
#initializing variables
pos_correct = 0
pos_count = 0

#for loop to evaluate model accuracy
for index, row in reviews_training_data.iterrows():
    if row['TextBlob_Analysis'] == row['sentiment']:
        pos_correct += 1
    pos_count += 1
    
accuracy = (pos_correct / pos_count) * 100.0

print(f"The accuracy is {accuracy}%")

The accuracy is 68.524%


#### This model with an accuracy of 68.5% is more accurate than random guessing which would have an accuracy of 50%

### Step 5: For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4)

In [233]:
!pip install vaderSentiment



In [234]:
# import libraries
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [235]:
def sentiment_scores(review):
    sid_obj = SentimentIntensityAnalyzer()
    sentiment_dict = sid_obj.polarity_scores(review)
        
    if sentiment_dict['compound'] < 0:
        return 0
    else:
        return 1

reviews_training_data['Vader_Analysis'] = reviews_training_data['review'].apply(sentiment_scores)

reviews_training_data.head(3)

Unnamed: 0,id,sentiment,review,TextBlob_Subjectivity,TextBlob_Polarity,TextBlob_Analysis,Vader_Analysis
0,5814_8,1,With all this stuff going down at the moment w...,0.606746,0.001277,1,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.531111,0.256349,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,0.562933,-0.053941,0,0


In [236]:
# checking for the accuracy of Vader
pos_correct = 0
pos_count = 0

#for loop to evaluate model accuracy
for index, row in reviews_training_data.iterrows():
    if row['Vader_Analysis'] == row['sentiment']:
        pos_correct += 1
    pos_count += 1
    
accuracy = (pos_correct / pos_count) * 100.0

print(f"The accuracy is {accuracy}%")

The accuracy is 69.404%


## Part 2: Prepping Text for a Custom Model

### Step 1: Convert all text to lowercase letters

In [239]:
reviews_training_data['review'] = reviews_training_data['review'].str.lower()
reviews_training_data['review'][0]

"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />the actual feature film bit when it finally sta

### Step 2: Remove punctuation and special characters from the text

In [241]:
# import libraries
import string
import re

In [242]:
# removing html from text
reviews_training_data['review_clean'] = reviews_training_data['review'].str.replace(r'<[^<>]*>', ' ', regex = True)
reviews_training_data['review_clean'][0]

"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj's feeling towards the press and also the obvious message of drugs are bad m'kay.  visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.  the actual feature film bit when it finally starts is only on for 2

In [243]:

# removing all punctuation from review and create new column
reviews_training_data['review_clean'] = reviews_training_data['review_clean'].str.translate(str.maketrans('', '', string.punctuation))
reviews_training_data['review_clean'][0]

'with all this stuff going down at the moment with mj ive started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay  visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him  the actual feature film bit when it finally starts is only on for 20 minutes or

### Step 3: Remove stop words

In [245]:
# import libraries
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/valuedcustomer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [246]:

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/valuedcustomer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [247]:
# tokenize text into new column
reviews_training_data['tokenized_review'] = reviews_training_data['review_clean'].apply(word_tokenize)
reviews_training_data['tokenized_review'][0]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'ive',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again',
 'maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent',
 'moonwalker',
 'is',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released',
 'some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 'mjs',
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 'obvious',
 'message',
 'of',
 'drugs',
 'are',
 

In [248]:
# loading stop words
stop_words = stopwords.words('english')

In [249]:
# removing stop words
reviews_training_data['review_no_stopwords'] = reviews_training_data['tokenized_review'].apply(
    lambda x: [item for item in x if item not in stop_words])

reviews_training_data['review_no_stopwords'][0]

['stuff',
 'going',
 'moment',
 'mj',
 'ive',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'really',
 'cool',
 'eighties',
 'maybe',
 'make',
 'mind',
 'whether',
 'guilty',
 'innocent',
 'moonwalker',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'remember',
 'going',
 'see',
 'cinema',
 'originally',
 'released',
 'subtle',
 'messages',
 'mjs',
 'feeling',
 'towards',
 'press',
 'also',
 'obvious',
 'message',
 'drugs',
 'bad',
 'mkay',
 'visually',
 'impressive',
 'course',
 'michael',
 'jackson',
 'unless',
 'remotely',
 'like',
 'mj',
 'anyway',
 'going',
 'hate',
 'find',
 'boring',
 'may',
 'call',
 'mj',
 'egotist',
 'consenting',
 'making',
 'movie',
 'mj',
 'fans',
 'would',
 'say',
 'made',
 'fans',
 'true',
 'really',
 'nice',
 'actual',
 'feature',
 'film',
 'bit',
 'finally',
 'starts',
 '20',
 'minutes',
 'excluding',


### Step 4: Apply NLTK's PorterStemmer

In [251]:
# import libraries
from nltk.stem.porter import PorterStemmer

In [252]:
# create stemmer
porter = PorterStemmer()

In [253]:
# Apply stemmer
def stem_review(words):
    return [porter.stem(word) for word in words]
    
reviews_training_data['reviews_stemmer'] = reviews_training_data['review_no_stopwords'].apply(stem_review)

In [254]:
len(reviews_training_data['reviews_stemmer'][0])

222

### Step 5: Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [256]:
# import libraries
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [257]:

# Create bag of words feature matrix
def bag_review(words):
    return count.fit_transform(words)

count = CountVectorizer()
    
reviews_training_data['review_bags'] = reviews_training_data['reviews_stemmer'].apply(bag_review)

reviews_training_data['review_bags'][0].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [258]:
reviews_training_data['review_bags'][0].shape

(222, 158)

In [259]:
count.get_feature_names_out()

array(['age', 'alon', 'anim', 'beauti', 'broke', 'broken', 'cant',
       'catharsi', 'child', 'chirin', 'choic', 'cri', 'cruelti', 'degrad',
       'depth', 'doesnt', 'eat', 'encount', 'end', 'even', 'favourit',
       'feel', 'follow', 'glad', 'great', 'grew', 'hard', 'heart', 'ill',
       'im', 'intricaci', 'ive', 'kill', 'lamb', 'life', 'like', 'loneli',
       'made', 'mani', 'mayb', 'mean', 'mother', 'mountain', 'movi',
       'need', 'never', 'night', 'one', 'other', 'rare', 'repair', 'sad',
       'saw', 'say', 'seen', 'someday', 'stori', 'tape', 'though',
       'tomorrow', 'undon', 'unfair', 'unfinish', 'unusu', 'use', 'watch',
       'wolf'], dtype=object)

### Step 6: Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [261]:
# importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer

In [262]:
# Creating the tf-idf feature matrix
def tfidf_review(words):
    return tfidf.fit_transform(words)

tfidf = TfidfVectorizer()
    
reviews_training_data['review_tfidf'] = reviews_training_data['reviews_stemmer'].apply(tfidf_review)

reviews_training_data['review_tfidf'][0].toarray()


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [263]:
reviews_training_data['review_tfidf'][0].shape

(222, 158)

In [264]:
# showing feature names
tfidf.vocabulary_

{'saw': 52,
 'movi': 43,
 'child': 8,
 'broke': 4,
 'heart': 27,
 'stori': 56,
 'unfinish': 62,
 'end': 18,
 'grew': 25,
 'mani': 38,
 'great': 24,
 'anim': 2,
 'one': 47,
 'favourit': 20,
 'unusu': 63,
 'unfair': 61,
 'cruelti': 12,
 'loneli': 36,
 'life': 34,
 'choic': 10,
 'cant': 6,
 'undon': 60,
 'need': 44,
 'other': 48,
 'chirin': 9,
 'made': 37,
 'alon': 1,
 'wolf': 66,
 'kill': 32,
 'mother': 41,
 'follow': 22,
 'mountain': 42,
 'doesnt': 15,
 'lamb': 33,
 'even': 19,
 'though': 58,
 'night': 46,
 'say': 53,
 'mayb': 39,
 'ill': 28,
 'eat': 16,
 'tomorrow': 59,
 'tape': 57,
 'broken': 5,
 'degrad': 13,
 'age': 0,
 'use': 64,
 'repair': 50,
 'watch': 65,
 'someday': 55,
 'cri': 11,
 'hard': 26,
 'like': 35,
 'depth': 14,
 'feel': 21,
 'intricaci': 30,
 'mean': 40,
 'rare': 49,
 'sad': 51,
 'ive': 31,
 'never': 45,
 'encount': 17,
 'catharsi': 7,
 'beauti': 3,
 'glad': 23,
 'seen': 54,
 'im': 29}