# Weeks 3 Exercises

Name: Kesav Adithya Venkidusamy <br>
Course: DSC550 - Data Mining <br>
Instructor: Brett Werner <br>

### Part 1: Using the TextBlob Sentiment Analyzer

#### 1. Import the movie review data as a data frame and ensure that the data is loaded properly.

In [1]:
## Installing the TextBlob library required for this exercise

! pip install -U textblob

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.17.1


In [2]:
## Importing the libraries required for this assignment

import pandas as pd
import numpy as np
from textblob import TextBlob

In [3]:
## Import the tsv file having movie review data into dataframe

movie_df = pd.read_csv('labeledTrainData.tsv',sep = '\t')

In [4]:
## Display few records in dataframe using head command

movie_df.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,<br /><br />This movie is full of references. ...


In [5]:
## Calculating total number of rows and columns using shape command

print("Total number of rows and columns: {}".format(movie_df.shape))

Total number of rows and columns: (25000, 3)


#### 2. How many of each positive and negative reviews are there?

In [6]:
## Calculating the total number of positive and negative feedbacks

movie_df.groupby('sentiment').count()

Unnamed: 0_level_0,id,review
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
0,12500,12500
1,12500,12500


From the above result, we see the total number of postive and negative reviews have been split equally among the dataset.

#### 3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [8]:
## Creating functions to return polarity and subjectivity

def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity
  
#Create a function to get the polarity
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

In [9]:
## Classify each movie review as positive or negative (Polarity greater than 0 is positive sentiment, and less than 0 is negative sentiment and equal to 0 is neutral)

movie_df['polarity'] = movie_df['review'].apply(getPolarity)
movie_df.head(5)

Unnamed: 0,id,sentiment,review,polarity
0,5814_8,1,With all this stuff going down at the moment w...,0.001277
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941
3,3630_4,0,It must be assumed that those who praised this...,0.134753
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842


In [10]:
## Add subjectivity field to the dataframe

movie_df['subjectivity'] = movie_df['review'].apply(getSubjectivity)
movie_df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818


In [11]:
## Create score column based on the polarity column

movie_df['score'] = movie_df['polarity'].apply(lambda x: 1 if x > 0 else 0)
movie_df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,score
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0


#### 4. Check the accuracy of this model. Is this model better than random guessing?

In [13]:
## Adding new field called "accuracy" to check if sentiment is equal to the score

movie_df['accuracy'] = np.where(movie_df['score'] == movie_df['sentiment'], 1, 0)
movie_df.head()

Unnamed: 0,id,sentiment,review,polarity,subjectivity,score,accuracy
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0


In [14]:
## Calculate the percentage of accuracy

movie_df['accuracy'].mean()

0.68528

On calculating the count of records by sentiments, we could see the number of records are equally distributed (12500 each) between positive and negative sentiments which would result in 50/50 percentage between postive and negative sentiments through random guessing. However, using TextBlob library, we see the positive sentiment is coming around 68.5% compared to negative sentiments. 

So, the accuracy of this model is <b> better compared to random guessing.</b>

#### 5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [20]:
## Installing VADER library

!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [21]:
## Import VADER sentiment library for analysis

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [22]:
## Creating a function to calculate the scores using SentimentIntensityAnalyzer
## Create a SentimentIntensityAnalyzer object.

sid_obj = SentimentIntensityAnalyzer()

def sentiment_scores(sentence):
     return sid_obj.polarity_scores(sentence)['compound']

In [23]:
# create new column for vadar compound sentiment score

movie_df['vadar_compound'] = movie_df['review'].apply(sentiment_scores)

In [24]:
## Display few records using head command

movie_df.head()

Unnamed: 0,id,sentiment,review,polarity,subjectivity,score,accuracy,vadar_compound
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1,-0.8879
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1,0.9736
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1,-0.9883
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0,-0.1202
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0,0.6115


In [25]:
## Adding a field called "vadar_sentiment" and calculate based on vadar_compound and display records using head command

movie_df['vadar_sentiment'] = np.where(movie_df['vadar_compound']>= 0.05, 1,
                              np.where(movie_df['vadar_compound'] <= -0.05, 0, 1))

movie_df.head()

Unnamed: 0,id,sentiment,review,polarity,subjectivity,score,accuracy,vadar_compound,vadar_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1,-0.8879,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1,0.9736,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1,-0.9883,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0,-0.1202,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0,0.6115,1


In [28]:
## Calculatig the accuracy

movie_df['accuracy_vadar'] = np.where(movie_df['vadar_sentiment'] == movie_df['sentiment'], 1, 0)
movie_df.head()

Unnamed: 0,id,sentiment,review,polarity,subjectivity,score,accuracy,vadar_compound,vadar_sentiment,accuracy_vadar
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1,-0.8879,0,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1,0.9736,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1,-0.9883,0,1
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0,-0.1202,0,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0,0.6115,1,1


In [29]:
## Calculating the accuracy percentage for postive sentiments

movie_df['accuracy_vadar'].mean()

0.69224

We got accuracy percentage for postive sentiment as 69% using vadar library in python. This is more or less same as what we got using TextBlob. So, the accuracy of this model is <b>better than random guessing</b> where the negative and postive sentiments are 50% each.

### Part 2: Prepping Text for a Custom Model

#### If you want to run your own model to classify text, it needs to be in proper form to do so. The following steps will outline a procedure to do this on the movie reviews text.

In [32]:
## Import the tsv file having movie review data into dataframe

movie_custom_df = pd.read_csv('labeledTrainData.tsv',sep = '\t')

#### 1. Convert all text to lowercase letters.

#### 2. Remove punctuation and special characters from the text.

In [39]:
## Importing the lib required for this exercise

import re

In [107]:
## Creating a function for cleaning the text
## This function gets the input sentence and convert it to lower and remove punctuation

def text_clean(sentence):
    
    sentence = sentence.lower() ## Convert the text to lower 
    sentence = re.sub(r'[^\w\s]+|_', '',sentence) ## Remove all non-numeric characters
    sentence = re.sub(r'\s\s', ' ',sentence) ## Removing extra space
    
    return sentence

In [108]:
## Coverting text present in the "review" column to lower case and printing few values using head command

movie_custom_df['review_clean'] = movie_custom_df['review'].apply(text_clean)
movie_custom_df.head()

Unnamed: 0,id,sentiment,review,review_clean
0,5814_8,1,with all this stuff going down at the moment w...,with all this stuff going down at the moment w...
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",the classic war of the worlds by timothy hines...
2,7759_3,0,the film starts with a manager (nicholas bell)...,the film starts with a manager nicholas bell g...
3,3630_4,0,it must be assumed that those who praised this...,it must be assumed that those who praised this...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...


#### 3. Remove stop words

In [110]:
## import nltk module and download stop words

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KesavAdithya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [116]:
## Import tokenize module and apply the same.
## Creating a function called tokenize
## Create a new column and apply tokenize function

nltk.download('punkt')
from nltk.tokenize import word_tokenize

def tokenize(text):
    
    text = word_tokenize(text)
    return text
    

movie_custom_df['review_token'] = movie_custom_df['review_clean'].apply(tokenize)
movie_custom_df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KesavAdithya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


Unnamed: 0,id,sentiment,review,review_clean,review_stop,review_token
0,5814_8,1,with all this stuff going down at the moment w...,with all this stuff going down at the moment w...,stuff going moment mj ive started listening mu...,"[with, all, this, stuff, going, down, at, the,..."
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",the classic war of the worlds by timothy hines...,classic war worlds timothy hines entertaining ...,"[the, classic, war, of, the, worlds, by, timot..."
2,7759_3,0,the film starts with a manager (nicholas bell)...,the film starts with a manager nicholas bell g...,film starts manager nicholas bell giving welco...,"[the, film, starts, with, a, manager, nicholas..."
3,3630_4,0,it must be assumed that those who praised this...,it must be assumed that those who praised this...,must assumed praised film greatest filmed oper...,"[it, must, be, assumed, that, those, who, prai..."
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,superbly trashy wondrously unpretentious 80s e...,"[superbly, trashy, and, wondrously, unpretenti..."


In [113]:
# Import stopwords with nltk and choose english

from nltk.corpus import stopwords
stop = stopwords.words('english')

In [123]:
# Exclude stopwords with Python's list comprehension and pandas using Dataframe apply

movie_custom_df['review_stop'] = movie_custom_df['review_token'].apply(lambda x: [word for word in x if word not in (stop)])
movie_custom_df.head()

Unnamed: 0,id,sentiment,review,review_clean,review_token,review_stop
0,5814_8,1,with all this stuff going down at the moment w...,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,...","[stuff, going, moment, mj, ive, started, liste..."
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot...","[classic, war, worlds, timothy, hines, enterta..."
2,7759_3,0,the film starts with a manager (nicholas bell)...,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas...","[film, starts, manager, nicholas, bell, giving..."
3,3630_4,0,it must be assumed that those who praised this...,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai...","[must, assumed, praised, film, greatest, filme..."
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, and, wondrously, unpretenti...","[superbly, trashy, wondrously, unpretentious, ..."


#### 4. Apply NLTK’s PorterStemmer.

In [124]:
## Import the module required for PorterStemmer

from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer() #create porter_stemmer variable

In [125]:
# Apply stemmer to the above tokenized column as follows

movie_custom_df['review_stemmed']=movie_custom_df['review_stop'].apply(lambda x : [porter_stemmer.stem(y) for y in x])
movie_custom_df.head()

Unnamed: 0,id,sentiment,review,review_clean,review_token,review_stop,review_stemmed
0,5814_8,1,with all this stuff going down at the moment w...,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,...","[stuff, going, moment, mj, ive, started, liste...","[stuff, go, moment, mj, ive, start, listen, mu..."
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot...","[classic, war, worlds, timothy, hines, enterta...","[classic, war, world, timothi, hine, entertain..."
2,7759_3,0,the film starts with a manager (nicholas bell)...,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas...","[film, starts, manager, nicholas, bell, giving...","[film, start, manag, nichola, bell, give, welc..."
3,3630_4,0,it must be assumed that those who praised this...,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai...","[must, assumed, praised, film, greatest, filme...","[must, assum, prais, film, greatest, film, ope..."
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, and, wondrously, unpretenti...","[superbly, trashy, wondrously, unpretentious, ...","[superbl, trashi, wondrous, unpretenti, 80, ex..."


In [126]:
## Creating a review_final and join all the words present in review_stemmed
## Printing few values present in the dataframe using head command

movie_custom_df['review_final'] = movie_custom_df['review_stemmed'].apply(lambda txt: ' '.join(txt))
movie_custom_df.head()

Unnamed: 0,id,sentiment,review,review_clean,review_token,review_stop,review_stemmed,review_final
0,5814_8,1,with all this stuff going down at the moment w...,with all this stuff going down at the moment w...,"[with, all, this, stuff, going, down, at, the,...","[stuff, going, moment, mj, ive, started, liste...","[stuff, go, moment, mj, ive, start, listen, mu...",stuff go moment mj ive start listen music watc...
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",the classic war of the worlds by timothy hines...,"[the, classic, war, of, the, worlds, by, timot...","[classic, war, worlds, timothy, hines, enterta...","[classic, war, world, timothi, hine, entertain...",classic war world timothi hine entertain film ...
2,7759_3,0,the film starts with a manager (nicholas bell)...,the film starts with a manager nicholas bell g...,"[the, film, starts, with, a, manager, nicholas...","[film, starts, manager, nicholas, bell, giving...","[film, start, manag, nichola, bell, give, welc...",film start manag nichola bell give welcom inve...
3,3630_4,0,it must be assumed that those who praised this...,it must be assumed that those who praised this...,"[it, must, be, assumed, that, those, who, prai...","[must, assumed, praised, film, greatest, filme...","[must, assum, prais, film, greatest, film, ope...",must assum prais film greatest film opera ever...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,"[superbly, trashy, and, wondrously, unpretenti...","[superbly, trashy, wondrously, unpretentious, ...","[superbl, trashi, wondrous, unpretenti, 80, ex...",superbl trashi wondrous unpretenti 80 exploit ...


In [127]:
## Printing the shape of the dataframe to calculate the rows and columns

print("Total number of rows and columns present in dataframe: {}".format(movie_custom_df.shape))

Total number of rows and columns present in dataframe: (25000, 8)


#### 5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [129]:
## Create bag-of-words matrix using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
count= CountVectorizer()
bag_of_words = count.fit_transform(movie_custom_df['review_final'])

In [130]:
## Printing the shape to verify if the row count is matching with original dataframe and displaying the column count

bag_of_words.shape

(25000, 92395)

#### 6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [132]:
## Use scikit-learn to create a sparse matrix using fit_transform

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(movie_custom_df['review_final'])

In [133]:
## Displaying the total number of rows and columns

x.shape

(25000, 92395)