## Assignment 3.2 - Sentiment Analysis and Preprocessing Text
### by Alex Hamedaninia

In this assignment, we will be reading in a file titled "Bag of Words Meets Bags of Popcorn" found [here](https://www.kaggle.com/competitions/word2vec-nlp-tutorial/data?select=labeledTrainData.tsv.zip). 
We'll use sentiment analysis to discover how many positive and negative reviews there are, and when we classify them, we will assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.
As extra credit, we will do the same using anothe prebuilt text sentiment analyzer, VADER.


### Part 1: Using the TextBlob Sentiment Analyzer

In [1]:
import pandas as pd
import numpy as np
from textblob import TextBlob

In [2]:
# note: sep = \t required because it's a tsv (tab sep values) file not csv file
df = pd.read_csv('labeledTrainData.tsv', sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


First let's see how many positive and negative reviews there are. 1 - positive, 0 - negative

In [3]:
df[['sentiment']].value_counts()

sentiment
0            12500
1            12500
dtype: int64

There's exactly the same number of positive and negative reviews. 
Now using TextBlob, let's see if we can produce similar results.

In [5]:
df['tb_model_sentiment'] = df['review'].apply(lambda review: TextBlob(review).polarity)
# reminder: >= 0 is positive sentiment, < 0 is negative sentiment
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,0.001277
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941
3,3630_4,0,It must be assumed that those who praised this...,0.134753
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842


In [6]:
# check if the value under tb_model_sentiment is >= 0, or a positive sentiment
df['predicted_sentiment'] = df['tb_model_sentiment'].apply(lambda x : x >= 0)
# this creates a column of T/F booleans
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,True
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,True
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,False
3,3630_4,0,It must be assumed that those who praised this...,0.134753,True
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,False


In [7]:
# convert the T/F to 1's and 0's, T = 1, F = 0
df['predicted_sentiment'] = df['predicted_sentiment'].apply(lambda x : 1 if x else 0)
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0


In [8]:
# now we can calculate accuracy
# first store the original counts of each original sentiment
actual_1, actual_0 = df[['sentiment']].value_counts()
print('Actual sentiment counts: ', actual_1, actual_0)
# now for the predicted sentiment
predicted_1, predicted_0 = df[['predicted_sentiment']].value_counts()
print('Predicted sentiment counts: ', predicted_1, predicted_0)

# accuracy: (true positives + true neg) / total num of samples
true_pos = actual_1 - abs(actual_1 - predicted_1)
true_neg = actual_0 - abs(actual_0 - predicted_0)
accuracy = (true_pos + true_neg) / 25000
print('Accuracy: ', accuracy)

Actual sentiment counts:  12500 12500
Predicted sentiment counts:  19017 5983
Accuracy:  0.47864


This model is slightly better than random guessing, though not by much.
Let's see if we can get better results using the Vader sentiment analyzer.

In [9]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# create a sentiment intensity analyzer object
sid_obj = SentimentIntensityAnalyzer()

In [10]:
# polarity_scores returns a sentiment dictionary with pos, neg, neu, and compound scores
df['vader_sentiments'] = df['review'].apply(lambda review: sid_obj.polarity_scores(review))
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment,vader_sentiments
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,1,"{'neg': 0.128, 'neu': 0.751, 'pos': 0.121, 'co..."
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,1,"{'neg': 0.08, 'neu': 0.713, 'pos': 0.207, 'com..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0,"{'neg': 0.135, 'neu': 0.809, 'pos': 0.055, 'co..."
3,3630_4,0,It must be assumed that those who praised this...,0.134753,1,"{'neg': 0.062, 'neu': 0.884, 'pos': 0.054, 'co..."
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0,"{'neg': 0.122, 'neu': 0.743, 'pos': 0.135, 'co..."


In [11]:
# we want to know what was the highest percentage from the dict
# we can compare the neg and pos to see which is higher
# if equal, let's make it positive
df['vader_prediction'] = df['vader_sentiments'].apply(lambda x : 1 if x['pos'] >= x['neg'] else 0)
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment,vader_sentiments,vader_prediction
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,1,"{'neg': 0.128, 'neu': 0.751, 'pos': 0.121, 'co...",0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,1,"{'neg': 0.08, 'neu': 0.713, 'pos': 0.207, 'com...",1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0,"{'neg': 0.135, 'neu': 0.809, 'pos': 0.055, 'co...",0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,1,"{'neg': 0.062, 'neu': 0.884, 'pos': 0.054, 'co...",0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0,"{'neg': 0.122, 'neu': 0.743, 'pos': 0.135, 'co...",1


In [12]:
# now we can check the accuracy like before
print('Actual sentiment counts: ', actual_1, actual_0)
# now for the predicted sentiment
predicted_v1, predicted_v0 = df[['vader_prediction']].value_counts()
print('Predicted Vader sentiment counts: ', predicted_v1, predicted_v0)

# accuracy: (true positives + true neg) / total num of samples
true_pos = actual_1 - abs(actual_1 - predicted_v1)
true_neg = actual_0 - abs(actual_0 - predicted_v0)
accuracy = (true_pos + true_neg) / 25000
print('Accuracy: ', accuracy)

Actual sentiment counts:  12500 12500
Predicted Vader sentiment counts:  16908 8092
Accuracy:  0.64736


This model is a definite improvement over the last one. It is definitely better than random guessing.

### Part 2: Prepping Text for a Custom Model
We can follow the following procedure to prepare our own model:
* Convert all text to lowercase letters.
* Remove punctuation and special characters from the text.
* Remove stop words.
* Apply NLTK’s PorterStemmer.
* Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.
* Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [13]:
# first let's convert all text to lowercase
df['review'] = df['review'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment,vader_sentiments,vader_prediction
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,1,"{'neg': 0.128, 'neu': 0.751, 'pos': 0.121, 'co...",0
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.256349,1,"{'neg': 0.08, 'neu': 0.713, 'pos': 0.207, 'com...",1
2,7759_3,0,the film starts with a manager (nicholas bell)...,-0.053941,0,"{'neg': 0.135, 'neu': 0.809, 'pos': 0.055, 'co...",0
3,3630_4,0,it must be assumed that those who praised this...,0.134753,1,"{'neg': 0.062, 'neu': 0.884, 'pos': 0.054, 'co...",0
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,-0.024842,0,"{'neg': 0.122, 'neu': 0.743, 'pos': 0.135, 'co...",1


In [14]:
# next let's remove punctation and special characters
import re
# note: re.sub removes all special chars except for spaces
df['review'] = df['review'].apply(lambda review : re.sub(r'[^a-zA-Z0-9\s]+', '', review))
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment,vader_sentiments,vader_prediction
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,1,"{'neg': 0.128, 'neu': 0.751, 'pos': 0.121, 'co...",0
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,1,"{'neg': 0.08, 'neu': 0.713, 'pos': 0.207, 'com...",1
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,0,"{'neg': 0.135, 'neu': 0.809, 'pos': 0.055, 'co...",0
3,3630_4,0,it must be assumed that those who praised this...,0.134753,1,"{'neg': 0.062, 'neu': 0.884, 'pos': 0.054, 'co...",0
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,-0.024842,0,"{'neg': 0.122, 'neu': 0.743, 'pos': 0.135, 'co...",1


In [15]:
# next let's remove the stop words (ex: and, but, can, or, the, etc.)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# first download the stopwords
# then set the stop words (creates a list)
stop_words = set(stopwords.words('english'))

# now let's remove the stopwords
df['token_words'] = df['review'].apply(lambda review : [w for w in word_tokenize(review) if not w in stop_words])
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment,vader_sentiments,vader_prediction,token_words
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,1,"{'neg': 0.128, 'neu': 0.751, 'pos': 0.121, 'co...",0,"[stuff, going, moment, mj, ive, started, liste..."
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,1,"{'neg': 0.08, 'neu': 0.713, 'pos': 0.207, 'com...",1,"[classic, war, worlds, timothy, hines, enterta..."
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,0,"{'neg': 0.135, 'neu': 0.809, 'pos': 0.055, 'co...",0,"[film, starts, manager, nicholas, bell, giving..."
3,3630_4,0,it must be assumed that those who praised this...,0.134753,1,"{'neg': 0.062, 'neu': 0.884, 'pos': 0.054, 'co...",0,"[must, assumed, praised, film, greatest, filme..."
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,-0.024842,0,"{'neg': 0.122, 'neu': 0.743, 'pos': 0.135, 'co...",1,"[superbly, trashy, wondrously, unpretentious, ..."


In [16]:
# next let's apply NLTK's Porter Stemmer
# Stemmers remove morphological affixes from words, leaving only the word stem.
# ex: boxes -> box, cats -> cat, jogging -> jog etc.
from nltk.stem import *
stemmer = PorterStemmer()
df['token_words'] = df['token_words'].apply(lambda words : [stemmer.stem(word) for word in words])
df.head()

Unnamed: 0,id,sentiment,review,tb_model_sentiment,predicted_sentiment,vader_sentiments,vader_prediction,token_words
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,1,"{'neg': 0.128, 'neu': 0.751, 'pos': 0.121, 'co...",0,"[stuff, go, moment, mj, ive, start, listen, mu..."
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,1,"{'neg': 0.08, 'neu': 0.713, 'pos': 0.207, 'com...",1,"[classic, war, world, timothi, hine, entertain..."
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,0,"{'neg': 0.135, 'neu': 0.809, 'pos': 0.055, 'co...",0,"[film, start, manag, nichola, bell, give, welc..."
3,3630_4,0,it must be assumed that those who praised this...,0.134753,1,"{'neg': 0.062, 'neu': 0.884, 'pos': 0.054, 'co...",0,"[must, assum, prais, film, greatest, film, ope..."
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,-0.024842,0,"{'neg': 0.122, 'neu': 0.743, 'pos': 0.135, 'co...",1,"[superbl, trashi, wondrous, unpretenti, 80, ex..."


In [24]:
# next we will be creating a bag of words matrix using the Tfid Vectorizer
# num of rows in this matrix will be equal to num of rows in original dataframe (25000)
# each row is a word-count vector for a single movie review

from sklearn.feature_extraction.text import CountVectorizer

# create a dictionary vectorizer
vectorizer = CountVectorizer(max_features = 25000)

# let's put the list under token_words into a string separated by spaces
# this will make it easier to convert to feature matrix using CountVectorizer
df['token_words_string'] = df['token_words'].apply(lambda x : ' '.join(x))

# creating our bag of words matrix
features = vectorizer.fit_transform(df['token_words_string'])
features.shape

<25000x25000 sparse matrix of type '<class 'numpy.int64'>'
	with 2353438 stored elements in Compressed Sparse Row format>