### Sentiment Analysis in Python - Part 1

In today's topic we will discuss sentiment analysis using Python. We will be using the movies review datasaet which can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/

**Sentiment Analysis** is the process of identifying or categorizing an opinion/s expressed in words or texts in order to know wether a particular product, movie, topic, etc is either negative or positive.

### Load the Data Files
The dataset is for binary sentiment classification containing 25000 training data and 25000 test data.

In [1]:
from sklearn.datasets import load_files
import numpy as np
import pandas as pd
import os

In [2]:
path = "aclImdb/"
positiveFiles_train = [x for x in os.listdir("aclImdb/train/pos/") if x.endswith(".txt")]
negativeFiles_train = [x for x in os.listdir("aclImdb/train/neg/") if x.endswith(".txt")]

positiveFiles_test = [x for x in os.listdir("aclImdb/test/pos/") if x.endswith(".txt")]
negativeFiles_test = [x for x in os.listdir("aclImdb/test/neg/") if x.endswith(".txt")]

Looking at the folder of the downloaded dataset, it can seen that each train and test folders both have a pos and neg folders.
The pos folder contains the positve movie reviews while the neg folder contains the negative movie review. As shown on the print command below, training data set is balanced. It contains 12500 positive reviews data and 12500 negative reviews data.

In [3]:
print("Number of documents in train(pos) data: {}".format(len(positiveFiles_train)))
print("Number of documents in train(neg) data: {}".format(len(negativeFiles_train)))

print("Number of documents in test(pos) data: {}".format(len(positiveFiles_test)))
print("Number of documents in test(neg) data: {}".format(len(negativeFiles_test)))

Number of documents in train(pos) data: 12500
Number of documents in train(neg) data: 12500
Number of documents in test(pos) data: 12500
Number of documents in test(neg) data: 12500


### Read file content into a list

Code below will read the content of all the text files in the pos and neg folder. The text content is then save into a list. Afterwhich, the two lists (positiveReviews and negativeReviews) will be concatinated into a single pandas dataframe named **train_reviews**. The concatinated dataframe has 3 columns: **review**, **label** and the **file**

In [4]:
positiveReviews, negativeReviews = [], []
for pfile in positiveFiles_train:
    with open(path + "train/pos/" + pfile, encoding="latin1") as f:
        positiveReviews.append(f.read())
for nfile in negativeFiles_train:
    with open(path + "train/neg/" +nfile, encoding="latin1") as f:
        negativeReviews.append(f.read())
        

train_reviews = pd.concat([pd.DataFrame({"review":positiveReviews, "label":1, "file":positiveFiles_train}),
                           pd.DataFrame({"review":negativeReviews, "label":0, "file":negativeFiles_train})], 
                           ignore_index=True).sample(frac=1)

The same steps is being implemented on the test data to save and convert the data into a pandas dataframe.

In [5]:
positiveReviews, negativeReviews = [], []
for pfile in positiveFiles_test:
    with open(path + "test/pos/" + pfile, encoding="latin1") as f:
        positiveReviews.append(f.read())
for nfile in negativeFiles_test:
    with open(path + "test/neg/" + nfile, encoding="latin1") as f:
        negativeReviews.append(f.read())
        

test_reviews = pd.concat([pd.DataFrame({"review":positiveReviews, "label":1, "file":positiveFiles_test}),
                          pd.DataFrame({"review":negativeReviews, "label":0, "file":negativeFiles_test})], 
                          ignore_index=True).sample(frac=1)

In [6]:
train_reviews.head()

Unnamed: 0,review,label,file
12704,For months I've been hearing about this little...,0,10184_1.txt
15761,Imagine that you are asked by your date what m...,0,1686_1.txt
7835,Moonwalker is absolutely incredible !!!!!!! Wh...,1,5802_10.txt
19170,I loved the original. It was brilliant and alw...,0,4754_4.txt
2010,"""Tulip"" is on the ""Australian All Shorts"" vide...",1,1180_9.txt


In [7]:
test_reviews.head()

Unnamed: 0,review,label,file
11414,"This was a very funny movie, not Oscar-worthy,...",1,9023_10.txt
18551,This is absolutely the most stupidest movie ev...,0,4197_1.txt
16536,Synopsis Correction: The ending does not show ...,0,2383_4.txt
6837,Lovingly crafted and terribly interesting to w...,1,4904_7.txt
7499,This was a great movie. Something not only for...,1,54_10.txt


### Data Cleaning and Preprocessing
Examining a sample review from the train and test data, we can clearly notice that the raw review data are uncleaned and messy. So before we can do some analysis we will need to clean that data. CLeaning the data will include removing unwanted text and symbols as well as converting the text data to lower cases.

In [10]:
train_reviews.iloc[1,0])

'Imagine that you are asked by your date what movie you wanted to see, and you remember seeing a rather intriguing trailer about "The Grudge." So, in good faith, you recommend seeing that movie. It is the Halloween season, after all. And it did boffo box office this past weekend, so it must be pretty good...so you go.<br /><br />And you\'re actually in a state of shock when the movie ends the way it does, and you hear yourself audibly saying, "that can\'t be the end of the movie...." But, alas, it is. <br /><br />And imagine coming out of the movie theater being embarrassed and ashamed for recommending such a dog of a movie. You think that your date thinks you\'re a bonehead for suggesting such an atrocity, and your suggestion will certainly end a promising relationship. Actually, it was so bad that both of us cracked up laughing at how bad it was. I see no future for Miss Gellar in the movies, and suggest that she sticks to television in the future. Actually, it won\'t be long before 

In [11]:
test_reviews.iloc[1,0]

"This is absolutely the most stupidest movie ever produced in front of a camera. I cant believe I was gullable enough to rent this piece of junk. I have seen some bad movies in my time, But this takes the cake....Ice cream ,,,, and Chips Too. Omg, I still cant get over how bad this thing was. The acting was a Joke.... The Plot was Non Exsistant..and the camera work had to be done by a 3 year old child. I have never seen a movie take so long to go Nowhere. I mean the whole movie could have been shot is less than 30 minutes. I guess this guy had some extra time on his hands.... ( Like 3 Hours. ) And an extra 60 bucks in his wallet, and decided one night...( Hey ..Lets go make the stupidest movie ever made. ) And they did just that. Give me a break.I'm heading back to the video store right now to get Demand my money back.Anyone else who has watched this piece of trash, should do the same."

### Regular Expression

Cleaning the messy data can easily be accomplished by using keyboard shortcuts like CTRL+F, CTRL+C, CTRL+V and DEL. But with this huge amount data, we needed a more efficient way of finding unwanted text and replacing it with the appropriate one. This is where **Regular Expression** or **regex** comes into picture. 

Regular expression is very essential in natural language processing. In this section we will use regular express to perform search and replace. To use regex we need import the python **re** library

In [12]:
import re

Let's take a look at one sample of data. 

In [41]:
train_reviews.iloc[0,0]

'For months I\'ve been hearing about this little movie and now I\'ve seen it. I find it cute, cute how so many fledgling directors make movies where they combine other people\'s creative ideas in order to make their own one-joke premise of a movie. Troops, Swingblade, any of the million Blair Witch parodies come to mind. If all that these directors want is a foot inside Hollywood\'s door then they\'re doing the right thing and they should keep it up because combining plot outlines is how Hollywood makes films. How many times have you heard the phrase, "It\'s Animal House meets Back to the Future"; "It\'s Wall Street meets Dead Poet\'s Society"; or "Shakespeare in Love meets Star Wars"? I remember when independent films meant original and daring not safe and predictable.'

As observed, there are plenty of unwanted characters in the text data. To remove these characters we will create a function that  will search the unwanted characters and at the same time replace it with the appropriate characters.

In [35]:
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    
    reviews = REPLACE_NO_SPACE.sub('', reviews.lower()) 
    reviews = REPLACE_WITH_SPACE.sub(' ', reviews)
    
    return reviews


In [37]:
#clean the train data
train_reviews['tidy_review'] = np.vectorize(preprocess_reviews)(train_reviews['review'])

In [55]:
#clean the test data
test_reviews['tidy_review'] = np.vectorize(preprocess_reviews)(test_reviews['review'])

In [38]:
train_reviews.head()

Unnamed: 0,review,label,file,tidy_review
12704,For months I've been hearing about this little...,0,10184_1.txt,for months ive been hearing about this little ...
15761,Imagine that you are asked by your date what m...,0,1686_1.txt,imagine that you are asked by your date what m...
7835,Moonwalker is absolutely incredible !!!!!!! Wh...,1,5802_10.txt,moonwalker is absolutely incredible what else...
19170,I loved the original. It was brilliant and alw...,0,4754_4.txt,i loved the original it was brilliant and alwa...
2010,"""Tulip"" is on the ""Australian All Shorts"" vide...",1,1180_9.txt,tulip is on the australian all shorts video fr...


In [40]:
train_reviews.iloc[0,3]

'for months ive been hearing about this little movie and now ive seen it i find it cute cute how so many fledgling directors make movies where they combine other peoples creative ideas in order to make their own one joke premise of a movie troops swingblade any of the million blair witch parodies come to mind if all that these directors want is a foot inside hollywoods door then theyre doing the right thing and they should keep it up because combining plot outlines is how hollywood makes films how many times have you heard the phrase its animal house meets back to the future its wall street meets dead poets society or shakespeare in love meets star wars i remember when independent films meant original and daring not safe and predictable'

#### Removing Stop words
Another preprocessing step that we can do to our data is removing stop words. Stop words are common word in a language that does not neccessarily help in identifying the context or the true meaning of a sentence. 

To remove the stop words from our data, we will tokenize each review data and compare each words or token to the list of stop words from the nltk stopwords.

In [44]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [46]:
#train data
train_reviews['reviews_without_stopwords'] = train_reviews['tidy_review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [56]:
#test data
test_reviews['reviews_without_stopwords'] = test_reviews['tidy_review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [47]:
train_reviews.head()

Unnamed: 0,review,label,file,tidy_review,reviews_without_stopwords
12704,For months I've been hearing about this little...,0,10184_1.txt,for months ive been hearing about this little ...,months ive hearing little movie ive seen find ...
15761,Imagine that you are asked by your date what m...,0,1686_1.txt,imagine that you are asked by your date what m...,imagine asked date movie wanted see remember s...
7835,Moonwalker is absolutely incredible !!!!!!! Wh...,1,5802_10.txt,moonwalker is absolutely incredible what else...,moonwalker absolutely incredible else say mich...
19170,I loved the original. It was brilliant and alw...,0,4754_4.txt,i loved the original it was brilliant and alwa...,loved original brilliant always strangely thou...
2010,"""Tulip"" is on the ""Australian All Shorts"" vide...",1,1180_9.txt,tulip is on the australian all shorts video fr...,tulip australian shorts video tribe first rite...


**Sample result after removing stop words:**

In [49]:
train_reviews.iloc[0,4]

'months ive hearing little movie ive seen find cute cute many fledgling directors make movies combine peoples creative ideas order make one joke premise movie troops swingblade million blair witch parodies come mind directors want foot inside hollywoods door theyre right thing keep combining plot outlines hollywood makes films many times heard phrase animal house meets back future wall street meets dead poets society shakespeare love meets star wars remember independent films meant original daring safe predictable'

#### Stemming
Stemming is a common step done in natural language processing. This is a process where in a word/s is reduced to its core root. 
One very popular algorithm that perform stemming is Porter Stemmer. This is already included in the NLTK library and we just need to import it and create and instance of the PorterStemmer.

In [50]:
from nltk.stem.porter import PorterStemmer

In [51]:
stemmer = PorterStemmer()

In [52]:
train_reviews['normalized'] = train_reviews['reviews_without_stopwords'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [57]:
test_reviews['normalized'] = test_reviews['reviews_without_stopwords'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

**Sample result after stemming:**

In [53]:
train_reviews.head()

Unnamed: 0,review,label,file,tidy_review,reviews_without_stopwords,normalized
12704,For months I've been hearing about this little...,0,10184_1.txt,for months ive been hearing about this little ...,months ive hearing little movie ive seen find ...,month ive hear littl movi ive seen find cute c...
15761,Imagine that you are asked by your date what m...,0,1686_1.txt,imagine that you are asked by your date what m...,imagine asked date movie wanted see remember s...,imagin ask date movi want see rememb see rathe...
7835,Moonwalker is absolutely incredible !!!!!!! Wh...,1,5802_10.txt,moonwalker is absolutely incredible what else...,moonwalker absolutely incredible else say mich...,moonwalk absolut incred els say michael jackso...
19170,I loved the original. It was brilliant and alw...,0,4754_4.txt,i loved the original it was brilliant and alwa...,loved original brilliant always strangely thou...,love origin brilliant alway strang though actu...
2010,"""Tulip"" is on the ""Australian All Shorts"" vide...",1,1180_9.txt,tulip is on the australian all shorts video fr...,tulip australian shorts video tribe first rite...,tulip australian short video tribe first rite ...
