# Sentiment Analysis for Movie Reviews

We are going to be exploring the Standford Large Movie Review Dataset http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import nltk 
# nltk.download()

### Read in text data

The data is saved in the following file structure

In [2]:
# aclimdb/
#     test/
#          pos/
#              [id]_[rating]
#              ....
#          neg/
#              [id]_[rating]
#              ....
#     train/
#          pos/
#              [id]_[rating]
#              ....
#          neg/
#              -[id]_[rating]
#              ....

In [3]:
import glob
import pandas as pd
pd.set_option('display.max_colwidth', 200) # set max number of characters can see in pd dataframe

#path location to the positive directory
positive_path = 'data/aclimdb/train/pos/*.txt'
positive_files = glob.glob(positive_path)

body = []

for name in positive_files:
    with open(name) as f:
        body.append(f.read())

In [4]:
print(len(body))

12500


In [5]:
positive_corpus = pd.DataFrame({
    'label': '1',
    'body': body
})

positive_corpus.head()

Unnamed: 0,label,body
0,1,For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni...
1,1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her ..."
2,1,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the phy..."
3,1,"It's a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Base..."
4,1,"You probably all already know this by now, but 5 additional episodes never aired can be viewed on ABC.com I've watched a lot of television over the years and this is possibly my favorite show, eve..."


In [6]:
#path location to the positive directory
negative_path = 'data/aclimdb/train/pos/*.txt'
negative_files = glob.glob(negative_path)

body = []

for name in negative_files:
    with open(name) as f:
        body.append(f.read())

In [7]:
negative_corpus = pd.DataFrame({
    'label': '0',
    'body': body
})

negative_corpus.head()

Unnamed: 0,label,body
0,0,For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni...
1,0,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her ..."
2,0,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the phy..."
3,0,"It's a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Base..."
4,0,"You probably all already know this by now, but 5 additional episodes never aired can be viewed on ABC.com I've watched a lot of television over the years and this is possibly my favorite show, eve..."


In [8]:
#shuffle the positive and negative reviews and reset the index 
full_corpus = pd.concat([positive_corpus, negative_corpus])
data = full_corpus.sample(frac=1).reset_index(drop=True)
data.head(10)

Unnamed: 0,label,body
0,1,"This movie answers the question, how does a relationship survive when your girlfriend is codependent, clinging, needy, jealous .. and has powers and abilities far beyond those of mortal women?<br ..."
1,0,"I'm no big fan of Martial Arts movies, but the video shop was nearly empty and Jet Li was in Lethal Weapon 4 and I got it free when the other films I'd rented, either way I rented it. I absolutely..."
2,1,"Beat a path to this important documentary that looks like an attractive feature. Forbidden Lie$(2007) is simply a better (cinematic) version of Norma Khouri's book Forbidden Love, and THAT was a b..."
3,0,"Tsui Hark's visual artistry is at its peek in this movie. Unfortunately the terrible acting by Ekin Cheng and especially Cecilia Cheung (I felt the urge to strangle her while watching this, it's t..."
4,0,Parker and Stone transplant their pacy expletive-ridden humour from their animated masterpiece to a feature length live action film with generally good results. Much of the film is Trey and Matt r...
5,1,"From start to finish, I laughed real hard throughout the whole movie. It's amazing that ""The Groove Tube"" is possibly the granddaddy, yet raunchiest, of all comedic skit movies.This is the way I e..."
6,1,"It ran 8 seasons, but it's first, in early 1959, and it's last, in the autumn of 1965, were shorter than seasons 2-7. CBS chief William Paley canceled Rawhide's production after watching the 1st s..."
7,1,"I happened to see this movie twice or more and found it well made! WWII had freshly ended and the so-called ""Cold War"" was about to begin. This movie could, therefore, be defined as one of the bes..."
8,1,"This movie's origins are a mystery to me, as I only know as much as IMDB did before I rented it. I assume that before ""Starship Troopers"", ""Killshot"" was one of the countless unaired pilots that n..."
9,1,"Elisha Cuthbert plays Sue a fourteen year old girl who has lost her mother and finds it hard to communicate with her father, until one day in the basement of her apartment she finds a secret magic..."


### Exploring Dataset

In [9]:
# What is the shape of this datataset

print("Input data has {} rows and {} columns".format(len(full_corpus), len(full_corpus.columns)))

Input data has 25000 rows and 2 columns


In [10]:
# How many revies are negative and positive?

print("Out of {} rows, {} are negative, {} are positive".format(len(full_corpus), 
                                                       len(full_corpus[full_corpus['label']=='0']), 
                                                       len(full_corpus[full_corpus['label']=='1'])))

Out of 25000 rows, 12500 are negative, 12500 are positive


In [11]:
# Is there any missing data?
print("Number of null in label: {}".format(full_corpus['label'].isnull().sum()))
print("Number of null in text: {}".format(full_corpus['body'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


## Pre-processing text data

Clean up the text data by removing unnecesary information with tools such as :
- Removing punctuation
- Tokenization 
- Removing stopwords
- Lemmatizing/Stemming words

### Removing breaks br

In [12]:
import re 

def remove_breaks(text):
    text_nobreaks = [re.sub("<br />", '', text)]
    return text_nobreaks

data['body_nobreaks'] = data['body'].apply(lambda x: remove_breaks(x))
data.head(5)

NameError: name 're_test' is not defined

### Removing Punctuation

- removing unuseful punctuation from text '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
import string 

string.punctuation

In [None]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

In [None]:
data['body_nopunct'] = data['body_nobreaks'].apply(lambda x: remove_punct(x))
data.head()

### Tokenization 
- Split the text body into list of words 

In [None]:
import re # regex

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

In [None]:
data['body_tokenized'] = data['body_nopunct'].apply(lambda x: tokenize(x))
data.head()