# Sentiment Analysis for Movie Reviews

We are going to be exploring the Standford Large Movie Review Dataset http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import nltk 
# nltk.download()

### Read in text data

The data is saved in the following file structure

In [2]:
# aclimdb/
#     test/
#          pos/
#              [id]_[rating]
#              ....
#          neg/
#              [id]_[rating]
#              ....
#     train/
#          pos/
#              [id]_[rating]
#              ....
#          neg/
#              -[id]_[rating]
#              ....

In [3]:
import glob
import pandas as pd
pd.set_option('display.max_colwidth', 200) # set max number of characters can see in pd dataframe

#path location to the positive directory
positive_path = 'data/aclimdb/train/pos/*.txt'
positive_files = glob.glob(positive_path)

body = []

for name in positive_files:
    with open(name) as f:
        body.append(f.read())

In [4]:
print(len(body))

12500


In [5]:
positive_corpus = pd.DataFrame({
    'label': '1',
    'body': body
})

positive_corpus.head()

Unnamed: 0,label,body
0,1,For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni...
1,1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her ..."
2,1,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the phy..."
3,1,"It's a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Base..."
4,1,"You probably all already know this by now, but 5 additional episodes never aired can be viewed on ABC.com I've watched a lot of television over the years and this is possibly my favorite show, eve..."


In [6]:
#path location to the positive directory
negative_path = 'data/aclimdb/train/pos/*.txt'
negative_files = glob.glob(negative_path)

body = []

for name in negative_files:
    with open(name) as f:
        body.append(f.read())

In [7]:
negative_corpus = pd.DataFrame({
    'label': '0',
    'body': body
})

negative_corpus.head()

Unnamed: 0,label,body
0,0,For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni...
1,0,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her ..."
2,0,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the phy..."
3,0,"It's a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Base..."
4,0,"You probably all already know this by now, but 5 additional episodes never aired can be viewed on ABC.com I've watched a lot of television over the years and this is possibly my favorite show, eve..."


In [8]:
#shuffle the positive and negative reviews and reset the index 
full_corpus = pd.concat([positive_corpus, negative_corpus])
data = full_corpus.sample(frac=1).reset_index(drop=True)
data.head(10)

Unnamed: 0,label,body
0,1,"This film is a bit reminiscent of the German film, THE NEVERENDING STORY because a child is magically transported to a strange land in order to be a hero. However, due to far superior modern techn..."
1,0,An old vaudeville team of Willy Clark (Walter Matthau) and Al Lewis (George Burns) were one of the best known but they broke up hating each other. Over 20 years later they agree to get together fo...
2,1,"Burlinson and Thornton give an outstanding performance in this movie, along with Dennehy. Although it is at first thought to be only about love, it really goes down deeper than that. The beauty of..."
3,0,"Louis Sachar's compelling children's classic is about as Disney as Freddy Krueger. It's got murder, racism, facial disfigurement and killer lizards.<br /><br />Tightly plotted, it's a multi-layere..."
4,0,This is one of my favourite martial arts movies from Hong Kong. It is one of John Woo's earliest films and one of only a few traditional martial arts movies he directed. You can see his influences...
5,0,I have complained to ABC about the cancellation of six degrees. If enough people do the same then it could be enough to bring this fabulous show back to life!! Just go onto the official site and t...
6,0,"This Belgian film, directed by Tom Barman, singer of the well-known group dEUS, will not be favoured by everyone. For the simple reason that there isn't a clear story or even a plot. This movie ju..."
7,1,What is so taboo about love?! People seem to have major problems with the transgenered.<br /><br />The title of this movie didn't catch my eye. It was a grainy shot about 4 minutes into the movie ...
8,0,"The interesting aspect of ""The Apprentice"" is it demonstrates that the traditional job interview and resume do not necessarily predict teamwork skills, task dedication, and job performance. And th..."
9,0,"This TV-series was one of the ones I loved when I was a kid. Even though I see it now through the pink-shaded glasses of nostalgia, I can still tell it was a quality show, very educational but sti..."


### Exploring Dataset

In [9]:
# What is the shape of this datataset

print("Input data has {} rows and {} columns".format(len(full_corpus), len(full_corpus.columns)))

Input data has 25000 rows and 2 columns


In [10]:
# How many revies are negative and positive?

print("Out of {} rows, {} are negative, {} are positive".format(len(full_corpus), 
                                                       len(full_corpus[full_corpus['label']=='0']), 
                                                       len(full_corpus[full_corpus['label']=='1'])))

Out of 25000 rows, 12500 are negative, 12500 are positive


In [11]:
# Is there any missing data?
print("Number of null in label: {}".format(full_corpus['label'].isnull().sum()))
print("Number of null in text: {}".format(full_corpus['body'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


## Pre-processing text data

Clean up the text data by removing unnecesary information with tools such as :
- Removing punctuation
- Tokenization 
- Removing stopwords
- Lemmatizing/Stemming words

### Removing Punctuation

- removing unuseful punctuation from text '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
import string 

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [17]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

In [18]:
data['body_nopunct'] = data['body'].apply(lambda x: remove_punct(x))
data.head()

Unnamed: 0,label,body,body_nopunct
0,1,"This film is a bit reminiscent of the German film, THE NEVERENDING STORY because a child is magically transported to a strange land in order to be a hero. However, due to far superior modern techn...",This film is a bit reminiscent of the German film THE NEVERENDING STORY because a child is magically transported to a strange land in order to be a hero However due to far superior modern technolo...
1,0,An old vaudeville team of Willy Clark (Walter Matthau) and Al Lewis (George Burns) were one of the best known but they broke up hating each other. Over 20 years later they agree to get together fo...,An old vaudeville team of Willy Clark Walter Matthau and Al Lewis George Burns were one of the best known but they broke up hating each other Over 20 years later they agree to get together for a T...
2,1,"Burlinson and Thornton give an outstanding performance in this movie, along with Dennehy. Although it is at first thought to be only about love, it really goes down deeper than that. The beauty of...",Burlinson and Thornton give an outstanding performance in this movie along with Dennehy Although it is at first thought to be only about love it really goes down deeper than that The beauty of nat...
3,0,"Louis Sachar's compelling children's classic is about as Disney as Freddy Krueger. It's got murder, racism, facial disfigurement and killer lizards.<br /><br />Tightly plotted, it's a multi-layere...",Louis Sachars compelling childrens classic is about as Disney as Freddy Krueger Its got murder racism facial disfigurement and killer lizardsbr br Tightly plotted its a multilayered interlinking s...
4,0,This is one of my favourite martial arts movies from Hong Kong. It is one of John Woo's earliest films and one of only a few traditional martial arts movies he directed. You can see his influences...,This is one of my favourite martial arts movies from Hong Kong It is one of John Woos earliest films and one of only a few traditional martial arts movies he directed You can see his influences fr...


### Tokenization 
- Split the text body into list of words 

In [19]:
import re # regex

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

In [20]:
data['body_tokenized'] = data['body_nopunct'].apply(lambda x: tokenize(x))
data.head()

Unnamed: 0,label,body,body_nopunct,body_tokenized
0,1,"This film is a bit reminiscent of the German film, THE NEVERENDING STORY because a child is magically transported to a strange land in order to be a hero. However, due to far superior modern techn...",This film is a bit reminiscent of the German film THE NEVERENDING STORY because a child is magically transported to a strange land in order to be a hero However due to far superior modern technolo...,"[This, film, is, a, bit, reminiscent, of, the, German, film, THE, NEVERENDING, STORY, because, a, child, is, magically, transported, to, a, strange, land, in, order, to, be, a, hero, However, due,..."
1,0,An old vaudeville team of Willy Clark (Walter Matthau) and Al Lewis (George Burns) were one of the best known but they broke up hating each other. Over 20 years later they agree to get together fo...,An old vaudeville team of Willy Clark Walter Matthau and Al Lewis George Burns were one of the best known but they broke up hating each other Over 20 years later they agree to get together for a T...,"[An, old, vaudeville, team, of, Willy, Clark, Walter, Matthau, and, Al, Lewis, George, Burns, were, one, of, the, best, known, but, they, broke, up, hating, each, other, Over, 20, years, later, th..."
2,1,"Burlinson and Thornton give an outstanding performance in this movie, along with Dennehy. Although it is at first thought to be only about love, it really goes down deeper than that. The beauty of...",Burlinson and Thornton give an outstanding performance in this movie along with Dennehy Although it is at first thought to be only about love it really goes down deeper than that The beauty of nat...,"[Burlinson, and, Thornton, give, an, outstanding, performance, in, this, movie, along, with, Dennehy, Although, it, is, at, first, thought, to, be, only, about, love, it, really, goes, down, deepe..."
3,0,"Louis Sachar's compelling children's classic is about as Disney as Freddy Krueger. It's got murder, racism, facial disfigurement and killer lizards.<br /><br />Tightly plotted, it's a multi-layere...",Louis Sachars compelling childrens classic is about as Disney as Freddy Krueger Its got murder racism facial disfigurement and killer lizardsbr br Tightly plotted its a multilayered interlinking s...,"[Louis, Sachars, compelling, childrens, classic, is, about, as, Disney, as, Freddy, Krueger, Its, got, murder, racism, facial, disfigurement, and, killer, lizardsbr, br, Tightly, plotted, its, a, ..."
4,0,This is one of my favourite martial arts movies from Hong Kong. It is one of John Woo's earliest films and one of only a few traditional martial arts movies he directed. You can see his influences...,This is one of my favourite martial arts movies from Hong Kong It is one of John Woos earliest films and one of only a few traditional martial arts movies he directed You can see his influences fr...,"[This, is, one, of, my, favourite, martial, arts, movies, from, Hong, Kong, It, is, one, of, John, Woos, earliest, films, and, one, of, only, a, few, traditional, martial, arts, movies, he, direct..."
