<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [2]:
%%javascript

$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

This script uses bag-of-ngrams approach to sentiment classification using the IMDB review dataset.

# PyTorch

## Data Loading

The dataset was downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/

In [7]:
import os

In [31]:
data_loc = "data/imdb_reviews/"

In [92]:
def read_txt_files(folder_path):
    """Reads all .txt files in a folder to a list"""
    
    all_reviews = []
    file_list = os.listdir(folder_path)
    print(file_list[:10])
    for file_path in file_list:
        f = open(folder_path+file_path,"r")
        all_reviews.append(f.readline())
    return all_reviews

In [93]:
import numpy as np

In [94]:
train_pos = read_txt_files(folder_path=data_loc+"train/pos/")

['4715_9.txt', '12390_8.txt', '8329_7.txt', '9063_8.txt', '3092_10.txt', '9865_8.txt', '6639_10.txt', '10460_10.txt', '10331_10.txt', '11606_10.txt']


In [120]:
random_text = np.random.randint(1, high=len(train_pos)-1)
print(random_text)
train_pos[random_text]

11487


'I can understand why some people like this movie, and why some people don\'t. For me, though, I really like it, even if I noticed some good bits, and not so impressive bits. The animation was actually excellent, like Charlie\'s dream. The characters were a mixed bag, the best being Anne-Marie, voiced by the late Judith Barsi.(I was physically ill when I read what happened to her) Also, Carface is a very convincing villain,especially voiced by the wonderful Vic Tayback(I particularly loved "Morons I\'m surrounded by Morons") and along with Rasputin and Warren T.Rat is probably the most memorable of all the Don Bluth villains. Charlie and Itchy only just lacked the same sparkle, but I loved King Gator and his song. Some of the film is very haunting, like Annabelle\'s "You can never come come back", which kind of scares me still. Unfortunately, there were some bits I didn\'t like so much. The story had a tendency to become clumsy and unfocused, but Disney\'s Black Cauldron suffered from 

Now that we know how to parse and merge all .txt files, we are going to do them for the negatives as well.

In [105]:
train_neg = read_txt_files(folder_path=data_loc+"train/neg/")

['1821_4.txt', '10402_1.txt', '1062_4.txt', '9056_1.txt', '5392_3.txt', '2682_3.txt', '3351_4.txt', '399_2.txt', '10447_1.txt', '10096_1.txt']


In [106]:
print("Positive examples = " + str(len(train_pos)))
print("Negative examples = " + str(len(train_neg)))

Positive examples = 12500
Negative examples = 12500


We are going to use the same function later on to load our test data.

## Data Preparation

### Labeling the training dataset

In [107]:
np.ones((len(train_pos),), dtype=int)

array([1, 1, 1, ..., 1, 1, 1])

In [108]:
np.zeros((len(train_neg),), dtype=int)

array([0, 0, 0, ..., 0, 0, 0])

### Removing HTML tags

In [121]:
import re

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

In [122]:
train_pos[random_text]

'I can understand why some people like this movie, and why some people don\'t. For me, though, I really like it, even if I noticed some good bits, and not so impressive bits. The animation was actually excellent, like Charlie\'s dream. The characters were a mixed bag, the best being Anne-Marie, voiced by the late Judith Barsi.(I was physically ill when I read what happened to her) Also, Carface is a very convincing villain,especially voiced by the wonderful Vic Tayback(I particularly loved "Morons I\'m surrounded by Morons") and along with Rasputin and Warren T.Rat is probably the most memorable of all the Don Bluth villains. Charlie and Itchy only just lacked the same sparkle, but I loved King Gator and his song. Some of the film is very haunting, like Annabelle\'s "You can never come come back", which kind of scares me still. Unfortunately, there were some bits I didn\'t like so much. The story had a tendency to become clumsy and unfocused, but Disney\'s Black Cauldron suffered from 

In [123]:
train_pos_clean = [cleanhtml(x) for x in train_pos]
train_neg_clean = [cleanhtml(x) for x in train_neg]

In [124]:
train_pos_clean[random_text]

'I can understand why some people like this movie, and why some people don\'t. For me, though, I really like it, even if I noticed some good bits, and not so impressive bits. The animation was actually excellent, like Charlie\'s dream. The characters were a mixed bag, the best being Anne-Marie, voiced by the late Judith Barsi.(I was physically ill when I read what happened to her) Also, Carface is a very convincing villain,especially voiced by the wonderful Vic Tayback(I particularly loved "Morons I\'m surrounded by Morons") and along with Rasputin and Warren T.Rat is probably the most memorable of all the Don Bluth villains. Charlie and Itchy only just lacked the same sparkle, but I loved King Gator and his song. Some of the film is very haunting, like Annabelle\'s "You can never come come back", which kind of scares me still. Unfortunately, there were some bits I didn\'t like so much. The story had a tendency to become clumsy and unfocused, but Disney\'s Black Cauldron suffered from 

### Replacing dots & question marks & paranthesis with space

In [127]:
"asdasdasds.asdasda".replace("."," ")

'asdasdasds asdasda'

In [137]:
def remove_dqmp(review):
    
    review = review.replace("."," ")
    review = review.replace("?"," ")
    review = review.replace(")"," ")
    review = review.replace("("," ")
    
    return review

In [138]:
remove_dqmp(train_pos_clean[random_text])

'Streisand fans only familiar with her work from the FUNNY GIRL film onwards need to see this show to see what a brilliant performer Streisand WAS - BEFORE she achieved her goal of becoming a Movie Star  There had never been a female singer quite like her ever before, and there never would be again  sorry, Celine - only in your dreams! , but never again would Streisand sing with the vibrancy, energy, and, above all, the ENTHUSIASM and VULNERABILITY with which she performs here - by the time she gets to that Central Park concert only 2 or 3 years later, she\'d been filming FUNNY GIRL in Hollywood and her performing style has become less spontaneous and more reserved, more rehearsed  and, let\'s face it: more angry  - there\'s a wall between her and the audience  Live performing was never what she really enjoyed - she did it because she knew it was her ticket to Hollywood, and once she no longer had to do it she\'s done it as little as possible  and oh, that legendary stage fright provid

In [139]:
train_pos_clean = [remove_dqmp(x) for x in train_pos_clean]
train_neg_clean = [remove_dqmp(x) for x in train_neg_clean]

## Tokenization

In [140]:
import spacy
import string

# Load English tokenizer, tagger, parser, NER and word vectors
tokenizer = spacy.load('en_core_web_sm')
punctuations = string.punctuation

# lowercase and remove punctuation
def tokenize(sent):
    tokens = tokenizer(sent)
    return [token.text.lower() for token in tokens if (token.text not in punctuations)]
    #return [token.text.lower() for token in tokens]

In [141]:
random_text = np.random.randint(1, high=len(train_pos)-1)
print(random_text)

3090


In [142]:
# Example
tokens = tokenize(train_pos_clean[random_text])
print(tokens)

['the', 'year', '1995', 'when', 'so', 'many', 'people', 'talked', 'about', 'the', 'great', 'premiere', 'of', 'braveheart', 'by', 'mel', 'gibson', 'also', 'saw', 'another', 'very', 'fine', 'yet', 'underrated', 'movie', 'on', 'scottish', 'history', 'rob', 'roy', ' ', 'although', 'it', 'is', 'a', 'very', 'different', 'film', 'especially', 'due', 'to', 'the', 'historical', 'period', 'the', 'story', 'is', 'set', 'in', 'rob', 'roy', 'has', 'much', 'in', 'common', 'not', 'only', 'with', 'marvelous', 'braveheart', 'but', 'also', 'with', 'the', 'very', 'spirit', 'of', 'epic', 'movies', 'it', 'is', 'a', 'film', 'that', 'discusses', 'similar', 'themes', 'like', 'fight', 'for', 'dignity', 'courage', 'honor', 'revenge', 'family', 'being', 'a', 'key', 'to', 'happiness', ' ', 'it', 'also', 'leads', 'us', 'to', 'the', 'very', 'bliss', 'of', 'scottish', 'highlands', 'where', 'the', 'human', 'soul', 'finds', 'its', 'rest', 'being', 'surrounded', 'by', 'all', 'grandeur', 'of', 'nature', ' ', 'robert', 'r

### Remove blank space tokens