### LING1340 Homework 1
Name: **Daniel Zheng**

Email: **daniel.zheng@pitt.edu**

Due Date: **September 5, 2017**

#### Dataset
Large Movie Review Dataset from [here](http://ai.stanford.edu/~amaas/data/sentiment/).

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. *The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

In [1]:
# useful libraries
import numpy as np
from nltk.tokenize import word_tokenize
import re
from nltk import pos_tag
import glob


In [2]:
# loading training data
def load(filepath):
    files = glob.glob(filepath)
    raw = []
    for file in files:
        with open(file) as f:
            raw.append(f.read())
    return raw
train_neg_raw = load('data/aclImdb/train/neg/*')
train_pos_raw = load('data/aclImdb/train/pos/*')
test_neg_raw = load('data/aclImdb/test/neg/*')
test_pos_raw = load('data/aclImdb/test/pos/*')

In [3]:
print(train_neg_raw[:2])

["Wimpy stuffed shirt Armand Louque (blandly played by veteran character actor Dean Jagger in a rare lead role) joins a group of researchers who want to find and destroy the secret technique of creating zombies. Armand falls for the lovely Claire Duval (fetching blonde Dorothy Stone), who uses the meek sap to get Armand's colleague Clifford Grayson (the hopelessly wooden Robert Noland) to marry her. Furious over being used and spurned by Claire, Armand uses his knowledge of voodoo to get revenge. Sound exciting? Well, it sure ain't. For starters, Victor Halperin's static (non)direction lets the meandering and uneventful talk-ridden story plod along at an excruciatingly slow pace. Worse yet, Halperin crucially fails to bring any tension, atmosphere and momentum to the hideously tedious proceedings. The mostly blah acting from a largely insipid cast doesn't help matters any; only George Cleveland as the hearty General Duval and E. Alyn Warren as the irascible Dr. Trevissant manage to enl

In [4]:
print(train_pos_raw[:2])

["All Kira Reed fans MUST see this. The film's premise has struggling romance novelist Kira unable to come up with any new ideas. She's also getting over a divorce. However, she meets this guy at a restaurant and he helps her out of her shell (and clothing). They go into a corner room and they do it. Thankfully, Kira gets a condom out (Now don't ever tell me these Playboy films are worthless piles of soft-core fluff. Remember kids, safe sex). Later, she marvels to her publishist how great it was, but she didn't get his name. Despite this, the guy finds her and they continue their kinky games. But eventually she tires of his sneakiness and wants to know more. When she does, all hell breaks loose, and I'll leave it at that. This is easily the best of these soft-core Playboys films I've seen. Check this out, and marvel at the greatness of Kira.", 'I was not expecting much from this movie, but was very pleasantly surprised, as it is light and funny and very well observed. The central trio 

### Description of Data
There are train and test sets of data. Within `train` and `test`, there is a `neg` and `pos` folder each with 12,500 negative and positive samples. In the `train` folder there is also a folder called `unsup` with 50,000 examples for unsupervised learning.

### Processing
A lot of the data contains `<br>` tags from HTML, which will have to be cleaned up. I will weight using term frequency - inverse document frequency (tf-idf) and train a naive bayes classifier with sci-kit learn. For classifying movie reviews, I have a feeling adjectives and verbs are the most important, because words like "terrible", "amazing", "thrilled", and "enjoyed", just intuitively seem like they are more indicative of sentiment. For now, I will only keep those parts of speech, but it is definitely worth investigating using the other parts of speech.

### Expected problems with approach
A lot of the movie review data is going to be background information on the movies that probably won't be helpful for learning sentiment. Even though tf-idf will deemphasize many common words like "is", it might emphasize rare background info even more than common reviewing terms like "terrible", "fantastic", etc... Since background info terms should have low-frequency, it shouldn't make a huge difference.

In [5]:
# function clean up data, taking out punctuation, numbers and special characters
def clean(raw_input):
    # tokenize and remove invalid characters
    tokens = [word_tokenize(' '.join([x for x in string.split() if re.sub('[a-zA-Z0-9_.,!"\'-/]', '', x) == ''])) for string in raw_input]
    # part of speech tags
    review_pos = [pos_tag(x) for x in tokens]
    # keep only adjectives and verbs for every review
    cleaned = [' '.join([word for (word,pos) in phrase_pos if pos.find('JJ') != -1 or pos.find('V') != -1]) for phrase_pos in review_pos]
    return cleaned


In [6]:
# takes a while because tokenizing + pos_tag is slow
train_neg = clean(train_neg_raw)
train_pos = clean(train_pos_raw)
test_neg = clean(test_neg_raw)
test_pos = clean(test_pos_raw)

In [7]:
# example of what the filtered data looks like
print(train_neg[:2])
print('total negative training samples:', len(train_neg))

['stuffed shirt played veteran rare lead joins want find destroy secret creating falls lovely fetching uses meek get wooden marry Furious being used spurned uses get sure ai static lets uneventful talk-ridden slow Worse fails bring tedious blah acting insipid does help irascible enliven refreshing hammy drippy library obvious are pretty lousy unimpressive feeble fright is crummy uncredited staring great alleviate brain-numbing dull', "other mentioned walked had been wanted stay have left 's think have been good is worst adapted 've seen starts goes say goes goes slow slow are interesting happen is depth underneath 's get is single entire need be entertained love good next add entire do care happens single start hoping die least be interesting watching inexplicable is strange unpredictable think be compelling 's quirky noir-esquire acting hard-boiled recognize talented miscast raising reading bizarre relevant slow"]
total negative training samples: 12500


In [8]:
print(train_pos[:2])
print('total positive training samples:', len(train_pos))

["see has struggling unable come new 's getting meets helps go do gets do tell are worthless soft-core safe marvels publishist great was did get finds continue tires wants know more does leave is best soft-core 've seen Check", 'was expecting was surprised is light funny observed central were likable staggered clowned drug-centred was gentle subtle were witness sympathetic little old rural were captured was chosen scripted great detail have seen realistic alternative felt was light full expect difficult criticise thought last were lame have ended left overall unexpected']
total positive training samples: 12500


In [16]:
# some visualizations
from collections import Counter
word_counts = Counter()
for phrase in train_neg:
    word_counts.update(phrase.strip('.,?!"\'').lower() for word in phrase.split())

### Creating labels
Train and test labels are assigned using `0` as the negative class and `1` as the positive class.

In [9]:
train = train_neg + train_pos # concatenate for vectorizing
train_labels = [0]*len(train_neg) + [1]*len(train_pos) # labels
test = test_neg + test_pos
test_labels = [0]*len(test_neg) + [1]*len(test_pos)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
# should be the same as CountVectorizer combined with TfidfTransformer
tfidf = TfidfVectorizer()
train_vectors = tfidf.fit_transform(train)
# already fit to training set, so just transform
test_vectors = tfidf.transform(test)

In [11]:
print(train_vectors.shape)
print(test_vectors.shape)

(25000, 30195)
(25000, 30195)


In [12]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(train_vectors, train_labels)
print(np.mean(classifier.predict(train_vectors) == train_labels))

0.90296


In [13]:
predicted = classifier.predict(test_vectors)
print(np.mean(predicted == test_labels)*100, '% accuracy')

84.308 % accuracy


### Results
So 84.3% accuracy is pretty good! Just for fun, I put together my own test set of movie review strings to see how it performs.

In [14]:
custom_test = ["This movie was the worst. I hate it.", "Terrible acting. Negative, bland, uninteresting.", 
               "This movie was great, I really enjoyed the acting!", 
               "Amazing storyline, hilarious characters, and a shocking ending.", 
               "The vague plot was ridiculously boring, and put me to sleep."]
custom_labels = [0,0,1,1,0]
custom_test_vectors = tfidf.transform(clean(custom_test))

In [15]:
custom_predictions = classifier.predict(custom_test_vectors)
print(np.mean(custom_predictions == custom_labels))
print(custom_predictions)

1.0
[0 0 1 1 0]


### Conclusions
This code does a few things:
1. Reads in movie review data so that each review is one string in a list
2. Preprocesses, removing everything but adjectives and verbs within each review.
3. Creates train and test tf-idf vectors
4. Fits a naive-bayes classifier to the train vector
5. Test on testing data
So it looks like using tf-idf with a Multinomial Naive-Bayes classifier can pretty reliably guess binary sentiment of a movie review. This was by no means a comprehensive study, however. 

I would have liked to try other classifiers, and also experiment with some of the TfidfVectorizer parameters. Also, I would like to understand tf-idf and naive-bayes in a more in-depth manner, beyond just how to use them in code. 

Perhaps another interesting project to do with this data would be to generate positive and negative movie reviews, or to use autoencoders to do some unsupervised learning since they conveniently provided a folder of 50,000 unlabeled movie reviews.