### LING1340 Homework 1
Name: **Daniel Zheng**

Email: **daniel.zheng@pitt.edu**

Due Date: **September 5, 2017**

#### Dataset
Large Movie Review Dataset from [here](http://ai.stanford.edu/~amaas/data/sentiment/).

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. *The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

In [1]:
# useful libraries
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import re, glob


In [2]:
# loading training data
def load(filepath):
    files = glob.glob(filepath)
    raw = []
    for file in files:
        with open(file, encoding='utf-8') as f:
            raw.append(f.read())
    return raw
train_neg_raw = load('data/aclImdb/train/neg/*')
train_pos_raw = load('data/aclImdb/train/pos/*')
test_neg_raw = load('data/aclImdb/test/neg/*')
test_pos_raw = load('data/aclImdb/test/pos/*')

In [3]:
print('Negative training sample:', np.random.choice(train_neg_raw, 1))

Negative training sample: [ 'Black and White film. Good photography. Believable characters. <br /><br />Just awful.<br /><br />I have wasted another perfect evening watching a film that other rated as "worthy" and "very good." There is some good acting here and the back ground setting for the plot is good (more should have been done with this) but it is very slow to grow and never develops. It is totally bases on sex without much romance with much un needed nudity. More could have been done with the main characters. If you are looking for something to watch with you family this in not the movie and if not you will have trouble sitting through it. Though this film is long its only about 1 inch deep!']


In [4]:
print('Positive training sample: ', np.random.choice(train_pos_raw, 1))

Positive training sample:  [ '"Mr. Harvey Lights a Candle" is anchored by a brilliant performance by Timothy Spall.<br /><br />While we can predict that his titular morose, up tight teacher will have some sort of break down or catharsis based on some deep down secret from his past, how his emotions are unveiled is surprising. Spall\'s range of feelings conveyed is quite moving and more than he usually gets to portray as part of the Mike Leigh repertory.<br /><br />While an expected boring school bus trip has only been used for comic purposes, such as on "The Simpsons," this central situation of a visit to Salisbury Cathedral in Rhidian Brook\'s script is well-contained and structured for dramatic purposes, and is almost formally divided into acts.<br /><br />We\'re introduced to the urban British range of racially and religiously diverse kids (with their uniforms I couldn\'t tell if this is a "private" or "public" school), as they gather \x96 the rapping black kids, the serious South A

### Description of Data
There are train and test sets of data. Within `train` and `test`, there is a `neg` and `pos` folder each with 12,500 negative and positive samples. In the `train` folder there is also a folder called `unsup` with 50,000 examples for unsupervised learning.

### Processing
A lot of the data contains `<br>` tags from HTML, which will have to be cleaned up. I will weight using term frequency - inverse document frequency (tf-idf) and train a naive bayes classifier with sci-kit learn. For classifying movie reviews, I have a feeling adjectives and verbs are the most important, because words like "terrible", "amazing", "thrilled", and "enjoyed" just intuitively seem like they are more indicative of sentiment. For now, I will only keep those parts of speech, but it is definitely worth investigating using the other parts of speech.

### Expected problems with approach
A lot of the movie review data is going to be background information on the movies that probably won't be helpful for learning sentiment. Even though tf-idf will deemphasize many common words like "is", it might emphasize rare background info even more than common reviewing terms like "terrible", "fantastic", etc... Since background info terms should have low-frequency, it shouldn't make a huge difference.

In [5]:
# function clean up data, taking out punctuation, numbers and special characters
def clean(raw_input):
    # tokenize and remove invalid characters
    tokens = [word_tokenize(' '.join([x for x in string.split() if re.sub('[a-zA-Z0-9_.,!"\'-/]', '', x) == ''])) for string in raw_input]
    # part of speech tags
    review_pos = [pos_tag(x) for x in tokens]
    # keep only adjectives and verbs for every review
    cleaned = [' '.join([word for (word,pos) in phrase_pos if pos.find('JJ') != -1 or pos.find('V') != -1]) for phrase_pos in review_pos]
    return cleaned


In [6]:
# takes a while because tokenizing + pos_tag is slow
train_neg = clean(train_neg_raw)
train_pos = clean(train_pos_raw)
test_neg = clean(test_neg_raw)
test_pos = clean(test_pos_raw)

In [7]:
# example of what the filtered data looks like
print(train_neg[:2])
print('total negative training samples:', len(train_neg))

['stuffed shirt played veteran rare lead joins want find destroy secret creating falls lovely fetching uses meek get wooden marry Furious being used spurned uses get sure ai static lets uneventful talk-ridden slow Worse fails bring tedious blah acting insipid does help irascible enliven refreshing hammy drippy library obvious are pretty lousy unimpressive feeble fright is crummy uncredited staring great alleviate brain-numbing dull', "other mentioned walked had been wanted stay have left 's think have been good is worst adapted 've seen starts goes say goes goes slow slow are interesting happen is depth underneath 's get is single entire need be entertained love good next add entire do care happens single start hoping die least be interesting watching inexplicable is strange unpredictable think be compelling 's quirky noir-esquire acting hard-boiled recognize talented miscast raising reading bizarre relevant slow"]
total negative training samples: 12500


In [8]:
print(train_pos[:2])
print('total positive training samples:', len(train_pos))

["see has struggling unable come new 's getting meets helps go do gets do tell are worthless soft-core safe marvels publishist great was did get finds continue tires wants know more does leave is best soft-core 've seen Check", 'was expecting was surprised is light funny observed central were likable staggered clowned drug-centred was gentle subtle were witness sympathetic little old rural were captured was chosen scripted great detail have seen realistic alternative felt was light full expect difficult criticise thought last were lame have ended left overall unexpected']
total positive training samples: 12500


In [9]:
# some visualizations
from collections import Counter
neg_word_counts = Counter()
pos_word_counts = Counter()
for neg, pos in zip(train_neg, train_pos):
    neg_word_counts.update(word.strip('.,?!"\'').lower() for word in neg.split())
    pos_word_counts.update(word.strip('.,?!"\'').lower() for word in pos.split())

In [10]:
print('100 most common negative words:',neg_word_counts.most_common(100))

100 most common negative words: [('is', 51501), ('was', 27479), ('have', 15414), ('s', 15220), ('are', 15001), ('be', 14262), ('do', 9823), ('has', 7690), ('good', 7014), ('bad', 6696), ('were', 6258), ('had', 6237), ('did', 5831), ('does', 5354), ('see', 5278), ('been', 5056), ('get', 4814), ('make', 4447), ('made', 4284), ('other', 4211), ('think', 3489), ('being', 3339), ('more', 3237), ('first', 3211), ('know', 3205), ('seen', 3166), ('watch', 3072), ('much', 2999), ('say', 2859), ('many', 2825), ('ve', 2748), ('m', 2705), ('little', 2667), ('go', 2481), ('great', 2466), ('watching', 2452), ('going', 2259), ('re', 2205), ('worst', 2199), ('few', 2134), ('want', 2102), ('better', 2080), ('such', 2056), ('same', 2018), ('real', 2003), ('seems', 2000), ('got', 1980), ('least', 1972), ('most', 1957), ('only', 1953), ('funny', 1929), ('old', 1892), ('original', 1882), ('find', 1831), ('give', 1782), ('makes', 1780), ('best', 1745), ('gets', 1692), ('take', 1658), ('whole', 1577), ('thou

In [11]:
print('100 most common positive words:',pos_word_counts.most_common(100))

100 most common positive words: [('is', 58219), ('was', 22565), ('are', 15035), ('s', 13437), ('have', 12948), ('be', 12187), ('has', 9336), ('good', 7245), ('do', 6926), ('great', 5885), ('see', 5712), ('had', 5222), ('does', 4909), ('were', 4897), ('other', 4737), ('did', 4305), ('been', 4157), ('get', 4115), ('best', 3683), ('many', 3663), ('more', 3637), ('made', 3610), ('think', 3513), ('first', 3512), ('seen', 3317), ('being', 3164), ('make', 3104), ('little', 2969), ('watch', 2911), ('know', 2731), ('much', 2660), ('real', 2369), ('makes', 2362), ('say', 2331), ('most', 2253), ('ve', 2234), ('find', 2220), ('go', 2200), ('such', 2156), ('young', 2091), ('own', 1966), ('love', 1943), ('old', 1932), ('same', 1923), ('few', 1891), ('watching', 1815), ('m', 1787), ('bad', 1739), ('going', 1729), ('funny', 1680), ('new', 1670), ('got', 1639), ('saw', 1599), ('seems', 1582), ('like', 1561), ('played', 1548), ('take', 1541), ('excellent', 1532), ('give', 1517), ('want', 1513), ('big', 

### Frequency counts
Looking at the processed data above, words like "is" and "was" are very common, as expected. The negative set also has negative words like "awful", "terrible", and "stupid", while the positive set has words like "perfect", "excellent", and "beautiful". This is a good sign! After applying tf-idf, it should be pretty easy for a classifier to determine sentiment.

### Creating labels
Train and test labels are assigned using `0` as the negative class and `1` as the positive class.

In [12]:
train = train_neg + train_pos # concatenate for vectorizing
train_labels = [0]*len(train_neg) + [1]*len(train_pos) # labels
test = test_neg + test_pos
test_labels = [0]*len(test_neg) + [1]*len(test_pos)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
# should be the same as CountVectorizer combined with TfidfTransformer
tfidf = TfidfVectorizer()
train_vectors = tfidf.fit_transform(train)
# already fit to training set, so just transform
test_vectors = tfidf.transform(test)

In [14]:
print(train_vectors.shape)
print(test_vectors.shape)

(25000, 30197)
(25000, 30197)


In [15]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(train_vectors, train_labels)
print(np.mean(classifier.predict(train_vectors) == train_labels))

0.90296


In [16]:
predicted = classifier.predict(test_vectors)
print(np.mean(predicted == test_labels)*100, '% accuracy')

84.308 % accuracy


### Results
84.3% accuracy is pretty good! Definitely better than expected. Just for fun, I put together my own test set of movie review strings to see how it performs.

In [17]:
custom_test = ["This movie was the worst. I hate it.", "Terrible acting. Negative, bland, uninteresting.", 
               "This movie was great, I really enjoyed the acting!", 
               "Amazing storyline, hilarious characters, and a shocking ending.", 
               "The vague plot was ridiculously boring, and put me to sleep."]
custom_labels = [0,0,1,1,0]
custom_test_vectors = tfidf.transform(clean(custom_test))

In [18]:
custom_predictions = classifier.predict(custom_test_vectors)
print(np.mean(custom_predictions == custom_labels)*100, '% accuracy on custom test set')
print(custom_predictions)

100.0 % accuracy on custom test set
[0 0 1 1 0]


### Conclusions
This code does a few things:
1. Reads in movie review data so that each review is one string in a list
2. Preprocesses, removing everything but adjectives and verbs within each review.
3. Creates train and test tf-idf vectors
4. Fits a naive-bayes classifier to the train vector
5. Test on testing data
So it looks like using tf-idf with a Multinomial Naive-Bayes classifier can pretty reliably guess binary sentiment of a movie review. This was by no means a comprehensive study, however. 

I would have liked to try other classifiers, and also experiment with some of the TfidfVectorizer parameters. Also, I would like to understand tf-idf and naive-bayes in a more in-depth manner, beyond just how to use them in code. 

Perhaps another interesting project to do with this data would be to generate positive and negative movie reviews, or to use autoencoders to do some unsupervised learning since they conveniently provided a folder of 50,000 unlabeled movie reviews.