**Name:** WANG XI

**EID:** xwang258

**Kaggle Team Name:** eeriee

# CS4487 - Assignment 1 - Movie Review Sentiment Analysis
Due date: Oct 12, 2015 11:59pm HKT

## Goal
In this assignment, the task is predict the sentiment of a movie review sentence.  For example, the review _"A very charming film with wonderful sentiment and heart"_ is positive, while the review _"this is probably the most irritating show I have ever seen in my entire life"_ is negative.  Your goal is to train a classifier to predict whether a sentence is positive or negative sentiment.


## Methodology
You need to train classifiers using the training data, and then predict on the test data. You are free to choose the feature extraction method and classifier algorithm.  You are free to use methods that were not introduced in class.  You should probably do cross-validation to select a good parameters.


## Evaluation on Kaggle

You need to submit your test predictions to Kaggle for evaluation.  50% of the test data will be used to show your ranking on the live leaderboard.  After the assignment deadline, the remaining 50% will be used to calculate your final ranking. The entry with the highest final ranking will win a prize!

To submit to Kaggle you need to create an account, and use the competition invitation that will be posted on Canvas.

**Note:** You can only submit 2 times per day to Kaggle!

## What to hand in
You need to turn in the following things:

1. This ipynb file with your source code and documentation.
2. Your final submission file to Kaggle.

Files should be uploaded to Assignment 1 on Canvas.

## Grading
The marks of the assignment are distributed as follows:
- 50% - Results using various classifiers and feature representations.
- 30% - Trying out feature representations (e.g. adding additional features) or classifiers not used in the tutorials.
- 20% - Quality of the written report.  More points for insightful observations and analysis.
<hr>

# Load the Data

The training data is in the text file `imdb_train.txt`.  The training text data is in the format: `sentence \t label\n`

The label is either 1 for positive sentiment or 0 for negative sentiment.

The testing data is in the text file `imdb_test.txt`. The test text data is in the format: `sentence\n`

Kaggle submission files are CSV files: 

<pre>
Id,Prediction
1,0
2,0
3,1
...
</pre>

Here are two helpful functions for reading the text data and writing the Kaggle submission file.

In [1]:
%matplotlib inline
import IPython.core.display         
# setup output image format (Chrome works best)
IPython.core.display.set_matplotlib_formats("svg")
import matplotlib.pyplot as plt
import matplotlib
from numpy import *
from sklearn import *
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from scipy import stats
import nltk
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
import string
import re
from nltk.stem.snowball import SnowballStemmer
import IPython.utils.warn as warn
import csv
random.seed(100)

In [2]:
def read_text_data(fname):
    f = open(fname, 'r')
    contents = f.read()
    f.close()

    lines = contents.split("\n")
    txtdata = []
    classes = []
    for l in lines:
        if len(l)>0:
            tmp = l.split("\t")
            txtdata.append(tmp[0].strip())
            if (len(tmp)>1):
                classes.append(int(tmp[1]))
    if (len(classes)>0) and (len(txtdata) != len(classes)):        
        warn.error("mismatched length!")
    
    return (txtdata, asarray(classes))

def write_csv_kaggle_sub(fname, Y):
    # fname = file name
    # Y is a list/array with class entries
    
    # header
    tmp = [['Id', 'Prediction']]
    
    # add ID numbers for each Y
    for (i,y) in enumerate(Y):
        tmp2 = [(i+1), int(y)]
        tmp.append(tmp2)
        
    # write CSV file
    f = open(fname, 'wb')
    writer = csv.writer(f)
    writer.writerows(tmp)
    f.close()

# YOUR CODE and DOCUMENTATION HERE

In [3]:
# load the data
(traintxt, trainY) = read_text_data("imdb_train.txt")
(testtxt, _)       = read_text_data("imdb_test.txt")

print len(traintxt)
print len(testtxt)

900
100


In [4]:
tmp = feature_extraction.text.CountVectorizer()
trainXtmp = tmp.fit_transform(traintxt)
print len(tmp.get_feature_names())

2831


If only using the default tokenizer for the feature extraction, there are 2831 features.

In [5]:
stp = feature_extraction.text.CountVectorizer(stop_words = 'english')
trainXstp = stp.fit_transform(traintxt)
print len(stp.get_feature_names())

2600


After excluding the default stop words, there are 2600 features.

In [6]:
#the NB Bernoulli model
alphasb = logspace(-1,0,30)
vocasb = range(1000,3000,200)
avgscoresb = empty((len(alphasb), len(vocasb)))

for i,al in enumerate(alphasb):
    for j,voca in enumerate(vocasb):    
        cntvect = feature_extraction.text.CountVectorizer(stop_words = 'english', max_features=voca)
        trainXb = cntvect.fit_transform(traintxt)
        bmodel = naive_bayes.BernoulliNB(alpha=al)
        myscoreb = cross_validation.cross_val_score(bmodel, trainXb, trainY, cv=5)
        avgscoresb[i,j] = mean(myscoreb)

In [7]:
bestib = argmax(avgscoresb)

(bestiab, bestivb) = unravel_index(bestib, avgscoresb.shape)
bestab = alphasb[bestiab]
bestvb = vocasb[bestivb]
print "vocabulary size = ", bestvb
print "max acc of cross-validation =", avgscoresb[bestiab, bestivb]

cntvect = feature_extraction.text.CountVectorizer(stop_words = 'english', max_features=bestvb)
trainXb = cntvect.fit_transform(traintxt)
bmodel = naive_bayes.BernoulliNB(alpha=bestab)
bmodel.fit(trainXb, trainY)
predTrainYb = bmodel.predict(trainXb)

print "acc of bernoulli = ", mean(predTrainYb == trainY)

vocabulary size =  1200
max acc of cross-validation = 0.807777777778
acc of bernoulli =  0.938888888889


Since the words such as 'like' that are very common in the document can be important, it is not good to use idf to downscale those common words. Therefore I use tf rather than tf-idf for feature extration in the Naive Bayes Multinomial Model.

In [8]:
#tf
alphas = logspace(-1,0,30)
vocas = range(1000,3000,200)
avgscores = empty((len(alphas), len(vocas)))

for i,al in enumerate(alphas):
    for j,voca in enumerate(vocas):     
        tfvect = feature_extraction.text.TfidfVectorizer(use_idf=False, stop_words = 'english', max_features=voca)
        trainXtf = tfvect.fit_transform(traintxt)
        mmodel_tf = naive_bayes.MultinomialNB(alpha=al)      
        myscore = cross_validation.cross_val_score(mmodel_tf, trainXtf, trainY, cv=5)
        avgscores[i,j] = mean(myscore)


In [9]:
besti = argmax(avgscores)

(bestia, bestiv) = unravel_index(besti, avgscores.shape)
besta = alphas[bestia]
bestv = vocas[bestiv]
print "vocabulary size = ", bestv
print "max acc of cross-validation = ", avgscores[bestia,bestiv]

tfvect = feature_extraction.text.TfidfVectorizer(use_idf=False, max_features=bestv)
trainXtf = tfvect.fit_transform(traintxt)
    
mmodel_tf = naive_bayes.MultinomialNB(alpha=besta)
mmodel_tf.fit(trainXtf, trainY)

predtrainYtf = mmodel_tf.predict(trainXtf)

print "acc of tf-idf = ", mean(predtrainYtf==trainY)

vocabulary size =  1400
max acc of cross-validation =  0.813333333333
acc of tf-idf =  0.948888888889


The number of features is 1200 in both model, which means 1400 features has been removed. Although the training accuracy is high for both two models (NB multinomial model has a higher accuracy than the NB bernoulli model), they have a relatively bad performance on the testing data and show the same 78% accuracy (on Kaggle). Some important features are likely to be removed.

Try different classifiers to see the performance of the selected features.

In [10]:
#logistic regression
logreg = linear_model.LogisticRegressionCV(Cs=logspace(-4,1,20), cv=5)
logreg.fit(trainXb, trainY)

# predict from the model
predYreg = logreg.predict(trainXb)

# calculate accuracy
print "acc of lr =", mean(trainY==predYreg)

acc of lr = 0.975555555556


The training accuracy of logistic regression model is higher than the previous two models. However, the testing accuracy is only 74% on Kaggle.

After searching for a more robust model, I find out that SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification. So I try SGD classification for this problem.

In [11]:
# SGD classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

parameters = {
    'vect__max_features': range(1000,2000,200),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': logspace(-1,0,10),
    #'clf__penalty': ('l2', 'elasticnet'),
    #'clf__n_iter': (10, 20),
}

In [12]:
if __name__ == "__main__":
    # find the best parameters for both the feature extraction and the
    # classifier
    print("Performing grid search...")
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

    print("pipeline:", [name for name, _ in pipeline.steps])
    grid_search.fit(traintxt, trainY)
        
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done  50 jobs       | elapsed:   12.2s
[Parallel(n_jobs=1)]: Done 200 jobs       | elapsed:   52.1s
[Parallel(n_jobs=1)]: Done 450 jobs       | elapsed:  2.0min
[Parallel(n_jobs=1)]: Done 800 jobs       | elapsed:  3.4min
[Parallel(n_jobs=1)]: Done 1200 out of 1200 | elapsed:  5.1min finished


Performing grid search...
('pipeline:', ['vect', 'tfidf', 'clf'])
Fitting 3 folds for each of 400 candidates, totalling 1200 fits
Best score: 0.708
Best parameters set:
	clf__alpha: 0.21544346900318834
	tfidf__norm: 'l2'
	tfidf__use_idf: True
	vect__max_features: 1000
	vect__ngram_range: (1, 1)


In [13]:
pipeline.set_params(clf__alpha=0.21544346900318834,
        tfidf__norm='l2',
        tfidf__use_idf=True,
        vect__max_features=1000,
        vect__ngram_range=(1, 1))
pipeline.fit(traintxt, trainY)
predPip = pipeline.predict(traintxt) 

print "acc of SGD = ", mean(predPip == trainY)

acc of SGD =  0.874444444444


Since SGD requires a number of hyperparameters such as the regularization parameter, it is hard to tune the parameters so as to find out the best solution. In addtion, the training accuracy and max accuracy of cross validation is relatively low. As a result, the testing accuracy on Kaggle is very low, around 70%.

Therefore using different classifiers cannot increase the accuracy. The way to optimize the accuracy is to find out a better feature extraction method. I filter unimportant features and add other features in the following ways and use NB Bernoulli Model to test the performance of new features.

In [14]:
print stp.get_stop_words()

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neit

First, in the default stop-word list, some important words like 'no' are excluded from the features, which affects the accuracy of classification. Therefore it is better to create a new stop-word list instead of the default one. After growing the stopwords list by iteratively analyzing the top features in the algorithm for words that shouldn’t be in there, the new stop-word list is as below:


In [15]:
stp_list = ['a', 'an', 'the','there', 'theres', 'what', 'where','when', 'which', 'who', 'whom', 'then', 'those', 'which',
            'do','does', 'did', 'had', 'has', 'have', 'can', 'are', 'was', 'were', 'is','be','thats', 'this', 'that', 'such',
            'they',  'he','she',  'shes', 'your','youll','youself', 'youd','youve', 'me','hes', 'i','ill', 'ive', 'id', 'im', 'it',
            'them', 'its', 'themselves','itself','us', 'we', 'him', 'her', 'his','youre','my', 'our', 'you','am', 'having',
            'camera', 'same', 'show', 'some', 'action', 'actor', 'actors', 'actress', 'actresses', 'film', 'films','movie', 'movies', 'people',
            'to', 'written', 'for','by','as', 'also', 'with','of', 'end', 'or', 'and', 'in', 'about', 'into', 'yet',  'now', 'from']

The raw features tokenized by the existed function contain punctuation which is obviously useless for classification. So new tokenizer should be written. The written tokenizer removes the punctuation and then removes designed stop-word list.

In [16]:
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation]) #move punctuation
    tokens = nltk.word_tokenize(text)
    filtered_tokens = []
        
    for t in tokens:
        if t not in stp_list:
            filtered_tokens.append(t)
            
    return filtered_tokens

It is noticeable that two-related words (Bigrams) can result in different sentiment from two single words (Unigrams). For example, 'too many' can be in a negative sentiment while 'too' and 'many' are likely to be in a positive sentiment. Therefore, adding Bigrams can provide more information for the classifier.

However, many Bigrams are too trivial: they only appear once in the whole training data. Therefore I'd like to limit the min_df as 0.002 (2/900 is slightly larger than 0.002), which also saves the time to select the best parameter of max features. The min_df also remove some Unigram features.

In [17]:
#the NB Bernoulli model
alphasb = logspace(-1,0,30)
avgscoresb = empty(len(alphasb))

cntvect = feature_extraction.text.CountVectorizer(tokenizer =  tokenize, ngram_range=(1, 2), min_df=0.002)
trainXb = cntvect.fit_transform(traintxt)

for i,al in enumerate(alphasb):       
        bmodel = naive_bayes.BernoulliNB(alpha=al)
        myscoreb = cross_validation.cross_val_score(bmodel, trainXb, trainY, cv=5)
        avgscoresb[i] = mean(myscoreb)

In [18]:
bestib = argmax(avgscoresb)
bestab = alphasb[bestib]


print "max acc of cross-validation =", avgscoresb[bestib]

bmodel = naive_bayes.BernoulliNB(alpha=bestab)
bmodel.fit(trainXb, trainY)
predTrainYb = bmodel.predict(trainXb)

print "acc of bernoulli = ", mean(predTrainYb == trainY)

max acc of cross-validation = 0.823333333333
acc of bernoulli =  0.936666666667


The max accuracy of cross-validation is larger than previous NB Bernoulli model but the training accuracy decreases slightly. However, the testing accuracy on Kaggle now increases to 84%. 

In [19]:
print len(cntvect.get_feature_names())

1210


The number of features is also reduced to 1210.

Still, same words occur in the features as different format. As the grammatical placement of words is an irrelevant feature to the classifier, it is important to transform those words into the same format. I try both stemming and lemmatization and notice that stemming and lemmatization both increase the accuracy.

Stemming and lemmatization reduces not only the redundant features that the algorithm generated by the training set, but also the chances of encountering new words that the algorithm has not been trained on. since derivatives are transformed to a single format, the practical accuracy increases.

In [20]:
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
    text = "".join([ch for ch in text if ch not in string.punctuation]) #move punctuation
    tokens = nltk.word_tokenize(text)
    filtered_tokens = []
        
    for t in tokens:
        if t not in stp_list:
            filtered_tokens.append(t)
            
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [21]:
#Lemmatization
from nltk.corpus import wordnet as wn

def is_noun(tag):
    return tag.startswith('N')


def is_verb(tag):
    return tag.startswith('V')


def is_adverb(tag):
    return tag.startswith('R')


def is_adjective(tag):
    return tag.startswith('J')


def penn_to_wn(tag):
    if is_adjective(tag):
        return wn.ADJ
    elif is_noun(tag):
        return wn.NOUN
    elif is_adverb(tag):
        return wn.ADV
    elif is_verb(tag):
        return wn.VERB
    return wn.NOUN

def tokenize_and_lemma(text):
    text = "".join([ch for ch in text if ch not in string.punctuation]) #move punctuation
    tokens = nltk.word_tokenize(text)
    lemmas = []
    
    filtered_tokens = []
        
    for t in tokens:
        if t not in stp_list:
            filtered_tokens.append(t)
            
    for (t, tag) in nltk.pos_tag(filtered_tokens):
        lemmas.append(WordNetLemmatizer().lemmatize(t, penn_to_wn(tag)))
                
    return lemmas

Take stemming as an example:

In [22]:
#the NB Bernoulli model
alphasb = logspace(-1,0,30)
avgscoresb = empty(len(alphasb))

cntvect = feature_extraction.text.CountVectorizer(tokenizer =  tokenize_and_stem, ngram_range=(1, 2), min_df=0.002)
trainXb = cntvect.fit_transform(traintxt)
testXb = cntvect.transform(testtxt)

for i,al in enumerate(alphasb):       
        bmodel = naive_bayes.BernoulliNB(alpha=al)
        myscoreb = cross_validation.cross_val_score(bmodel, trainXb, trainY, cv=5)
        avgscoresb[i] = mean(myscoreb)

In [23]:
bestib = argmax(avgscoresb)
bestab = alphasb[bestib]


print "max acc of cross-validation =", avgscoresb[bestib]

bmodel = naive_bayes.BernoulliNB(alpha=bestab)
bmodel.fit(trainXb, trainY)
predTrainYb = bmodel.predict(trainXb)
predY = bmodel.predict(testXb)
print "acc of bernoulli = ", mean(predTrainYb == trainY)


max acc of cross-validation = 0.822222222222
acc of bernoulli =  0.944444444444


In [24]:
print len(cntvect.get_feature_names())

1204


The features are further reduced. The training accuracy increases and the testing accuracy on Kaggle becomes 90% (the testing accuracy of the classifier using lemmatization to produce features is 88%).

Applying the designed stop-word list, tokenize_and_stem function, bigrams to the tf model, the result is the same.

In [25]:
# write your predictions on the test set 
predY = bmodel.predict(testXb)  # for example

In [26]:
write_csv_kaggle_sub("my_submission.csv", predY)