## Sentiment Analysis on Movie Reviews - Kaggle

"There's a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side."

The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1]. In their work on sentiment treebanks, Socher et al. [2] used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.




The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
test.tsv contains just phrases. You must assign a sentiment label to each phrase.
The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive


In [1]:
#pipeline
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

#classifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.naive_bayes import *
from sklearn.linear_model import *

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier

### Pipeline :
    - read file
    - tokenize into words
    - lowercase
    - remove stopwords
    - stemming
    - add 1-grams and 2-grams to featurs
    - implemented with tf-idf vectorizer

In [2]:
train_data = [x.split('\t') for x in open('train_rotten_tomatoes.tsv','r').readlines()][1:]
test_data = [x.split('\t') for x in open('test_rotten_tomatoes.tsv','r').readlines()][1:]

train_x_raw = [x[2].lower() for x in train_data]
train_y = [int(x[3].rstrip('\n')) for x in train_data]

paragraph_id = [x[0].lower() for x in test_data]
test_x_raw = [x[2].lower() for x in test_data]
#test_y = [x[3].rstrip('\n') for x in test_data] --> no predictions, we need to figure them out!

print('size of training data:',len(train_x_raw))
print('size of test data:',len(test_x_raw))
    
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

tfidf = TfidfVectorizer(tokenizer=tokenize, sublinear_tf=True, stop_words='english',ngram_range=(1,3))
train_X =  tfidf.fit_transform(train_x_raw)

size of training data: 156060
size of test data: 66292


### Testing classifiers (for multiclass classification)
    - cross validation to get a gauge of performance

In [3]:
sgd = SGDClassifier(alpha=.0001, n_iter=50,penalty="elasticnet")
svc = LinearSVC(dual=False, tol=1e-3)
ridge = RidgeClassifier(tol=1e-2, solver="lsqr")
#
ova_ridge = OneVsOneClassifier(RidgeClassifier(tol=1e-2, solver="lsqr"))
ovr_ridge = OneVsRestClassifier(RidgeClassifier(tol=1e-2, solver="lsqr"))
#
pas_agg = PassiveAggressiveClassifier(n_iter=50)
percep = Perceptron(n_iter=200)
multi_nb = MultinomialNB(alpha=.01)
bern_nb = BernoulliNB(alpha=.01)


classifiers = [sgd, svc,ridge, ovr_ridge, ova_ridge, pas_agg, percep, multi_nb,bern_nb]

max_score = 0
best_clf = None
print('* note that cross validation usually gives pessmistic scores')

for clf in classifiers:
    score = cross_val_score(clf, train_X, train_y, cv=5).mean()
    if score > max_score:
        max_score=score
        best_clf=clf
    print(clf,': score = ',score)

* note that cross validation usually gives pessmistic scores
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
       penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False) : score =  0.520037157072
LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0) : score =  0.524451718822
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, solver='lsqr', tol=0.01) : score =  0.529071809662
OneVsRestClassifier(estimator=RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, solver='lsqr', tol=0.01),
          n_jobs=1) : 

### Submission

In [4]:
best_clf.fit(train_X, train_y)
y=best_clf.predict(tfidf.transform(test_x_raw))
result = zip(paragraph_id,y)
f = open('subimission.txt', 'w')
f.write('PhraseId,Sentiment\n')
for t in result:
    line = ','.join(str(x) for x in t)
    f.write(line + '\n')
f.close()

### Reflection

- The result is ok for my time investment, the kaggle submission accuracy is 0.60033
- This is around 440/860 places. The best result on Kaggle is 0.76, top 10 average is 0.70


- A prepackaged library from Stanford http://nlp.stanford.edu/sentiment/ using Deep Learning for Sentiment Analysis with the 5 categories classification we used above yields very good results of 0.65-0.7. 

- So maybe trying Keras or other Deep Learning libraries would be a good approach.


- One more thing is that I didn't use Random Forest because of the time it takes to train them.  That might be interesting to try 


- I should have done more feature extraction: 
    - Dependency Parsing to put all the parts of one sentence together
    - extra features using NLTK's Sentiment Analysis