# Sentiment Analysis code

Using Unigram and Bigram classifier

    1. NLTK
    2. SKLEARN
    
The main idea was to use an already developed human language interpreter, like NLTK (Natural Language Toolkit). I've tried this to save some time.
In this case, I also thought Bigram classifier would be one of the best things to use, specially in the negation cases. I've tried Unigram and Bigram separately, but the results were not so good. The best results (92% match) I got combining both classifiers.
The only change I've made was in the input files. I've merged them into one single file ("review_all.txt") and, at some point in the code, I shuffle the positive and negative reviews.
I also included in the files one extra information. One variable (called "sentiment") that is "1" if the review is positive and "0" if the review is negative.
Since the data is labeled, my first attempt was to use supervised machine learning.

In [14]:
import pandas as pd       
#opening the text file with the reviews
data = pd.read_csv("review_all.txt", header=0, delimiter="\t", quoting=2) 
# 10660 movie reviews (5330 positive and 5330 negative)
print data.shape # number of lines and "columns"(10660, 2) 
print data["review"][0]       # Check out the review - only first line
# in this file, a positive review means a sentiment = 1 and a 
# negative review means a sentiment = 0
print data["sentiment"][0]    # Check out the sentiment - only first line

(10660, 2)
the rock is destined to be the 21st century's new  conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal ."
1 


In [15]:
import random

review_data = zip(data["review"], data["sentiment"])
# just shuffle the data
random.shuffle(review_data)
# ~75% for training and ~25% for testing
train_X, train_y = zip(*review_data[2660:])
test_X, test_y = zip(*review_data[:8000])
#
# now our data set is ready to be used

In [16]:
#####################################################
# using NLTK (Natural Language Toolkit)
# a library that works with human language data
#####################################################
#
# tokenizer interface - used to divide a string into
# substrings by splitting on the specified string
# Ex: 
# from nltk.tokenize import TweetTokenizer
# tknzr = TweetTokenizer()
# s0 = "This is a cooool #dummysmiley: :-)"
# tknzr.tokenize(s0)
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)']
from nltk import word_tokenize
#
# "Convert a collection of text documents to a matrix of token counts" taken from 
# http://scikit-learn.org
# It belongs to scikit-learn utilities to extract numerical features from text content  
from sklearn.feature_extraction.text import CountVectorizer
#
# Pipeline - useful when there is a fixed sequence of steps in processing the data:
# you only have to call fit and predict once on your data to fit a whole 
# sequence of estimators.
from sklearn.pipeline import Pipeline
#
# LinearSVC - Linear Support Vector Classification
# Useful for labeled data, since it is a supervised learning model
# It learns from the training examples and builds a model that 
# assigns new examples to one category or the other
from sklearn.svm import LinearSVC

unigram_bigram_sap = Pipeline([
    ('vectorizer', CountVectorizer(analyzer="word",
                                   # using ngram_range = (1, 2) one obtain
                                   # the unigram and bigram model
                                   ngram_range=(1, 2),
                                   tokenizer=word_tokenize,)),
    ('classifier', LinearSVC())
])
 
# Fit the model
unigram_bigram_sap.fit(train_X, train_y) 
# Apply transforms, and score with the final estimator
unigram_bigram_sap.score(test_X, test_y) 

# Check the feature names
# It is not important. Just if you are curious, like me!!!!
#print unigram_bigram_sap.named_steps['vectorizer'].get_feature_names()

0.92612499999999998