# Some Baselines for Sentiment Analysis

A good starting point for understanding recent work in sentiment analysis and text classification is 
[_Baselines and Bigrams: Simple, Good Sentiment and Topic Classification_](http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) by Sida Wang and Christopher D. Manning. In this notebook, I'll implement the models described in that paper and try to reproduce their results on several datasets.


| AthR  | XGraph | BbCrypt|   CR   |  IMDB  | MPQA   | RT-2k  | RTs    | subj   |              |
|-------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|-------------:|
| 85.13 |  91.19 |  99.40 |  79.97 |  86.59 |  86.27 |  85.85 |  79.03 |  93.56 |  MNB-bigram  |
| 84.99 |  89.96 |  99.29 |  79.76 |  83.55 |  85.29 |  83.45 |  77.94 |  92.58 |  MNB-unigram | 
| 83.73 |  86.17 |  97.68 |  80.85 |  89.16 |  86.72 |  87.40 |  77.72 |  91.74 |  SVM-bigram  | 
| 82.61 |  85.14 |  98.29 |  79.02 |  86.95 |  86.15 |  86.25 |  76.23 |  90.84 |  SVM-unigram |  
| 87.66 |  90.68 |  99.50 |  81.75 |  91.22 |  86.32 |  89.45 |  79.38 |  93.18 |  NBSVM-bigram|  
| 87.94 |  91.19 |  99.70 |  80.45 |  88.29 |  85.25 |  87.80 |  78.05 |  92.40 |  SVM-unigram |

[peng](http://nlp.stanford.edu/wiki/Software/Classifier/Sentiment)


## Loading the Datasets

The baselines and bigrams paper uses several datasets to run sentiment analysis experiments. 
In this section I'll show how to prepare these datasets for training and evaluating classifiers.

### RT-s

The dataset consists of 2,000 full-length movie reviews and was introducted in 
[Pang and Lee, 2004](http://www.aclweb.org/anthology/P04-1035).

### RT-2k

The dataset consists of 2,000 full-length movie reviews and was introducted in 
[Pang and Lee, 2004](http://www.aclweb.org/anthology/P04-1035).



## Multinomial Naive Bayes (MNB)

## NBSVM

There are several implementations of NBSVM available; for example:

* Sida Wang's original [implementation](https://github.com/sidaw/nbsvm) in Matlab.  
* A Python [version](https://github.com/mesnilgr/nbsvm) by Grégoire Mesnil.  
* Daniel Pressel's [version](https://github.com/dpressel/nbsvm-xl) in Java.  

I'll follow the beautiful [implementation](https://github.com/Joshua-Chin/nbsvm) in scikit-learn by Joshua Chin.

In [3]:
import numpy as np

from scipy.sparse import spmatrix, coo_matrix

from sklearn.base import BaseEstimator
from sklearn.linear_model.base import LinearClassifierMixin, SparseCoefMixin
from sklearn.svm import LinearSVC

class NBSVM(BaseEstimator, LinearClassifierMixin, SparseCoefMixin):

    def __init__(self, alpha=1, C=1, beta=0.25, fit_intercept=False):
        self.alpha = alpha
        self.C = C
        self.beta = beta
        self.fit_intercept = fit_intercept

    def fit(self, X, y):
        self.classes_ = np.unique(y)
        if len(self.classes_) == 2:
            coef_, intercept_ = self._fit_binary(X, y)
            self.coef_ = coef_
            self.intercept_ = intercept_
        else:
            coef_, intercept_ = zip(*[
                self._fit_binary(X, y == class_)
                for class_ in self.classes_
            ])
            self.coef_ = np.concatenate(coef_)
            self.intercept_ = np.array(intercept_).flatten()
        return self

    def _fit_binary(self, X, y):
        p = np.asarray(self.alpha + X[y == 1].sum(axis=0)).flatten()
        q = np.asarray(self.alpha + X[y == 0].sum(axis=0)).flatten()
        r = np.log(p/np.abs(p).sum()) - np.log(q/np.abs(q).sum())
        b = np.log((y == 1).sum()) - np.log((y == 0).sum())

        if isinstance(X, spmatrix):
            indices = np.arange(len(r))
            r_sparse = coo_matrix(
                (r, (indices, indices)),
                shape=(len(r), len(r))
            )
            X_scaled = X * r_sparse
        else:
            X_scaled = X * r

        lsvc = LinearSVC(
            C=self.C,
            fit_intercept=self.fit_intercept,
            max_iter=10000
        ).fit(X_scaled, y)

        mean_mag =  np.abs(lsvc.coef_).mean()
        coef_ = (1 - self.beta) * mean_mag * r + self.beta * (r * lsvc.coef_)
        intercept_ = (1 - self.beta) * mean_mag * b + self.beta * lsvc.intercept_

        return coef_, intercept_

In [8]:
import glob
import os
import string

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

def load_imdb(data_directory='/home/data/sentiment-analysis-and-text-classification/aclImdb'):
    print("Vectorizing Training Text")
    
    train_pos = glob.glob(os.path.join(data_directory, 'train', 'pos', '*.txt'))
    train_neg = glob.glob(os.path.join(data_directory, 'train', 'neg', '*.txt'))

    token_pattern = r'\w+|[%s]' % string.punctuation

    vectorizer = CountVectorizer(
        'filename', 
        ngram_range=(1, 3),
        token_pattern=token_pattern,
        binary=True
    )
    X_train = vectorizer.fit_transform(train_pos+train_neg)
    y_train = np.array([1]*len(train_pos)+[0]*len(train_neg))

    print("Vocabulary Size: %s" % len(vectorizer.vocabulary_))
    print("Vectorizing Testing Text")

    test_pos = glob.glob(os.path.join(data_directory, 'test', 'pos', '*.txt'))
    test_neg = glob.glob(os.path.join(data_directory, 'test', 'neg', '*.txt'))

    X_test = vectorizer.transform(test_pos + test_neg)
    y_test = np.array([1]*len(test_pos)+[0]*len(test_neg))

    return X_train, y_train, X_test, y_test

In [10]:
X_train, y_train, X_test, y_test = load_imdb()

Vectorizing Training Text
Vocabulary Size: 4996192
Vectorizing Testing Text


In [6]:
mnbsvm = NBSVM()
mnbsvm.fit(X_train, y_train)

NBSVM(C=1, alpha=1, beta=0.25, fit_intercept=False)

In [11]:
print('Test Accuracy: %s' % mnbsvm.score(X_test, y_test))

Test Accuracy: 0.92032
