# 1. Introduction

In this post we are going to use multiple classifiers to split the positive and negative reviews. With scikit-learn, it is now very convenient to train the data with much more classifers and much less codes.

For the tasks of classification, we can use Naive Bayes, Logistics Regression, etc to train the data. Firstly, I will give a brief introduction about the priciples of these algorithms. Then I will apply these algorithms to the dataset so that we can see which one performs the best. Moreover, different features will be used for trainning. The first featureset is the top 5000 most frequent words while the second featureset is a collection of top 5000 most frequent adj&adv. 

In the Trainning Part II, I will use multilayer perception to do the classification.

# 2. Classifiers for Classification

In this section, we are going to use Naive Bayes, Multinomial Naive Bayes, Bernoulli Classifier, Logistics Regression Classifier, SGD Classifier, SVC Classifier, Linear SVC Classifier, NuSVC classifier to do the classification. More information about these classifiers can be found here:[Classification - Supervised Learning](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning)

## 2.1 Naive Bayes

In scikit-learn, Naive Bayes actually represents a set of supervied learning algorithms based on applying Bayes' theorem with **the "naive" assumption of independence between every pair of features**. Given a class variable $y$ and a dependent feature vector $x_1$ through $x_n$, Bayes theorem states the following relationship:

$P(y|x_1...x_n) = \frac{P(y)P(x_1..x_n|y)}{P(x_1...x_n)}$

In the above formular, we find calculating $P(x_1...x_n|y)$ is very difficult. But using the "naive" assumption, we get:

$P(x_i|y,x_1,...,x_{i-1},x_{i+1},...,x_{n}) = P(x_i|y) \quad i = 1...n $

Then the relationship can be simplified to:

$P(y|x_1...x_n) = \frac{P(y)\prod_{i=1}^{n}P(x_i|y)}{P(x_1...x_n)}$

The denominator and $P(y)$ are constants and we could get $y$ by maximizing the $P(y)\prod_{i=1}^{n}P(x_i|y)$, which means:

\begin{equation}
	y = \mathop{\arg\max}_{y}P(y)\prod_{i=1}^{n}P(x_i|y) 
\end{equation}

Naive Bayes classifier has many versions. The difference between these versions is that they have different methods to approximate $P(x_i|y)$. For the version that we explained above, if we want to use it to do the classification, just use the **NaiveBayesClassifier** in nltk

In [7]:
%%time

classifier = nltk.NaiveBayesClassifier.train(training_set)
print('Naive Bayes Algo accuracy percent: ', (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

Naive Bayes Algo accuracy percent:  68.0
Most Informative Features
               ludicrous = True              neg : pos    =     10.6 : 1.0
               addresses = True              pos : neg    =      9.7 : 1.0
                  hudson = True              neg : pos    =      9.6 : 1.0
                 idiotic = True              neg : pos    =      9.2 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                   dread = True              pos : neg    =      9.1 : 1.0
                   vocal = True              pos : neg    =      9.1 : 1.0
                    scum = True              pos : neg    =      8.4 : 1.0
                  feeble = True              neg : pos    =      7.0 : 1.0
                predator = True              neg : pos    =      7.0 : 1.0
               strongest = True              pos : neg    =      6.6 : 1.0
               hawthorne = True              pos : neg    =      6.4 : 1.0
                 cunning = True  

Another version: Gausian Naive Bayes implements Gaussian Naive Bayes algorithm for classification. This method approximate $P(x_i|y)$ by:

$P(x_i|y) = \frac{1}{\sqrt{2\pi{\sigma_y}^2}}exp(-\frac{(x_i-{\mu_y})^2}{2{\sigma_y}^2})$

The parameters $\sigma_y$ and $\mu_y$ are estimated by maximum likelihood.

In [49]:
%%time

gnb = GaussianNB()

training_review = []
training_target = []
test_review = []
test_target = []

for text in training_set:
    training_review.append(list(text[0].values()))
    training_target.append(text[1])
    
for text in testing_set:
    test_review.append(list(text[0].values()))
    test_target.append(text[1])
    
predictions = gnb.fit(training_review, training_target).predict(test_review)
count = 0

for i in range(len(predictions)):
    if predictions[i] == test_target[i]:
        count += 1
    else:
        pass
    
print('The Gaussian NB accuracy rate is: ', (count/len(predictions))*100, '%')

The Gaussian NB accuracy rate is:  60.0 %
Wall time: 612 ms


However, the Multinomial Naive Bayes estimate the $P(x_i|y)$ by the following way: 

$\hat{\theta_{yi}} = \frac{N_{yi} + \alpha}{N_{y} + \alpha n}$

where $N_{yi}$ means the number of times feature i appears in a sample of class y in the training set. $N_y$ means the total number of all features for class y. For this movie review data set, we can also use multinomial NB classifier to train the data and make predictions.

In [50]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)                           
print('MNB_classifier accuracy percent: ', (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

MNB_classifier accuracy percent:  70.0


The last kind of Naive Bayes approach we are going to introduce is the Bernoulli Naive Bayes classifer, which is based on multivariate bernoulli distribution. There might be multiple features but each one is assumed to be **a binary-valued variable**. For this classifier, it estimates $P(x_i|y)$ by:

$P(x_i|y) = P(i|y)x_i + (1-P(i|y))(1-x_i)$

i means feature. So the implementation of this method on movie review corpora is given below:

In [51]:
Bernoulli_classifier = SklearnClassifier(BernoulliNB())
Bernoulli_classifier.train(training_set)                           
print('Bernoulli_classifier accuracy percent: ', (nltk.classify.accuracy(Bernoulli_classifier, testing_set))*100)

Bernoulli_classifier accuracy percent:  70.0


## 2.2 Logistics Regression