# 1. Introduction

In this post we are going to use multiple classifiers to split the positive and negative reviews. With scikit-learn, it is now very convenient to train the data with much more classifers and much less codes.

For the classification tasks, we can use Naive Bayes, Logistics Regression, etc to train the data. Firstly, I will give a brief introduction about the priciples of these algorithms. Then I will apply these algorithms to the dataset so that we can see which one performs the best. Moreover, for the type of features, I am going to use the most frequent 5000 words as featuresets. If you chose adjectives and adverbs, the accuracy rate would be higher. You could just use the codes below but don't forget to change the featuresets. 

In the Trainning Part II, I will use multilayer perception to do the classification.

# 2. Classifiers for Classification

In this section, we are going to use Naive Bayes, Multinomial Naive Bayes, Bernoulli Classifier, Logistics Regression Classifier, SGD Classifier, SVC Classifier, Linear SVC Classifier, NuSVC classifier to do the classification. More information about these classifiers can be found here:[Classification - Supervised Learning](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning)

## 2.1 Naive Bayes

In scikit-learn, Naive Bayes actually represents a set of supervied learning algorithms based on applying Bayes' theorem with **the "naive" assumption of independence between every pair of features**. Given a class variable $y$ and a dependent feature vector $x_1$ through $x_n$, Bayes theorem states the following relationship:

\begin{equation}
    P(y|x_1...x_n) = \frac{P(y)P(x_1..x_n|y)}{P(x_1...x_n)}
\end{equation}

In the above formular, we find calculating $P(x_1...x_n|y)$ is very difficult. But using the "naive" assumption, we get:

\begin{equation}
    P(x_i|y,x_1,...,x_{i-1},x_{i+1},...,x_{n}) = P(x_i|y) \quad i = 1...n
\end{equation}

Then the relationship can be simplified to:

\begin{equation}
    P(y|x_1...x_n) = \frac{P(y)\prod_{i=1}^{n}P(x_i|y)}{P(x_1...x_n)}
\end{equation}

The denominator and $P(y)$ are constants and we could get $y$ by maximizing the $\prod_{i=1}^{n}P(x_i|y)$, which means:

\begin{equation}
	y = \mathop{\arg\max}_{y}P(y)\prod_{i=1}^{n}P(x_i|y) 
\end{equation}

Naive Bayes classifier has many versions. The difference between these versions is that they have different methods to approximate $P(x_i|y)$. For the version that we explained above, if we want to use it to do the classification, just use the **NaiveBayesClassifier** in nltk

In [9]:
%%time

classifier = nltk.NaiveBayesClassifier.train(training_set)
print('Naive Bayes Algo accuracy percent: ', (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

Naive Bayes Algo accuracy percent:  75.0
Most Informative Features
                  avoids = True              pos : neg    =     12.5 : 1.0
              astounding = True              pos : neg    =     11.9 : 1.0
             fascination = True              pos : neg    =     10.5 : 1.0
                fairness = True              neg : pos    =      8.8 : 1.0
                predator = True              neg : pos    =      8.2 : 1.0
                  feeble = True              neg : pos    =      8.2 : 1.0
             overwrought = True              neg : pos    =      6.9 : 1.0
            breathtaking = True              pos : neg    =      6.7 : 1.0
                 insipid = True              neg : pos    =      6.5 : 1.0
                    noah = True              pos : neg    =      6.4 : 1.0
               hawthorne = True              pos : neg    =      6.4 : 1.0
                supports = True              pos : neg    =      6.4 : 1.0
                    mena = True  

Another version: Gausian Naive Bayes implements Gaussian Naive Bayes algorithm for classification. This method approximate $P(x_i|y)$ by:

\begin{equation}
P(x_i|y) = \frac{1}{\sqrt{2\pi{\sigma_y}^2}}exp(-\frac{(x_i-{\mu_y})^2}{2{\sigma_y}^2})
\end{equation}

The parameters $\sigma_y$ and $\mu_y$ are estimated by maximum likelihood.

In [10]:
%%time

gnb = GaussianNB()

training_review = []
training_target = []
test_review = []
test_target = []

for text in training_set:
    training_review.append(list(text[0].values()))
    training_target.append(text[1])
    
for text in testing_set:
    test_review.append(list(text[0].values()))
    test_target.append(text[1])
    
predictions = gnb.fit(training_review, training_target).predict(test_review)
count = 0

for i in range(len(predictions)):
    if predictions[i] == test_target[i]:
        count += 1
    else:
        pass
    
print('The Gaussian NB accuracy rate is: ', (count/len(predictions))*100, '%')

The Gaussian NB accuracy rate is:  51.0 %
Wall time: 593 ms


However, the Multinomial Naive Bayes estimate the $P(x_i|y)$ by the following way: 

\begin{equation}
    \hat{\theta_{yi}} = \frac{N_{yi} + \alpha}{N_{y} + \alpha n}
\end{equation}

where $N_{yi}$ means the number of times feature i appears in a sample of class y in the training set. $N_y$ means the total number of all features for class y. For this movie review data set, we can also use multinomial NB classifier to train the data and make predictions.

In [11]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)                           
print('MNB_classifier accuracy percent: ', (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

MNB_classifier accuracy percent:  76.0


The last kind of Naive Bayes approach we are going to introduce is the Bernoulli Naive Bayes classifer, which is based on multivariate bernoulli distribution. There might be multiple features but each one is assumed to be **a binary-valued variable**. For this classifier, it estimates $P(x_i|y)$ by:

\begin{equation}
    P(x_i|y) = P(i|y)x_i + (1-P(i|y))(1-x_i)
\end{equation}

i means feature. So the implementation of this method on movie review corpora is given below:

In [12]:
Bernoulli_classifier = SklearnClassifier(BernoulliNB())
Bernoulli_classifier.train(training_set)                           
print('Bernoulli_classifier accuracy percent: ', (nltk.classify.accuracy(Bernoulli_classifier, testing_set))*100)

Bernoulli_classifier accuracy percent:  74.0


## 2.2 Logistics Regression

Logistics regression, also known as logit regression, maximum-entropy classification(MaxEnt) or the log-linear classifier, is a linear model for classification. In this model, the probabilities describing the possible outcomes of a single trial are modeled using the following logistic function:

\begin{equation}
    f(x) = \frac{L}{1 + e^{-k(x-x_0)}}
\end{equation}

where L means the curve's maximum value, $x_0$ means the x-value of the sigmoid's midpoint, and k means the stepness of the curve. More information about this function can be found here: [logistic function](https://en.wikipedia.org/wiki/Logistic_function)

In general, we always use the following form of logistic function:

\begin{equation}
    h_{\theta}(x) = \frac{L}{1 + e^{-\theta x}}
\end{equation}

The above function is also called sigmoid function, which is also widely used as activation function in neural network. 

The cost function for logistics regression is:

\begin{equation}
    J(\theta) = -\frac{1}{m} \sum_{i=1}^{m}[y_i logh_{\theta}(x) + (1 - y_i)log(1-h_{\theta}(x))]
\end{equation}

Why do we choose this cost function? How to get this cost function?

Firstly, because if we just calculat the estimation $h_{\theta}(x)$ and plug it in the ordinary cost function:

\begin{equation}
    J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (y_i - h_{\theta}(x))
\end{equation}

$J(\theta)$ is non-convex and we can't find the global minimal. Hence we decide to change the cost function. For the cost function except the $\frac{1}{m}$ and summation, we could use the following trick to deal with predictions:

\begin{equation}
cost=
\begin{cases}
-log(h_{\theta}(x))& \text{y=1}\\
-log(1-h_{\theta}(x))& \text{y=0}
\end{cases}
\end{equation}

Since y is a variable with only two possible values(y=1 and y=0), then the above thing actually equal to the cost function of logistics regression we have listed.

In scikit-learn, we can use the LogisticRegression() to construct the classifier. The classifier we constructed in this way can **fit binary**, **One-vs- Rest**, or **multinomial logistic regression** with **optional L2** or **L1** regularization. For this movie reviews dataset, if we use logistics regression, we can work out:

In [3]:
logistic_classifier = SklearnClassifier(LogisticRegression())
logistic_classifier.train(training_set)                           
print('logistic_classifier accuracy percent: ', (nltk.classify.accuracy(logistic_classifier, testing_set))*100)

logistic_classifier accuracy percent:  67.0


## 2.3 SGD Classifier

This classifier implements linear classifier such as SVM and logistics regression together with stochastic gradient descent algorithm.

In [4]:
SGD_classifier = SklearnClassifier(SGDClassifier())
SGD_classifier.train(training_set)                           
print('SGD_classifier accuracy percent: ', (nltk.classify.accuracy(SGD_classifier, testing_set))*100)

SGD_classifier accuracy percent:  71.0


## 2.4 SVM

SVC Classifer actually means C-support support vector 