**To use Naive Bayes in sklearn, we need to have feature matrix. In this case, the each word present in any of the training set would become a feature.**

**So if there are 1800 reviews (900 postive and 900 negative) and the entire corpus has 45142 distict words, then our feature matrix will be a numpy array of 1800 rows and 45142 columns.**

**We could build it on our own or use CountVectorizer.**

In [1]:
import glob
import os
from collections import defaultdict
import re
import numpy as np
from sklearn.cross_validation import train_test_split

In [2]:
def processFile(filename):
    f = open(filename, 'r')
    content = f.read()
    content = re.sub('[^A-z \n]','',content)
    return content.split()

**We read through all the docments and build a list-of-list of words:**

In [3]:
path1 = '/Users/vsenguttuvan/Downloads/movie_reviews/txt_sentoken/pos'
path2 = '/Users/vsenguttuvan/Downloads/movie_reviews/txt_sentoken/neg'
content = []
for filename in glob.glob(os.path.join(path1, '*.txt')):
    content.append(processFile(filename))
for filename in glob.glob(os.path.join(path2, '*.txt')):
    content.append(processFile(filename))

In [4]:
# make the list-of-lists to be list-of-texts
alter = [' '.join(c) for c in content]

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
features = CountVectorizer().fit_transform(alter)

In [19]:
features.shape

(1800, 44840)

**We set the labels, 1 for postive reviews and 0 for negative review:**

In [8]:
label = [1]*900 + [0]*900

**As we have often done before, we split the data into training and test:**

In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(features,label)

**We choose Multinomial Naive Bayes:**

In [21]:
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score, classification_report

model = naive_bayes.MultinomialNB()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.804
             precision    recall  f1-score   support

          0       0.80      0.80      0.80       216
          1       0.81      0.81      0.81       234

avg / total       0.80      0.80      0.80       450



In [22]:
features = CountVectorizer(ngram_range=(1, 2)).fit_transform(alter)

In [23]:
features.shape

(1800, 509488)

In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(features,label)

In [25]:
model = naive_bayes.MultinomialNB()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.856
             precision    recall  f1-score   support

          0       0.86      0.84      0.85       216
          1       0.85      0.87      0.86       234

avg / total       0.86      0.86      0.86       450



**Let's try other classifiers:**

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

** Linear SVC gets the same results but is much slower**

In [26]:
model = LinearSVC()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.836
             precision    recall  f1-score   support

          0       0.81      0.86      0.83       216
          1       0.86      0.81      0.84       234

avg / total       0.84      0.84      0.84       450



** Random Forest doesn't do that well**

In [27]:
model = RandomForestClassifier()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.689
             precision    recall  f1-score   support

          0       0.63      0.87      0.73       216
          1       0.81      0.53      0.64       234

avg / total       0.72      0.69      0.68       450

