# Dataset
Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library.<Br>
It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

In [None]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import word_tokenize
import string
from nltk.probability import FreqDist

In [None]:
from nltk.corpus import movie_reviews as mr
print("The corpus contains %d reviews"% len(mr.fileids()))

## Shuffling

Lets shuffle the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]

In [None]:
import random
docnames=mr.fileids()
random.shuffle(docnames)

## Let's do it using some useful scikit-learn functions 


In [None]:
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

### Assuming that documents are shuffled
Go back to the shuffle section, and make sure `docnames` contain a shuffled list of documents 

In [None]:
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

In [None]:
for i in range(2):
    print(tags[i], documents[i])

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

In [None]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)

In [None]:
print(vectorizer.get_feature_names()[:1000])

In [None]:
print(train_X.shape, test_X.shape)

In [None]:
classifier = MultinomialNB()

In [None]:
classifier.fit(train_X, train_tags)

In [None]:
pred = classifier.predict(test_X)

In [None]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)