# Training your own Sentiment Classifier from scratch

Lets use [Sentiment Polarity Dataset 2.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/), included in the `NLTK` library. It consists of 1000 positive and 1000 negative processed reviews. Introduced in Pang/Lee ACL 2004. Released June 2004.

We will be using scikit-learn , a very efficient library for text processing and Machine Learning

## Setup

In [None]:
%pip install scikit-learn

In [None]:

import nltk
nltk.download(['movie_reviews','punkt','stopwords'])

# Let's use some useful scikit-learn functions 
import sklearn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import random

In [None]:
from nltk.corpus import movie_reviews as mr
print("The data contains %d reviews"% len(mr.fileids()))

### Shuffling the documents

We start by shuffling the documents, otherwise they will remain sorted ["neg", "neg" ... "pos"]. Then we will proceed using scikit-learn.

In [None]:
# Shuffle
docnames=mr.fileids()
random.shuffle(docnames)

# create two separate lists: documents and tags
documents=[]
tags = []
for doc in docnames:
    documents.append(mr.raw(doc))
    tags.append( doc.split('/')[0])

Let's check the first few documents ...

In [None]:
for i in range(5):
    print("DOC:", documents[i][:400])
    print("TAG:", tags[i])

The first 80% of the documents will be used for training, and the final 20% will be used for testing...

In [None]:
numtrain = int(len(documents) * 80 / 100)  # number of training documents
train_documents, test_documents = documents[:numtrain], documents[numtrain:]
train_tags, test_tags = tags[:numtrain], tags[numtrain:]

Now that we have separated training and testing sets, we will convert the texts into their vectorial representation. Scikit-learn provides two interesting methods for this: `CountVectorizer` and `TfidfVectorizer`. Please check the documentation if you want to check different parameters.
- [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(train_documents)
test_X = vectorizer.transform(test_documents)
print("TRAIN SIZE:", train_X.shape)
print("TEST SIZE:", test_X.shape)

We can see that the features are actually the words from the texts, where some strange "words" can also be found.

In [None]:
feature_names = vectorizer.get_feature_names_out()[600:700]
feature_names

In [None]:
classifier = MultinomialNB()

In [None]:
classifier.fit(train_X, train_tags)

In [None]:
pred = classifier.predict(test_X)

In [None]:
score = sklearn.metrics.accuracy_score(test_tags, pred)
print("accuracy:   %0.3f" % score)

In [None]:
print(classification_report(test_tags, pred, digits=3))

### Using the classifier for processing new texts

1. Please note that you have to perform the exact same processing steps to the new sentences, previously used during training.
2. Then, you have only to apply the classifier to the new sentences

In [None]:
frases = ["I love movies very much", 
          "I hate my stupid life",
          "I am disapointed with the argument"]
frases_X = vectorizer.transform(frases)
classifier.predict(frases_X)