## Movie Reviews and Bag-of-Words Modeling

In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [2]:
import pandas as pd
data = pd.read_csv("movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [3]:
data.shape

(2000, 2)

## 1. Preprocessing

We will write a function `preprocessing` that will clean a sentence and apply it to all our reviews. We will store the cleaned reviews into a column called `clean_reviews`.

In [4]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [5]:
def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, "")
    return sentence

## Clean reviews

In [6]:
data['clean_reviews'] = data.reviews.apply(preprocessing)
data.head()

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couples go to a church party d...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first fea...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing p...


## Labelencoding

We will encode our target values using sklearn label encoder and store it into a column called `"target_encoded"`

In [11]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data["target_encoded"] = le.fit_transform(data.target)
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couples go to a church party d...,0
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastards quick movie review \ndamn t...,0
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first fea...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing p...,0


## 2. Bag-of-Words modeling

Using `cross_validate`, we will score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts. (NaiveBayes with unigrams)

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

vectorizer = CountVectorizer()
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(naivebayes,
                       X_bow,
                       data.target_encoded,
                       scoring = "accuracy")

cv_nb['test_score'].mean()

0.817

## 3. N-gram modeling

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts. (NaiveBayes with bigrams)

In [13]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(naivebayes,
                       X_bow,
                       data.target_encoded,
                       scoring = "accuracy")

cv_nb['test_score'].mean()

0.8375