# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of sentences.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [27]:
import pandas as pd
import string
import re
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [28]:
data = pd.read_csv(
    "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv"
)
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [29]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning sentence)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `cleaned_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [38]:
def make_lower_case(sentence):
    return sentence.lower()


def remove_punctuation(sentence):
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, "")
    return sentence


def remove_numbers(sentence):
    pattern = r"[0-9]"
    return re.sub(pattern, "", sentence)


def tokenize(sentence):
    return word_tokenize(sentence)

def remove_whitespance(tokenized_sentence):
    return [word for word in tokenized_sentence if word]


def remove_stopwords(tokenized_sentence):
    sw = set(stopwords.words("english"))
    return [word for word in tokenized_sentence if word not in sw]


def lemmatize(tokenized_sentence):
    lemmatizer = WordNetLemmatizer()
    v_lemmatized = [lemmatizer.lemmatize(word, pos="v") for word in tokenized_sentence]

    v_n_lemmatized = [lemmatizer.lemmatize(word, pos="n") for word in v_lemmatized]

    v_n_a_lemmatized = [lemmatizer.lemmatize(word, pos="a") for word in v_n_lemmatized]

    v_n_a_r_lemmatized = [
        lemmatizer.lemmatize(word, pos="r") for word in v_n_a_lemmatized
    ]

    return v_n_a_r_lemmatized


def preprocessing(sentence, remove_stopwords=True):
    sentence = make_lower_case(sentence)
    sentence = remove_numbers(sentence)
    sentence = remove_punctuation(sentence)
    tokenized_sentence = tokenize(sentence)
    if remove_stopwords:
        tokenized_sentence = remove_stopwords(tokenized_sentence)
    tokenized_sentence = remove_whitespance(tokenized_sentence)
    tokenized_sentence = lemmatize(tokenized_sentence)
    sentence = " ".join(tokenized_sentence)
    return sentence

In [39]:
data["clean_reviews"] = data["reviews"].apply(preprocessing, remove_stopwords=False)
data["clean_reviews"].head(5)

0    plot two teen couple go to a church party drin...
1    the happy bastard quick movie review damn that...
2    it be movie like these that make a jade movie ...
3    quest for camelot be warner bros first feature...
4    synopsis a mentally unstable man undergo psych...
Name: clean_reviews, dtype: object

❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [40]:
label_encoder = LabelEncoder()
label_encoder.fit(data["target"])

data["target_encoded"] = label_encoder.fit_transform(data["target"])

data.head(1)

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0


In [41]:
data.loc[data["target"] == "pos"].head(1)

Unnamed: 0,target,reviews,clean_reviews,target_encoded
1000,pos,films adapted from comic books have had plenty...,film adapt from comic book have have plenty of...,1


In [42]:
# Quick check
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,neg,it is movies like these that make a jaded movi...,it be movie like these that make a jade movie ...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot be warner bros first feature...,0
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergo psych...,0


## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the sentences.

In [43]:
count_bayes_pipeline = make_pipeline(CountVectorizer(), MultinomialNB())

cv_results = cross_validate(
    count_bayes_pipeline,
    data["clean_reviews"],
    data["target_encoded"],
    scoring=["accuracy", "precision", "recall"],
    cv=5,
)

for k, v in cv_results.items():
    print(f"{k.title()}: {v.mean()}")

Fit_Time: 1.1883056163787842
Score_Time: 0.2546726703643799
Test_Accuracy: 0.8135
Test_Precision: 0.8188306311907745
Test_Recall: 0.805


## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the sentences.

In [44]:
vectorizer = CountVectorizer(ngram_range=(2, 2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(naivebayes, X_bow, data.target_encoded, scoring="accuracy")

round(cv_nb["test_score"].mean(), 2)

0.84

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized sentences.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!