# Movie Reviews and Bag-of-Words Modeling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modeling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. You can store the cleaned reviews into a column called `clean_reviews`.

In [3]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

punctuations=string.punctuation
stopwords=set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def preprocessing(sentence):
    #remove punctuation
    punctuations=string.punctuation
    for p in punctuations:
        sentence = sentence.replace(p,"")
    
    #remove digits
    for d in ["0","1","2","3","4","5","6","7","8","9"]:
        sentence = sentence.replace(d,"")
        
    #remove stopwords
    sentence_list = [w for w in word_tokenize(sentence) if not w in stopwords]
    
    #lemmatize
    sentence_list = [lemmatizer.lemmatize(word) for word in sentence_list]
    
    #recompose
    final = ""
    for x in sentence_list:
        final = final + " " + x
    
    return final

preprocessing("la vie est b66elle!!!!")
    

' la vie est belle'

In [4]:
#clean reviews

data["clean_reviews"] = data.reviews.apply(preprocessing)
data.head()

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dr...
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastard quick movie review damn yk bug ...
2,neg,it is movies like these that make a jaded movi...,movie like make jaded movie viewer thankful i...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergoing psy...


❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [5]:
def encode(string):
    target = -1
    if (string == "neg"):
        target = 1
    else:
        target = 0
    return target
        
data["target_encoded"]=data.target.apply(encode)
data.head()

Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go church party drink dr...,1
1,neg,the happy bastard's quick movie review \ndamn ...,happy bastard quick movie review damn yk bug ...,1
2,neg,it is movies like these that make a jaded movi...,movie like make jaded movie viewer thankful i...,1
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest camelot warner bros first featurelength...,1
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis mentally unstable man undergoing psy...,1


## 2. Bag-of-Words modeling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

count_vectorizer = CountVectorizer(ngram_range = (2,2))
X = count_vectorizer.fit_transform(data.clean_reviews)

X_bow = pd.DataFrame(X.toarray(),columns = count_vectorizer.get_feature_names())
y=data["target_encoded"]

nb_model = MultinomialNB()



In [7]:
cross_val_score(nb_model, X, y, cv=3, scoring="accuracy", n_jobs=-1)

array([0.73463268, 0.73163418, 0.77177177])

## 3. N-gram modeling

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [10]:
from sklearn.model_selection import cross_validate
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.clean_reviews)

cv_nb = cross_validate(naivebayes,
                       X_bow,
                       data.target_encoded,
                       scoring = "accuracy")

cv_nb['test_score'].mean()

0.759

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!