# Movie Reviews and Bag-of-Words Modelling

🎯 The goal of this challenge is to play with the ***Bag-of-words*** modelling of texts.

✍️ In the following dataset, we have $2000$ reviews classified either as _"positive"_ or _"negative"_.

In [15]:
import pandas as pd
import string
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [16]:
y= data.target

In [17]:
data.shape

(2000, 2)

## 1. Preprocessing

❓ **Question (Cleaning Text)** ❓

- Write a function `preprocessing` that will clean a sentence and apply it to all our reviews. It should:
    - remove whitespace
    - lowercase characters
    - remove numbers
    - remove punctuation
    - tokenize
    - lemmatize
- You can store the cleaned reviews into a column called `cleaned_reviews`.
- Do not remove stopwords in this challenge, we will explain why in the section `3. N-gram modelling`

In [40]:
def preprocessing(sentence):

    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') 
    
    sentence = sentence.strip()
    
    tokenized_sentence = word_tokenize(sentence)
        
    verb_lemmatized = [                  
    WordNetLemmatizer().lemmatize(word, pos = "v") # v --> verbs
    for word in tokenized_sentence
    ]

    noun_lemmatized = [                  
    WordNetLemmatizer().lemmatize(word, pos = "n") # v --> verbs
    for word in tokenized_sentence
    ]
    
    preprocessed_sentence = ' '.join([word for word in verb_lemmatized])
    preprocessed_sentence = " ".join([word for word in noun_lemmatized])

    return preprocessed_sentence

In [66]:
# Clean reviews
data["cleaned_reviews"]=data.reviews.apply(preprocessing)
data.head()


Unnamed: 0,target,reviews,cleaned_reviews,target1
0,0,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,0,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,0,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,0,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,0,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0


In [67]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
test_values = data.cleaned_reviews.values
X_bow = count_vectorizer.fit_transform(test_values).toarray()

❓ **Question (LabelEncoding)**❓

LabelEncode your target and store it into a column called `"target_encoded"`

In [68]:
# YOUR CODE HERE
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data["target"]=encoder.fit_transform(y)

In [69]:
# Quick check
data.head()

Unnamed: 0,target,reviews,cleaned_reviews,target1
0,0,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,0,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,0,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,0,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,0,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0


## 2. Bag-of-Words Modelling

❓ **Question (NaiveBayes with unigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Words representation of the texts.

In [71]:
# YOUR CODE HERE

from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
model = MultinomialNB()
X=data.cleaned_reviews
y=data.target

cv_results = cross_validate(model, X_bow,y, cv = 5, scoring = ["accuracy"])
avg_accuracy = cv_results["test_accuracy"].mean()
avg_accuracy



0.8160000000000001

## 3. N-gram Modelling

👀 Remember that we asked you not to remove stopwords. Why? 

👉 We will train the Naive Bayes model with bigrams. Hence, in sentence like "I do not like coriander", it is important to scan the bigram "do not" to detect negativity in this sentence for example.

❓ **Question (NaiveBayes with bigrams)** ❓

Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Words representation of the texts.

In [75]:
vectorizer = CountVectorizer(ngram_range = (2,2))
naivebayes = MultinomialNB()

X_bow = vectorizer.fit_transform(data.cleaned_reviews)

cv_nb = cross_validate(
    naivebayes,
    X_bow,
    y,
    scoring = "accuracy"
)

round(cv_nb['test_score'].mean(),2)

0.84

🏁 Congratulations! Now, you know how to train a Naive Bayes model on vectorized texts.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!

In [None]:
!git add .
!git commit -m"excercise 100%"
!