# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to finetune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [3]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [4]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') 
    return sentence

In [6]:
data['clean_reviews'] = data['reviews'].apply(preprocessing)

In [7]:
data

Unnamed: 0,target,reviews,target_encoded,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couples go to a church party d...
1,neg,the happy bastard's quick movie review \ndamn ...,0,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...,0,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest for camelot is warner bros first fea...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis a mentally unstable man undergoing p...
...,...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,1,wow what a movie \nits everything a movie ca...
1996,pos,"richard gere can be a commanding actor , but h...",1,richard gere can be a commanding actor but he...
1997,pos,"glory--starring matthew broderick , denzel was...",1,glorystarring matthew broderick denzel washin...
1998,pos,steven spielberg's second epic film on world w...,1,steven spielbergs second epic film on world wa...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

pipeline.fit(data.clean_reviews, data.target_encoded)

# Define the grid of parameters
parameters = {
    'tfidf__ngram_range': ((1,1), (2,2), (1,2)),
    'nb__alpha': (0.1,1),}

# Perform Grid Search
grid_search = GridSearchCV(pipeline, parameters, scoring = "recall",
                           cv = 5, n_jobs=-1, verbose=1)
grid_search.fit(data.clean_reviews, data.target_encoded)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [25]:
grid_search.best_params_

{'nb__alpha': 1, 'tfidf__ngram_range': (2, 2)}

In [26]:
tuned = grid_search.best_estimator_

In [29]:
from sklearn.model_selection import cross_validate

cv_nb = cross_validate( pipeline, data.clean_reviews, data.target_encoded, scoring = "accuracy")

cv_nb['test_score'].mean()

0.8205

In [30]:
from sklearn.model_selection import cross_validate

cv_nb = cross_validate( tuned, data.clean_reviews, data.target_encoded, scoring = "accuracy")

cv_nb['test_score'].mean()

0.8280000000000001

🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!