# Vectorizer Tuning

In [4]:
import pandas as pd

data = pd.read_pickle("reviews_3")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [5]:
import string

def remove_punctuation(df, column):
    for punctuation in string.punctuation:
        df[column] = df[column].apply(lambda x: x.replace(punctuation,''))
    return df[column]
def lower(df, column):
    df[column] = df[column].apply(lambda x: x.lower())
    return df[column]

data['clean_text'] = remove_punctuation(data, "reviews")
data['clean_text'] = lower(data, "clean_text")
data

Unnamed: 0,target,reviews,clean_text
0,neg,plot two teen couples go to a church party d...,plot two teen couples go to a church party d...
1,neg,the happy bastards quick movie review \ndamn t...,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first fe...,quest for camelot is warner bros first fe...
4,neg,synopsis a mentally unstable man undergoing p...,synopsis a mentally unstable man undergoing p...
...,...,...,...
1995,pos,wow what a movie \nits everything a movie ca...,wow what a movie \nits everything a movie ca...
1996,pos,richard gere can be a commanding actor but he...,richard gere can be a commanding actor but he...
1997,pos,glorystarring matthew broderick denzel washin...,glorystarring matthew broderick denzel washin...
1998,pos,steven spielbergs second epic film on world wa...,steven spielbergs second epic film on world wa...


## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

X = data.clean_text
y = data.target

# Create Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

# Set parameters to search (model and vectorizer)

parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'vect__max_df': [0.5, 0.75, 1.0],
    'clf__alpha': [0.1, 1.0, 10.0],
}

# Perform grid search on pipeline
grid_search = GridSearchCV(text_clf, parameters, cv=5)
grid_search.fit(X, y)

print(grid_search.best_params_)

{'clf__alpha': 1.0, 'vect__max_df': 0.5, 'vect__ngram_range': (1, 2)}


⚠️ Please push the exercise once you are done 🙃

## 🏁 