<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# Vectorizer Tuning

The task of this exercise is to simultaneously tune a vectorizer and a model. You will reuse your pre-processed text (the `processed_data.csv` file you created in the last exercise), and then:
- Stack a vectorizer and model in a Pipeline
- Set a range of parameters for vectorizer and model in a grid
- Perform grid search on entire pipeline

## Load the data

First, load the data your already processed in a dataframe called `data`.

In [None]:
from nltk.stem import WordNetLemmatizer
import string
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

In [None]:
import pandas as pd

data = pd.read_csv("raw_data/processed_data.csv")

data.head()

## Tuning

Now use `GridSearchCV` to tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously. The goal is to beak your previous score. Save your best cross validation score in a variable named `best_score`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.compose import ColumnTransformer


In [None]:
from sklearn.model_selection import GridSearchCV
# Set parameters to search (model and vectorizer)

params = {
    'vectorizer__ngram_range':[(1,1),(1,2),(2,2),(2,3),(3,3)]
}


results = []


for vectorizer in [CountVectorizer(),TfidfVectorizer()]:
    pipe = Pipeline([
    ('vectorizer',vectorizer),
    ('model', MultinomialNB())])
    cv = GridSearchCV(estimator=pipe,scoring='accuracy', 
                          param_grid = params, n_jobs=-1, verbose=1)
    results.append(cv.fit(data.clean_text, data.sentiment))

In [None]:
print(f"Bag of words: {results[0].best_params_}")
print(f"Tfidf: {results[1].best_params_}")

In [None]:
print(f"Bag of words: {results[0].best_score_}")
print(f"Tfidf: {results[1].best_score_}")

In [None]:
best_score = results[1].best_score_

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('model_performance',
                         score = best_score,
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.