# NLP Fun 🎉 - Solution


---
By Jeff Hale


### Learning Objectives:

By the end of this lesson students will:
- Have learned a workflow for using text data in models
- Understand sklearn's CountVectorizer and TfidfVectorizer
- Use nltk's lemmatization or stemming as part of CountVectorizer or TfidfVectorizer
- Use CountVectorizer and TfidfVectorizer in a Pipeline with GridSearchCV
- Be able to use make_column_transformer to create pipelines with text and non-text features


When you have text data and you want to make a model this is the workflow I suggest:



## Make a basic model first

Just use the text data
- Use CountVectorizer to transform
- Use MultinomialNB to predict

Now you have a baseline model.

## Then add complexity

### Options:
- Add lemmatization/stemming
- Hyperparameter tune (e.g. ngrams)
- Use Tfidf
- Add non-text features

Put your workflow a Pipeline with GridSearchCV to be able to iterate faster and reduce the chance of errors.

In [None]:
# imports
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 

### Data
The data is Yelp reviews.

We'll filter the data so it just includes the 5-star and 1-star responses.

The goal is to classify a reviews number of stars as a 5 or a 1.

In [None]:
path = './data/yelp.csv'
yelp = pd.read_csv(path)
yelp.head(2)

### Filter the DataFrame to only keep reviews with 1 or 5 stars

#### Split into X and y

### Split into train and test

## Basic CountVectorizer first

Use CountVectorizer to process the text data.

- use the built-in stopwords
- lowercase everything


#### Use a Naive Bayes

Instantiate, fit_transform, transform, and predict 

#### How did that do?

#### Let's add stop words

##  TFIDF

Now try a TFIDF model.

#### How does that perform?

# Stemming and Lemmatization
Adding a Stemmer or Lemmatizer from nltk or spacy can be done several ways. We did something similar in global. I'm going to leave that for the solution, and we'll move to just using those in a pipeline because it's much more straightforward.

##  Lemmatize

### Let's make a function to lemmatize the text.

In [None]:
def split_into_lemmas(text):
    '''return lemmatizeed list of words from a document passed in '''


## Pipeline  


#### Let's do this the smart way and make a pipeline.

## GridSearchCV

Want to tune hyperparameters? Let's do it! 🚀

### Warning, this might take 10min to run. 

#### How about them apples? 🍎

## Add non-text columns

In [None]:
X = yelp_best_worst[['text', 'cool', 'useful']]
y = yelp_best_worst['stars']

In [None]:
X.info()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

#### Let's use `make_column_transformer` and scale our data

##### fyi, MultinomialNB wants positive values only

In [None]:
ct = make_column_transformer(
    (CountVectorizer(preprocessor=split_into_lemmas, ngram_range = (1,1)), 'text'),
    (MinMaxScaler(), ['cool', 'useful'])
)

# 'text' is not passed in a list - very confusing
# # https://stackoverflow.com/a/56299794/4590385  

# docs: https://github.com/scikit-learn-contrib/sklearn-pandas#map-the-columns-to-transformations
# Be aware that some transformers expect a 1-dimensional input 
# (the label-oriented ones) while some others, like OneHotEncoder or Imputer, 
# expect 2-dimensional input, with the shape [n_samples, n_features].

Make the pipeline

In [None]:
pipe = make_pipeline(
    ct,
    MultinomialNB()
)

pipe.fit(X_train, y_train)

#### Did adding the two non-text columns help?


### Bonus ⭐️

- Try differentiating the 4 star vs. 5 star reviews
- Try a logistic regression or knn model
- Try more transformer hyperparameters
- Use GridSearch with the final pipeline -  hint: `columntransformer__countvectorizer__ngram_range` is how to create grid elements.

## Summary 

You've seen how to use `CountVectorizer` and `TfidfVectorizer` for NLP with a pipeline and grid searching.

### Check for Understanding

- What is bag of words?
- What is TF-IDF?

- What are stop words?
- What are n-grams?

- What's a document?
- What's a corpus?

- How are stemming and lemmatization different?




#### NLP is a big area, but you've covered lots of the core aspects! 🎉

