# NLP Fun 🎉 - Solution


---
By Jeff Hale


### Learning Objectives:

By the end of this lesson students will:
- Have learned a workflow for using text data in models
- Understand sklearn's CountVectorizer and TfidfVectorizer
- Use nltk's lemmatization or stemming as part of CountVectorizer or TfidfVectorizer
- Use CountVectorizer and TfidfVectorizer in a Pipeline with GridSearchCV
- Be able to use make_column_transformer to create pipelines with text and non-text features


When you have text data and you want to make a model this is the workflow I suggest:



## Make a basic model first

Just use the text data
- Use CountVectorizer to transform
- Use MultinomialNB to predict

Now you have a baseline model.

## Then add complexity

### Options:
- Add lemmatization/stemming
- Hyperparameter tune (e.g. ngrams)
- Use Tfidf
- Add non-text features

Put your workflow a Pipeline with GridSearchCV to be able to iterate faster and reduce the chance of errors.

In [235]:
# imports
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 

### Data
The data is Yelp reviews.

We'll filter the data so it just includes the 5-star and 1-star responses.

The goal is to classify a reviews number of stars as a 5 or a 1.

In [236]:
path = './data/yelp.csv'
yelp = pd.read_csv(path)
yelp.head(2)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0


In [237]:
yelp_best_worst = yelp[(yelp['stars']==5) | (yelp['stars']==1)]

In [238]:
X = yelp_best_worst['text']
y = yelp_best_worst['stars']

### Split into train and test

In [239]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Basic CountVectorizer first

Use CountVectorizer to process the text data.

- use the built-in stopwords
- lowercase everything


In [2]:
cv = CountVectorizer()
X_train_cvec = cv.fit_transform(X_train)
X_test_cvec = cv.transform(X_test)

NameError: name 'CountVectorizer' is not defined

#### Use a Naive Bayes

Instantiate, fit_transform, transform, and predict 

In [241]:
nb = MultinomialNB() 

nb.fit(X_train_processed, y_train)
nb.score(X_test_processed, y_test)

0.9187866927592955

#### How did that do?

#### Let's add stop words

In [242]:
cv = CountVectorizer(stop_words='english')
X_train_processed = cv.fit_transform(X_train)
X_test_processed = cv.transform(X_test)

In [243]:
nb = MultinomialNB()

nb.fit(X_train_processed, y_train)
nb.score(X_test_processed, y_test)

0.9158512720156555

##  TFIDF

Now try a TFIDF model.

In [2]:
cv = TfidfVectorizer()
X_train_processed = cv.fit_transform(X_train)
X_test_processed = cv.transform(X_test)

NameError: name 'TfidfVectorizer' is not defined

In [3]:
nb = MultinomialNB()

nb.fit(X_train_processed, y_train)
nb.score(X_test_processed, y_test)

NameError: name 'MultinomialNB' is not defined

#### How does that perform?

## Add a Stemmer

#### Instantiate a stemmer object.

In [246]:
stemmer = SnowballStemmer('english')

#### Stem the text in the training and test sets.

In [247]:
X_train_stemmed = [' '.join([stemmer.stem(word) for word in text.split(' ')])
    for text in X_train]

In [248]:
X_train_stemmed[:3]

["filly-b's!!!!!  onli 8 reviews?? nine now!!!\n\nwow do i miss this place:\n\n- 24hrs\n- drive-thru or walk up only\n- ridicul cheap\n- ridicul tasty\n\nof cours the arizona burrito are good, everyth is good. i use to love one of the combos... you get a beef burrito, taco, rice and beans... for under $6.  wow.  color me silli and call me sally. they have bomb horchata too.\n\nreal good and fresh flautas/rol taco and breakfast burritos. damn, everyth here is good, whether drunk or sober.",
 'my husband and i absolut love this restaurant! anytim i find myself crave mexican food, the first place that pop in my head is salsa blanca. we have alway encount friendly, welcom staff and amazing, fulfil food. what more could you ask for?!',
 'we went today after lunch. i got my usual of lime basil and real mint chip (which i love for the real mint leaves) and my hubbi got chocol guiness and four peak hop knot. best ice cream in phoenix! the staff is alway super nice. they give us ice bag to take

In [249]:
type(X_train_stemmed)

list

In [250]:
X_test_stemmed = [' '.join([stemmer.stem(word) for word in text.split(' ')]) for text in X_test]

### Tokenize and vectorize with CountVectorizer

In [251]:
cv = CountVectorizer()
X_train_processed = cv.fit_transform(X_train_stemmed)
X_test_processed = cv.transform(X_test_stemmed)

### Use the transformed features with a NaiveBayes model.

In [252]:
nb = MultinomialNB()

nb.fit(X_train_processed, y_train)
nb.score(X_test_processed, y_test)

0.9178082191780822

### How does that look?

## Try Lemmatizer

### This time let's make a function to lemmatize the text.

In [253]:
def split_into_lemmas(text):
    '''return lemmatizeed list of words as a string from a document passed in '''
    text = text.lower()
    lemmer = WordNetLemmatizer()
    return "".join([lemmer.lemmatize(word) for word in text])

Let's use the pandas function `.apply()` to transform each row.

In [254]:
X_train_lem = X_train.apply(split_into_lemmas)
X_train_lem[:5]

6841    filly-b's!!!!!  only 8 reviews?? nine now!!!\n...
1728    my husband and i absolutely love this restaura...
3853    we went today after lunch. i got my usual of l...
671     totally dissapointed.  i had purchased a coupo...
4920    costco travel - my husband and i recently retu...
Name: text, dtype: object

In [181]:
X_test_lem = X_test.apply(split_into_lemmas)

In [182]:
cv = CountVectorizer()
X_train_processed = cv.fit_transform(X_train_lem)
X_test_processed = cv.transform(X_test_lem)

In [183]:
nb = MultinomialNB()

nb.fit(X_train_processed, y_train)
nb.score(X_test_processed, y_test)

0.9187866927592955

How did that do?

## Pipeline  


#### Let's do this the smart way and make a pipeline.

In [191]:
pipe = make_pipeline(
    CountVectorizer(preprocessor=split_into_lemmas, ngram_range = (1,1)), 
    MultinomialNB()
)

pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('countvectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1),
                                 preprocessor=<function split_into_lemmas at 0x1a2928fe60>,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('multinomialnb',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [192]:
pipe.score(X_test, y_test)

0.9187866927592955

## GridSearchCV

Want to tune hyperparameters? Let's do it! 🚀

### Warning, this might take 10min to run. 

In [161]:
from sklearn.model_selection import GridSearchCV

In [193]:
params = dict(
    countvectorizer__ngram_range=[(1,1), (1,2), (1,3)],
    multinomialnb__alpha=[.5, 1, 5]
)

In [194]:
gs = GridSearchCV(estimator=pipe, param_grid=params)

In [195]:
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('countvectorizer',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                              

In [196]:
gs.score(X_test, y_test)

0.9256360078277887

In [197]:
gs.best_params_

{'countvectorizer__ngram_range': (1, 1), 'multinomialnb__alpha': 0.5}

#### How about them apples? 🍎

## Add non-text columns

In [210]:
X = yelp_best_worst[['text', 'cool', 'useful']]
y = yelp_best_worst['stars']

In [212]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4086 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    4086 non-null   object
 1   cool    4086 non-null   int64 
 2   useful  4086 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 127.7+ KB


In [214]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [231]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [232]:
ct = make_column_transformer(
    (CountVectorizer(preprocessor=split_into_lemmas, ngram_range = (1,1)), 'text'),
    (MinMaxScaler(), ['cool', 'useful'])
)

# 'text' is not passed in a list - very confusing
# # https://stackoverflow.com/a/56299794/4590385  

# docs: https://github.com/scikit-learn-contrib/sklearn-pandas#map-the-columns-to-transformations
# Be aware that some transformers expect a 1-dimensional input 
# (the label-oriented ones) while some others, like OneHotEncoder or Imputer, 
# expect 2-dimensional input, with the shape [n_samples, n_features].

In [233]:
pipe = make_pipeline(
    ct,
    MultinomialNB()
)

pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('countvectorizer',
                                                  CountVectorizer(analyzer='word',
                                                                  binary=False,
                                                                  decode_error='strict',
                                                                  dtype=<class 'numpy.int64'>,
                                                                  encoding='utf-8',
                                                                  input='content',
                                                                  lowercase=True,
                                                                  max_df=1.0,
             

In [234]:
pipe.score(X_test, y_test)

0.9070450097847358

#### Did adding the two non-text columns help?


### Bonus ⭐️

- Try differentiating the 4 star vs. 5 star reviews.
- Try a logistic regression or knn model
- Try more transformer hyperparameters

## Summary 

You've seen how to use `CountVectorizer` and `TfidfVectorizer` for NLP with a pipeline and grid searching.

### Check for Understanding

- What is bag of words?
- What is TF-IDF?

- What are stop words?
- What are n-grams?

- What's a document?
- What's a corpus?

- How are stemming and lemmatization different?




#### NLP is a big area, but you've covered lots of the core aspects! 🎉

