# Incorporating Heterogeneous Features with NLP

Natural Language Processing using scikit-learn's CountVectorizer or libraries such as spaCY and Gensim can provide powerful insights into text data, allowing us to extract topics which can then be added as features and regressed on to generate predictions.

However, what if we want to use the sparse matrix that Countvectorizer produces as a feature along with other categoricals or numerical features in the dataset?

The answer is Column-Transformer, and I'll demonstrate it's usage on some yelp review data.

In [None]:
# standard imports
import pandas as pd
import numpy as np

# text processing imports
import re
import spacy

# scikit-learn imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import TruncatedSVD

# scikit-learn pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

#encoders
from sklearn.preprocessing import StandardScaler

#randomized search CV
from sklearn.model_selection import RandomizedSearchCV

# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
yelp = pd.read_json('./data/review_sample.json', lines=True)

# First I regex and tokenize using spaCY's lemmatizer
cleaning = ['text']

def regex_clean(dataframe, target_list):    
    for target in target_list:
        dataframe[target].apply(lambda x: re.sub(r'[^a-zA-Z ^0-9]', '', x))
        dataframe[target].apply(lambda x: re.sub(r'/n', '', x))
                        

custom_stops = ['<', '>', '\n', '"', '\\', '|', '</div>', '<ul>', '<li>', '<p>', 'li', 'ul', 'p', ']' , '\>', '\n\n']

def tokenize(text): #lemmatize actually
    doc = nlp(text)
    lemmas = []
    for token in doc:
        if ((token.is_stop == False) and (token.is_punct == False) and (token.pos_ != 'PRON')):
            valid = True
            
            for stop in custom_stops:
                    if stop in token.text:
                        valid = False
                        break
        else:
            valid = False
                
        if valid == True:   
            lemmas.append(token.lemma_)
    return lemmas

regex_clean(yelp, cleaning)
yelp['lemmas'] = yelp['text'].apply(tokenize)

spaCY's lemmas are now very clean and can be processed by TFIDF into a sparse vector matrix once I process them back into strings.

In [None]:
#I'll rejoin the lemmas into a string that TFIDF can process
#de-tokenize
detokenized_doc = []
for i in range(len(yelp['lemmas'])):
    t = ' '.join(yelp['lemmas'][i])
    detokenized_doc.append(t)
    
yelp['lemma_text'] = detokenized_doc

Now if we want to predict the star rating of a given review, we can use a pipeline to TFIDF vectorize and then predict the review stars. In fact, let's use our pipeline to feed our TFIDF matrix into a TruncatedSVD, which will can perform principle component analysis with a sparse matrix, this should improve our predictions.

In [None]:
X = yelp['lemma_text']
y = yelp['stars']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

vect = TfidfVectorizer(stop_words='english', max_features=5000)

svd = TruncatedSVD(n_components=500,
                  algorithm='randomized',
                  n_iter=10,
                  random_state=42)

# text processing pipeline
lsi = Pipeline([('vect', vect), ('svd', svd)])

# classifier
clf = SGDClassifier(max_iter=5)

# Classifier pipeline
pipe = Pipeline([('vect', vect), ('clf', clf)])

pipe.fit(X_train,y_train)
print("train model score: %.3f" % pipe.score(X_test,y_test))

This score is not particularly impressive (I got 57.8% when I ran it) and could be improved with some hyperparamater tuning using RandomizedSearchCV, however, perhaps we would like to take advantage of some of the other features of the dataset first?
The 'cool', 'funny', and 'useful' ratings may be helpful. Also, maybe some users leave a certain type of review? Or some businesses are just great, or terrible and always receive the same type of review?
Combining all those features with our NLP should improve our model.
This brings us to the Column Transformer. This is a modified pipeline that will act as a pipeline, or take multiple pipelines, and concatenate the outputs so we can regress on them. Let's try adding an additional feature.

In [None]:
column_trans = ColumnTransformer(
    [
    ('onehot', StandardScaler(), ['cool', 'funny', 'useful']),
    ('lsi', lsi, 'lemma_text')],
    n_jobs=-1, remainder='drop', verbose=True)

column_pipe = Pipeline([('ct', column_trans), ('clf', clf)])

X = yelp.drop(columns=['stars']) 
y = yelp['stars']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
column_pipe.fit(X_train,y_train)

print("train model score: %.3f" % column_pipe.score(X_test,y_test))

A couple things to note, first our score didn't improve, but we can do some hyperparameter tuning to see if it does. Secondly, we can pass our original `lsi` pipeline directly into the `ColumnTransformer()` we also have the ability to drop all columns not specified (though we could pass them through and the CT would append them to the bag of words matrix that the TFIDF Vectorizer produces. 

In [None]:
# use Randomized search cv to optimize
params = {
    'ct__lsi__vect__max_df' : (.5, .7, 1),
    'ct__lsi__vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'clf__loss' : ('hinge', 'modified_huber'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'ct__lsi__svd__n_components': (2, 50, 75, 100)
}

rsCV = RandomizedSearchCV(estimator=column_pipe, param_distributions=params, 
                          cv=5, n_jobs=-1, random_state=42, verbose=1)

rsCV.fit(X_train,y_train)

rsCV.score(X_test,y_test)

And we get a minor improvement on our first attempt at using multiple features. The power of the ColumnTransformer is in enabling you to specify and then tune the treatment of each feature in your dataset.