# 3 - Advanced Tokenization, Stemming, and Lemmatization

As mentioned previously, the feature extraction in the CountVectorizer and Tfidf Vectorizer is relatively simple, and much more elaborate methods are possible. One particular step that is often improved in more sophisticated text-processing applications is the first step in the bag-of-words model: tokenization. This step defines what constitutes a word for the purpose of feature extraction.

We saw earlier that the vocabulary often contains singular and plural versions of some words, as in "drawback" and "drawbacks", "drawer" and "drawers", and "drawing" and "drawings". For the purposes of a bag-of-words model, the semantics of "drawback" and "drawbacks" are so close that distinguishing them will only increase overfitting, and not allow the model to fully exploit the training data. Similarly, we found the vocabulary includes words like "replace", "replaced", "replace ment", "replaces", and "replacing", which are different verb forms and a noun relating to the verb “to replace.” Similarly to having singular and plural forms of a noun, treating different verb forms and related words as distinct tokens is disadvantageous for building a model that generalizes well.

This problem can be overcome by representing each word using its word stem, which involves identifying (or conflating) all the words that have the same word stem. If this is done by using a rule-based heuristic, like dropping common suffixes, it is usually referred to as stemming. If instead a dictionary of known word forms is used (an explicit and human-verified system), and the role of the word in the sentence is taken into account, the process is referred to as lemmatization and the standardized form of the word is referred to as the lemma. Both processing methods, lemmatization and stemming, are forms of normalization that try to extract some normal form of a word.


To get a better understanding of normalization, let’s compare a method for stemming —the Porter stemmer, a widely used collection of heuristics (here imported from the nltk package)—to lemmatization as implemented in the spacy package:


In [1]:
# Prepare and load dataset

# sklearn load_files function provide to load dataset from external file
from sklearn.datasets import load_files
review_train=load_files('dataset/')
# load_file returns a bunch, containing training texts and training labels
text_train,y_train=review_train.data,review_train.target

In [2]:
import spacy
import nltk



In [3]:
# load scapy english-language model
en_nlp = spacy.load('en')
# instantiate nltk's Porter stemmer
stemmer=nltk.stem.PorterStemmer()

In [4]:
def compare_normalization(doc):
    # tokenize document in spacy
    doc_spacy=en_nlp(doc)
    # print lemmas found by spacy
    print('Lemmatization:')
    print([token.lemma_ for token in doc_spacy])
    
    # Print token found by porter stemmer
    print("Stemming:")
    print([stemmer.stem(token.norm_.lower()) for token in doc_spacy])

In [8]:
compare_normalization(u'Our meeting today was worse than yesterday, '
                     "I'm scared of meeting the clients tomorrow")

Lemmatization:
['-PRON-', 'meeting', 'today', 'be', 'bad', 'than', 'yesterday', ',', '-PRON-', 'be', 'scared', 'of', 'meet', 'the', 'client', 'tomorrow']
Stemming:
['our', 'meet', 'today', 'wa', 'wors', 'than', 'yesterday', ',', 'i', 'am', 'scare', 'of', 'meet', 'the', 'client', 'tomorrow']


Stemming is always restricted to trimming the word to a stem, so "was" becomes
"wa" , while lemmatization can retrieve the correct base verb form, "be" . Similarly,
lemmatization can normalize "worse" to "bad" , while stemming produces "wors" .
Another major difference is that stemming reduces both occurrences of "meeting" to
"meet" . Using lemmatization, the first occurrence of "meeting" is recognized as a noun and left as is, while the second occurrence is recognized as a verb and reduced
to "meet" . In general, lemmatization is a much more involved process than stem‐
ming, but it usually produces better results than stemming when used for normaliz‐
ing tokens for machine learning.

While scikit-learn implements neither form of normalization, CountVectorizer
allows specifying your own tokenizer to convert each document into a list of tokens
using the tokenizer parameter. We can use the lemmatization from spacy to create a
callable that will take a string and produce a list of lemmas:

In [9]:
# Technicality: we want to use the regexp-based tokenizer 
# that is used by CountVectorizer and only use the lemmatization 
# from spacy. To this end, we replace en_nlp.tokenizer (the spacy tokenizer) 
# with the regexp-based tokenization. 

import re
from sklearn.feature_extraction.text import CountVectorizer
#regxp used in CountVectorizer
regxp=re.compile('(?u)\\b\\w\\w+\\b')

# load spacy language model and save old tokenzier
en_nlp=spacy.load('en')
old_tokenizer=en_nlp.tokenizer
# replace the tokenizer using the spacy document processing pipeline

def custom_tokenizer(document):
    doc_spacy = en_nlp(document)
    return [token.lemma_ for token in doc_spacy]

# define a count vectorizer with the custom tokenizer
lemma_vect = CountVectorizer(tokenizer=custom_tokenizer, min_df=5)

Let's transform the data and inspact the vocabulary size

In [10]:
# transform text_train using CountVectorizer
X_train_lemma=lemma_vect.fit_transform(text_train)
print("X_train_lemma.shape: {}".format(X_train_lemma.shape))

X_train_lemma.shape: (25000, 23764)


In [11]:
# standard CountVectorizer for refernce
vect=CountVectorizer(min_df=5).fit(text_train)
X_train=vect.transform(text_train)
print('X_train.shape: {}'.format(X_train.shape))

X_train.shape: (25000, 27272)


As you can see from the output, lemmatization reduced the number of features from
27,271 (with the standard CountVectorizer processing) to 23,764. Lemmatization
can be seen as a kind of regularization, as it conflates certain features. Therefore, we
expect lemmatization to improve performance most when the dataset is small. To
illustrate how lemmatization can help, we will use StratifiedShuffleSplit for
cross-validation, using only 1% of the data as training data and the rest as test data:

In [15]:
# build a grid search using only 1% of the data as the training set
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.99,train_size=0.01, random_state=0)
grid = GridSearchCV(LogisticRegression(), param_grid, cv=cv)

# perform grid search with standard CountVectorizer
grid.fit(X_train,y_train)
print("Best cross-validation score (standard CountVectorizer): {:.3f}".format(grid.best_score_))

# perform grid search with lemmatization
grid.fit(X_train_lemma,y_train)
print("Best cross-validation score (lemmatization): {:.3f}".format(grid.best_score_))



Best cross-validation score (standard CountVectorizer): 0.719




Best cross-validation score (lemmatization): 0.719


In this case, lemmatization provided a modest improvement in performance. As with
many of the different feature extraction techniques, the result varies depending on
the dataset. Lemmatization and stemming can sometimes help in building better (or
at least more compact) models, so we suggest you give these techniques a try when
trying to squeeze out the last bit of performance on a particular task.