# Stemming

Stemming is the process of reducing inflected or derived words to their word stem or root

In practice, this means crudely cutting off the end of a word and leaving only its base

Examples:

-- Stemming and Stemmed will be reduced to Stem

-- Electricity and Electrical will be reduced to Electr

-- Berries and Berry will be reduced to Berri

-- Connection, Connective, and Connected will all be reduced to Connect

Sometimes this crude method matches words with diff meanings (e.g. Meanness and Meaning will both be reduced to Mean)

Why might you want to use stemming?

-- It reduces the number of words the model is exposed to

-- It explicitly correlates words with similar meanings

What types of stemmers are there in the NLTK package?

-- Porter Stemmer

-- Snowball Stemmer

-- Lancaster Stemmer

-- Regex Stemmer

We will focus on the Porter Stemmer (most popular one) in the code below

In [1]:
import nltk

In [2]:
ps = nltk.PorterStemmer()

In [3]:
# what attributes and methods are contained in this stemmer?

dir(ps)

['MARTIN_EXTENSIONS',
 'NLTK_EXTENSIONS',
 'ORIGINAL_ALGORITHM',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_abc_cache',
 '_abc_negative_cache',
 '_abc_negative_cache_version',
 '_abc_registry',
 '_apply_rule_list',
 '_contains_vowel',
 '_ends_cvc',
 '_ends_double_consonant',
 '_has_positive_measure',
 '_is_consonant',
 '_measure',
 '_replace_suffix',
 '_step1a',
 '_step1b',
 '_step1c',
 '_step2',
 '_step3',
 '_step4',
 '_step5a',
 '_step5b',
 'mode',
 'pool',
 'stem',
 'unicode_repr',
 'vowels']

In [4]:
# we will focus on the stem method - we can pass in the word we want to stem

ps.stem('growing')

'grow'

In [6]:
ps.stem('grows')

'grow'

In [8]:
ps.stem('grow')

'grow'

In [9]:
# another example

ps.stem('running')

'run'

In [10]:
ps.stem('runner')

'runner'

In [11]:
# note that runner is different because it describes a person (noun) as opposed to an action (verb)

# lets bring in the SMS spam data

import re
import pandas as pd
import string

In [12]:
pd.set_option('display.max_colwidth', 100)

stops = nltk.corpus.stopwords.words('english')

data = pd.read_csv('SMSSpamCollection.tsv', sep = '\t', header = None)

data.columns = ['label', 'text']

In [13]:
data.head()

Unnamed: 0,label,text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [14]:
# make function with all text cleaning steps

def clean_text(t):
    
    text = ''.join(x for x in t if x not in string.punctuation)
    
    tokens = re.split('\W+', text)
    
    text = [word for word in tokens if word not in stops]
    
    return text

In [17]:
data['clean'] = data['text'].apply(lambda x: clean_text(x.lower()))

In [18]:
data.head()

Unnamed: 0,label,text,clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"


In [19]:
# now that text is cleaned, lets write a function to stem the words

def stem_it(t):
    
    stemmed_text = [ps.stem(word) for word in t]
    
    return stemmed_text

In [20]:
data['stemmed'] = data['clean'].apply(lambda x: stem_it(x))

In [21]:
data.head()

Unnamed: 0,label,text,clean,stemmed
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


In [22]:
# note that stemming isn't great with slang and abbreviations, so it may not be the best choice for this dataset

# Lemmatizing

Lemmatizing is the process of grouping together the inflected forms of a word so they can be analyzed and processed as a single word (this is identified by the word's lemma)

This means using vocabulary analysis of words aiming to remove inflectional endings to return the dictionary form of a word

How is lemmatizing different from stemming?

-- Same goal for both with different ways of getting there: condensing words into their base forms

-- Stemming is generally faster because it chops off the end of a word using heuristics without any understanding of the context in which a word is used

-- Lemmatizing is typically more accurate and uses a more informed analysis to create groups of words with similar meaning based on the context of a word

We will show lemmatization below

In [23]:
# create the word lemmatizer

wn = nltk.WordNetLemmatizer()

In [24]:
# lets see some of the differences

ps.stem('meanness')

'mean'

In [25]:
# same stem for meanness and meaning despite differences in vocabulary definitions

ps.stem('meaning')

'mean'

In [26]:
# lets see how lemmatizer handles this

wn.lemmatize('meanness')

'meanness'

In [27]:
# meaning and meanness were untouched

wn.lemmatize('meaning')

'meaning'

In [28]:
# stemming uses an algorithmic approach that only cares about the string it is given

# lemmatizing searches the corpus to find related words and simplify down to core concept

# lets use another example with goose and geese

print(ps.stem('goose'))
print(ps.stem('geese'))

goos
gees


In [29]:
# stemming doesn't really know how to handle above examples - lets try lemmatization

print(wn.lemmatize('goose'))
print(wn.lemmatize('geese'))

goose
goose


In [30]:
# lemmatizing handled the example much better than stemming

# lets apply this to our SMS dataset

data.head()

Unnamed: 0,label,text,clean,stemmed
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"


In [31]:
# need to create a function for lemmatizing

def lemmatizing(t):
    
    text = [wn.lemmatize(word) for word in t]
    
    return text

In [32]:
data['lemmatized'] = data['clean'].apply(lambda x: lemmatizing(x))

In [34]:
data.head(15)

Unnamed: 0,label,text,clean,stemmed,lemmatized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...","[ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]","[date, sunday]"
5,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,"[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, pr...","[per, request, mell, mell, oru, minnaminungint, nurungu, vettam, set, callertun, caller, press, ...","[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, caller, pre..."
6,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...,"[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170...","[winner, valu, network, custom, select, receivea, 900, prize, reward, claim, call, 09061701461, ...","[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 0906170..."
7,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...,"[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile...","[mobil, 11, month, u, r, entitl, updat, latest, colour, mobil, camera, free, call, mobil, updat,...","[mobile, 11, month, u, r, entitled, update, latest, colour, mobile, camera, free, call, mobile, ..."
8,ham,"I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried ...","[im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]","[im, gonna, home, soon, dont, want, talk, stuff, anymor, tonight, k, ive, cri, enough, today]","[im, gonna, home, soon, dont, want, talk, stuff, anymore, tonight, k, ive, cried, enough, today]"
9,spam,"SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, ...","[six, chances, win, cash, 100, 20000, pounds, txt, csh11, send, 87575, cost, 150pday, 6days, 16,...","[six, chanc, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6day, 16, tsa...","[six, chance, win, cash, 100, 20000, pound, txt, csh11, send, 87575, cost, 150pday, 6days, 16, t..."
