# Stemming

It's often (but not always) useful to reduce words to their roots. One reason for doing this may be that word tense or conjugation is not important for your model. It would be useful to combine variations of a word together. Then for models like Naive Bayes where each word is a feature, we can strongly reduce our feature space.

Let's see what this looks like. First, let's tokenize a bit of text from the wikipedia page on data science.

In [None]:
# from __future__ import unicode_literals

In [1]:
from nltk.tokenize import wordpunct_tokenize  # for tokenizing our text

In [2]:
# sample text from wikipedia
text = open('sample.txt').read()

word_bag = wordpunct_tokenize(text)
word_bag = [unicode(w, 'latin1') for w in word_bag]

print 'a few tokens:', word_bag[:10]
print 'number of tokens:', len(word_bag)
print 'number of unique tokens:', len(set(word_bag))

a few tokens: [u'Data', u'science', u'From', u'Wikipedia', u',', u'the', u'free', u'encyclopedia', u'Data', u'Science']
number of tokens: 1688
number of unique tokens: 666


Look for common word endings to clip off. Start with the suffix, '-s', '-er', '-ing'. But be careful to only strip these tokens when they appear at the end of the word. Write rules into the function below.

In [3]:
# define a function to stem tokens based on rules.

def stem(tokens):
    '''rules-based stemming of a bunch of tokens'''
    
    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('ing'):
            new_bag.append(token[:-2])
        else:
            new_bag.append(token)

    return new_bag

In [4]:
# Check how well you're doing by running this cell:

print 'initial number of unique tokens:', len(set(word_bag))
print 'stemmed number of unique tokens:', len(set(stem(word_bag)))

initial number of unique tokens: 666
stemmed number of unique tokens: 645


In [5]:
# Do we have to refine our rules? Are we stripping away too many letters? Run this cell to see

for token in stem(word_bag):
    print token

Data
scien
From
Wikipedia
,
the
free
encyclopedia
Data
Scien
Venn
Diagram
Data
scien
i
the
study
of
the
generalizable
extrac
of
knowledge
from
data
,[
1
]
yet
the
key
word
i
scien
.[
2
]
It
incorporate
varyi
element
and
build
on
technique
and
theorie
from
many
field
,
includi
signal
processi
,
mathematic
,
probability
model
,
machine
learni
,
statistical
learni
,
comput
programmi
,
data
engineeri
,
pattern
recogni
and
learni
,
visualiza
,
uncertainty
modeli
,
data
warehousi
,
and
high
performan
computi
with
the
goal
of
extracti
meani
from
data
and
creati
data
product
.
The
subject
i
not
restricted
to
only
big
data
,
although
the
fact
that
data
i
scali
up
make
big
data
an
important
aspect
of
data
scien
.
A
practition
of
data
scien
i
called
a
data
scien
.
Data
scientist
solve
complex
data
problem
through
employi
deep
expertise
in
some
scientific
discipline
.
It
i
generally
expected
that
data
scientist
are
able
to
work
with
variou
element
of
mathematic
,
statistic
and
comput
scien
,
altho

Feel free to add more rules and see how much you can pare down the feature set, i.e. the number of unique tokens. Try not to strip too much off the words!

## Porter Stemmer

The classic stemmer is the Porter stemmer which is [available in NLTK](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter). Others are available, too

In [6]:
from nltk.stem.porter import PorterStemmer

In [12]:
set([1,2,3,4])
set([3,1,2,3,3])

{1, 2, 3}

In [14]:
# Run this cell to see how the Porter Stemmer performs.
ps = PorterStemmer()

print 'initial number of unique tokens:', len(set(word_bag))
print 'stemmed number of unique tokens:', len({ps.stem(token).lower() for token in word_bag})  # this uses a set comprehension

initial number of unique tokens: 666
stemmed number of unique tokens: 553


In [16]:
# examine how weird the tokens get

for token in word_bag:
    print ps.stem(token).lower()

data
scienc
from
wikipedia
,
the
free
encyclopedia
data
scienc
venn
diagram
data
scienc
is
the
studi
of
the
generaliz
extract
of
knowledg
from
data
,[
1
]
yet
the
key
word
is
scienc
.[
2
]
it
incorpor
vari
element
and
build
on
techniqu
and
theori
from
mani
field
,
includ
signal
process
,
mathemat
,
probabl
model
,
machin
learn
,
statist
learn
,
comput
program
,
data
engin
,
pattern
recognit
and
learn
,
visual
,
uncertainti
model
,
data
wareh
,
and
high
perform
comput
with
the
goal
of
extract
mean
from
data
and
creat
data
product
.
the
subject
is
not
restrict
to
onli
big
data
,
although
the
fact
that
data
is
scale
up
make
big
data
an
import
aspect
of
data
scienc
.
a
practition
of
data
scienc
is
call
a
data
scientist
.
data
scientist
solv
complex
data
problem
through
employ
deep
expertis
in
some
scientif
disciplin
.
it
is
gener
expect
that
data
scientist
are
abl
to
work
with
variou
element
of
mathemat
,
statist
and
comput
scienc
,
although
expertis
in
these
subject
are
not
requir
.[
3
]


In [None]:
from IPython.display import HTML
HTML('''
<style>
.text_cell_render {
  background-color: cyan;
}
</style>
''')