# Exercise 1: Building a "little stemmer"

For this exercise, we will take a sample of Antoine de Saint-Exupéry's novella *The Little Prince* and use it to demonstrate tokenization and stemming.

Here is your sample text, which appears at the beginning of the book:

In [2]:
text = """
Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
Boa
In the book it said: "Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion."
I pondered deeply, then, over the adventures of the jungle. And after some work with a colored pencil I succeeded in making my first drawing. My Drawing Number One. It looked something like this:
Hat
I showed my masterpiece to the grown-ups, and asked them whether the drawing frightened them.
But they answered: "Frighten? Why should any one be frightened by a hat?"
My drawing was not a picture of a hat. It was a picture of a boa constrictor digesting an elephant. But since the grown-ups were not able to understand it, I made another drawing: I drew the inside of a boa constrictor, so that the grown-ups could see it clearly. They always need to have things explained. My Drawing Number Two looked like this:
Elephant inside the boa
The grown-ups' response, this time, was to advise me to lay aside my drawings of boa constrictors, whether from the inside or the outside, and devote myself instead to geography, history, arithmetic, and grammar. That is why, at the age of six, I gave up what might have been a magnificent career as a painter. I had been disheartened by the failure of my Drawing Number One and my Drawing Number Two. Grown-ups never understand anything by themselves, and it is tiresome for children to be always and forever explaining things to them.
"""

First let's use NLTK's build-in functions to tokenize and stem this text. First convert the given text into an array of lowercase tokens using the NLTK functions word_tokenize and PorterStemmer.

In [16]:
from nltk import word_tokenize, PorterStemmer

ps = PorterStemmer()
words = word_tokenize(text)
stems = set()
lowercase_stems = set()

for word in words:
    stemmed_word = ps.stem(word)
    stems.add(stemmed_word)
    lowercase_stems.add(stemmed_word.lower())
    print(word + ":" + stemmed_word)

Once:onc
when:when
I:I
was:wa
six:six
years:year
old:old
I:I
saw:saw
a:a
magnificent:magnific
picture:pictur
in:in
a:a
book:book
,:,
called:call
True:true
Stories:stori
from:from
Nature:natur
,:,
about:about
the:the
primeval:primev
forest:forest
.:.
It:It
was:wa
a:a
picture:pictur
of:of
a:a
boa:boa
constrictor:constrictor
in:in
the:the
act:act
of:of
swallowing:swallow
an:an
animal:anim
.:.
Here:here
is:is
a:a
copy:copi
of:of
the:the
drawing:draw
.:.
Boa:boa
In:In
the:the
book:book
it:it
said:said
:::
``:``
Boa:boa
constrictors:constrictor
swallow:swallow
their:their
prey:prey
whole:whole
,:,
without:without
chewing:chew
it:it
.:.
After:after
that:that
they:they
are:are
not:not
able:abl
to:to
move:move
,:,
and:and
they:they
sleep:sleep
through:through
the:the
six:six
months:month
that:that
they:they
need:need
for:for
digestion:digest
.:.
'':''
I:I
pondered:ponder
deeply:deepli
,:,
then:then
,:,
over:over
the:the
adventures:adventur
of:of
the:the
jungle:jungl
.:.
And:and
after:after
some

**Questions:**
  1. How many unique tokens are there in the text?
  1. How many unique stemmed tokens are in the text? Lowercase stemmed tokens?
  1. What are some examples of words that have surprising stemmed forms? Can you explain why?

In [17]:
print('There are {} unique tokens in the text'.format(len(word_tokenize(text))))
print('There are {} unique stems in the text'.format(len(stems)))
print('There are {} unique lowercase stems in the text'.format(len(lowercase_stems)))

There are 353 unique tokens in the text
There are 152 unique stems in the text
There are 149 unique lowercase stems in the text


Words ending with a letter or sequence of letters that can be a suffix to other roots end up looking strange.
For example:
- 'why' becomes 'whi'
-> this is weird but it makes sense for 'geography' that becomes 'geographi' and would be a common stem to the word geographical
- 'failure' becomes 'failur'
-> other words like 'nature' could have derivates e.g. 'naturally' where a stem could make sense, but in case of 'failure' there are no such derivatives

Now let's try writing our own stemmer. Write a function which takes in a token and returns its stem, by removing common English suffixes (e.g. remove the suffix -ed as in *listened* -> *listen*). Try to handle as many suffixes as you can think of. Then use this custom stemmer to convert the given text to an array of lowercase tokens.

In [30]:
import re
suffixes = ['s', 'ed', 'ing', 'e', 'es', 'y', 'ent', 'er', 'ize', 'ary', 'al', 'ly', 'ally', 'ion', 'ary', 'ic', 'izing', 'wise',
           'ize', 'ization', 'isation', 'ise', 'ies', 'ate', 'ation', 'ant', 'ice', 'isor', 'isory', 'ome', 'ically', 'cly', 'ical']

# sort the suffixes by decreasing length so that the most letters we find the better
def sort_suffixes():
    global suffixes
    suffixes.sort(key=lambda x: len(x), reverse=True)

def jeremy_stemmer(word):
    # don't stem 3 letter words
    if len(word) <= 3:
        return word
    
    for suffix in suffixes:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    
    # no change
    return word
    
sort_suffixes()
print(suffixes)

for word in words:
    print(word + ":" + jeremy_stemmer(word))

['ization', 'isation', 'ically', 'izing', 'ation', 'isory', 'ally', 'wise', 'isor', 'ical', 'ing', 'ent', 'ize', 'ary', 'ion', 'ary', 'ize', 'ise', 'ies', 'ate', 'ant', 'ice', 'ome', 'cly', 'ed', 'es', 'er', 'al', 'ly', 'ic', 's', 'e', 'y']
Once:Onc
when:when
I:I
was:was
six:six
years:year
old:old
I:I
saw:saw
a:a
magnificent:magnific
picture:pictur
in:in
a:a
book:book
,:,
called:call
True:Tru
Stories:Stor
from:from
Nature:Natur
,:,
about:about
the:the
primeval:primev
forest:forest
.:.
It:It
was:was
a:a
picture:pictur
of:of
a:a
boa:boa
constrictor:constrictor
in:in
the:the
act:act
of:of
swallowing:swallow
an:an
animal:anim
.:.
Here:Her
is:is
a:a
copy:cop
of:of
the:the
drawing:draw
.:.
Boa:Boa
In:In
the:the
book:book
it:it
said:said
:::
``:``
Boa:Boa
constrictors:constrictor
swallow:swallow
their:their
prey:pre
whole:whol
,:,
without:without
chewing:chew
it:it
.:.
After:Aft
that:that
they:the
are:are
not:not
able:abl
to:to
move:mov
,:,
and:and
they:the
sleep:sleep
through:through
the:the

**Questions:**
  4. What are some examples where  your stemmer on the text differs from the PorterStemmer?
  5. Can you explain why the differences occur?
  
**Bonus**: Use NLTK's WordNetLemmatizer to get an array of lemmatized tokens. Where does it differ from the stemmers' outputs? Why?
 

I have decided to only remove suffixes, where the PorterStemmer seems to be more sophisticated: it is converting the 'y' suffix into 'i'.
I don't really know in which case this would be relevant though.
Also, I have added some logic:
- 3 letters words or less are ignored in my stemmer, while the PorterStemmer seems again more sophisticated: the word 'the' stays the same, but the word 'was' becomes 'wa' so only some words seem to be ignored.
- sort the array of suffixes by descending suffix length, that way we make sure a word like 'alphabetization' becomes 'alphabet' and not 'alphabetizat'

In [37]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jeremybensoussan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [41]:
wl = WordNetLemmatizer()
lemmatized_tokens = set()
# lemmatized_tokens = {wl.lemmatize(word) for word in words}
for word in words:
    for part_of_speech in ['a', 's', 'r', 'n', 'v']:
        lemma = wl.lemmatize(word, part_of_speech)
        if lemma != word:
            lemmatized_tokens.add(lemma)
print(lemmatized_tokens)

{'digest', 'year', 'constrictor', 'wa', 'look', 'ponder', 'show', 'frighten', 'draw', 'month', 'dishearten', 'color', 'chew', 'give', 'a', 'call', 'have', 'child', 'swallow', 'answer', 'thing', 'succeed', 'be', 'adventure', 'make', 'ask', 'say', 'explain', 'drawing'}


Using the different parts of speech that the lemmatize() function knows, we were able to transform words like 'was' into 'be', which the stemmer doesn't know to do.