# Exercise 2: Building a "little stemmer"

For this exercise, we will take a sample of Antoine de Saint-Exupéry's novella *The Little Prince* and use it to demonstrate tokenization and stemming.

Here is your sample text, which appears at the beginning of the book:

In [1]:
text = """
Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
Boa
In the book it said: "Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion."
I pondered deeply, then, over the adventures of the jungle. And after some work with a colored pencil I succeeded in making my first drawing. My Drawing Number One. It looked something like this:
Hat
I showed my masterpiece to the grown-ups, and asked them whether the drawing frightened them.
But they answered: "Frighten? Why should any one be frightened by a hat?"
My drawing was not a picture of a hat. It was a picture of a boa constrictor digesting an elephant. But since the grown-ups were not able to understand it, I made another drawing: I drew the inside of a boa constrictor, so that the grown-ups could see it clearly. They always need to have things explained. My Drawing Number Two looked like this:
Elephant inside the boa
The grown-ups' response, this time, was to advise me to lay aside my drawings of boa constrictors, whether from the inside or the outside, and devote myself instead to geography, history, arithmetic, and grammar. That is why, at the age of six, I gave up what might have been a magnificent career as a painter. I had been disheartened by the failure of my Drawing Number One and my Drawing Number Two. Grown-ups never understand anything by themselves, and it is tiresome for children to be always and forever explaining things to them.
"""

First let's use NLTK's build-in functions to tokenize and stem this text. First convert the given text into an array of lowercase tokens using the NLTK functions word_tokenize and PorterStemmer.

In [2]:
from nltk import word_tokenize, PorterStemmer

In [9]:
stemmer = PorterStemmer()
text_tokens = word_tokenize(text, language='English')
lower_tokens = [word.lower() for word in text_tokens]

**Questions:**
  1. How many unique tokens are there in the text?
  1. How many unique stemmed tokens are in the text? Lowercase stemmed tokens?
  1. What are some examples of words that have surprising stemmed forms? Can you explain why?

In [4]:
import pandas as pd
import re

In [11]:
print("There are {} unique tokens in the text".format(len(set(lower_tokens))))

There are 155 unique tokens in the text


In [6]:
stems_lower = set([stemmer.stem(word) for word in lower_tokens])
stems = set([stemmer.stem(word) for word in text_tokens])

print("There are {} unique stemmed tokens".format(len(stems)))
print("There are {} unique lowercase stemmed tokens".format(len(stems_lower)))

There are 152 unique stemmed tokens
There are 149 unique lowercase stemmed tokens


Examples of words that have surprising stemmed tokens are once, is, was, animal, able, something, this.
It is surprising because these words have common english suffixes that would normally appear at the end of verbs and such but they are part of the real root of the word. These are simply cases where stemming fails in capturing the root meaning of words.

Now let's try writing our own stemmer. Write a function which takes in a token and returns its stem, by removing common English suffixes (e.g. remove the suffix -ed as in *listened* -> *listen*). Try to handle as many suffixes as you can think of. Then use this custom stemmer to convert the given text to an array of lowercase tokens.

In [7]:
def suffix_word(my_word, suf_list):
    for my_suf in suf_list:
        suf_len = len(my_suf)
        if my_word[-suf_len:] == my_suf:
            return my_word[0:-len(my_suf)].lower()
    return my_word.lower()
        
def stem_text(my_text):
    my_words = re.findall(r"[\w']+|[.,!?;:-]", my_text)
    my_suffixes = ['ed', 'es', 'ing', 'e', 's', 'ent', 'al', 'ion', 'y']
    
    return [suffix_word(word, my_suffixes) for word in my_words]

In [8]:
stem_text(text)

['onc',
 'when',
 'i',
 'wa',
 'six',
 'year',
 'old',
 'i',
 'saw',
 'a',
 'magnific',
 'pictur',
 'in',
 'a',
 'book',
 ',',
 'call',
 'tru',
 'stori',
 'from',
 'natur',
 ',',
 'about',
 'th',
 'primev',
 'forest',
 '.',
 'it',
 'wa',
 'a',
 'pictur',
 'of',
 'a',
 'boa',
 'constrictor',
 'in',
 'th',
 'act',
 'of',
 'swallow',
 'an',
 'anim',
 '.',
 'her',
 'i',
 'a',
 'cop',
 'of',
 'th',
 'draw',
 '.',
 'boa',
 'in',
 'th',
 'book',
 'it',
 'said',
 ':',
 'boa',
 'constrictor',
 'swallow',
 'their',
 'pre',
 'whol',
 ',',
 'without',
 'chew',
 'it',
 '.',
 'after',
 'that',
 'the',
 'ar',
 'not',
 'abl',
 'to',
 'mov',
 ',',
 'and',
 'the',
 'sleep',
 'through',
 'th',
 'six',
 'month',
 'that',
 'the',
 'ne',
 'for',
 'digest',
 '.',
 'i',
 'ponder',
 'deepl',
 ',',
 'then',
 ',',
 'over',
 'th',
 'adventur',
 'of',
 'th',
 'jungl',
 '.',
 'and',
 'after',
 'som',
 'work',
 'with',
 'a',
 'color',
 'pencil',
 'i',
 'succeed',
 'in',
 'mak',
 'm',
 'first',
 'draw',
 '.',
 'm',
 

**Questions:**
  4. What are some examples where  your stemmer on the text differs from the PorterStemmer?
  5. Can you explain why the differences occur?
  
**Bonus**: Use NLTK's WordNetLemmatizer to get an array of lemmatized tokens. Where does it differ from the stemmers' outputs? Why?
 

4.
There are many examples where my stemmer differs. "true", "deepli", "making" for example. For true, my stemmer removes the "e" at the end while their stemmer keeps it. deeply becomes deepli in their stemmer and in mine it becomes deepl and making for them is make and for me is mak.

5.
The differences occur because their stemmer is a lot more complex. it recognizes the roots of words better so it has many exceptions while my stemmer is much simpler. it just removes suffixes without considering which word it is.