# Exercise 1: Building a "little stemmer"

For this exercise, we will take a sample of Antoine de Saint-Exupéry's novella *The Little Prince* and use it to demonstrate tokenization and stemming.

Here is your sample text, which appears at the beginning of the book:

In [0]:
text = """
Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
Boa
In the book it said: "Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion."
I pondered deeply, then, over the adventures of the jungle. And after some work with a colored pencil I succeeded in making my first drawing. My Drawing Number One. It looked something like this:
Hat
I showed my masterpiece to the grown-ups, and asked them whether the drawing frightened them.
But they answered: "Frighten? Why should any one be frightened by a hat?"
My drawing was not a picture of a hat. It was a picture of a boa constrictor digesting an elephant. But since the grown-ups were not able to understand it, I made another drawing: I drew the inside of a boa constrictor, so that the grown-ups could see it clearly. They always need to have things explained. My Drawing Number Two looked like this:
Elephant inside the boa
The grown-ups' response, this time, was to advise me to lay aside my drawings of boa constrictors, whether from the inside or the outside, and devote myself instead to geography, history, arithmetic, and grammar. That is why, at the age of six, I gave up what might have been a magnificent career as a painter. I had been disheartened by the failure of my Drawing Number One and my Drawing Number Two. Grown-ups never understand anything by themselves, and it is tiresome for children to be always and forever explaining things to them.
"""

First let's use NLTK's build-in functions to tokenize and stem this text. First convert the given text into an array of lowercase tokens using the NLTK functions word_tokenize and PorterStemmer.

In [35]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokens = [stemmer.stem(word.lower()) for word in word_tokenize(text)]
print(tokens)
# to answer the questions:
print(len(set(word_tokenize(text))), len(set([stemmer.stem(word) for word in word_tokenize(text)])), len(set(tokens)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['onc', 'when', 'i', 'wa', 'six', 'year', 'old', 'i', 'saw', 'a', 'magnific', 'pictur', 'in', 'a', 'book', ',', 'call', 'true', 'stori', 'from', 'natur', ',', 'about', 'the', 'primev', 'forest', '.', 'it', 'wa', 'a', 'pictur', 'of', 'a', 'boa', 'constrictor', 'in', 'the', 'act', 'of', 'swallow', 'an', 'anim', '.', 'here', 'is', 'a', 'copi', 'of', 'the', 'draw', '.', 'boa', 'in', 'the', 'book', 'it', 'said', ':', '``', 'boa', 'constrictor', 'swallow', 'their', 'prey', 'whole', ',', 'without', 'chew', 'it', '.', 'after', 'that', 'they', 'are', 'not', 'abl', 'to', 'move', ',', 'and', 'they', 'sleep', 'through', 'the', 'six', 'month', 'that', 'they', 'need', 'for', 'digest', '.', "''", 'i', 'ponder', 'deepli', ',', 'then', ',', 'over', 'the', 'adventur', 'of', 'the', 'jungl', '.', 'and', 'after', 'some', 'work', 'with', 'a', 'color', 'pencil', 'i', 'succeed', 'in', 'make', 'my', '

**Questions:**
  1. How many unique tokens are there in the text? **170**
  1. How many unique stemmed tokens are in the text? **152** Lowercase stemmed tokens? **149**
  1. What are some examples of words that have surprising stemmed forms? Can you explain why? **e.g. anim eleph anoth all look like they end in suffixes**

Now let's try writing our own stemmer. Write a function which takes in a token and returns its stem, by removing common English suffixes (e.g. remove the suffix -ed as in *listened* -> *listen*). Try to handle as many suffixes as you can think of. Then use this custom stemmer to convert the given text to an array of lowercase tokens.

In [41]:
import re
suffixes = ['ed', 's', 'ing', 'es']
def custom_stemmer(token):
  out = token
  for suffix in suffixes:
    out = re.sub(suffix + '$', '', out)
  return out
tokens = [custom_stemmer(word.lower()) for word in word_tokenize(text)]
print(tokens)
# to answer questions:

['once', 'when', 'i', 'wa', 'six', 'year', 'old', 'i', 'saw', 'a', 'magnificent', 'picture', 'in', 'a', 'book', ',', 'call', 'true', 'storie', 'from', 'nature', ',', 'about', 'the', 'primeval', 'forest', '.', 'it', 'wa', 'a', 'picture', 'of', 'a', 'boa', 'constrictor', 'in', 'the', 'act', 'of', 'swallow', 'an', 'animal', '.', 'here', 'i', 'a', 'copy', 'of', 'the', 'draw', '.', 'boa', 'in', 'the', 'book', 'it', 'said', ':', '``', 'boa', 'constrictor', 'swallow', 'their', 'prey', 'whole', ',', 'without', 'chew', 'it', '.', 'after', 'that', 'they', 'are', 'not', 'able', 'to', 'move', ',', 'and', 'they', 'sleep', 'through', 'the', 'six', 'month', 'that', 'they', 'ne', 'for', 'digestion', '.', "''", 'i', 'ponder', 'deeply', ',', 'then', ',', 'over', 'the', 'adventure', 'of', 'the', 'jungle', '.', 'and', 'after', 'some', 'work', 'with', 'a', 'color', 'pencil', 'i', 'succeed', 'in', 'mak', 'my', 'first', 'draw', '.', 'my', 'draw', 'number', 'one', '.', 'it', 'look', 'someth', 'like', 'thi', '

**Questions:**
  4. What are some examples where  your stemmer on the text differs from the PorterStemmer? **once, magnificent, storie**
  5. Can you explain why the differences occur? **missing suffixes, missing rule y$ => i**
  
**Bonus**: Use NLTK's WordNetLemmatizer to get an array of lemmatized tokens. Where does it differ from the stemmers' outputs? Why?
 

In [43]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print([wordnet_lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text)])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
['once', 'when', 'i', 'wa', 'six', 'year', 'old', 'i', 'saw', 'a', 'magnificent', 'picture', 'in', 'a', 'book', ',', 'called', 'true', 'story', 'from', 'nature', ',', 'about', 'the', 'primeval', 'forest', '.', 'it', 'wa', 'a', 'picture', 'of', 'a', 'boa', 'constrictor', 'in', 'the', 'act', 'of', 'swallowing', 'an', 'animal', '.', 'here', 'is', 'a', 'copy', 'of', 'the', 'drawing', '.', 'boa', 'in', 'the', 'book', 'it', 'said', ':', '``', 'boa', 'constrictor', 'swallow', 'their', 'prey', 'whole', ',', 'without', 'chewing', 'it', '.', 'after', 'that', 'they', 'are', 'not', 'able', 'to', 'move', ',', 'and', 'they', 'sleep', 'through', 'the', 'six', 'month', 'that', 'they', 'need', 'for', 'digestion', '.', "''", 'i', 'pondered', 'deeply', ',', 'then', ',', 'over', 'the', 'adventure', 'of', 'the', 'jungle', '.', 'and', 'after', 'some', 'work', 'with', 'a', 'colored', 'pencil', 'i', 'suc