# NLP Preprocessing exercise: Building a "little stemmer"

For this exercise, we will take a sample of Antoine de Saint-Exupéry's novella *The Little Prince* and use it to demonstrate tokenization and stemming.

Here is your sample text, which appears at the beginning of the book:

In [1]:
text = """
Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
Boa
In the book it said: "Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion."
I pondered deeply, then, over the adventures of the jungle. And after some work with a colored pencil I succeeded in making my first drawing. My Drawing Number One. It looked something like this:
Hat
I showed my masterpiece to the grown-ups, and asked them whether the drawing frightened them.
But they answered: "Frighten? Why should any one be frightened by a hat?"
My drawing was not a picture of a hat. It was a picture of a boa constrictor digesting an elephant. But since the grown-ups were not able to understand it, I made another drawing: I drew the inside of a boa constrictor, so that the grown-ups could see it clearly. They always need to have things explained. My Drawing Number Two looked like this:
Elephant inside the boa
The grown-ups' response, this time, was to advise me to lay aside my drawings of boa constrictors, whether from the inside or the outside, and devote myself instead to geography, history, arithmetic, and grammar. That is why, at the age of six, I gave up what might have been a magnificent career as a painter. I had been disheartened by the failure of my Drawing Number One and my Drawing Number Two. Grown-ups never understand anything by themselves, and it is tiresome for children to be always and forever explaining things to them.
"""

First let's use NLTK's build-in functions to tokenize and stem this text. First convert the given text into an array of lowercase tokens using the NLTK functions word_tokenize and PorterStemmer.

In [2]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
word_tokenized = word_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words_stemmed = [stemmer.stem(word) for word in word_tokenized]

**Questions:**
  1. How many unique tokens are there in the text?

  1. How many unique stemmed tokens are in the text? How many lowercase stemmed tokens?
  
  1. What are some examples of words that have surprising stemmed forms? Can you explain why? Answer from a linguistic point of view

In [4]:
#1.
import numpy as np
np.unique(word_tokenized).shape

(170,)

There are 170 unique tolens in the text.

In [5]:
#2.
np.unique(words_stemmed).shape

(149,)

There are 149 unique stemmed tokens in the text.

In [6]:
np.unique(list(map(lambda x : x.islower(), words_stemmed)), return_counts = True)

(array([False,  True]), array([ 48, 305]))

Those words have surprising stemmed forms. This is because in stemming we keep for each word only the word radical so it can lead to unknown words.

In [7]:
#3.
words_stemmed[0], words_stemmed[3], words_stemmed[10]

('onc', 'wa', 'magnific')

There is 305 lowercase stemmed tokens.

Now let's try writing our own stemmer. Write a function which takes in a token and returns its stem, by removing common English suffixes (e.g. remove the suffix -ed as in *listened* -> *listen*). Handle at least four such suffixes in English. Then use this custom stemmer to convert the given text to an array of **lowercase stemmed tokens**.

In [8]:
import re
def mystemmer(token):
  suffixes = ['ed','ing','ment','s']
  for suffix in suffixes:
    if token.endswith(suffix):
      return token.removesuffix(suffix).lower()
  return token.lower()



In [9]:
my_words_stemmed = [mystemmer(word) for word in word_tokenized]

**Questions:**
  4. What are some examples where  your stemmer on the text differs from the PorterStemmer?
  
  5. Can you explain why the differences occur?


In [10]:
#4.
word_not_common = [word for word in my_words_stemmed if word not in words_stemmed]
word_not_common[:10]

['once',
 'magnificent',
 'picture',
 'storie',
 'nature',
 'primeval',
 'picture',
 'animal',
 'copy',
 'able']

'once',
 'magnificent',
 'picture',
 'storie',
 'nature',
 'primeval',
 'picture',
 'animal',
 'copy',
 'able'

5. The difference occurs because we are not considering all the possible suffixes in our own function stemmer.

Finally, we will use the library Spacy to lemmatize the text and compare the output to the stemming performed above. First we load the default Spacy model for English:

In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Note: You may need to run the following command first to download the model:
# ! python -m spacy download en_core_web_sm

This contains Spacy's saved data about how to process English text. Now we will use this to lemmatize:

**Question:**
  6. Lemmatize the text and output an array of lemmatized tokens - **return lower cased tokens**. How many unique lemmas are in the text? Hint: Use *nlp(text)* as a Python iterator. Each item in the iterator has an attribute *.lemma_*.


  7. What is an example of a word which has different lemmatized and stemmed forms? Why? Answer from a linguistic point of view

In [12]:
l = [1,2,3,3].remove(3)

In [13]:
l

In [14]:
#6.
doc= nlp(text)
lemmatized_word = [token.lemma_.lower() for token in doc]

In [15]:
np.unique(lemmatized_word).shape

(141,)

141 unique lemmas are in the text.

In [16]:
#7.
not_common_stemlem = [word for word in lemmatized_word if word not in words_stemmed]

In [17]:
not_common_stemlem[:10]

['\n',
 'once',
 'magnificent',
 'picture',
 'story',
 'nature',
 'primeval',
 'picture',
 'animal',
 'copy']

Lemmatization reduce each word into a common base but in a dictionary form so each word will have a meaning.  It also changes word based on its intended meaning.