# Stemming

Let's see an example of stemming text using [NLTK](https://www.nltk.org/) (Natural Language ToolKit). We will use their SnowballStemmer implementation. The implementation is available online, so if you are curious about how stemming is done in different languages, you can look [here](https://www.nltk.org/api/nltk.stem.html).

In [2]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 

In [3]:
# We need to download a package for word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/marcvonwyl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Tokenization

In [4]:
text = "At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction."
text

'At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction.'

Let's start with the word tokenization. Notice how it cut words and symbols.

In [5]:
" ".join(word_tokenize(text))

'At first , historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction . [ 5 ] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages , using the comparative method and internal reconstruction .'

Rule-based tokenizers have to make implementation choices. For example, not splitting hyphenated words (`proto-languages`) or cutting symbols independently (`[ 5 ]`). Different tokenizers will bring different results. 

NLTK provides several [word and sentence tokenizers](https://www.nltk.org/api/nltk.tokenize.html).

## Stemming

Now let's apply the stemming to everything that is composed of characters.

In [6]:
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
stemmed = [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
" ".join(stemmed)

'at first histor linguist serv as the cornerston of compar linguist primarili as a tool for linguist reconstruct 5 scholar were concern chiefli with establish languag famili and reconstruct prehistor use the compar method and intern reconstruct'

Note how the words are simply cut and stemmed. Note that "were" didn't change as it does not follow standard stemming rules.

Another example with "went".

In [7]:
text = " I went to the cinema"
stemmed = [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
" ".join(stemmed)

'i went to the cinema'

## Speed

Let's see how fast stemming is on a certain quantity of text. [TorchText](https://pytorch.org/text/stable/index.html) proposes several [datasets](https://pytorch.org/text/stable/datasets.html), including the text-only part of the [Penn treebank](https://pytorch.org/text/stable/datasets.html#penntreebank) containing 42068 lines.

In [8]:
from torchtext.datasets import PennTreebank
train, valid, test = PennTreebank()

In [9]:
from tqdm import tqdm

re_word = re.compile(r"^\w+$")
nb_unique_token = set()
nb_unique_stem = set()
for text in tqdm(train, total=len(train)):
    for token in word_tokenize(text):
        if re_word.match(token):
            nb_unique_token.add(token)
            nb_unique_stem.add(stemmer.stem(token))

100%|████████████████████████████████████████████████████████| 42068/42068 [00:10<00:00, 3991.21it/s]


In [10]:
print(f"nb unique token: {len(nb_unique_token)} vs nb unique stem: {len(nb_unique_stem)}")

nb unique token: 9519 vs nb unique stem: 6151


## Going Further

NLTK proposes several stemming implementation in several languages. Notably, [this little tutorial](http://www.nltk.org/howto/stem.html) shows how to use the `Snowball stemmer` in several languages. You an also directly look into [NLTK's implementation](https://www.nltk.org/_modules/nltk/stem/snowball.html) of different stemmer.