Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# PREPROCESSING

## Tokenization

*Tokenization* is the process of spliting an input text into tokens (words or other relevant elements, such as punctuation).

#### Making use of regular expressions

We can tokenize a piece of text by using a regular expression tokenizer, such as the one available in **NLTK**.

For starters, let's stick to alphanumerical sequences of characters.

In [5]:
import nltk
from nltk import regexp_tokenize

text = 'That U.S.A. poster-print costs $12.40...'

pattern = '[a-zA-Z0-9_]+'
tokens = regexp_tokenize(text, pattern)
print(len(tokens))
print(tokens)

9
['That', 'U', 'S', 'A', 'poster', 'print', 'costs', '12', '40']


We can refine the regular expression to obtain a more sensible tokenization.

In [6]:
pattern = r'''(?x)           # set flag to allow verbose regexps
        (?:[A-Z]\.)+         # abbreviations, e.g. U.S.A.
        | \w+(?:-\w+)*       # words with optional internal hyphens
        | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
        | \.\.\.             # ellipsis
        | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
        '''

tokens = regexp_tokenize(text, pattern)
print(len(tokens))
print(tokens)

6
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']


#### Using NLTK

NLTK also includes a word tokenizer, which gets roughly the same result (it finds "words" and punctuation).

In [7]:
from nltk import word_tokenize
nltk.download('punkt_tab')

text = 'That U.S.A. poster-print costs $12.40...'
tokens = word_tokenize(text)

print(len(tokens))
print(tokens)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\maxbp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


7
['That', 'U.S.A.', 'poster-print', 'costs', '$', '12.40', '...']


In [8]:
word_tokenize("I don't think we're flying today.")

['I', 'do', "n't", 'think', 'we', "'re", 'flying', 'today', '.']

You can try [other tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available in NLTK.

In [36]:
# try out the wordpunct tokenizer
from nltk import wordpunct_tokenize

text = 'My U.S.A. flag cost more than my house...'
tokens = wordpunct_tokenize(text)

print(len(tokens))
print(tokens)

14
['My', 'U', '.', 'S', '.', 'A', '.', 'flag', 'cost', 'more', 'than', 'my', 'house', '...']


Let's get a sentence from the user and tokenize it.

In [37]:
import os

s = input("Enter some text:")
tokens = word_tokenize(s)

print("You typed", len(tokens), "words:", tokens)

You typed 15 words: ['dio', 'mio', ',', 'I', 'am', 'surprised', '!', 'Do', "n't", 'come', 'around', 'these', 'parts', 'again', '...']


#### Sentence segmentation

We may also be interested in spliting the text into sentences.

In [38]:
from nltk import sent_tokenize

text = "Hello. Are you Mr. Smith? Just to let you know that I have finished my M.Sc. and Ph.D. on AI. I loved it!"
sentences = sent_tokenize(text)

print(sentences)
print("Number of sentences:", len(sentences))

['Hello.', 'Are you Mr. Smith?', 'Just to let you know that I have finished my M.Sc.', 'and Ph.D. on AI.', 'I loved it!']
Number of sentences: 5


#### Experimenting with long texts

We can try downloading a book from [Project Gutenberg](https://www.gutenberg.org/).

In [12]:
from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

print(len(raw))
print(raw[:75])

1135214
*** START OF THE PROJECT GUTENBERG EBOOK 2554 ***




CRIME AND PUNISHMENT



How many sentences are there? Printout the second sentence (index 1).

In [39]:
# insert your code here
sentences = sent_tokenize(raw)
print("Number of sentences:", len(sentences))
print(sentences[1])

Number of sentences: 11940
Dostoevsky was the son of a doctor.


How many tokens are there? What is the index of the first token in the second sentence?

In [40]:
# insert your code here
numTokens = len(word_tokenize(raw))

firstTokenIndex = len(word_tokenize(sentences[0]))

print("Number of tokens:", numTokens)
print("Index of first token in second sentence:", firstTokenIndex)

Number of tokens: 253688
Index of first token in second sentence: 43


And how many types (unique words) are there? Which is the most frequent one? *(Hint: use a [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) container from collections.)*

In [41]:
# insert your code here
from collections import Counter

tokenCount = Counter(word_tokenize(raw))
print("The number of unique tokens is: ", len(tokenCount))

mostFrequentWord, mostFrequentCount = tokenCount.most_common(1)[0]
print("The most frequent word is:", mostFrequentWord, "with a count of:", mostFrequentCount)


The number of unique tokens is:  11103
The most frequent word is: , with a count of: 16042


#### Dealing with multi-word expressions (MWE)

Sometimes we want certain words to stick together when tokenizing, such as in multi-word names.

In [16]:
word_tokenize("Good muffins cost $3.88\nin New York.")

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.']

One way to do it is to suply our own lexicon and make use of NLTK's [MWE tokenizer](https://www.nltk.org/api/nltk.tokenize.mwe.html).

In [17]:
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize

s = "Good muffins cost $3.88\nin New York."
mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator=' ')

[mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]

[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York', '.']]

Try out your own multi-word expressions to tokenize text.

In [None]:
# try out your own multi-word expressions

s2 = "My best friend really likes ice cream."
mwe2 = MWETokenizer([('best', 'friend'), ('ice', 'cream')], separator=' ')

[mwe2.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s2)]


[['My', 'best friend', 'really', 'likes', 'ice cream', '.']]

### Tokenization in large language models

Tokenization in large language models is more involved, and usually consists of training a vocabulary on a large corpus, using algorithms such as Byte-Pair Encoding (BPE), WordPiece, or Unigram.

OpenAI models use mostly BPE tokenizers, made available in the [tiktoken](https://github.com/openai/tiktoken) library, which you can also explore [here](https://tiktokenizer.vercel.app/).

The [SentencePiece](https://github.com/google/sentencepiece) library is another alternative.

A very nice introduction to GPT tokenizers can be found [here](https://www.youtube.com/watch?v=zduSFxRajkE).

In [20]:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # check the available tokenizers at https://tiktokenizer.vercel.app/

In [21]:
s = "Hellow world, how are you doing? NLP rocks!"
token_ids = enc.encode(s)

In [22]:
token_ids

[39, 5412, 1917, 11, 1268, 527, 499, 3815, 30, 452, 12852, 23902, 0]

In [23]:
enc.decode(token_ids)

'Hellow world, how are you doing? NLP rocks!'

Check out what the token ids correspond to by iterating and decoding each id at a time.

In [44]:
# insert your code here
for tokenId in token_ids:
    print("Token id:", tokenId, "corresponds to:", enc.decode([tokenId]))


Token id: 39 corresponds to: H
Token id: 5412 corresponds to: ellow
Token id: 1917 corresponds to:  world
Token id: 11 corresponds to: ,
Token id: 1268 corresponds to:  how
Token id: 527 corresponds to:  are
Token id: 499 corresponds to:  you
Token id: 3815 corresponds to:  doing
Token id: 30 corresponds to: ?
Token id: 452 corresponds to:  N
Token id: 12852 corresponds to: LP
Token id: 23902 corresponds to:  rocks
Token id: 0 corresponds to: !


## Stemming and Lemmatization

*Stemming* and *Lemmatization* are techniques used to normalize tokens, so as to reduce the size of the vocabulary.
Whereas lemmatization is a process of finding the root of the word, stemming typically applies a set of transformation rules that aim to cut off word final affixes.

#### Stemming

NLTK includes one of the most well-known stemmers: the [Porter stemmer](https://www.emerald.com/insight/content/doi/10.1108/00330330610681286/full/pdf?casa_token=eT_IPtH_eLEAAAAA:Z3lAtxWdxf0FL479mL-A7tC-_QRzxNeeyC2DFLyWwGBlcj6DQcwu2Bnq37waDPcXKOnXkMMDtKGyCaYGZtYcb3lgBZ9uaHKUNO0JCMivSdPE4HTe).

In [25]:
from nltk.stem import PorterStemmer

# initialize the Porter Stemmer
porter = PorterStemmer()

Let's use an illustrative piece of text:

In [26]:
sentence = '''The European Commission has funded a numerical study to analyze the purchase of a pipe organ with no noise
for Europe's organization. Numerous donations have followed the analysis after a noisy debate.'''

# tokenize: split the text into words
word_list = nltk.word_tokenize(sentence)

print("\nOriginal word list:", word_list)
print("\nOriginal number of distinct tokens:", len(set(word_list)))


Original word list: ['The', 'European', 'Commission', 'has', 'funded', 'a', 'numerical', 'study', 'to', 'analyze', 'the', 'purchase', 'of', 'a', 'pipe', 'organ', 'with', 'no', 'noise', 'for', 'Europe', "'s", 'organization', '.', 'Numerous', 'donations', 'have', 'followed', 'the', 'analysis', 'after', 'a', 'noisy', 'debate', '.']

Original number of distinct tokens: 31


Now, we stem the tokens in the text:

In [27]:
# stem list of words and join
stemmed_output = ' '.join([porter.stem(w) for w in word_list])
print("Stemmed text:", stemmed_output)

# tokenize: split the text into words
stemmed_word_list = nltk.word_tokenize(stemmed_output)

print("\nStemmed word list:", stemmed_word_list)
print("\nStemmed number of distinct tokens:", len(set(stemmed_word_list)))

Stemmed text: the european commiss ha fund a numer studi to analyz the purchas of a pipe organ with no nois for europ 's organ . numer donat have follow the analysi after a noisi debat .

Stemmed word list: ['the', 'european', 'commiss', 'ha', 'fund', 'a', 'numer', 'studi', 'to', 'analyz', 'the', 'purchas', 'of', 'a', 'pipe', 'organ', 'with', 'no', 'nois', 'for', 'europ', "'s", 'organ', '.', 'numer', 'donat', 'have', 'follow', 'the', 'analysi', 'after', 'a', 'noisi', 'debat', '.']

Stemmed number of distinct tokens: 28


You can see the reduced vocabulary size. Some tokens are over-generalized (semantically different tokens that get the same stem), while others are under-generalized (semantically similar tokens that get different stems).

Try out [other stemmers](https://www.nltk.org/api/nltk.stem.html) available in NLTK.

In [28]:
# try out other stemmers


We can try a few for Portuguese:

In [None]:
# Portuguese stemmer: https://www.nltk.org/_modules/nltk/stem/rslp.html
from nltk.stem import RSLPStemmer
nltk.download('rslp')

stemmer = RSLPStemmer()
sentence = "Estou mesmo a gostar desta unidade curricular, todos gostamos de unidades curriculares interessantes."

word_list = nltk.word_tokenize(sentence)
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print(stemmed_output)

est mesm a gost dest unidad curricul , tod gost de unidad curricul interess .


In [46]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("portuguese")
sentence = "Estou mesmo a gostar desta unidade curricular, todos gostamos de unidades curriculares interessantes."

word_list = nltk.word_tokenize(sentence)
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print(stemmed_output)

estou mesm a gost dest unidad curricul , tod gost de unidad curricul interess .


#### Lemmatization

NLTK includes a [lemmatizer based on WordNet](https://www.nltk.org/api/nltk.stem.wordnet.html).

In [33]:
# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

sentence = "Men and women love to study artificial intelligence while studying data science. Don't you? My feet and teeth are clean!"


# tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

# lemmatize list of words
lemmatized_output = [lemmatizer.lemmatize(w) for w in word_list]
print(lemmatized_output)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maxbp\AppData\Roaming\nltk_data...


['Men', 'and', 'women', 'love', 'to', 'study', 'artificial', 'intelligence', 'while', 'studying', 'data', 'science', '.', 'Do', "n't", 'you', '?', 'My', 'feet', 'and', 'teeth', 'are', 'clean', '!']
['Men', 'and', 'woman', 'love', 'to', 'study', 'artificial', 'intelligence', 'while', 'studying', 'data', 'science', '.', 'Do', "n't", 'you', '?', 'My', 'foot', 'and', 'teeth', 'are', 'clean', '!']


Compare the result with stemming applied to the same text.

In [34]:
# compare with stemming


## spaCy

SpaCy includes several [language processing pipelines](https://spacy.io/usage/processing-pipelines) that streamline several NLP tasks at once. We can use one of the available [trained pipelines](https://spacy.io/models).

In [47]:
import spacy
nlp = spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

We simply pass the sentence through the language processing pipeline (in this case, for English):

In [None]:
sent = nlp(sentence)
print(sent)
print(len(sent))

As you can see, we now have a sequence of tokens, each of which has specific [attributes](https://spacy.io/api/token#attributes) attached. For instance, we can easily get the lemma for each word:

In [None]:
for token in sent:
    print(token.text, token.lemma_)