### Credits:

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen based on the notebooks created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />

Reused and modified for internal use at Università Cattolica del Sacro Cuore di Milano, by Deborah Grbac, email deborah.grbac@unicatt.it and Valentina Schiariti, email valentina.schiariti-collaboratore@unicatt.it, released under CC BY License.

This repository is founded on **Constellate notebooks**. The original Jupyter notebooks repository was designed by the educators at **ITHAKA's Constellate project**. The project was sunset on July 1, 2025. This current repository uses and resuses Constellate notebooks as Open Educational Resources (OER), free for re-use under a Creative Commons CC BY License.
___


# Tokenizers

**Description:**
This notebook focuses on the basic concepts surrounding tokenization. It includes material on the following concepts:

* Word segmentation
* n-grams
* Stemming
* Lemmatization
* Tokenizers

**Libraries Used:**
* urllib.request
* NLTK
* spaCy

___

## Tokenization and words

**Tokenization** is the process of segmenting text into smaller units, called **tokens**, which may be sentences, words, or sub-word chunks. It is typically the first step in a Natural Language Processing (NLP) pipeline and can be carried out by a variety of tokenizers, each reflecting different design choices.

A simple approach to word tokenization splits text on whitespace and punctuation.

> Now that summer's here, we're going to visit the beach at Lake Michigan and eat ice cream.

By splitting on whitespace only, we would get in this case 17 words: 

> Now, that, summer's, here, we're, going, to, visit, the, beach, at, Lake, Michigan, and, eat, ice, cream.

However, this raises questions: should “Lake Michigan” count as one token or two? Is “we’re” one word or two? Should “going” be treated differently from “go” or “went”?  

These challenges reveal that even the seemingly straightforward **concept of a “word”** becomes complicated when formalized for computational analysis. This is why more advanced tokenization methods, such as Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, were developed to address these issues in modern language models like BERT and GPT. 

We will look at a few examples of traditional tokenizers with a goal of gathering tokens into one-, two-, and three-word constructions. The general name for these is **n-grams**.

An **n-gram** is a **sequence of n items from a given sample of text or speech**. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:

* stock (a 1-gram, or **unigram**)
* vegetable stock (a 2-gram, or **bigram**)
* homemade vegetable stock (a 3-gram, or **trigram**)

Analyzing text through n-grams allows us to capture meaning that extends beyond single words. By looking only at unigrams would not be able, for example, to differentiate between the "stock" in "stock market" and "vegetable stock." By including bigrams and trigrams in our analysis, we are able to look at concepts that extend across multiple words. 

One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams).

## Creating your own basic tokenizer

As explained, the most intuitive way to turn a text into tokens is by  on whitespace and punctuation. It is possible to create your own basic tokenizer by using Python string methods. 

The following example uses the `.split()` method to gather unigrams. We will be using an extract from Shakespeare's Othello, provided at this [link](https://www.folger.edu/explore/shakespeares-works/othello/read/)

In [1]:
import urllib.request
from pathlib import Path

# Step 1: Create a ./data folder if it doesn’t exist
data_folder = Path("./data/")
data_folder.mkdir(exist_ok=True)

# Step 2: Load the text file (after the person has placed it there)
text_address = "https://folger-main-site-assets.s3.amazonaws.com/uploads/2022/11/othello_TXT_FolgerShakespeare.txt"
text_name = './data/' + text_address.rsplit('/', 1)[-1]
urllib.request.urlretrieve(text_address, text_name)

('./data/othello_TXT_FolgerShakespeare.txt',
 <http.client.HTTPMessage at 0x20474e2f610>)

In [2]:
text_path = data_folder / "othello_TXT_FolgerShakespeare.txt"

if text_path.exists():
    print("File loaded successfully!")
else:
    print("File not found. Please download it and place it in the ./data/ folder.")

File loaded successfully!


In [3]:
# Opening a file in read mode
with open(text_path, "r") as f:
    othello_text_extract = f.read(1000) #reading an extract of the text
    print(othello_text_extract)

Othello
by William Shakespeare
Edited by Barbara A. Mowat and Paul Werstine
  with Michael Poston and Rebecca Niles
Folger Shakespeare Library
https://shakespeare.folger.edu/shakespeares-works/othello/
Created on May 11, 2016, from FDT version 0.9.2.1

Characters in the Play
OTHELLO, a Moorish general in the Venetian army
DESDEMONA, a Venetian lady
BRABANTIO, a Venetian senator, father to Desdemona
IAGO, Othello's standard-bearer, or "ancient"
EMILIA, Iago's wife and Desdemona's attendant
CASSIO, Othello's second-in-command, or lieutenant
RODERIGO, a Venetian gentleman
Duke of Venice
Venetian gentlemen, kinsmen to Brabantio:
  LODOVICO
  GRATIANO
Venetian senators
MONTANO, an official in Cyprus
BIANCA, a woman in Cyprus in love with Cassio
Clown, a comic servant to Othello and Desdemona
Gentlemen of Cyprus
Sailors
Servants, Attendants, Officers, Messengers, Herald, Musicians, Torchbearers.


ACT 1
=====

Scene 1
[Enter Roderigo and Iago.]


RODERIGO
Tush,


In [4]:
# See the raw string version of our text
othello_text_extract



In [5]:
# Splitting the text string into a list of strings
extract_tokenized_list =  othello_text_extract.split()
list(extract_tokenized_list)

['Othello',
 'by',
 'William',
 'Shakespeare',
 'Edited',
 'by',
 'Barbara',
 'A.',
 'Mowat',
 'and',
 'Paul',
 'Werstine',
 'with',
 'Michael',
 'Poston',
 'and',
 'Rebecca',
 'Niles',
 'Folger',
 'Shakespeare',
 'Library',
 'https://shakespeare.folger.edu/shakespeares-works/othello/',
 'Created',
 'on',
 'May',
 '11,',
 '2016,',
 'from',
 'FDT',
 'version',
 '0.9.2.1',
 'Characters',
 'in',
 'the',
 'Play',
 'OTHELLO,',
 'a',
 'Moorish',
 'general',
 'in',
 'the',
 'Venetian',
 'army',
 'DESDEMONA,',
 'a',
 'Venetian',
 'lady',
 'BRABANTIO,',
 'a',
 'Venetian',
 'senator,',
 'father',
 'to',
 'Desdemona',
 'IAGO,',
 "Othello's",
 'standard-bearer,',
 'or',
 '"ancient"',
 'EMILIA,',
 "Iago's",
 'wife',
 'and',
 "Desdemona's",
 'attendant',
 'CASSIO,',
 "Othello's",
 'second-in-command,',
 'or',
 'lieutenant',
 'RODERIGO,',
 'a',
 'Venetian',
 'gentleman',
 'Duke',
 'of',
 'Venice',
 'Venetian',
 'gentlemen,',
 'kinsmen',
 'to',
 'Brabantio:',
 'LODOVICO',
 'GRATIANO',
 'Venetian',
 's

In [6]:
# Cleaning up the tokens
unigrams = []

for token in extract_tokenized_list:
    token = token.lower() # lowercase tokens
    token = token.replace('.', '') # remove periods
    token = token.replace('!', '') # remove exclamation points
    token = token.replace('?', '') # remove question marks
    unigrams.append(token)

In [7]:
# Preview the unigrams
list(unigrams)

['othello',
 'by',
 'william',
 'shakespeare',
 'edited',
 'by',
 'barbara',
 'a',
 'mowat',
 'and',
 'paul',
 'werstine',
 'with',
 'michael',
 'poston',
 'and',
 'rebecca',
 'niles',
 'folger',
 'shakespeare',
 'library',
 'https://shakespearefolgeredu/shakespeares-works/othello/',
 'created',
 'on',
 'may',
 '11,',
 '2016,',
 'from',
 'fdt',
 'version',
 '0921',
 'characters',
 'in',
 'the',
 'play',
 'othello,',
 'a',
 'moorish',
 'general',
 'in',
 'the',
 'venetian',
 'army',
 'desdemona,',
 'a',
 'venetian',
 'lady',
 'brabantio,',
 'a',
 'venetian',
 'senator,',
 'father',
 'to',
 'desdemona',
 'iago,',
 "othello's",
 'standard-bearer,',
 'or',
 '"ancient"',
 'emilia,',
 "iago's",
 'wife',
 'and',
 "desdemona's",
 'attendant',
 'cassio,',
 "othello's",
 'second-in-command,',
 'or',
 'lieutenant',
 'roderigo,',
 'a',
 'venetian',
 'gentleman',
 'duke',
 'of',
 'venice',
 'venetian',
 'gentlemen,',
 'kinsmen',
 'to',
 'brabantio:',
 'lodovico',
 'gratiano',
 'venetian',
 'senator

In [8]:
# Count up the tokens using a Counter() object
from collections import Counter
word_counts = Counter(unigrams)
print(word_counts)



In [9]:
#Applying to the full text

#Step 1: open the text in read mode and create file variable
with open(text_path, "r") as f:
    othello_text_full = f.read()

#Step 2: split the text on whitespace    
full_tokenized_list =  othello_text_full.split()

#Step 3: text normalization
full_unigrams = []

for token in full_tokenized_list:
    token = token.lower() # lowercase tokens
    token = token.replace('.', '') # remove periods
    token = token.replace('!', '') # remove exclamation points
    token = token.replace('?', '') # remove question marks
    full_unigrams.append(token)

#Step 4: Unigram word frequency counting
from collections import Counter
full_word_counts = Counter(full_unigrams)
print(full_word_counts.most_common(50)) #print the first 50 as example

[('i', 805), ('and', 781), ('the', 701), ('to', 575), ('of', 443), ('a', 436), ('my', 429), ('you', 418), ('that', 355), ('in', 332), ('is', 303), ('not', 299), ('iago', 298), ('othello', 289), ('it', 280), ('me', 241), ('for', 226), ('with', 224), ('but', 215), ('be', 213), ('do', 212), ('your', 210), ('this', 207), ('have', 204), ('he', 201), ('desdemona', 194), ('her', 193), ('cassio', 189), ('his', 167), ('as', 163), ('what', 159), ('him', 151), ('she', 150), ('will', 145), ('if', 140), ('thou', 140), ('so', 140), ('by', 113), ('on', 108), ('emilia', 103), ('are', 102), ('shall', 96), ('am', 90), ("'tis", 88), ('or', 87), ('all', 86), ('roderigo', 83), ('good', 83), ("'t", 82), ('would', 81)]


## NLTK

While writing your own tokenizer may allow you to create highly customized results, it is easier and more often more effective to use **existing tokenizers** offered in packages such as the **Natural Language Toolkit (NLTK)** and **spaCy**. 


The NLTK library has multiple tokenizers available, each with its own specific advantages and disadvantages. 

### [Word Punctuation](https://www.nltk.org/_modules/nltk/tokenize/punkt.html)
The word punctuation tokenizer splits on white spaces and splits out punctuation into separate tokens.

### [Penn Treebank](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
The Tree Bank tokenizer is the default tokenizer for NLTK. It features a variety of regular expressions for addressing punctuation such as contractions, quotes, parentheses, brackets, and dashes.

### [Tweet](https://www.nltk.org/_modules/nltk/tokenize/casual.html#TweetTokenizer)
The Twitter tokenizer is designed to work with Twitter and social media text. It uses regular expressions for addressing emoticons, phone numbers, URLs, Twitter usernames, and email addresses.

### [Multi-Word Expression](https://www.nltk.org/_modules/nltk/tokenize/mwe.html)
The MWETokenizer takes a "string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs." The lexicon of Multi-Word Entities is constructed by the user. It can be constructed ad-hoc depended on the user's research interest or discovered through the use of techniques like part of speech tagging, collocation, and named entity recognition.

In [10]:
# Import a variety of tokenizers
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import (TreebankWordTokenizer,
                          word_tokenize,
                          wordpunct_tokenize,
                          TweetTokenizer,
                          MWETokenizer)




[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [11]:
string = "Last weekend, I traveled to New York City and wrote about my experience on https://example.com — it was amazing! #TravelDiaries"

In [12]:
# Python .split() tokenization
split_tokens = string.split()
print('Python .split()')
print(split_tokens, '\n')

# Punctuation-based tokenization
punct_tokens = wordpunct_tokenize(string)
print('Wordpunct tokenizer')
print(punct_tokens, '\n')

# Treebank Tokenizer
treebank_tokens = TreebankWordTokenizer().tokenize(string)
print('Treebank Tokenizer')
print(treebank_tokens, '\n')

# TweetTokenizer
tweet_tokens = TweetTokenizer().tokenize(string)
print('Tweet Tokenizer')
print(tweet_tokens, '\n')

# Multi-Word Expression Tokenizer
tokenizer = MWETokenizer([('New', 'York','City')])
MWE_tokens = tokenizer.tokenize(word_tokenize(string))
print('MWE Tokenizer')
print(MWE_tokens)

Python .split()
['Last', 'weekend,', 'I', 'traveled', 'to', 'New', 'York', 'City', 'and', 'wrote', 'about', 'my', 'experience', 'on', 'https://example.com', '—', 'it', 'was', 'amazing!', '#TravelDiaries'] 

Wordpunct tokenizer
['Last', 'weekend', ',', 'I', 'traveled', 'to', 'New', 'York', 'City', 'and', 'wrote', 'about', 'my', 'experience', 'on', 'https', '://', 'example', '.', 'com', '—', 'it', 'was', 'amazing', '!', '#', 'TravelDiaries'] 

Treebank Tokenizer
['Last', 'weekend', ',', 'I', 'traveled', 'to', 'New', 'York', 'City', 'and', 'wrote', 'about', 'my', 'experience', 'on', 'https', ':', '//example.com', '—', 'it', 'was', 'amazing', '!', '#', 'TravelDiaries'] 

Tweet Tokenizer
['Last', 'weekend', ',', 'I', 'traveled', 'to', 'New', 'York', 'City', 'and', 'wrote', 'about', 'my', 'experience', 'on', 'https://example.com', '—', 'it', 'was', 'amazing', '!', '#TravelDiaries'] 

MWE Tokenizer
['Last', 'weekend', ',', 'I', 'traveled', 'to', 'New_York_City', 'and', 'wrote', 'about', 'my',

The tokenizer will generate a list of unigrams, but we still need to generate our bigrams and trigrams. We can simply pass the tokens into NLTK's bigrams and trigrams methods then store the results in a list.

In [13]:
# Creating our bigrams and trigrams
bigrams = list(nltk.bigrams(treebank_tokens))
trigrams = list(nltk.trigrams(treebank_tokens))

print('Bigrams: \n ', bigrams, '\n')
    
print('Trigrams: \n,', trigrams)


Bigrams: 
  [('Last', 'weekend'), ('weekend', ','), (',', 'I'), ('I', 'traveled'), ('traveled', 'to'), ('to', 'New'), ('New', 'York'), ('York', 'City'), ('City', 'and'), ('and', 'wrote'), ('wrote', 'about'), ('about', 'my'), ('my', 'experience'), ('experience', 'on'), ('on', 'https'), ('https', ':'), (':', '//example.com'), ('//example.com', '—'), ('—', 'it'), ('it', 'was'), ('was', 'amazing'), ('amazing', '!'), ('!', '#'), ('#', 'TravelDiaries')] 

Trigrams: 
, [('Last', 'weekend', ','), ('weekend', ',', 'I'), (',', 'I', 'traveled'), ('I', 'traveled', 'to'), ('traveled', 'to', 'New'), ('to', 'New', 'York'), ('New', 'York', 'City'), ('York', 'City', 'and'), ('City', 'and', 'wrote'), ('and', 'wrote', 'about'), ('wrote', 'about', 'my'), ('about', 'my', 'experience'), ('my', 'experience', 'on'), ('experience', 'on', 'https'), ('on', 'https', ':'), ('https', ':', '//example.com'), (':', '//example.com', '—'), ('//example.com', '—', 'it'), ('—', 'it', 'was'), ('it', 'was', 'amazing'), ('was

The NLTK bigrams and trigrams method creates a list of bigrams that are tuples. If we want them to be strings, then we would need to access each index of the tuple and create a string out of it.

In [14]:
# Function definitions for Converting NLTK tuples into strings

from collections import Counter

def convert_tuple_bigrams(tuples_to_convert):
    """Converts NLTK tuples into bigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        gram_string = f'{first_word} {second_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_tuple_trigrams(tuples_to_convert):
    """Converts NLTK tuples into trigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        third_word = tuple_grams[2]
        gram_string = f'{first_word} {second_word} {third_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_strings_to_counts(string_grams):
    """Converts a Counter of n-grams into a dictionary"""
    counter_of_grams = Counter(string_grams)
    dict_of_grams = dict(counter_of_grams)
    return dict_of_grams

In [15]:
# Converting the tuples
string_bigrams = convert_tuple_bigrams(bigrams)
bigramCount = convert_strings_to_counts(string_bigrams)

print('Bigrams as a dictionary of counts')
print(bigramCount, '\n')

string_trigrams = convert_tuple_trigrams(trigrams)
trigramCount = convert_strings_to_counts(string_trigrams)

print('Trigrams as a dictionary of counts')
print(trigramCount)

Bigrams as a dictionary of counts
{'Last weekend': 1, 'weekend ,': 1, ', I': 1, 'I traveled': 1, 'traveled to': 1, 'to New': 1, 'New York': 1, 'York City': 1, 'City and': 1, 'and wrote': 1, 'wrote about': 1, 'about my': 1, 'my experience': 1, 'experience on': 1, 'on https': 1, 'https :': 1, ': //example.com': 1, '//example.com —': 1, '— it': 1, 'it was': 1, 'was amazing': 1, 'amazing !': 1, '! #': 1, '# TravelDiaries': 1} 

Trigrams as a dictionary of counts
{'Last weekend ,': 1, 'weekend , I': 1, ', I traveled': 1, 'I traveled to': 1, 'traveled to New': 1, 'to New York': 1, 'New York City': 1, 'York City and': 1, 'City and wrote': 1, 'and wrote about': 1, 'wrote about my': 1, 'about my experience': 1, 'my experience on': 1, 'experience on https': 1, 'on https :': 1, 'https : //example.com': 1, ': //example.com —': 1, '//example.com — it': 1, '— it was': 1, 'it was amazing': 1, 'was amazing !': 1, 'amazing ! #': 1, '! # TravelDiaries': 1}


### Stemmer and Speech Tagging

the NLTK library can also be used for Stemming and Speech tagging: 

A stemmer removes the endings of words to obtain their base form. The idea is to group together related words that share the same core meaning, regardless of their tense or number (singular or plural).

* ducks -> duck
* flown -> fly

This process has its limits: stemmatization relies on string rules, not grammar or meaning. It doesn’t “know” what a word means or how it’s used in a sentence. This can cause over-stemming (different words are merged incorrectly, e.g., organize, organization → organ) and under-stemming (similar words are not reduced to the same stem). At the same time, it can be difficult to use for morphologically rich languages.


In [16]:
# Snowball stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

unstemmed_token1 = 'running'
unstemmed_token2 = 'flown'

stemmed_token1 = stemmer.stem(unstemmed_token1)
stemmed_token2 = stemmer.stem(unstemmed_token2)

print(stemmed_token1)
print(stemmed_token2)

run
flown


As you can see, the effectiveness of stemming depends on the type of word. The more variations a word has, the harder it is for stemming to handle it (e.g., fly → flew → flown).

Some of these problems can be fixed with a **lemmatizer**. A lemmatizer doesn't simply strip off letters but looks up verb tenses and takes into account the part of speech of each word. **Part of Speech (POS)** tagging allows us to see the parts of speech of various tokens.

In [17]:
# Part of Speech Tagging
pos_list = nltk.pos_tag(nltk.word_tokenize(string))
print(pos_list)

[('Last', 'JJ'), ('weekend', 'NN'), (',', ','), ('I', 'PRP'), ('traveled', 'VBD'), ('to', 'TO'), ('New', 'NNP'), ('York', 'NNP'), ('City', 'NNP'), ('and', 'CC'), ('wrote', 'VBD'), ('about', 'IN'), ('my', 'PRP$'), ('experience', 'NN'), ('on', 'IN'), ('https', 'NN'), (':', ':'), ('//example.com', 'NN'), ('—', 'IN'), ('it', 'PRP'), ('was', 'VBD'), ('amazing', 'VBG'), ('!', '.'), ('#', '#'), ('TravelDiaries', 'NNS')]


The above output shows a tag connected to each token (e.g. JJ → Adjective, NN → Noun singular, VBD → verb, past tense etc...).

In the following example we can see a lemmatizer in action, compared to a stemmer:

In [18]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

words = ["running", "flies", "better", "studies", "mice", "geese", "went"]

print("Word\tStemmed\tLemmatized")

for word in words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos='v')  # 'v' = verb
    print(f"{word}\t{stemmed}\t{lemmatized}")

Word	Stemmed	Lemmatized


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


running	run	run
flies	fli	fly
better	better	better
studies	studi	study
mice	mice	mice
geese	gees	geese
went	went	go


It is clear here how the lemmatized version captures more nuances of the listed words.

## spaCy

spaCy takes a different approach from NLTK, creating a document model of a text. It is more sophisticated, but uses a different syntax for NLP tasks.


In [19]:
# Install the spaCy Program
# For installation, see https://spacy.io/usage
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Using cached pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Using cached pip-25.3-py3-none-any.whl (1.8 MB)
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)



[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: To modify pip, please run the following command:
C:\Users\Utente\AppData\Local\Programs\Python\Python313\python.exe -m pip install -U pip setuptools wheel





[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --- ------------------------------------ 1.0/12.8 MB 7.2 MB/s eta 0:00:02
     ------ --------------------------------- 2.1/12.8 MB 6.5 MB/s eta 0:00:02
     ---------- ----------------------------- 3.4/12.8 MB 6.2 MB/s eta 0:00:02
     ------------- -------------------------- 4.5/12.8 MB 5.8 MB/s eta 0:00:02
     ------------- -------------------------- 4.5/12.8 MB 5.8 MB/s eta 0:00:02
     ------------------ --------------------- 6.0/12.8 MB 5.0 MB/s eta 0:00:02
     ---------------------- ----------------- 7.1/12.8 MB 5.0 MB/s eta 0:00:02
     --------------------------- ------------ 8.7/12.8 MB 5.3 MB/s eta 0:00:01
     ------------------------------- -------- 10.0/12.8 MB 5.5 MB/s eta 0:00:01
     --------------------------------- -


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [20]:
from spacy.lang.en import English

nlp = English()

string = "Last weekend, I traveled to New York City and wrote about it on https://example.com #TravelDiaries"

my_doc = nlp(string)

tokens = []
for token in my_doc:
    tokens.append(token.text)

print(tokens)

['Last', 'weekend', ',', 'I', 'traveled', 'to', 'New', 'York', 'City', 'and', 'wrote', 'about', 'it', 'on', 'https://example.com', '#', 'TravelDiaries']


If we want to improve on the spaCy default tokenization, it is possible to add [add rules](https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/). In the example in the above we could tokenize togheter the New York City string and the #TravalDiaries. We can do this in the following way.

In [29]:
from spacy.symbols import ORTH

# Add a special case for "New York City"
special_case = [{"ORTH": "New York City"}]
nlp.tokenizer.add_special_case("New York City", special_case)

# Add a special case for "#TravelDiaries"
special_case2 = [{"ORTH": "#TravelDiaries"}]
nlp.tokenizer.add_special_case("#TravelDiaries", special_case2)

my_doc = nlp(string)

tokens = []
for token in my_doc:
    tokens.append(token.text)

print(tokens)


['Last', 'weekend', ',', 'I', 'traveled', 'to', 'New York City', 'and', 'wrote', 'about', 'it', 'on', 'https://example.com', '#TravelDiaries']


spaCy also supports Parts of Speech tagging and lemmatization:

In [21]:
import spacy
nlp = spacy.load('en_core_web_sm')
my_doc = nlp(string)

print('Parts of Speech')
for token in my_doc:
    print(token, token.pos_,)

print('\nLemmatizations')
for token in my_doc:
    print(token, token.lemma_)

Parts of Speech
Last ADJ
weekend NOUN
, PUNCT
I PRON
traveled VERB
to ADP
New PROPN
York PROPN
City PROPN
and CCONJ
wrote VERB
about ADP
it PRON
on ADP
https://example.com X
# SYM
TravelDiaries PROPN

Lemmatizations
Last last
weekend weekend
, ,
I I
traveled travel
to to
New New
York York
City City
and and
wrote write
about about
it it
on on
https://example.com https://example.com
# #
TravelDiaries TravelDiaries


We can gather our n-grams by defining a function that accepts our tokens and an argument `n` for the "n" in "n-gram." So, a bigram would be n = 2.

In [22]:
# A function for gathering n-grams with spaCy
def n_grams(tokens, n):
    n_grams = []
    for i in range(len(tokens)-n+1):
        n_grams.append(tokens[i:i+n])
    return(n_grams)
    # return[tokens[i:i+n] for i in range(len(tokens)-n+1)] # Written as a list comprehension

In [23]:
bigrams = n_grams(tokens, 2)
trigrams = n_grams(tokens, 3)
print(bigrams)
print(trigrams)

[['Last', 'weekend'], ['weekend', ','], [',', 'I'], ['I', 'traveled'], ['traveled', 'to'], ['to', 'New'], ['New', 'York'], ['York', 'City'], ['City', 'and'], ['and', 'wrote'], ['wrote', 'about'], ['about', 'it'], ['it', 'on'], ['on', 'https://example.com'], ['https://example.com', '#'], ['#', 'TravelDiaries']]
[['Last', 'weekend', ','], ['weekend', ',', 'I'], [',', 'I', 'traveled'], ['I', 'traveled', 'to'], ['traveled', 'to', 'New'], ['to', 'New', 'York'], ['New', 'York', 'City'], ['York', 'City', 'and'], ['City', 'and', 'wrote'], ['and', 'wrote', 'about'], ['wrote', 'about', 'it'], ['about', 'it', 'on'], ['it', 'on', 'https://example.com'], ['on', 'https://example.com', '#'], ['https://example.com', '#', 'TravelDiaries']]


While NLTK and spaCy tokenizers are the most prominent, there are also tokenizers available for packages such as:

* [Gensim](https://radimrehurek.com/gensim/)
* [Keras](https://keras.io/)
* [Stanford NLP](https://nlp.stanford.edu/software/tokenizer.shtml)