### Credits:

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />

Reused and modified for internal use at Università Cattolica del Sacro Cuore di Milano, by Deborah Grbac, email deborah.grbac@unicatt.it and released under CC BY License.

This repository is founded on **Constellate notebooks**. The original Jupyter notebooks repository was designed by the educators at **ITHAKA's Constellate project**. The project was sunset on July 1, 2025. This current repository use and resuse Constellate notebooks as Open Educational Resources (OER), free for re-use under a Creative Commons CC BY License. 

This repository, Constellate notebooks and this readme file reference to the [Ithaka-Constellate repository](https://github.com/ithaka/constellate-notebooks?ref=cms-prod.constellate.org), but have been modified by Deborah Grbac and Valentina Schiariti for internal use at Università Cattolica del Sacro Cuore di Milano.

___


# Tokenizers

**Description:**
This notebook focuses on the basic concepts surrounding tokenization. It includes material on the following concepts:

* Word segmentation
* n-grams
* Stemming
* Lemmatization
* Tokenizers

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 
* Python Basics ([Start Python Basics 1](../Python-basics/python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](../Python-intermediate/python-intermediate-2.ipynb)

**Data Format:** None

**Libraries Used:**
* urllib.request
* NLTK
* spaCy

**Research Pipeline:**

1. Scan documents
2. OCR files
3. Clean up texts
4. **Tokenize text files** (this notebook)
___

## Tokenization and words

**Tokenization** is the process of segmenting text into smaller units, called **tokens**, which may be sentences, words, or sub-word chunks. It is typically the first step in a Natural Language Processing (NLP) pipeline and can be carried out by a variety of tokenizers, each reflecting different design choices.

A simple approach to word tokenization splits text on whitespace and punctuation.

> Now that summer's here, we're going to visit the beach at Lake Michigan and eat ice cream.

By splitting on whitespace only, we would get in this case 17 words: 

> Now, that, summer's, here, we're, going, to, visit, the, beach, at, Lake, Michigan, and, eat, ice, cream.

However, this raises questions: should “Lake Michigan” count as one token or two? Is “we’re” one word or two? Should “going” be treated differently from “go” or “went”?  

These challenges reveal that even the seemingly straightforward **concept of a “word”** becomes complicated when formalized for computational analysis. This is why more advanced tokenization methods, such as Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, were developed to address these issues in modern language models like BERT and GPT. 

We will look at a few examples of traditional tokenizers with a goal of gathering tokens into one-, two-, and three-word constructions. The general name for these is **n-grams**.

An **n-gram** is a **sequence of n items from a given sample of text or speech**. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:

* stock (a 1-gram, or **unigram**)
* vegetable stock (a 2-gram, or **bigram**)
* homemade vegetable stock (a 3-gram, or **trigram**)

Analyzing text through n-grams allows us to capture meaning that extends beyond single words. By looking only at unigrams would not be able, for example, to differentiate between the "stock" in "stock market" and "vegetable stock." By including bigrams and trigrams in our analysis, we are able to look at concepts that extend across multiple words. 

One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams).

## Creating your own basic tokenizer

As explained, the most intuitive way to turn a text into tokens is by  on whitespace and punctuation. It is possible to create your own basic tokenizer by using Python string methods. 

The following example uses the `.split()` method to gather unigrams. We will be using an extract from Shakespeare's Othello, provided at this [link](https://github.com/ithaka/constellate-notebooks/blob/3121ac06e7f03651dc016b33536ffeebc180c33a/All-sample-files/othello_TXT_FolgerShakespeare.txt)

In [16]:
from pathlib import Path

# Step 1: Create a ./data folder if it doesn’t exist
data_folder = Path("./data/")
data_folder.mkdir(exist_ok=True)

# Step 2: Download the file manually from GitHub (URL: https://raw.githubusercontent.com/ithaka/constellate-notebooks/3121ac06e7f03651dc016b33536ffeebc180c33a/All-sample-files/othello_TXT_FolgerShakespeare.txt
# Save it inside the ./data/ folder as "othello_TXT_FolgerShakespeare.txt"

# Step 3: Load the text file (after the person has placed it there)
text_path = data_folder / "othello_TXT_FolgerShakespeare.txt"

if text_path.exists():
    print("File loaded successfully!")
else:
    print("File not found. Please download it and place it in the ./data/ folder.")

File loaded successfully!


In [17]:
# Opening a file in read mode
with open(text_path, "r") as f:
    othello_text = f.read()
    print(othello_text)

Othello
by William Shakespeare
Edited by Barbara A. Mowat and Paul Werstine
  with Michael Poston and Rebecca Niles
Folger Shakespeare Library
https://shakespeare.folger.edu/shakespeares-works/othello/
Created on May 11, 2016, from FDT version 0.9.2.1

Characters in the Play
OTHELLO, a Moorish general in the Venetian army
DESDEMONA, a Venetian lady
BRABANTIO, a Venetian senator, father to Desdemona
IAGO, Othello's standard-bearer, or "ancient"
EMILIA, Iago's wife and Desdemona's attendant
CASSIO, Othello's second-in-command, or lieutenant
RODERIGO, a Venetian gentleman
Duke of Venice
Venetian gentlemen, kinsmen to Brabantio:
  LODOVICO
  GRATIANO
Venetian senators
MONTANO, an official in Cyprus
BIANCA, a woman in Cyprus in love with Cassio
Clown, a comic servant to Othello and Desdemona
Gentlemen of Cyprus
Sailors
Servants, Attendants, Officers, Messengers, Herald, Musicians, Torchbearers.


ACT 1
=====

Scene 1
[Enter Roderigo and Iago.]


RODERIGO
Tush, never tell me! I take it much 

In [18]:
# See the raw string version of our text
othello_text



In [20]:
# Splitting the text string into a list of strings
tokenized_list = othello_text.split()
list(tokenized_list)

['Othello',
 'by',
 'William',
 'Shakespeare',
 'Edited',
 'by',
 'Barbara',
 'A.',
 'Mowat',
 'and',
 'Paul',
 'Werstine',
 'with',
 'Michael',
 'Poston',
 'and',
 'Rebecca',
 'Niles',
 'Folger',
 'Shakespeare',
 'Library',
 'https://shakespeare.folger.edu/shakespeares-works/othello/',
 'Created',
 'on',
 'May',
 '11,',
 '2016,',
 'from',
 'FDT',
 'version',
 '0.9.2.1',
 'Characters',
 'in',
 'the',
 'Play',
 'OTHELLO,',
 'a',
 'Moorish',
 'general',
 'in',
 'the',
 'Venetian',
 'army',
 'DESDEMONA,',
 'a',
 'Venetian',
 'lady',
 'BRABANTIO,',
 'a',
 'Venetian',
 'senator,',
 'father',
 'to',
 'Desdemona',
 'IAGO,',
 "Othello's",
 'standard-bearer,',
 'or',
 '"ancient"',
 'EMILIA,',
 "Iago's",
 'wife',
 'and',
 "Desdemona's",
 'attendant',
 'CASSIO,',
 "Othello's",
 'second-in-command,',
 'or',
 'lieutenant',
 'RODERIGO,',
 'a',
 'Venetian',
 'gentleman',
 'Duke',
 'of',
 'Venice',
 'Venetian',
 'gentlemen,',
 'kinsmen',
 'to',
 'Brabantio:',
 'LODOVICO',
 'GRATIANO',
 'Venetian',
 's

In [21]:
# Cleaning up the tokens
unigrams = []

for token in tokenized_list:
    token = token.lower() # lowercase tokens
    token = token.replace('.', '') # remove periods
    token = token.replace('!', '') # remove exclamation points
    token = token.replace('?', '') # remove question marks
    unigrams.append(token)

In [22]:
# Preview the unigrams
list(unigrams)

['othello',
 'by',
 'william',
 'shakespeare',
 'edited',
 'by',
 'barbara',
 'a',
 'mowat',
 'and',
 'paul',
 'werstine',
 'with',
 'michael',
 'poston',
 'and',
 'rebecca',
 'niles',
 'folger',
 'shakespeare',
 'library',
 'https://shakespearefolgeredu/shakespeares-works/othello/',
 'created',
 'on',
 'may',
 '11,',
 '2016,',
 'from',
 'fdt',
 'version',
 '0921',
 'characters',
 'in',
 'the',
 'play',
 'othello,',
 'a',
 'moorish',
 'general',
 'in',
 'the',
 'venetian',
 'army',
 'desdemona,',
 'a',
 'venetian',
 'lady',
 'brabantio,',
 'a',
 'venetian',
 'senator,',
 'father',
 'to',
 'desdemona',
 'iago,',
 "othello's",
 'standard-bearer,',
 'or',
 '"ancient"',
 'emilia,',
 "iago's",
 'wife',
 'and',
 "desdemona's",
 'attendant',
 'cassio,',
 "othello's",
 'second-in-command,',
 'or',
 'lieutenant',
 'roderigo,',
 'a',
 'venetian',
 'gentleman',
 'duke',
 'of',
 'venice',
 'venetian',
 'gentlemen,',
 'kinsmen',
 'to',
 'brabantio:',
 'lodovico',
 'gratiano',
 'venetian',
 'senator

In [23]:
# Count up the tokens using a Counter() object
from collections import Counter
word_counts = Counter(unigrams)
print(word_counts)



## NLTK

While writing your own tokenizer may allow you to create highly customized results, it is easier and more often more effective to use **existing tokenizers** offered in packages such as the **Natural Language Toolkit (NLTK)** and **spaCy**. 


The NLTK library has multiple tokenizers available, each with its own specific advantages and disadvantages. 

### [Word Punctuation](https://www.nltk.org/_modules/nltk/tokenize/punkt.html)
The word punctuation tokenizer splits on white spaces and splits out punctuation into separate tokens.

### [Penn Treebank](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
The Tree Bank tokenizer is the default tokenizer for NLTK. It features a variety of regular expressions for addressing punctuation such as contractions, quotes, parentheses, brackets, and dashes.

### [Tweet](https://www.nltk.org/_modules/nltk/tokenize/casual.html#TweetTokenizer)
The Twitter tokenizer is designed to work with Twitter and social media text. It uses regular expressions for addressing emoticons, phone numbers, URLs, Twitter usernames, and email addresses.

### [Multi-Word Expression](https://www.nltk.org/_modules/nltk/tokenize/mwe.html)
The MWETokenizer takes a "string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs." The lexicon of Multi-Word Entities is constructed by the user. It can be constructed ad-hoc depended on the user's research interest or discovered through the use of techniques like part of speech tagging, collocation, and named entity recognition.

In [27]:
# Import a variety of tokenizers
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import (TreebankWordTokenizer,
                          word_tokenize,
                          wordpunct_tokenize,
                          TweetTokenizer,
                          MWETokenizer)




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Utente\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [28]:
string = "Nathan Kelber is helping us tokenize with the Constellate platform. http://constellate.org #NLP"

In [29]:
# Python .split() tokenization
split_tokens = string.split()
print('Python .split()')
print(split_tokens, '\n')

# Punctuation-based tokenization
punct_tokens = wordpunct_tokenize(string)
print('Wordpunct tokenizer')
print(punct_tokens, '\n')

# Treebank Tokenizer
treebank_tokens = TreebankWordTokenizer().tokenize(string)
print('Treebank Tokenizer')
print(treebank_tokens, '\n')

# TweetTokenizer
tweet_tokens = TweetTokenizer().tokenize(string)
print('Tweet Tokenizer')
print(tweet_tokens, '\n')

# Multi-Word Expression Tokenizer
tokenizer = MWETokenizer([('Nathan', 'Kelber')])
MWE_tokens = tokenizer.tokenize(word_tokenize(string))
print('MWE Tokenizer')
print(MWE_tokens)

Python .split()
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform.', 'http://constellate.org', '#NLP'] 

Wordpunct tokenizer
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http', '://', 'constellate', '.', 'org', '#', 'NLP'] 

Treebank Tokenizer
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform.', 'http', ':', '//constellate.org', '#', 'NLP'] 

Tweet Tokenizer
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http://constellate.org', '#NLP'] 

MWE Tokenizer
['Nathan_Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http', ':', '//constellate.org', '#', 'NLP']


The tokenizer will generate a list of unigrams, but we still need to generate our bigrams and trigrams. We can simply pass the tokens into NLTK's bigrams and trigrams methods then store the results in a list.

In [30]:
# Creating our bigrams and trigrams
bigrams = list(nltk.bigrams(treebank_tokens))
trigrams = list(nltk.trigrams(treebank_tokens))

print('Bigrams: \n ', bigrams, '\n')
    
print('Trigrams: \n,', trigrams)


Bigrams: 
  [('Nathan', 'Kelber'), ('Kelber', 'is'), ('is', 'helping'), ('helping', 'us'), ('us', 'tokenize'), ('tokenize', 'with'), ('with', 'the'), ('the', 'Constellate'), ('Constellate', 'platform.'), ('platform.', 'http'), ('http', ':'), (':', '//constellate.org'), ('//constellate.org', '#'), ('#', 'NLP')] 

Trigrams: 
, [('Nathan', 'Kelber', 'is'), ('Kelber', 'is', 'helping'), ('is', 'helping', 'us'), ('helping', 'us', 'tokenize'), ('us', 'tokenize', 'with'), ('tokenize', 'with', 'the'), ('with', 'the', 'Constellate'), ('the', 'Constellate', 'platform.'), ('Constellate', 'platform.', 'http'), ('platform.', 'http', ':'), ('http', ':', '//constellate.org'), (':', '//constellate.org', '#'), ('//constellate.org', '#', 'NLP')]


The NLTK bigrams and trigrams method creates a list of bigrams that are tuples. If we want them to be strings, then we would need to access each index of the tuple and create a string out of it.

In [31]:
# Function definitions for Converting NLTK tuples into strings

from collections import Counter

def convert_tuple_bigrams(tuples_to_convert):
    """Converts NLTK tuples into bigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        gram_string = f'{first_word} {second_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_tuple_trigrams(tuples_to_convert):
    """Converts NLTK tuples into trigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        third_word = tuple_grams[2]
        gram_string = f'{first_word} {second_word} {third_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_strings_to_counts(string_grams):
    """Converts a Counter of n-grams into a dictionary"""
    counter_of_grams = Counter(string_grams)
    dict_of_grams = dict(counter_of_grams)
    return dict_of_grams

In [32]:
# Converting the tuples
string_bigrams = convert_tuple_bigrams(bigrams)
bigramCount = convert_strings_to_counts(string_bigrams)

print('Bigrams as a dictionary of counts')
print(bigramCount, '\n')

string_trigrams = convert_tuple_trigrams(trigrams)
trigramCount = convert_strings_to_counts(string_trigrams)

print('Trigrams as a dictionary of counts')
print(trigramCount)

Bigrams as a dictionary of counts
{'Nathan Kelber': 1, 'Kelber is': 1, 'is helping': 1, 'helping us': 1, 'us tokenize': 1, 'tokenize with': 1, 'with the': 1, 'the Constellate': 1, 'Constellate platform.': 1, 'platform. http': 1, 'http :': 1, ': //constellate.org': 1, '//constellate.org #': 1, '# NLP': 1} 

Trigrams as a dictionary of counts
{'Nathan Kelber is': 1, 'Kelber is helping': 1, 'is helping us': 1, 'helping us tokenize': 1, 'us tokenize with': 1, 'tokenize with the': 1, 'with the Constellate': 1, 'the Constellate platform.': 1, 'Constellate platform. http': 1, 'platform. http :': 1, 'http : //constellate.org': 1, ': //constellate.org #': 1, '//constellate.org # NLP': 1}


### Stemmer and Speech Tagging

Depending on the analysis we are doing, we may want to group similar words together. For example, we may want to group plural words together and verb tenses.

* ducks -> duck
* flown -> fly

To accomplish this, we could use a **stemmer**, such as the **Snowball stemmer**. A stemmer removes the last part of particular words to get a base form. It is a quick method which is useful for very large datasets and/or working with limited computing power.

In an ideal world, a **lemmatizer** will do a better job. It does not simply strip off letters but looks up verb tenses and takes into account the part of speech of each word.

In [33]:
# Snowball stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
unstemmed_token = 'running'

# A stemmer will not reduce 'flown' to 'fly'
# unstemmed_token = 'flown'

stemmed_token = stemmer.stem(unstemmed_token)

print(stemmed_token)

run


Part of Speech tagging allows us to see the parts of speech of various tokens.

In [34]:
# Part of Speech Tagging
pos_list = nltk.pos_tag(nltk.word_tokenize(string))
print(pos_list)

[('Nathan', 'NNP'), ('Kelber', 'NNP'), ('is', 'VBZ'), ('helping', 'VBG'), ('us', 'PRP'), ('tokenize', 'VB'), ('with', 'IN'), ('the', 'DT'), ('Constellate', 'NNP'), ('platform', 'NN'), ('.', '.'), ('http', 'NN'), (':', ':'), ('//constellate.org', 'JJ'), ('#', '#'), ('NLP', 'NNP')]


## spaCy

spaCy takes a different approach from NLTK, creating a document model of a text. It is more sophisticated, but uses a different syntax for NLP tasks.


In [35]:
# Install the spaCy Program
# For installation, see https://spacy.io/usage
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Collecting wheel
  Downloading wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ----------------- ---------------------- 0.8/1.8 MB 5.0 MB/s eta 0:00:01
   ----------------------------------- ---- 1.6/1.8 MB 4.1 MB/s eta 0:00:01
   ---------------------------------------- 1.8/1.8 MB 3.7 MB/s eta 0:00:00
Downloading wheel-0.45.1-py3-none-any.whl (72 kB)



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: To modify pip, please run the following command:
C:\Users\Utente\AppData\Local\Programs\Python\Python313\python.exe -m pip install -U pip setuptools wheel


Collecting spacy
  Downloading spacy-3.8.7-cp313-cp313-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp313-cp313-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp313-cp313-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp313-cp313-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp313-cp313-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp313-cp313-win_amd6


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 5.6 MB/s eta 0:00:03
     ---- ----------------------------------- 1.3/12.8 MB 3.8 MB/s eta 0:00:04
     ------ --------------------------------- 2.1/12.8 MB 3.8 MB/s eta 0:00:03
     -------- ------------------------------- 2.6/12.8 MB 3.5 MB/s eta 0:00:03
     ----------- ---------------------------- 3.7/12.8 MB 3.7 MB/s eta 0:00:03
     -------------- ------------------------- 4.7/12.8 MB 4.0 MB/s eta 0:00:03
     ------------------ --------------------- 5.8/12.8 MB 4.0 MB/s eta 0:00:02
     --------------------- ------------------ 6.8/12.8 MB 4.2 MB/s eta 0:00:02
     ----------------------- ---------------- 7.6/12.8 MB 4.3 MB/s eta 0:00:02
     --------------------------- --------


[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [37]:
from spacy.lang.en import English

nlp = English()

string = "Nathan Kelber is helping us tokenize with the Constellate platform. http://constellate.org #NLP"

my_doc = nlp(string)

tokens = []
for token in my_doc:
    tokens.append(token.text)

print(tokens)

['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http://constellate.org', '#', 'NLP']


In order to change tokenization with spaCy, you can [add rules](https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/). spaCy also supports Parts of Speech tagging and lemmatization.

In [38]:
import spacy
nlp = spacy.load('en_core_web_sm')
my_doc = nlp(string)

print('Parts of Speech')
for token in my_doc:
    print(token, token.pos_,)

print('\nLemmatizations')
for token in my_doc:
    print(token, token.lemma_)

Parts of Speech
Nathan PROPN
Kelber PROPN
is AUX
helping VERB
us PRON
tokenize VERB
with ADP
the DET
Constellate PROPN
platform NOUN
. PUNCT
http://constellate.org X
# SYM
NLP PROPN

Lemmatizations
Nathan Nathan
Kelber Kelber
is be
helping help
us we
tokenize tokenize
with with
the the
Constellate Constellate
platform platform
. .
http://constellate.org http://constellate.org
# #
NLP NLP


We can gather our n-grams by defining a function that accepts our tokens and an argument `n` for the "n" in "n-gram." So, a bigram would be n = 2.

In [39]:
# A function for gathering n-grams with spaCy
def n_grams(tokens, n):
    n_grams = []
    for i in range(len(tokens)-n+1):
        n_grams.append(tokens[i:i+n])
    return(n_grams)
    # return[tokens[i:i+n] for i in range(len(tokens)-n+1)] # Written as a list comprehension

In [40]:
bigrams = n_grams(tokens, 2)
trigrams = n_grams(tokens, 3)
print(bigrams)
print(trigrams)

[['Nathan', 'Kelber'], ['Kelber', 'is'], ['is', 'helping'], ['helping', 'us'], ['us', 'tokenize'], ['tokenize', 'with'], ['with', 'the'], ['the', 'Constellate'], ['Constellate', 'platform'], ['platform', '.'], ['.', 'http://constellate.org'], ['http://constellate.org', '#'], ['#', 'NLP']]
[['Nathan', 'Kelber', 'is'], ['Kelber', 'is', 'helping'], ['is', 'helping', 'us'], ['helping', 'us', 'tokenize'], ['us', 'tokenize', 'with'], ['tokenize', 'with', 'the'], ['with', 'the', 'Constellate'], ['the', 'Constellate', 'platform'], ['Constellate', 'platform', '.'], ['platform', '.', 'http://constellate.org'], ['.', 'http://constellate.org', '#'], ['http://constellate.org', '#', 'NLP']]


While NLTK and spaCy tokenizers are the most prominent, there are also tokenizers available for packages such as:

* [Gensim](https://radimrehurek.com/gensim/)
* [Keras](https://keras.io/)
* [Stanford NLP](https://nlp.stanford.edu/software/tokenizer.shtml)