## Tokenization lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Due:** October 8


### Background
The purpose of this lab is to explore different tokenization methods. On their own, tokenization methods don't do much. However, they are the starting place for all natural language processing. 


#### Notes
This is a short lab using the same dataset throughout. Feel free to switch it up, but once you are comfortable with how the different alogorithms approach the task of breaking up text, move on. 

You will be using the `datasets` package. You can [install the package](https://pypi.org/project/datasets/) with `$ pip install datasets`. If you do not have `pip` or `conda` installed on your machine, please install it now.

In [9]:
import nltk
import timeit

nltk.download('wordnet')  # got an error in cell 10 so added this line and reran...

from datasets import load_dataset

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cis-a\AppData\Roaming\nltk_data...


The next cell is just downloading the dataset. You need to be connected to the internet for this to work. 

This dataset is hosted by [Hugging Face](https://huggingface.co). Hugging Face hosts machine learning models, datasets, and more. We will reference them again. It's a great place to find corpora. 


The dataset is called [American Stories](https://huggingface.co/datasets/dell-research-harvard/AmericanStories). Please skim the Dataset Card. All models and datasets on the Hugging Face hub have these associated cards. 

In [4]:
# Decide what year you want between 1810 and 1963

my_year = "1942" # the year mom was born

# Decide how many articles you want to work with (keep this small - it's slow)
num_articles = 15

#  Download data for your choice of year (1810 to 1963)
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "subset_years",
    year_list=[my_year]
)

# Get the first n articles from that year
# instantiate the counter
i=0
# instantiate the string
my_articles = ''
# loop through each article for that year
for article in dataset[my_year]:
    #the article is a dictionary, 
    #we're getting the text of the article by accessing the key, "article"
    my_articles += article.get('article')
    #add one to our counter
    i+=1
    #if the counter is greater than num_articles-1, stop looping
    if i>(num_articles-1): break
    
#validate that it is what we expect by checking on first 100 characters
print(my_articles[:1000])


Downloading and preparing dataset american_stories/subset_years to C:/Users/cis-a/.cache/huggingface/datasets/dell-research-harvard___american_stories/subset_years-22ef276adc874771/0.1.0/75a916c5166c4f1fe51a57e0f5074cc72e19157c2bb064a2dc3e6362e19892fb...
Only taking a subset of years. Change name to 'all_years' to use all years in the dataset.
{'1942': 'https://huggingface.co/datasets/dell-research-harvard/AmericanStories/resolve/main/faro_1942.tar.gz'}


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/459M [00:00<?, ?B/s]

Generating 1942 split: 0 examples [00:00, ? examples/s]

Loading associated
Dataset american_stories downloaded and prepared to C:/Users/cis-a/.cache/huggingface/datasets/dell-research-harvard___american_stories/subset_years-22ef276adc874771/0.1.0/75a916c5166c4f1fe51a57e0f5074cc72e19157c2bb064a2dc3e6362e19892fb. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

The meat ration of the American
soldier has been increased about 50
per cent to satisfy the soldiers
tastes.


The coffee ration has been cut
almost in half because this genera
ton of soldiers drinks milk.


These are the outstanding changes
which will be incorporated in the
1943 edition of the Bible of mess
sergeants," the Army Cook, on which
a group of 10 of the countrys fore
most woman nutritionists now are
at work.


Working under the direction of
the Quartermaster Corps, they have
undertaken revision of recipes
for Army dishes, based on the newer
knowledge of nutrition and the
changed eating habits in the homes
from which the soldiers come.


Most of the old standbys of 1918
remain. but sometimes with differ
ent and more exact proportions.
"Slumgullion" is practically un-
changed Properly cooked, the nu-
triton experts say, it is one of the
most tasty and satisfying dishes
possible. Badly cooked, it may be
terrible mess, but it is harder tomake a mess of it than of some
other reci

This section is for formatting. It removes almost all the markup in these articles. It's a fairly standard set of character encodings.

In [5]:
#remove new line and other formatting characters
for char in ["\n", "\r", "\d", "\t"]:
    my_articles = my_articles.replace(char, " ")
my_articles[:1000]

'The meat ration of the American soldier has been increased about 50 per cent to satisfy the soldiers tastes.   The coffee ration has been cut almost in half because this genera ton of soldiers drinks milk.   These are the outstanding changes which will be incorporated in the 1943 edition of the Bible of mess sergeants," the Army Cook, on which a group of 10 of the countrys fore most woman nutritionists now are at work.   Working under the direction of the Quartermaster Corps, they have undertaken revision of recipes for Army dishes, based on the newer knowledge of nutrition and the changed eating habits in the homes from which the soldiers come.   Most of the old standbys of 1918 remain. but sometimes with differ ent and more exact proportions. "Slumgullion" is practically un- changed Properly cooked, the nu- triton experts say, it is one of the most tasty and satisfying dishes possible. Badly cooked, it may be terrible mess, but it is harder tomake a mess of it than of some other rec

# Whitespace tokenization


First we'll just break up the words using whitespace. As noted in class, this is a really common first pass. 

In [6]:
%%time
#this is a magic function to determine how long a cell takes to run. 
#It MUST be the first thing in a cell

#split the whole string on spaces. This returns a list
whitespace_tokens = my_articles.split(' ')

#check the list
whitespace_tokens[:20]

CPU times: total: 0 ns
Wall time: 0 ns


['The',
 'meat',
 'ration',
 'of',
 'the',
 'American',
 'soldier',
 'has',
 'been',
 'increased',
 'about',
 '50',
 'per',
 'cent',
 'to',
 'satisfy',
 'the',
 'soldiers',
 'tastes.',
 '']

Note: "µs" is microseconds, or a millionth of a second 1/1,000,000

# Morphological Tokenization 

Lemmatizing is the process of breaking down text into tokens by first breaking it up into "words" and then using syntactic knowledge of the language (in this case, English) to break up the words. 

Princeton maintains the [morphy project](https://wordnet.princeton.edu/documentation/morphy7wn#:~:text=Morphology%20in%20WordNet%20uses%20two,word%20that%20is%20in%20WordNet.), which powers `nltk`'s [WordNet Lemmatizer](https://www.nltk.org/api/nltk.stem.wordnet.html). You do NOT need to read this entire documentation, just acknowledge that it requires a significant amount of knowledge about English in order to make it work. 

In [7]:
#This lemmatizer is based on the Morphy project above
from nltk.stem import WordNetLemmatizer
 
#Uncomment these two lines - you may need to download these, maybe not. 
#nltk.download('wordnet')
#nltk.download('omw-1.4')
wn_lemmatizer = WordNetLemmatizer()

In [10]:
%%time

#first we have to split the string on spaces to get "words"
whitespace_tokens = my_articles.split(' ')

my_lemmas = []
for word in whitespace_tokens:
    w = wn_lemmatizer.lemmatize(word)
    my_lemmas.append(w)
my_lemmas[:20]

CPU times: total: 1.88 s
Wall time: 1.9 s


['The',
 'meat',
 'ration',
 'of',
 'the',
 'American',
 'soldier',
 'ha',
 'been',
 'increased',
 'about',
 '50',
 'per',
 'cent',
 'to',
 'satisfy',
 'the',
 'soldier',
 'tastes.',
 '']

Notice how much time it takes to tokenize on whitespace versus using morphological rules. Also notice if it produced the output you expected. Sometimes it doesn't. 

ms is a millisecond, or one one thousandth of a second 1/1,000

# Byte Pair Encoding

There are two implementations of BPE here. The first [uses a package (bpe)](https://github.com/soaxelbrooke/python-bpe) that you will have to install using `pip` (see above). 

This will implement the algorithm we covered in class and that you can review at [Hugging Face](https://youtu.be/HEikzVL-lZU).

In [12]:
from bpe import Encoder

In [13]:
%%time
whitespace_tokens = my_articles.split(' ')

# calling the Encoder algorithm
# we've specified 100 token vocab and 95% to be tokenized
# the other 5% is transformed into UNK
encoder = Encoder(100, pct_bpe=0.95)
encoder.fit(whitespace_tokens)

CPU times: total: 31.2 ms
Wall time: 26.4 ms


In [14]:
#print(encoder.tokenize(my_articles))

print(next(encoder.inverse_transform(encoder.transform([my_articles]))))

the meat ration of the american soldier has been increased about __unk0 per cent to satisfy the soldiers tastes . the coffee ration has been cut almost in half because this genera ton of soldiers drinks milk . these are the outstanding changes which will be incorporated in the 1__unk__unk__unk edition of the bible of mess sergeants __unk__unk the army cook , on which a group of 10 of the countrys fore most woman nutritionists now are at work . working under the direction of the __unkuartermaster corps , they have undertaken revision of recipes for army dishes , based on the newer knowledge of nutrition and the changed eating habits in the homes from which the soldiers come . most of the old standbys of 1__unk1__unk remain . but sometimes with differ ent and more e__unkact proportions . __unk slumgullion __unk is practically un - changed properly cooked , the nu - triton e__unkperts say , it is one of the most tasty and satisfying dishes possible . badly cooked , it may be terrible mess