### Lab 8.1 Tokenization

This week we will work up to creating an RNN text generator.  In today's lab you will explore different methods of text tokenization.   Here's an overview of what you will try to do.

Imagine that our entire dataset consists of the following text:

    hello world hello a b c

We would first build a vocabulary of the words in the dataset:

    0: hello
    1: world
    2: a
    3: b
    4: c

Thus the dataset can be mapped to token indices:

    0 1 0 2 3 4

Now suppose that we have defined the maximum sequence length (`seq_len`) to be 3.  We will use each possible sequence as the input to our RNN, and the next token as the target.  Here are the possible input sequences and targets:

    0 1 0 -> 2
    1 0 2 -> 3
    0 2 3 -> 4

You will build a subclass of `Dataset` to find all possible sequences for a given dataset, either at the word or character level.

The following code will download the text of Shakespeare's sonnets and read it in as one long string.

In [37]:
from torch.utils.data import Dataset

In [38]:
!wget --no-clobber "https://www.dropbox.com/scl/fi/7r68l64ijemidyb9lf80q/sonnets.txt?rlkey=udb47coatr2zbrk31hsfbr22y&dl=1" -O sonnets.txt
text = (open("sonnets.txt").read())


File ‘sonnets.txt’ already there; not retrieving.


In [39]:
text = text.lower()

In [40]:
print(text[:1000])

i

 from fairest creatures we desire increase,
 that thereby beauty's rose might never die,
 but as the riper should by time decease,
 his tender heir might bear his memory:
 but thou, contracted to thine own bright eyes,
 feed'st thy light's flame with self-substantial fuel,
 making a famine where abundance lies,
 thy self thy foe, to thy sweet self too cruel:
 thou that art now the world's fresh ornament,
 and only herald to the gaudy spring,
 within thine own bud buriest thy content,
 and tender churl mak'st waste in niggarding:
   pity the world, or else this glutton be,
   to eat the world's due, by the grave and thee.

 ii

 when forty winters shall besiege thy brow,
 and dig deep trenches in thy beauty's field,
 thy youth's proud livery so gazed on now,
 will be a tatter'd weed of small worth held:
 then being asked, where all thy beauty lies,
 where all the treasure of thy lusty days;
 to say, within thine own deep sunken eyes,
 were an all-eating shame, and thriftless praise.


### Exercises

1. Prepare a vocabulary of the unique words in the dataset.  (For simplicity's sake you can leave the punctuation in.)

In [41]:
worddict = {}
words = text.split()
for word in words:
    if word not in worddict:
        worddict[word] = len(worddict)
worddict

{'\ufeffi': 0,
 'from': 1,
 'fairest': 2,
 'creatures': 3,
 'we': 4,
 'desire': 5,
 'increase,': 6,
 'that': 7,
 'thereby': 8,
 "beauty's": 9,
 'rose': 10,
 'might': 11,
 'never': 12,
 'die,': 13,
 'but': 14,
 'as': 15,
 'the': 16,
 'riper': 17,
 'should': 18,
 'by': 19,
 'time': 20,
 'decease,': 21,
 'his': 22,
 'tender': 23,
 'heir': 24,
 'bear': 25,
 'memory:': 26,
 'thou,': 27,
 'contracted': 28,
 'to': 29,
 'thine': 30,
 'own': 31,
 'bright': 32,
 'eyes,': 33,
 "feed'st": 34,
 'thy': 35,
 "light's": 36,
 'flame': 37,
 'with': 38,
 'self-substantial': 39,
 'fuel,': 40,
 'making': 41,
 'a': 42,
 'famine': 43,
 'where': 44,
 'abundance': 45,
 'lies,': 46,
 'self': 47,
 'foe,': 48,
 'sweet': 49,
 'too': 50,
 'cruel:': 51,
 'thou': 52,
 'art': 53,
 'now': 54,
 "world's": 55,
 'fresh': 56,
 'ornament,': 57,
 'and': 58,
 'only': 59,
 'herald': 60,
 'gaudy': 61,
 'spring,': 62,
 'within': 63,
 'bud': 64,
 'buriest': 65,
 'content,': 66,
 'churl': 67,
 "mak'st": 68,
 'waste': 69,
 'in': 70

2. Now you will make a Dataset subclass that can return sequences of tokens, encoded as integers.

In [42]:
class WordDataset(Dataset):
  def __init__(self,text,seq_len=100):
    self.seq_len = seq_len
    # add code to compute the vocabulary (copied from exercise 1)
    self.worddict = {}
    words = text.split()
    for word in words:
        if word not in self.worddict:
            self.worddict[word] = len(self.worddict)

    # add code to convert the text to a sequence of word indices
    self.wordidxs = {}
    for idx in self.worddict:
       if idx not in self.wordidxs:
          self.wordidxs[self.worddict[idx]] = idx

    # Convert text to sequence of word indices
    self.word_indices = [self.worddict[word] for word in words]
    print("words", self.worddict)
    print("wordidxs", self.wordidxs)
    print("word_indices", self.word_indices)

  def __len__(self):
    return len(self.word_indices) - self.seq_len # replace this with code to return the number of possible sub-sequences

  def __getitem__(self,i):
    return (
            self.word_indices[i:i+self.seq_len],  # Sequence of word indices
            self.word_indices[i+self.seq_len]  # Next word index as the target
        ) # replace this with code to return a sequence of length seq_len of token indices starting at i, and the index of token i+seq_len as the label

  def decode(self,tokens):
    return "".join(self.wordidxs[token] for token in tokens) # replace this with code to convert a sequence of tokens back into a string

3. Verify that your class can successfully encode and decode sequences.

In [None]:
# Example usage
dataset = WordDataset(text, seq_len=5)

print("Example encoded sequence:", dataset[0][0])  # List of word indices
print("Decoded:", dataset.decode(dataset[0][0]))  # Convert back to words


word_indices [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 11, 25, 22, 26, 14, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 35, 47, 35, 48, 29, 35, 49, 47, 50, 51, 52, 7, 53, 54, 16, 55, 56, 57, 58, 59, 60, 29, 16, 61, 62, 63, 30, 31, 64, 65, 35, 66, 58, 23, 67, 68, 69, 70, 71, 72, 16, 73, 74, 75, 76, 77, 78, 29, 79, 16, 55, 80, 19, 16, 81, 58, 82, 83, 84, 85, 86, 87, 88, 35, 89, 58, 90, 91, 92, 70, 35, 9, 93, 35, 94, 95, 96, 97, 98, 99, 100, 101, 102, 42, 103, 104, 105, 106, 107, 108, 109, 110, 111, 44, 112, 35, 113, 46, 44, 112, 16, 114, 105, 35, 115, 116, 29, 117, 63, 30, 31, 91, 118, 33, 119, 120, 121, 122, 58, 123, 124, 125, 126, 127, 128, 129, 35, 9, 130, 131, 52, 132, 133, 134, 135, 136, 105, 137, 87, 138, 139, 140, 58, 141, 139, 142, 143, 144, 22, 113, 19, 145, 146, 76, 119, 29, 102, 147, 148, 84, 52, 53, 149, 58, 150, 35, 151, 152, 84, 52, 153, 154, 155, 156, 157, 70, 35, 158, 58, 159, 16, 160, 52,

4. Do the exercise again, but this time at the character level.

In [None]:
class CharacterDataset(Dataset):
  def __init__(self,text,seq_len=100):
    self.seq_len = seq_len
    # add code to compute the vocabulary of unique characters
    self.chardict = {}
    for char in text:
        if char not in self.chardict:
            self.chardict[char] = len(self.chardict)
    
    # add code to convert the text to a sequence of character indices
    self.charidxs = {}
    for idx in self.chardict:
       if idx not in self.charidxs:
          self.charidxs[self.chardict[idx]] = idx
    
    # Convert text to sequence of word indices
    self.char_indices = [self.chardict[char] for char in text]
    print("chars", self.chardict)
    print("charidxs", self.charidxs)
    print("char_indices", self.char_indices)
  def __len__(self):
    return len(self.char_indices) - self.seq_len # replace this with code to return the number of possible sub-sequences

  def __getitem__(self,i):
    return (
            self.char_indices[i:i+self.seq_len],  # Sequence of word indices
            self.char_indices[i+self.seq_len]  # Next word index as the target
        ) 
  def decode(self,tokens):
    return "".join(self.charidxs[token] for token in tokens) # replace this with code to convert a sequence of tokens back into a string

5. Compare the number of sequences for each tokenization method.

In [51]:
seq_len = 5

# Create datasets
word_dataset = WordDataset(text, seq_len)
char_dataset = CharacterDataset(text, seq_len)

# Compare sequence counts
print("Number of word-level sequences:", len(word_dataset))
print("Number of character-level sequences:", len(char_dataset))

word_indices [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 11, 25, 22, 26, 14, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 35, 47, 35, 48, 29, 35, 49, 47, 50, 51, 52, 7, 53, 54, 16, 55, 56, 57, 58, 59, 60, 29, 16, 61, 62, 63, 30, 31, 64, 65, 35, 66, 58, 23, 67, 68, 69, 70, 71, 72, 16, 73, 74, 75, 76, 77, 78, 29, 79, 16, 55, 80, 19, 16, 81, 58, 82, 83, 84, 85, 86, 87, 88, 35, 89, 58, 90, 91, 92, 70, 35, 9, 93, 35, 94, 95, 96, 97, 98, 99, 100, 101, 102, 42, 103, 104, 105, 106, 107, 108, 109, 110, 111, 44, 112, 35, 113, 46, 44, 112, 16, 114, 105, 35, 115, 116, 29, 117, 63, 30, 31, 91, 118, 33, 119, 120, 121, 122, 58, 123, 124, 125, 126, 127, 128, 129, 35, 9, 130, 131, 52, 132, 133, 134, 135, 136, 105, 137, 87, 138, 139, 140, 58, 141, 139, 142, 143, 144, 22, 113, 19, 145, 146, 76, 119, 29, 102, 147, 148, 84, 52, 53, 149, 58, 150, 35, 151, 152, 84, 52, 153, 154, 155, 156, 157, 70, 35, 158, 58, 159, 16, 160, 52,

6. Optional: implement the byte pair encoding algorithm to make a Dataset class that uses word parts.