# CS336 Assignments

| # | Topic                         | Description                                 |
|---|-------------------------------|---------------------------------------------|
| 1 | Basics                        | Train an LLM from scratch                   |
| 2 | Systems                       | Make it run fast!                           |
| 3 | Scaling                       | Make it performant at a FLOP budget         |
| 4 | Data                          | Prepare the right datasets                  |
| 5 | Alignment & Reasoning RL      | Align it to real-world use cases            |

# Assignment #1
- Implement all of the components (tokenizer, model, loss function, optimizer) necessary to train a standard Transformer language model
- Train a minimal language model

In [1]:
from datasets import load_dataset

tinystories = load_dataset("roneneldan/TinyStories")
tinystories

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

In [2]:
tinystories['train'][0:10]

{'text': ['One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.',
  'Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.\n\nOne day, Beep was driving in the park when he saw a big tree. The tree had many leav

# Tokenizer

In [3]:
sample_text = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "Python is a popular programming language.",
    "Machine learning enables computers to learn from data.",
    "Natural language processing helps computers understand text.",
    "Deep learning models require large amounts of data.",
    "Neural networks are inspired by the human brain.",
    "Data science combines statistics and computer science.",
    "Transformers have revolutionized language modeling.",
    "Open source software encourages collaboration."
]

Steps to create a tokenizer:
1. From all the words in our corpus, build a vocabulary
2. Create a mapping between vocab and integer IDs
3. Create a reverse mapping

In [4]:
set(sample_text[0].split())

{'The', 'brown', 'dog.', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'}

In [5]:
words = ' '.join(sample_text).split()
print(len(words))
words = set(words)
print(len(words))

69
62


In [6]:
stoi = {s:i for i, s in enumerate(words)}
itos = {i:s for i, s in enumerate(words)}
stoi

{'combines': 0,
 'have': 1,
 'large': 2,
 'to': 3,
 'language': 4,
 'The': 5,
 'a': 6,
 'by': 7,
 'computer': 8,
 'learn': 9,
 'of': 10,
 'Deep': 11,
 'Machine': 12,
 'from': 13,
 'models': 14,
 'Neural': 15,
 'are': 16,
 'the': 17,
 'is': 18,
 'human': 19,
 'statistics': 20,
 'encourages': 21,
 'science.': 22,
 'lazy': 23,
 'science': 24,
 'data.': 25,
 'processing': 26,
 'Data': 27,
 'require': 28,
 'jumps': 29,
 'networks': 30,
 'Transformers': 31,
 'understand': 32,
 'learning': 33,
 'Natural': 34,
 'brown': 35,
 'transforming': 36,
 'quick': 37,
 'Artificial': 38,
 'source': 39,
 'helps': 40,
 'world.': 41,
 'amounts': 42,
 'dog.': 43,
 'enables': 44,
 'popular': 45,
 'Python': 46,
 'brain.': 47,
 'revolutionized': 48,
 'programming': 49,
 'collaboration.': 50,
 'computers': 51,
 'Open': 52,
 'modeling.': 53,
 'software': 54,
 'language.': 55,
 'fox': 56,
 'and': 57,
 'text.': 58,
 'intelligence': 59,
 'inspired': 60,
 'over': 61}

In [7]:
[stoi[x] for x in sample_text[0].split()]

[5, 37, 35, 56, 29, 61, 17, 23, 43]

In [8]:
class Tokenizer():
    def __init__(self):
        pass

    def encode(self, s: str):
        self.vocab = ' '.join(sample_text).split()
        self.stoi = {s:i for i, s in enumerate(self.vocab)}
        self.itos = {i:s for i, s in enumerate(self.vocab)}
        encoded_str = [stoi[x] for x in sample_text[0].split()]
        return encoded_str

    def decode(self, i: list[str]):
        pass

In [9]:
tok = Tokenizer()
text_encoded = tok.encode(sample_text[0])
text_encoded

[5, 37, 35, 56, 29, 61, 17, 23, 43]

Now, let's implement a decoder that takes in a list of integer IDs and returns the corresponding input text.

In [10]:
class Tokenizer():
    def __init__(self):
        pass

    def encode(self, s: str):
        self.vocab = ' '.join(sample_text).split()
        self.stoi = {s:i for i, s in enumerate(self.vocab)}
        self.itos = {i:s for i, s in enumerate(self.vocab)}
        encoded_str = [stoi[x] for x in sample_text[0].split()]
        return encoded_str

    def decode(self, indices: list[str]):
        decoded_str = [itos[i] for i in indices]
        decoded_str = ' '.join(decoded_str) 
        return decoded_str


In [11]:
tok = Tokenizer()
text_encoded = tok.encode(sample_text[0])

print(sample_text[0])
print(text_encoded)

The quick brown fox jumps over the lazy dog.
[5, 37, 35, 56, 29, 61, 17, 23, 43]


In [12]:
tok.decode(text_encoded)

'The quick brown fox jumps over the lazy dog.'

While this works well, the vocab is created in runtime which is not desirable. The vocab should be create ahead of time so that any token can be encoded / decoded consistently. That means, we should create the vocab during initialization!

In [13]:
class TokenizerV2():
    def __init__(self, text_corpus: list[str]):
        self.text_corpus = text_corpus
        self.vocab = set(' '.join(sample_text).split())
        self.vocab_size = len(self.vocab)
        self.stoi = {s:i for i, s in enumerate(self.vocab)}
        self.itos = {i:s for i, s in enumerate(self.vocab)}

    def encode(self, text: str):
        encoded_str = [stoi[x] for x in text.split()]
        return encoded_str

    def decode(self, indices: list[int]):
        decoded_str = [itos[i] for i in indices]
        decoded_str = ' '.join(decoded_str) 
        return decoded_str


In [14]:
tok = TokenizerV2(sample_text)
tok.encode(sample_text[0])

[5, 37, 35, 56, 29, 61, 17, 23, 43]

In [15]:
tok.decode([2, 9, 34, 13, 49, 0, 57, 5, 10])

'large learn Natural from programming combines and The of'

In [16]:
indices = [tok.encode(s) for s in sample_text]
indices

[[5, 37, 35, 56, 29, 61, 17, 23, 43],
 [38, 59, 18, 36, 17, 41],
 [46, 18, 6, 45, 49, 55],
 [12, 33, 44, 51, 3, 9, 13, 25],
 [34, 4, 26, 40, 51, 32, 58],
 [11, 33, 14, 28, 2, 42, 10, 25],
 [15, 30, 16, 60, 7, 17, 19, 47],
 [27, 24, 0, 20, 57, 8, 22],
 [31, 1, 48, 4, 53],
 [52, 39, 54, 21, 50]]

In [17]:
[tok.decode(i) for i in indices]

['The quick brown fox jumps over the lazy dog.',
 'Artificial intelligence is transforming the world.',
 'Python is a popular programming language.',
 'Machine learning enables computers to learn from data.',
 'Natural language processing helps computers understand text.',
 'Deep learning models require large amounts of data.',
 'Neural networks are inspired by the human brain.',
 'Data science combines statistics and computer science.',
 'Transformers have revolutionized language modeling.',
 'Open source software encourages collaboration.']

In [18]:
sample_text

['The quick brown fox jumps over the lazy dog.',
 'Artificial intelligence is transforming the world.',
 'Python is a popular programming language.',
 'Machine learning enables computers to learn from data.',
 'Natural language processing helps computers understand text.',
 'Deep learning models require large amounts of data.',
 'Neural networks are inspired by the human brain.',
 'Data science combines statistics and computer science.',
 'Transformers have revolutionized language modeling.',
 'Open source software encourages collaboration.']

We are now able to convert text to integers and back. That's good. Let's try a new sentence with new words.

In [19]:
tok.encode("Satya Nadella leads Microsoft")

KeyError: 'Satya'

As the word Satya is absent in the vocab, we see this error. This is common in real-world where a lot of new tokens can appear in the wild.

We can add an unknown token during the vocab creation or use the more advanced BPE tokenizer which handles these situations quite well.

In [20]:
a = set([1,2,4])
a.add(15)
a

{1, 2, 4, 15}

In [21]:
class TokenizerV3():
    def __init__(self, text_corpus: list[str]):
        self.text_corpus = text_corpus
        self.vocab = ' '.join(sample_text).split()
        self.vocab = set(sorted(self.vocab))
        self.vocab.add("<UNK>")
        self.vocab_size = len(self.vocab)
        self.stoi = {s:i for i, s in enumerate(self.vocab)}
        self.itos = {i:s for i, s in enumerate(self.vocab)}

    def encode(self, text: str):
        encoded_str = [stoi.get(x, self.stoi['<UNK>']) for x in text.split()]
        return encoded_str

    def decode(self, indices: list[int]):
        decoded_str = [self.itos.get(i) for i in indices]
        decoded_str = ' '.join(decoded_str) 
        return decoded_str


In [22]:
tok = TokenizerV3(sample_text)
tok.encode(sample_text[0])

[5, 37, 35, 56, 29, 61, 17, 23, 43]

In [23]:
print(tok.encode("The bird understand the dog"))
print(tok.decode(tok.encode("The bird understand the dog")))

[5, 32, 32, 17, 32]
The <UNK> <UNK> human <UNK>


This is one way of handling unknown or special tokens. Now, we will build a BPE tokenizer which is a much better alternative for the following reasons:
    1. Handle out of vocabulary tokens (and language nuances such as singulars and plurals effectively).

This tokenizer breaks down words to subwords which are common in a language and this is shown to be more effective for language modeling tasks.

# BPE Tokenizer

## Pseudo Algorithm for BPE (Byte Pair Encoding) Tokenizer
(from github copilot)

1. **Initialize Vocabulary**
    - Start with a vocabulary of all unique characters in the corpus.

2. **Tokenize Corpus**
    - Split all words in the corpus into a list of characters (with a special end-of-word symbol, e.g., "l o w </w>").

3. **Count Pair Frequencies**
    - For all tokenized words, count the frequency of each adjacent character pair.

4. **Merge Most Frequent Pair**
    - Find the most frequent pair of characters.
    - Merge this pair into a new token (e.g., "l o" → "lo").

5. **Update Corpus**
    - Replace all occurrences of the merged pair in the corpus with the new token.

6. **Repeat**
    - Repeat steps 3–5 for a predefined number of merges or until no pairs remain.

7. **Build Final Vocabulary**
    - The final vocabulary consists of all tokens created during the merge steps.

8. **Tokenize New Text**
    - For new input, iteratively apply the learned merges to split words into known tokens.

---

**Note:**  
- Add an `<UNK>` token for unknown words/subwords.
- Store the merge operations for encoding/decoding.

## initialize vocab & tokenize corpus

In [24]:
class BPETokenizer():
    def __init__(self, text_corpus: list[str]):
        self.vocab = set(' '.join(sample_text))
        self.vocab.add("<UNK>")
        self.vocab_size = len(self.vocab)

    def encode(self, text: str):
        pass

    def decode(self, indices: list[int]):
        pass

btok = BPETokenizer(sample_text)

In [25]:
list(btok.vocab)[:5], btok.vocab_size

(['f', 'k', 'w', 'j', 'i'], 36)

## count pair frequencies

In [26]:
text = ' '.join(sample_text)
text

'The quick brown fox jumps over the lazy dog. Artificial intelligence is transforming the world. Python is a popular programming language. Machine learning enables computers to learn from data. Natural language processing helps computers understand text. Deep learning models require large amounts of data. Neural networks are inspired by the human brain. Data science combines statistics and computer science. Transformers have revolutionized language modeling. Open source software encourages collaboration.'

In [27]:
pair_freq = {}
for t, i in enumerate(text):
    chars = text[t:t+2]
    if chars in pair_freq:
        pair_freq[chars] += 1
    else:
        pair_freq[chars] = 1

sorted_pair_freq = dict(sorted(pair_freq.items(), key=lambda item: item[1], reverse=True))
sorted_pair_freq

{'e ': 15,
 's ': 14,
 'in': 11,
 '. ': 9,
 'ng': 9,
 ' l': 8,
 'ra': 8,
 'an': 8,
 'la': 7,
 'ar': 7,
 'er': 6,
 ' t': 6,
 'ge': 6,
 'en': 6,
 'co': 6,
 'at': 6,
 're': 6,
 'he': 5,
 'n ': 5,
 'ti': 5,
 'te': 5,
 'ce': 5,
 'or': 5,
 'g ': 5,
 ' c': 5,
 'om': 5,
 'ta': 5,
 ' s': 5,
 'ro': 4,
 'mp': 4,
 'th': 4,
 ' i': 4,
 'el': 4,
 'nc': 4,
 ' a': 4,
 'pu': 4,
 'ag': 4,
 'le': 4,
 'es': 4,
 'ut': 4,
 'rs': 4,
 'ur': 4,
 'd ': 4,
 'ic': 3,
 ' b': 3,
 'fo': 3,
 'r ': 3,
 ' d': 3,
 'ci': 3,
 'al': 3,
 'l ': 3,
 'is': 3,
 'ns': 3,
 'on': 3,
 ' p': 3,
 'gu': 3,
 'ua': 3,
 'ne': 3,
 'ea': 3,
 'rn': 3,
 'ni': 3,
 ' h': 3,
 'nd': 3,
 'de': 3,
 'st': 3,
 'mo': 3,
 'ou': 3,
 'qu': 2,
 'ui': 2,
 'br': 2,
 ' f': 2,
 'um': 2,
 'ps': 2,
 ' o': 2,
 've': 2,
 'y ': 2,
 'og': 2,
 'g.': 2,
 'nt': 2,
 'll': 2,
 'li': 2,
 'sf': 2,
 'rm': 2,
 'mi': 2,
 'wo': 2,
 'a ': 2,
 'pr': 2,
 'am': 2,
 'e.': 2,
 ' e': 2,
 'ab': 2,
 'da': 2,
 'a.': 2,
 ' N': 2,
 'un': 2,
 ' D': 2,
 ' m': 2,
 'od': 2,
 ' r': 2,
 'ir': 

Let's remove single chars from the pairs.

In [28]:
text = ' '.join(sample_text)
text


'The quick brown fox jumps over the lazy dog. Artificial intelligence is transforming the world. Python is a popular programming language. Machine learning enables computers to learn from data. Natural language processing helps computers understand text. Deep learning models require large amounts of data. Neural networks are inspired by the human brain. Data science combines statistics and computer science. Transformers have revolutionized language modeling. Open source software encourages collaboration.'

In [29]:
pair_freq = {}
for t, i in enumerate(text):
    chars = text[t:t+2]
    if len(chars.strip()) == 1:
        continue
    if chars in pair_freq:
        pair_freq[chars] += 1
    else:
        pair_freq[chars] = 1

sorted_pair_freq = dict(sorted(pair_freq.items(), key=lambda item: item[1], reverse=True))
sorted_pair_freq

{'in': 11,
 'ng': 9,
 'ra': 8,
 'an': 8,
 'la': 7,
 'ar': 7,
 'er': 6,
 'ge': 6,
 'en': 6,
 'co': 6,
 'at': 6,
 're': 6,
 'he': 5,
 'ti': 5,
 'te': 5,
 'ce': 5,
 'or': 5,
 'om': 5,
 'ta': 5,
 'ro': 4,
 'mp': 4,
 'th': 4,
 'el': 4,
 'nc': 4,
 'pu': 4,
 'ag': 4,
 'le': 4,
 'es': 4,
 'ut': 4,
 'rs': 4,
 'ur': 4,
 'ic': 3,
 'fo': 3,
 'ci': 3,
 'al': 3,
 'is': 3,
 'ns': 3,
 'on': 3,
 'gu': 3,
 'ua': 3,
 'ne': 3,
 'ea': 3,
 'rn': 3,
 'ni': 3,
 'nd': 3,
 'de': 3,
 'st': 3,
 'mo': 3,
 'ou': 3,
 'qu': 2,
 'ui': 2,
 'br': 2,
 'um': 2,
 'ps': 2,
 've': 2,
 'og': 2,
 'g.': 2,
 'nt': 2,
 'll': 2,
 'li': 2,
 'sf': 2,
 'rm': 2,
 'mi': 2,
 'wo': 2,
 'pr': 2,
 'am': 2,
 'e.': 2,
 'ab': 2,
 'da': 2,
 'a.': 2,
 'un': 2,
 'od': 2,
 'ir': 2,
 'of': 2,
 'tw': 2,
 'ed': 2,
 'n.': 2,
 'sc': 2,
 'ie': 2,
 'ol': 2,
 'io': 2,
 'so': 2,
 'Th': 1,
 'ck': 1,
 'ow': 1,
 'wn': 1,
 'ox': 1,
 'ju': 1,
 'ov': 1,
 'az': 1,
 'zy': 1,
 'do': 1,
 'Ar': 1,
 'rt': 1,
 'if': 1,
 'fi': 1,
 'ia': 1,
 'ig': 1,
 'tr': 1,
 'rl': 1,

## merge most frequent pair

let's consider cutoff as 5 for merging!

In [30]:
text = ' '.join(sample_text)
print(text)

chars = list(set(' '.join(sample_text)))
print(chars)

The quick brown fox jumps over the lazy dog. Artificial intelligence is transforming the world. Python is a popular programming language. Machine learning enables computers to learn from data. Natural language processing helps computers understand text. Deep learning models require large amounts of data. Neural networks are inspired by the human brain. Data science combines statistics and computer science. Transformers have revolutionized language modeling. Open source software encourages collaboration.
['f', 'k', 'w', 'j', 'i', 'a', 'y', 'N', 'T', 'p', 'o', 'x', 's', 'n', 'D', 'v', 't', 'z', 'l', '.', 'A', ' ', 'r', 'O', 'c', 'M', 'u', 'P', 'h', 'b', 'g', 'd', 'q', 'e', 'm']


In [31]:
vocab = chars
pair_freq = {}
for t, i in enumerate(text):
    char_group = text[t:t+2]
    if len(char_group.strip()) == 1:
        continue
    if char_group in pair_freq:
        pair_freq[char_group] += 1
        if pair_freq[char_group] >= 5:
            vocab.append(char_group)
    else:
        pair_freq[char_group] = 1

sorted_pair_freq = dict(sorted(pair_freq.items(), key=lambda item: item[1], reverse=True))
sorted_pair_freq

{'in': 11,
 'ng': 9,
 'ra': 8,
 'an': 8,
 'la': 7,
 'ar': 7,
 'er': 6,
 'ge': 6,
 'en': 6,
 'co': 6,
 'at': 6,
 're': 6,
 'he': 5,
 'ti': 5,
 'te': 5,
 'ce': 5,
 'or': 5,
 'om': 5,
 'ta': 5,
 'ro': 4,
 'mp': 4,
 'th': 4,
 'el': 4,
 'nc': 4,
 'pu': 4,
 'ag': 4,
 'le': 4,
 'es': 4,
 'ut': 4,
 'rs': 4,
 'ur': 4,
 'ic': 3,
 'fo': 3,
 'ci': 3,
 'al': 3,
 'is': 3,
 'ns': 3,
 'on': 3,
 'gu': 3,
 'ua': 3,
 'ne': 3,
 'ea': 3,
 'rn': 3,
 'ni': 3,
 'nd': 3,
 'de': 3,
 'st': 3,
 'mo': 3,
 'ou': 3,
 'qu': 2,
 'ui': 2,
 'br': 2,
 'um': 2,
 'ps': 2,
 've': 2,
 'og': 2,
 'g.': 2,
 'nt': 2,
 'll': 2,
 'li': 2,
 'sf': 2,
 'rm': 2,
 'mi': 2,
 'wo': 2,
 'pr': 2,
 'am': 2,
 'e.': 2,
 'ab': 2,
 'da': 2,
 'a.': 2,
 'un': 2,
 'od': 2,
 'ir': 2,
 'of': 2,
 'tw': 2,
 'ed': 2,
 'n.': 2,
 'sc': 2,
 'ie': 2,
 'ol': 2,
 'io': 2,
 'so': 2,
 'Th': 1,
 'ck': 1,
 'ow': 1,
 'wn': 1,
 'ox': 1,
 'ju': 1,
 'ov': 1,
 'az': 1,
 'zy': 1,
 'do': 1,
 'Ar': 1,
 'rt': 1,
 'if': 1,
 'fi': 1,
 'ia': 1,
 'ig': 1,
 'tr': 1,
 'rl': 1,

In [33]:
text

'The quick brown fox jumps over the lazy dog. Artificial intelligence is transforming the world. Python is a popular programming language. Machine learning enables computers to learn from data. Natural language processing helps computers understand text. Deep learning models require large amounts of data. Neural networks are inspired by the human brain. Data science combines statistics and computer science. Transformers have revolutionized language modeling. Open source software encourages collaboration.'

In [34]:
# vocab = chars
for t, i in enumerate(text):
    char_group = text[t:t+3]
    if len(char_group.replace(" ", "")) == 2:
        continue
    if char_group in pair_freq:
        pair_freq[char_group] += 1
        if pair_freq[char_group] >= 5:
            vocab.append(char_group)
    else:
        pair_freq[char_group] = 1

sorted_pair_freq = dict(sorted(pair_freq.items(), key=lambda item: item[1], reverse=True))
sorted_pair_freq

{'in': 11,
 'ng': 9,
 'ra': 8,
 'an': 8,
 'la': 7,
 'ar': 7,
 'er': 6,
 'ge': 6,
 'en': 6,
 'co': 6,
 'at': 6,
 're': 6,
 'ing': 6,
 'he': 5,
 'ti': 5,
 'te': 5,
 'ce': 5,
 'or': 5,
 'om': 5,
 'ta': 5,
 'ro': 4,
 'mp': 4,
 'th': 4,
 'el': 4,
 'nc': 4,
 'pu': 4,
 'ag': 4,
 'le': 4,
 'es': 4,
 'ut': 4,
 'rs': 4,
 'ur': 4,
 'enc': 4,
 'age': 4,
 'com': 4,
 'ers': 4,
 'ic': 3,
 'fo': 3,
 'ci': 3,
 'al': 3,
 'is': 3,
 'ns': 3,
 'on': 3,
 'gu': 3,
 'ua': 3,
 'ne': 3,
 'ea': 3,
 'rn': 3,
 'ni': 3,
 'nd': 3,
 'de': 3,
 'st': 3,
 'mo': 3,
 'ou': 3,
 'the': 3,
 'nce': 3,
 'lan': 3,
 'ang': 3,
 'ngu': 3,
 'gua': 3,
 'uag': 3,
 'lea': 3,
 'ear': 3,
 'arn': 3,
 'omp': 3,
 'mpu': 3,
 'put': 3,
 'ute': 3,
 'ter': 3,
 'ata': 3,
 'ura': 3,
 'qu': 2,
 'ui': 2,
 'br': 2,
 'um': 2,
 'ps': 2,
 've': 2,
 'og': 2,
 'g.': 2,
 'nt': 2,
 'll': 2,
 'li': 2,
 'sf': 2,
 'rm': 2,
 'mi': 2,
 'wo': 2,
 'pr': 2,
 'am': 2,
 'e.': 2,
 'ab': 2,
 'da': 2,
 'a.': 2,
 'un': 2,
 'od': 2,
 'ir': 2,
 'of': 2,
 'tw': 2,
 'ed': 

Now that the new bigram tokens are in our vocab, continue doing this until there are no more tokens left under the target frequency.

In [36]:
# TODO!

In practice, we use the efficient BPETokenizer implementation from tiktoken library from OpenAI!

# Tiktoken BPETokenizer

In [37]:
import tiktoken

ttok = tiktoken.get_encoding("gpt2")
ttok

<Encoding 'gpt2'>

In [38]:
sentence = "Hey! I am doing well.. I hope you're well too! "
print(f"original  : {sentence}")

sentence_encoded = ttok.encode(sentence)
print(f"encoded   : {sentence_encoded}")

sentence_decoded = ttok.decode(sentence_encoded)
print(f"decoded   : {sentence_decoded}")


original  : Hey! I am doing well.. I hope you're well too! 
encoded   : [10814, 0, 314, 716, 1804, 880, 492, 314, 2911, 345, 821, 880, 1165, 0, 220]
decoded   : Hey! I am doing well.. I hope you're well too! 


In [74]:
ttok = tiktoken.get_encoding("gpt2")

ttok.encode(tinystories['train'][0]['text'])[0:10]

[3198, 1110, 11, 257, 1310, 2576, 3706, 20037, 1043, 257]

Let's now convert an entire batch of text into integerIDs.

In [76]:
ttok.encode_batch(tinystories['train'][0:8]['text'])[0][0:10]

[3198, 1110, 11, 257, 1310, 2576, 3706, 20037, 1043, 257]

In [57]:
[len(t) for t in ttok.encode_batch(tinystories['train'][0:8]['text'])]

[162, 177, 212, 193, 159, 168, 171, 181]

We can convert this into a torch tensor.

In [72]:
import torch
ids = ttok.encode_batch(tinystories['train'][0:8]['text'])

In [73]:
[torch.tensor(x) for x in ids]

[tensor([ 3198,  1110,    11,   257,  1310,  2576,  3706, 20037,  1043,   257,
         17598,   287,   607,  2119,    13,  1375,  2993,   340,   373,  2408,
           284,   711,   351,   340,   780,   340,   373,  7786,    13, 20037,
          2227,   284,  2648,   262, 17598,   351,   607,  1995,    11,   523,
           673,   714, 34249,   257,  4936,   319,   607, 10147,    13,   198,
           198,    43,   813,  1816,   284,   607,  1995,   290,   531,    11,
           366, 29252,    11,   314,  1043,   428, 17598,    13,  1680,   345,
          2648,   340,   351,   502,   290, 34249,   616, 10147,  1701,  2332,
          1995, 13541,   290,   531,    11,   366,  5297,    11, 20037,    11,
           356,   460,  2648,   262, 17598,   290,  4259,   534, 10147,   526,
           198,   198, 41631,    11,   484,  4888,   262, 17598,   290,   384,
         19103,   262,  4936,   319, 20037,   338, 10147,    13,   632,   373,
           407,  2408,   329,   606,   780,   484,  

To convert this into a tensor with padding, we will be using the `torch.nn.utils.rnn.pad_sequence` function.

In [71]:
tids = torch.nn.utils.rnn.pad_sequence([torch.tensor(x) for x in ids], batch_first=True)
tids

tensor([[3198, 1110,   11,  ...,    0,    0,    0],
        [7454, 2402,  257,  ...,    0,    0,    0],
        [3198, 1110,   11,  ...,  922, 2460,   13],
        ...,
        [7454, 2402,  257,  ...,    0,    0,    0],
        [7454, 2402,  257,  ...,    0,    0,    0],
        [7454, 2402,  257,  ...,    0,    0,    0]])