# CS336 Assignments

| # | Topic                         | Description                                 |
|---|-------------------------------|---------------------------------------------|
| 1 | Basics                        | Train an LLM from scratch                   |
| 2 | Systems                       | Make it run fast!                           |
| 3 | Scaling                       | Make it performant at a FLOP budget         |
| 4 | Data                          | Prepare the right datasets                  |
| 5 | Alignment & Reasoning RL      | Align it to real-world use cases            |

# Assignment #1
- Implement all of the components (tokenizer, model, loss function, optimizer) necessary to train a standard Transformer language model
- Train a minimal language model

In [3]:
from datasets import load_dataset

tinystories = load_dataset("roneneldan/TinyStories")
tinystories

Generating train split: 100%|██████████| 2119719/2119719 [00:01<00:00, 1615856.43 examples/s]
Generating validation split: 100%|██████████| 21990/21990 [00:00<00:00, 1467832.85 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

In [6]:
tinystories['train'][0:10]

{'text': ['One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.',
  'Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.\n\nOne day, Beep was driving in the park when he saw a big tree. The tree had many leav

# Tokenizer

In [1]:
sample_text = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "Python is a popular programming language.",
    "Machine learning enables computers to learn from data.",
    "Natural language processing helps computers understand text.",
    "Deep learning models require large amounts of data.",
    "Neural networks are inspired by the human brain.",
    "Data science combines statistics and computer science.",
    "Transformers have revolutionized language modeling.",
    "Open source software encourages collaboration."
]

Steps to create a tokenizer:
1. From all the words in our corpus, build a vocabulary
2. Create a mapping between vocab and integer IDs
3. Create a reverse mapping

In [12]:
set(sample_text[0].split())

{'The', 'brown', 'dog.', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'}

In [17]:
words = ' '.join(sample_text).split()
print(len(words))
words = set(words)
print(len(words))

69
62


In [20]:
stoi = {s:i for i, s in enumerate(words)}
itos = {i:s for i, s in enumerate(words)}
stoi

{'over': 0,
 'Neural': 1,
 'The': 2,
 'world.': 3,
 'amounts': 4,
 'lazy': 5,
 'computer': 6,
 'encourages': 7,
 'understand': 8,
 'quick': 9,
 'dog.': 10,
 'helps': 11,
 'Open': 12,
 'fox': 13,
 'science.': 14,
 'networks': 15,
 'software': 16,
 'text.': 17,
 'science': 18,
 'to': 19,
 'Machine': 20,
 'revolutionized': 21,
 'Transformers': 22,
 'transforming': 23,
 'from': 24,
 'source': 25,
 'programming': 26,
 'brain.': 27,
 'and': 28,
 'data.': 29,
 'modeling.': 30,
 'Natural': 31,
 'enables': 32,
 'models': 33,
 'brown': 34,
 'Data': 35,
 'Python': 36,
 'collaboration.': 37,
 'statistics': 38,
 'a': 39,
 'popular': 40,
 'language': 41,
 'learn': 42,
 'by': 43,
 'Artificial': 44,
 'computers': 45,
 'inspired': 46,
 'have': 47,
 'require': 48,
 'jumps': 49,
 'large': 50,
 'intelligence': 51,
 'is': 52,
 'Deep': 53,
 'of': 54,
 'combines': 55,
 'language.': 56,
 'the': 57,
 'learning': 58,
 'are': 59,
 'processing': 60,
 'human': 61}

In [23]:
[stoi[x] for x in sample_text[0].split()]

[2, 9, 34, 13, 49, 0, 57, 5, 10]

In [26]:
class Tokenizer():
    def __init__(self):
        pass

    def encode(self, s: str):
        self.vocab = ' '.join(sample_text).split()
        self.stoi = {s:i for i, s in enumerate(self.vocab)}
        self.itos = {i:s for i, s in enumerate(self.vocab)}
        encoded_str = [stoi[x] for x in sample_text[0].split()]
        return encoded_str

    def decode(self, i: list[str]):
        pass

In [28]:
tok = Tokenizer()
text_encoded = tok.encode(sample_text[0])
text_encoded

[2, 9, 34, 13, 49, 0, 57, 5, 10]

Now, let's implement a decoder that takes in a list of integer IDs and returns the corresponding input text.

In [36]:
class Tokenizer():
    def __init__(self):
        pass

    def encode(self, s: str):
        self.vocab = ' '.join(sample_text).split()
        self.stoi = {s:i for i, s in enumerate(self.vocab)}
        self.itos = {i:s for i, s in enumerate(self.vocab)}
        encoded_str = [stoi[x] for x in sample_text[0].split()]
        return encoded_str

    def decode(self, indices: list[str]):
        decoded_str = [itos[i] for i in indices]
        decoded_str = ' '.join(decoded_str) 
        return decoded_str


In [37]:
tok = Tokenizer()
text_encoded = tok.encode(sample_text[0])

print(sample_text[0])
print(text_encoded)

The quick brown fox jumps over the lazy dog.
[2, 9, 34, 13, 49, 0, 57, 5, 10]


In [38]:
tok.decode(text_encoded)

'The quick brown fox jumps over the lazy dog.'

Now that we have build a simple tokenizer, let's go ahead and build the BPETokenizer which is more performant and effective.

# BPE Tokenizer

TODO