# Assignment1


**Problem (unicode1): Understanding Unicode**

(a) what Unicode character does chr(0) return?

\x00 (NUL)

(b) How does this character’s string representation (__repr__()) differ from its printed representation?

print(chr(0)) is unvisible.

(c) What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:

```python
>>> chr(0)
>>> print(chr(0))
>>> "this is a test" + chr(0) + "string"
>>> print("this is a test" + chr(0) + "string")
```

  'this is a test\x00string'

    this is a teststring


**Problem (unicode2): Unicode Encodings**

(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various input strings.

For most real-world text (especially English), UTF-8 is much more space-efficient and UTF-8 is byte-oriented. UTF-8 is the de facto standard for web, files, and APIs. Most text data you encounter is already in UTF-8.

UTF-16/32 can introduce null bytes in the middle of text, which can break C-style string handling and some legacy systems, while UTF-8 never encodes ASCII characters (U+0000 to U+007F) with null bytes (\x00).

Byte-level tokenizers (like GPT-2’s BPE) work naturally with UTF-8, since every possible byte value (0–255) is valid

(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into a Unicode string. Why is this function incorrect? Provide an example of an input byte string that yields incorrect results.

```python
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
return "".join([bytes([b]).decode("utf-8") for b in bytestring])
>>> decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))
'hello'
```

Decoding byte-by-byte breaks multi-byte characters.

(c) Give a two byte sequence that does not decode to any Unicode character(s).

[0xC0, 0x80]





**Problem (train_bpe_tinystories): BPE Training on TinyStories**

(a) Train a byte-level BPE tokenizer on the TinyStories dataset, using a maximum vocabulary size of 10,000. Make sure to add the TinyStories <|endoftext|> special token to the vocabulary. Serialize the resulting vocabulary and merges to disk for further inspection. How many hours and memory did training take? What is the longest token in the vocabulary? Does it make sense?

Resource requirements: ≤30 minutes (no GPUs), ≤ 30GB RAM

Hint: You should be able to get under 2 minutes for BPE training using multiprocessing during

pretokenization and the following two facts:

    (a) The <|endoftext|> token delimits documents in the data files.

    (b) The <|endoftext|> token is handled as a special case before the BPE merges are applied.
    
Deliverable: A one-to-two sentence response.

(b) Profile your code. What part of the tokenizer training process takes the most time?
Deliverable: A one-to-two sentence response.

In [1]:
# train_bpe_tinystories

import bbpe_train
import pickle
import time

start_time = time.time()
vocab, merges = bbpe_train.train_bbpe(
    input_path="../data/TinyStoriesV2-GPT4-train.txt",
    vocab_size=10000,
    special_tokens=["<|endoftext|>"],
    num_chunks=16,
    num_processes=8,
)
end_time = time.time()
print(f"Training took {end_time - start_time} seconds")

with open("tinystories_vocab.pkl", "wb") as f:
    pickle.dump(vocab, f)

with open("tinystories_merges.pkl", "wb") as f:
    pickle.dump(merges, f)




Time taken before pretokenization: 1.02 seconds
Starting pre-tokenization with 8 processes on 16 chunks...


Processing chunks: 100%|██████████| 16/16 [00:40<00:00,  2.53s/it]


Pre-tokenization and initial counting time: 40.97 seconds
Total words processed: 539,309,867
Unique word patterns: 59,934
Initial vocab size: 257
Target vocab size: 10000


Performing BPE merges: 100%|██████████| 9743/9743 [00:06<00:00, 1470.51it/s]


Merge time: 6.63 seconds
Training took 49.74058437347412 seconds


The initial `merge_pair` implementation was very slow. The latest version uses a highly optimized algorithm with several key improvements:

1. Word frequency based tokenization.
2. 16 chunks, 8 processes.

Profiling:

Use scalene to analyze the bottle neck.

```
cd cs336_basics
scalene --html --outfile analysis_report.html scalene_unified_analysis.py
```





Problem (train_bpe_expts_owt): BPE Training on OpenWebText

(a) Train a byte-level BPE tokenizer on the OpenWebText dataset, using a maximum vocabulary
size of 32,000. Serialize the resulting vocabulary and merges to disk for further inspection. What
is the longest token in the vocabulary? Does it make sense?

**Resource requirements**: ≤12 hours (no GPUs), ≤ 100GB RAM

**Deliverable**: A one-to-two sentence response.

Answer: 



In [2]:
# train_bpe_expts_owt
import bbpe_train
import pickle
import time
start_owt_time = time.time()
owt_vocab, owt_merges = bbpe_train.train_bbpe(
    input_path="../data/owt_train.txt",
    vocab_size=32000,
    special_tokens=["<|endoftext|>"],
    num_chunks=32,
    num_processes=8,
)
end_owt_time = time.time()
print(f"Training took {end_owt_time - start_owt_time} seconds")
with open("owt_vocab.pkl", "wb") as f:
    pickle.dump(owt_vocab, f)

with open("owt_merges.pkl", "wb") as f:
    pickle.dump(owt_merges, f)


Time taken before pretokenization: 5.38 seconds
Starting pre-tokenization with 8 processes on 32 chunks...


Processing chunks: 100%|██████████| 32/32 [03:27<00:00,  6.48s/it]


Pre-tokenization and initial counting time: 209.36 seconds
Total words processed: 2,474,152,489
Unique word patterns: 6,601,893
Initial vocab size: 257
Target vocab size: 32000


Performing BPE merges: 100%|██████████| 31743/31743 [36:40<00:00, 14.43it/s]   


Merge time: 2200.31 seconds
Training took 2601.484979391098 seconds
