#  Problem (unicode1)

 (a) What Unicode character does chr(0) return?

In [1]:
chr(0)

'\x00'

 (b) How does this character’s string representation (`__repr__()`) differ from its printed representation?

In [2]:
print('hello' + chr(0) + 'world')

hello world


In [3]:
repr('hello' + chr(0) + 'world')

"'hello\\x00world'"

打印以可读性为目标展示最终输出，会有不可见字符；`repr()`使用无歧义的转义字符以进行调试和理解等！

(c) What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:

In [4]:
print(chr(0))
print('hello' + chr(0) + 'world')

 
hello world


#  Problem (unicode2)

**Unicode Encoding**:  it’s impractical to train tokenizers directly on Unicode codepoints, since the vocabulary would be prohibitively large (around 150K items) and sparse (since many characters are quite rare)

In [5]:
# Example of UTF-8 encode
test_string = 'hello! こんにちは!'
utf8_encode = test_string.encode('utf-8')
print(utf8_encode)
print(type(utf8_encode)) # 'bytes': 8字节2进制数据 不可变
utf8_encode_list = list(utf8_encode)
print(utf8_encode_list)

# UTF-8为可变长编码方法
print(f'List length: {len(utf8_encode_list)}')
print(f'String length: {len(test_string)}')

print(utf8_encode.decode('utf-8'))

b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
<class 'bytes'>
[104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]
List length: 23
String length: 13
hello! こんにちは!


In [6]:
# Example of UTF-16 encode
test_string = 'hello! こんにちは!'
utf16_encode = test_string.encode('utf-16')
print(utf16_encode)
print(type(utf16_encode))
utf16_encode_list = list(utf16_encode)
print(utf16_encode_list)

# UTF-16为可变长编码方法
print(f'List length: {len(utf16_encode_list)}')
print(f'String length: {len(test_string)}')

print(utf16_encode.decode('utf-16'))

b'\xff\xfeh\x00e\x00l\x00l\x00o\x00!\x00 \x00S0\x930k0a0o0!\x00'
<class 'bytes'>
[255, 254, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 32, 0, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 33, 0]
List length: 28
String length: 13
hello! こんにちは!


(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various input strings.

1. UTF-8 空间最高效
2. UTF-8 最为通用
3. UTF-16/32 需要额外的BOM标记符

(b) Consider the following (incorrect) function, which is intended to decode a UTF-8 byte string into a Unicode string. Why is this function incorrect? Provide an example of an input byte string that yields incorrect results.

(c) Give a two byte sequence that does not decode to any Unicode character(s).

In [7]:
# 只对1byte字符有效
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
print(decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")))
try:
    print(decode_utf8_bytes_to_str_wrong("草".encode("utf-8")))
except:
    print(f'an incorrect example: 草')

hello
an incorrect example: 草


In [8]:
# pre-tokenizer for fast merge compute
import regex as re
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
re.findall(PAT, "some text that i'll pre-tokenize")

['some', ' text', ' that', ' i', "'ll", ' pre', '-', 'tokenize']

In [1]:
# train and save (1) vocabulary (2) merges (3) encoding over openweb
import base64
import json
from cs336_basics.Tokenizer.tokenizer import run_train_bpe, Tokenizer
import numpy as np

# train vocab and merges
vocab, merges = run_train_bpe(
    'data/owt_valid.txt',
    32000,
    ['<|endoftext|>'],
    **{'n_processes': 8}
)

# save vocab and merges
# without loss encode
vocab = {idx: base64.b64encode(token).decode('ascii')  for idx, token in vocab.items()}
with open('data/owt_vocab.json', 'w', encoding = 'utf-8') as f:
    json.dump(vocab, f, ensure_ascii=False, indent = 2)

with open('data/owt_merges.txt', 'w', encoding = 'utf-8') as f:
    for pair in merges:
        first = base64.b64encode(pair[0]).decode('ascii')
        second = base64.b64encode(pair[1]).decode('ascii')
        f.write(f'{first} {second}\n')


# load vocab and merges
with open('data/owt_vocab.json', 'r', encoding = 'utf-8') as f:
    vocab = json.load(f)
vocab = {idx: base64.b64decode(encode_b) for idx, encode_b in vocab.items()}

merges = []
with open('data/owt_merges.txt', 'r', encoding = 'utf-8') as f:
    for row in f:
        if not row:
            continue
        parts = row.split()
        if len(parts) == 2:
            first = base64.b64decode(parts[0])
            second = base64.b64decode(parts[1])
            merges.append((first, second))

tkn = Tokenizer(
    vocab, merges, ['<|endoftext|>']
)

# encode the owt train/valid within block
for stage in ['train', 'valid']:
    encodes = []
    with open(f'data/owt_{stage}.txt', 'r', encoding = 'utf-8') as f:
        for line in f:
            if not line:
                continue
            else:
                encodes.extend(tkn.encode(line))
    encodes = np.array(encodes, dtype = np.uint16)
    np.save(f'data/owt_{stage}.npy', encodes)

100%|██████████| 31743/31743 [00:52<00:00, 608.17it/s] 


KeyboardInterrupt: 

In [10]:
# Just test my implementation (fast enough)
from cs336_basics.Tokenizer.tokenizer import run_train_bpe

vocab, merges = run_train_bpe(
    'data/TinyStoriesV2-GPT4-train.txt',
    32000,
    ['<|endoftext|>'],
    **{'n_processes': 8}
)

print(vocab[20000])
print(merges[:10])

100%|██████████| 31743/31743 [00:03<00:00, 9167.62it/s] 


b' stillness'
[(b' ', b't'), (b'h', b'e'), (b' ', b'a'), (b' ', b's'), (b' ', b'w'), (b'n', b'd'), (b' t', b'he'), (b'e', b'd'), (b' ', b'b'), (b' t', b'o')]
