# Byte-Pair Encoding (BPE) Tokenizer

## 2.1 The Unicode Standard

In [5]:
# What Unicode character does chr(0) return?
chr(0)

'\x00'

\x00  ⇔  Unicode code point 0  ⇔  NUL

In [8]:
# How does this character’s string representation (__repr__()) differ from its printed representation?
print(0)
print(repr(0))

print("\n")
print(repr("\n"))

0
0


'\n'


`__repr__()` provides an unambiguous, programmer-oriented representation of the character, while the printed representation shows its human-readable or actual rendered form.

What happens when this character occurs in text? It may be helpful to play around with the following in your Python interpreter and see if it matches your expectations:

In [None]:
chr(0)

'\x00'

In [12]:
print(chr(0))

 


In [13]:
"this is a test" + chr(0) + "string"

'this is a test\x00string'

In [14]:
print("this is a test" + chr(0) + "string")

this is a test string


`chr(0)` returns the Unicode NUL character (U+0000), which is non-printable.

What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various input strings.

## 2.2 Unicode Encodings

In [2]:
test_string = "Hello"
print(test_string.encode('utf-8'))
print(test_string.encode('utf-16'))
print(test_string.encode('utf-32'))

b'Hello'
b'\xff\xfeH\x00e\x00l\x00l\x00o\x00'
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'


In [None]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

# decode_utf8_bytes_to_str_wrong("中文".encode("utf-8"))

# The function decodes each byte independently, but UTF-8 characters can be multi-byte sequences that must be decoded as a whole.

Give a two byte sequence that does not decode to any Unicode character(s)

In [15]:
# print(bytes([0x80, 0x80]).decode("utf-8"))

In [5]:
list(test_string.encode("utf-8"))

[72, 101, 108, 108, 111]