<a href="https://colab.research.google.com/github/cahya-wirawan/rwkv-tokenizer/blob/main/rwkv_tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparison of RWKV Tokenizers

In [3]:
!apt install cargo
!pip install rwkv datasets rwkv_tokenizer tiktoken
!pip install ipywidgets

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [1]:
import os
from datetime import datetime
import torch
import rwkv
from rwkv import rwkv_tokenizer
from rwkv.rwkv_tokenizer import TRIE_TOKENIZER
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import rwkv_tokenizer
import tiktoken
import tqdm
import plotly.express as px
import pandas as pd

In [2]:
#ds = load_dataset("wikitext", "wikitext-103-v1")
#ds = load_dataset("cahya/datasets-test")
ds = load_dataset("wikipedia", "20220301.simple")
print(ds)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 205328
    })
})


In [3]:
MAX_ROW = len(ds["train"])
print(MAX_ROW)
time_diffs = {}

205328


## The Original RWKV Tokenizer (BlinkDL)

In [4]:
tokenizer_rwkv = TRIE_TOKENIZER(os.path.dirname(os.path.abspath(rwkv.__file__)) + '/rwkv_vocab_v20230424.txt')

In [6]:
chars_number = 0
tokens_number = 0
start = datetime.now()
for i, row in tqdm.tqdm(enumerate(ds["train"]), total=MAX_ROW):
    try:
      inputs = tokenizer_rwkv.encode(row["text"][:])
    except:
      print(f"{i}: {row['text']}")
      break
    chars_number += len(row["text"])
    tokens_number += len(inputs)
    if i>=MAX_ROW:
        break
end = datetime.now()
time_diff = (end-start).total_seconds()
print()
print(f"Number of chars: {chars_number}")
print(f"Number of tokens: {tokens_number}")
print(f"Chars per token: {chars_number/tokens_number:.2f}")
print(f"Time difference: {time_diff:.2f}s")
print(f"Chars per second: {chars_number/time_diff:.2f}")
time_diffs["The Original RWKV Tokenizer (BlinkDL)"] = time_diff

100%|██████████| 205328/205328 [03:18<00:00, 1033.24it/s]


Number of chars: 215489882
Number of tokens: 53619552
Chars per token: 4.02
Time difference: 198.73s
Chars per second: 1084338.67





## The Huggingface RWKV Tokenizer

In [7]:
tokenizer_hf_rwkv = AutoTokenizer.from_pretrained("RWKV/rwkv-6-world-3b", trust_remote_code=True)

In [8]:
chars_number = 0
tokens_number = 0
start = datetime.now()
for i, row in tqdm.tqdm(enumerate(ds["train"]), total=MAX_ROW):
    try:
      inputs = tokenizer_hf_rwkv.encode(row["text"][:])
    except:
      print(f"{i}: {row['text']}")
      break
    chars_number += len(row["text"])
    tokens_number += len(inputs)
    if i>=MAX_ROW:
        break
end = datetime.now()
time_diff = (end-start).total_seconds()
print()
print(f"Number of chars: {chars_number}")
print(f"Number of tokens: {tokens_number}")
print(f"Chars per token: {chars_number/tokens_number:.2f}")
print(f"Time difference: {time_diff:.2f}s")
print(f"Chars per second: {chars_number/time_diff:.2f}")
time_diffs["The Huggingface RWKV Tokenizer"] = time_diff

100%|██████████| 205328/205328 [11:06<00:00, 308.26it/s]


Number of chars: 215489882
Number of tokens: 53767631
Chars per token: 4.01
Time difference: 666.10s
Chars per second: 323510.76





## The Huggingface LLama Tokenizer

In [9]:
tokenizer_hf_llama = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-AWQ")

In [10]:
chars_number = 0
tokens_number = 0
start = datetime.now()
for i, row in tqdm.tqdm(enumerate(ds["train"]), total=MAX_ROW):
    try:
      inputs = tokenizer_hf_llama.encode(row["text"][:])
    except:
      print(f"{i}: {row['text']}")
      break
    chars_number += len(row["text"])
    tokens_number += len(inputs)
    if i>=MAX_ROW:
        break
end = datetime.now()
time_diff = (end-start).total_seconds()
print()
print(f"Number of chars: {chars_number}")
print(f"Number of tokens: {tokens_number}")
print(f"Chars per token: {chars_number/tokens_number:.2f}")
print(f"Time difference: {time_diff:.2f}s")
print(f"Chars per second: {chars_number/time_diff:.2f}")
time_diffs["The Huggingface LLama Tokenizer"] = time_diff

100%|██████████| 205328/205328 [04:32<00:00, 753.76it/s] 


Number of chars: 215489882
Number of tokens: 64676875
Chars per token: 3.33
Time difference: 272.42s
Chars per second: 791012.82





## The OpenAI Tiktoken

In [11]:
tokenizer_tiktoken = tiktoken.get_encoding("o200k_base")

In [13]:
chars_number = 0
tokens_number = 0
start = datetime.now()
for i, row in tqdm.tqdm(enumerate(ds["train"]), total=MAX_ROW):
    try:
      inputs = tokenizer_tiktoken.encode(row["text"][:])
    except:
      print(f"{i}: {row['text']}")
      break
    chars_number += len(row["text"])
    tokens_number += len(inputs)
    if i>=MAX_ROW:
        break
end = datetime.now()
time_diff = (end-start).total_seconds()
print()
print(f"Number of chars: {chars_number}")
print(f"Number of tokens: {tokens_number}")
print(f"Chars per token: {chars_number/tokens_number:.2f}")
print(f"Time difference: {time_diff:.2f}s")
print(f"Chars per second: {chars_number/time_diff:.2f}")
time_diffs["The OpenAI Tiktoken"] = time_diff

100%|██████████| 205328/205328 [01:53<00:00, 1805.53it/s]


Number of chars: 215489882
Number of tokens: 51926917
Chars per token: 4.15
Time difference: 113.73s
Chars per second: 1894773.38





## The new RWKV Tokenizer in Rust (Cahya)

In [None]:
# Just to get the vocab file rwkv_vocab_v20230424.txt
!git clone https://github.com/cahya-wirawan/rwkv-tokenizer.git

In [14]:
tokenizer_rust = rwkv_tokenizer.Tokenizer("rwkv-tokenizer/src/rwkv_vocab_v20230424.txt")

In [18]:
chars_number = 0
tokens_number = 0
start = datetime.now()
for i, row in tqdm.tqdm(enumerate(ds["train"]), total=MAX_ROW):
    try:
      inputs = tokenizer_rust.encode(row["text"][:])
    except:
      print(f"{i}: {row['text']}")
      break
    chars_number += len(row["text"])
    tokens_number += len(inputs)
    if i>=MAX_ROW:
        break
end = datetime.now()
time_diff = (end-start).total_seconds()
print()
print(f"Number of chars: {chars_number}")
print(f"Number of tokens: {tokens_number}")
print(f"Chars per token: {chars_number/tokens_number:.2f}")
print(f"Time difference: {time_diff:.2f}s")
print(f"Chars per second: {chars_number/time_diff:.2f}")
time_diffs["The new RWKV Tokenizer in Rust"] = time_diff

100%|██████████| 205328/205328 [00:33<00:00, 6067.64it/s]


Number of chars: 215489882
Number of tokens: 53619552
Chars per token: 4.02
Time difference: 33.85s
Chars per second: 6366744.23





**The number of tokens encoded by the new rwkv tokenizer in Rust is exactly the same as the number of tokens encoded by the original tokenizer.**

## Tokens Comparison
We compare the encoded tokens between the new Rust rwkv_tokenizer and the original BlinkDL rwkv tokenizer

In [19]:
text = "Today is a beautiful day. 今天是美好的一天。"
encoded = tokenizer_rust.encode(text)
decoded = tokenizer_rust.decode(encoded)
print(f"encoded: {encoded}")
print(f"text:\t\t{text}")
print(f"decoded:\t{decoded}")

encoded: [33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]
text:		Today is a beautiful day. 今天是美好的一天。
decoded:	Today is a beautiful day. 今天是美好的一天。


#### English dataset

In [20]:
%%time

ds = load_dataset("wikipedia", "20220301.simple")
MAX_ROW = 10000 #len(ds["train"])
counter = 0
for i in tqdm.tqdm(range(0, MAX_ROW), total=MAX_ROW):
  tokens_rust = tokenizer_rust.encode(ds["train"][i]["text"])
  tokens_rwkv = tokenizer_rwkv.encode(ds["train"][i]["text"])
  for index, (rust, rwkv) in enumerate(zip(tokens_rust, tokens_rwkv)):
    if rust != rwkv:
      print(f"{i} - {index}: {tokens_rust[max(0,index-4):index+4]}")
      print(f"{i} - {index}: {tokens_rwkv[max(0,index-4):index+4]}")
      print(f"{i} - {index}: {tokenizer_rwkv.decode(tokens_rwkv[max(0,index-4):index+4])}\n")
      counter += 1
if counter > 0:
  print(f"\nFound {counter} difference(s)!")
else:
  print("\nBoth encoded tokens are exactly the same :-)")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
100%|██████████| 10000/10000 [00:38<00:00, 261.12it/s]


Both encoded tokens are exactly the same :-)
CPU times: user 34.9 s, sys: 396 ms, total: 35.2 s
Wall time: 39.2 s





#### Chinese dataset

In [21]:
%%time

ds = load_dataset("Lifan-Z/Chinese-poetries-txt")
MAX_ROW = 10000 #len(ds["train"])
counter = 0
for i in tqdm.tqdm(range(0, MAX_ROW), total=MAX_ROW):
  tokens_rust = tokenizer_rust.encode(ds["train"][i]["text"])
  tokens_rwkv = tokenizer_rwkv.encode(ds["train"][i]["text"])
  for index, (rust, rwkv) in enumerate(zip(tokens_rust, tokens_rwkv)):
    if rust != rwkv:
      print(f"{i} - {index}: {tokens_rust[max(0,index-4):index+4]}")
      print(f"{i} - {index}: {tokens_rwkv[max(0,index-4):index+4]}")
      print(f"{i} - {index}: {tokenizer_rwkv.decode(tokens_rwkv[max(0,index-4):index+4])}\n")
      counter += 1
if counter > 0:
  print(f"\nFound {counter} difference(s)!")
else:
  print("\nBoth encoded tokens are exactly the same :-)")

100%|██████████| 10000/10000 [00:02<00:00, 4826.87it/s]


Both encoded tokens are exactly the same :-)
CPU times: user 2.16 s, sys: 78.3 ms, total: 2.24 s
Wall time: 3.07 s





**The tokens comparison shows that the new RWKV tokenizer in Rust encoded exactly the same tokens as the original RWKV tokenizer.**

## Performance

In [22]:
df = pd.DataFrame({'tokenizer': [name for name in time_diffs], 'time': [time_diffs[name] for name in time_diffs]})
fig = px.bar(df, x='tokenizer', y='time')
fig.show()

### Unit Test

In [23]:
src = '''起業家イーロン・マスク氏が創業した宇宙開発企業「スペースX（エックス）」の巨大新型ロケット「スターシップ」が20日朝、初めて打ち上げられたが、爆発した。
打ち上げは米テキサス州の東海岸で行われた。無人の試験で、負傷者はいなかった。
打ち上げから2～3分後、史上最大のロケットが制御不能になり、まもなく搭載された装置で破壊された。
マスク氏は、数カ月後に再挑戦すると表明した。
スペースXのエンジニアたちは、それでもこの日のミッションは成功だったとしている。「早期に頻繁に試験する」ことを好む人たちなので、破壊を恐れていない。次のフライトに向け、大量のデータを収集したはずだ。2機目のスターシップは、ほぼ飛行準備が整っている。
マスク氏は、「SpaceXチームの皆さん、スターシップのエキサイティングな試験打ち上げ、おめでとう！　数カ月後に行われる次の試験打ち上げに向けて、多くを学んだ」とツイートした。
アメリカでのロケット打ち上げを認可する米連邦航空局（NASA）は、事故調査を監督するとした。広報担当者は、飛行中に機体が失われた場合の通常の対応だと述べた。
マスク氏は打ち上げ前、期待値を下げようとしていた。発射台の設備を破壊せずに機体を打ち上げるだけでも「成功」だとしていた。
その願いはかなった。スターシップは打ち上げ施設からどんどん上昇し、メキシコ湾の上空へと向かっていった。しかし1分もしないうち、すべてが計画通りに進んでいるのではないことが明らかになった。'''

In [25]:
print(src)
print(f'\n{len(src)} chars')
tokens = tokenizer_rwkv.encode(src)
assert tokenizer_rwkv.decode(tokens) == src
print()
print(tokens)
print(f'\n{len(tokens)} tokens\n')

起業家イーロン・マスク氏が創業した宇宙開発企業「スペースX（エックス）」の巨大新型ロケット「スターシップ」が20日朝、初めて打ち上げられたが、爆発した。
打ち上げは米テキサス州の東海岸で行われた。無人の試験で、負傷者はいなかった。
打ち上げから2～3分後、史上最大のロケットが制御不能になり、まもなく搭載された装置で破壊された。
マスク氏は、数カ月後に再挑戦すると表明した。
スペースXのエンジニアたちは、それでもこの日のミッションは成功だったとしている。「早期に頻繁に試験する」ことを好む人たちなので、破壊を恐れていない。次のフライトに向け、大量のデータを収集したはずだ。2機目のスターシップは、ほぼ飛行準備が整っている。
マスク氏は、「SpaceXチームの皆さん、スターシップのエキサイティングな試験打ち上げ、おめでとう！　数カ月後に行われる次の試験打ち上げに向けて、多くを学んだ」とツイートした。
アメリカでのロケット打ち上げを認可する米連邦航空局（NASA）は、事故調査を監督するとした。広報担当者は、飛行中に機体が失われた場合の通常の対応だと述べた。
マスク氏は打ち上げ前、期待値を下げようとしていた。発射台の設備を破壊せずに機体を打ち上げるだけでも「成功」だとしていた。
その願いはかなった。スターシップは打ち上げ施設からどんどん上昇し、メキシコ湾の上空へと向かっていった。しかし1分もしないうち、すべてが計画通りに進んでいるのではないことが明らかになった。

635 chars

[16944, 13436, 11920, 10169, 10242, 10237, 10239, 10241, 10222, 10189, 10179, 13651, 10108, 10827, 13436, 43347, 11887, 11898, 17693, 14728, 10412, 13436, 10088, 10189, 10218, 43528, 89, 19133, 10172, 43484, 10189, 19134, 10089, 10138, 12140, 11638, 13034, 11496, 10237, 10181, 43485, 10088, 58583, 10187, 43486, 10089, 10108, 640, 13053, 13198, 10079,

In [24]:
print(src)
print(f'\n{len(src)} chars')
tokens = tokenizer_rust.encode(src)
assert tokenizer_rust.decode(tokens) == src
print()
print(tokens)
print(f'\n{len(tokens)} tokens\n')

起業家イーロン・マスク氏が創業した宇宙開発企業「スペースX（エックス）」の巨大新型ロケット「スターシップ」が20日朝、初めて打ち上げられたが、爆発した。
打ち上げは米テキサス州の東海岸で行われた。無人の試験で、負傷者はいなかった。
打ち上げから2～3分後、史上最大のロケットが制御不能になり、まもなく搭載された装置で破壊された。
マスク氏は、数カ月後に再挑戦すると表明した。
スペースXのエンジニアたちは、それでもこの日のミッションは成功だったとしている。「早期に頻繁に試験する」ことを好む人たちなので、破壊を恐れていない。次のフライトに向け、大量のデータを収集したはずだ。2機目のスターシップは、ほぼ飛行準備が整っている。
マスク氏は、「SpaceXチームの皆さん、スターシップのエキサイティングな試験打ち上げ、おめでとう！　数カ月後に行われる次の試験打ち上げに向けて、多くを学んだ」とツイートした。
アメリカでのロケット打ち上げを認可する米連邦航空局（NASA）は、事故調査を監督するとした。広報担当者は、飛行中に機体が失われた場合の通常の対応だと述べた。
マスク氏は打ち上げ前、期待値を下げようとしていた。発射台の設備を破壊せずに機体を打ち上げるだけでも「成功」だとしていた。
その願いはかなった。スターシップは打ち上げ施設からどんどん上昇し、メキシコ湾の上空へと向かっていった。しかし1分もしないうち、すべてが計画通りに進んでいるのではないことが明らかになった。

635 chars

[16944, 13436, 11920, 10169, 10242, 10237, 10239, 10241, 10222, 10189, 10179, 13651, 10108, 10827, 13436, 43347, 11887, 11898, 17693, 14728, 10412, 13436, 10088, 10189, 10218, 43528, 89, 19133, 10172, 43484, 10189, 19134, 10089, 10138, 12140, 11638, 13034, 11496, 10237, 10181, 43485, 10088, 58583, 10187, 43486, 10089, 10108, 640, 13053, 13198, 10079,

In [27]:
import random

In [28]:
########################################################################################################
# Unit Test
########################################################################################################

print('Unit test...')

QQQ = ['', ' ', 'Õ\U000683b8', b'\xe6\xaa\x81'.decode('utf-8')]

for TRIAL in range(500):
    x = ''
    for xx in [
        ['0',' '],
        ['0','1'],
        ['0','1',' '],
        ['0','1',' ','00','11','  ','000','111','   '],
        list('01 \n\r\t,.;!?:\'\"-=你好')
    ]:
        for i in range(256):
            x += random.choice(xx)
    QQQ += [x]

for i in range(5000):
    QQQ += [' ' * i]

for TRIAL in range(5000):
    x = chr(random.randrange(0, 256))
    x = x * random.randrange(1, 32)
    QQQ += [x]

for TRIAL in range(99999):
    x = chr(random.randrange(256, 1114112))
    x = x * random.randrange(1, 4)
    try:
        tmp = x.encode("utf-8")
        QQQ += [x]
    except:
        pass

Unit test...


In [31]:
QQQ += ['''
UTF-8 decoder capability and stress test
----------------------------------------

Markus Kuhn <http://www.cl.cam.ac.uk/~mgk25/> - 2015-08-28 - CC BY 4.0

This test file can help you examine, how your UTF-8 decoder handles
various types of correct, malformed, or otherwise interesting UTF-8
sequences. This file is not meant to be a conformance test. It does
not prescribe any particular outcome. Therefore, there is no way to
"pass" or "fail" this test file, even though the text does suggest a
preferable decoder behaviour at some places. Its aim is, instead, to
help you think about, and test, the behaviour of your UTF-8 decoder on a
systematic collection of unusual inputs. Experience so far suggests
that most first-time authors of UTF-8 decoders find at least one
serious problem in their decoder using this file.

The test lines below cover boundary conditions, malformed UTF-8
sequences, as well as correctly encoded UTF-8 sequences of Unicode code
points that should never occur in a correct UTF-8 file.

According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" and
"characters that are not within the adopted subset shall be indicated
to the user" by a receiving device. One commonly used approach in
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
replacement character (U+FFFD), which looks a bit like an inverted
question mark, or a similar symbol. It might be a good idea to
visually distinguish a malformed UTF-8 sequence from a correctly
encoded Unicode character that is just not available in the current
font but otherwise fully legal, even though ISO 10646-1 doesn't
mandate this. In any case, just ignoring malformed sequences or
unavailable characters does not conform to ISO 10646, will make
debugging more difficult, and can lead to user confusion.

Please check, whether a malformed UTF-8 sequence is (1) represented at
all, (2) represented by exactly one single replacement character (or
equivalent signal), and (3) the following quotation mark after an
illegal UTF-8 sequence is correctly displayed, i.e. proper
resynchronization takes place immediately after any malformed
sequence. This file says "THE END" in the last line, so if you don't
see that, your decoder crashed somehow before, which should always be
cause for concern.

All lines in this file are exactly 79 characters long (plus the line
feed). In addition, all lines end with "|", except for the two test
lines 2.1.1 and 2.2.1, which contain non-printable ASCII controls
U+0000 and U+007F. If you display this file with a fixed-width font,
these "|" characters should all line up in column 79 (right margin).
This allows you to test quickly, whether your UTF-8 decoder finds the
correct number of characters in every line, that is whether each
malformed sequences is replaced by a single replacement character.

Note that, as an alternative to the notion of malformed sequence used
here, it is also a perfectly acceptable (and in some situations even
preferable) solution to represent each individual byte of a malformed
sequence with a replacement character. If you follow this strategy in
your decoder, then please ignore the "|" column.


Here come the tests:                                                          |
                                                                              |
1  Some correct UTF-8 text                                                    |
                                                                              |
You should see the Greek word 'kosme':       "κόσμε"                          |
                                                                              |
2  Boundary condition test cases                                              |
                                                                              |
2.1  First possible sequence of a certain length                              |
                                                                              |
2.1.1  1 byte  (U-00000000):        "�"
2.1.2  2 bytes (U-00000080):        ""                                       |
2.1.3  3 bytes (U-00000800):        "ࠀ"                                       |
2.1.4  4 bytes (U-00010000):        "𐀀"                                       |
2.1.5  5 bytes (U-00200000):        "�����"                                       |
2.1.6  6 bytes (U-04000000):        "������"                                       |
                                                                              |
2.2  Last possible sequence of a certain length                               |
                                                                              |
2.2.1  1 byte  (U-0000007F):        ""
2.2.2  2 bytes (U-000007FF):        "߿"                                       |
2.2.3  3 bytes (U-0000FFFF):        "￿"                                       |
2.2.4  4 bytes (U-001FFFFF):        "����"                                       |
2.2.5  5 bytes (U-03FFFFFF):        "�����"                                       |
2.2.6  6 bytes (U-7FFFFFFF):        "������"                                       |
                                                                              |
2.3  Other boundary conditions                                                |
                                                                              |
2.3.1  U-0000D7FF = ed 9f bf = "퟿"                                            |
2.3.2  U-0000E000 = ee 80 80 = ""                                            |
2.3.3  U-0000FFFD = ef bf bd = "�"                                            |
2.3.4  U-0010FFFF = f4 8f bf bf = "􏿿"                                         |
2.3.5  U-00110000 = f4 90 80 80 = "����"                                         |
                                                                              |
3  Malformed sequences                                                        |
                                                                              |
3.1  Unexpected continuation bytes                                            |
                                                                              |
Each unexpected continuation byte should be separately signalled as a         |
malformed sequence of its own.                                                |
                                                                              |
3.1.1  First continuation byte 0x80: "�"                                      |
3.1.2  Last  continuation byte 0xbf: "�"                                      |
                                                                              |
3.1.3  2 continuation bytes: "��"                                             |
3.1.4  3 continuation bytes: "���"                                            |
3.1.5  4 continuation bytes: "����"                                           |
3.1.6  5 continuation bytes: "�����"                                          |
3.1.7  6 continuation bytes: "������"                                         |
3.1.8  7 continuation bytes: "�������"                                        |
                                                                              |
3.1.9  Sequence of all 64 possible continuation bytes (0x80-0xbf):            |
                                                                              |
   "����������������                                                          |
    ����������������                                                          |
    ����������������                                                          |
    ����������������"                                                         |
                                                                              |
3.2  Lonely start characters                                                  |
                                                                              |
3.2.1  All 32 first bytes of 2-byte sequences (0xc0-0xdf),                    |
       each followed by a space character:                                    |
                                                                              |
   "� � � � � � � � � � � � � � � �                                           |
    � � � � � � � � � � � � � � � � "                                         |
                                                                              |
3.2.2  All 16 first bytes of 3-byte sequences (0xe0-0xef),                    |
       each followed by a space character:                                    |
                                                                              |
   "� � � � � � � � � � � � � � � � "                                         |
                                                                              |
3.2.3  All 8 first bytes of 4-byte sequences (0xf0-0xf7),                     |
       each followed by a space character:                                    |
                                                                              |
   "� � � � � � � � "                                                         |
                                                                              |
3.2.4  All 4 first bytes of 5-byte sequences (0xf8-0xfb),                     |
       each followed by a space character:                                    |
                                                                              |
   "� � � � "                                                                 |
                                                                              |
3.2.5  All 2 first bytes of 6-byte sequences (0xfc-0xfd),                     |
       each followed by a space character:                                    |
                                                                              |
   "� � "                                                                     |
                                                                              |
3.3  Sequences with last continuation byte missing                            |
                                                                              |
All bytes of an incomplete sequence should be signalled as a single           |
malformed sequence, i.e., you should see only a single replacement            |
character in each of the next 10 tests. (Characters as in section 2)          |
                                                                              |
3.3.1  2-byte sequence with last byte missing (U+0000):     "�"               |
3.3.2  3-byte sequence with last byte missing (U+0000):     "��"               |
3.3.3  4-byte sequence with last byte missing (U+0000):     "���"               |
3.3.4  5-byte sequence with last byte missing (U+0000):     "����"               |
3.3.5  6-byte sequence with last byte missing (U+0000):     "�����"               |
3.3.6  2-byte sequence with last byte missing (U-000007FF): "�"               |
3.3.7  3-byte sequence with last byte missing (U-0000FFFF): "�"               |
3.3.8  4-byte sequence with last byte missing (U-001FFFFF): "���"               |
3.3.9  5-byte sequence with last byte missing (U-03FFFFFF): "����"               |
3.3.10 6-byte sequence with last byte missing (U-7FFFFFFF): "�����"               |
                                                                              |
3.4  Concatenation of incomplete sequences                                    |
                                                                              |
All the 10 sequences of 3.3 concatenated, you should see 10 malformed         |
sequences being signalled:                                                    |
                                                                              |
   "�����������������������������"                                                               |
                                                                              |
3.5  Impossible bytes                                                         |
                                                                              |
The following two bytes cannot appear in a correct UTF-8 string               |
                                                                              |
3.5.1  fe = "�"                                                               |
3.5.2  ff = "�"                                                               |
3.5.3  fe fe ff ff = "����"                                                   |
                                                                              |
4  Overlong sequences                                                         |
                                                                              |
The following sequences are not malformed according to the letter of          |
the Unicode 2.0 standard. However, they are longer then necessary and         |
a correct UTF-8 encoder is not allowed to produce them. A "safe UTF-8         |
decoder" should reject them just like malformed sequences for two             |
reasons: (1) It helps to debug applications if overlong sequences are         |
not treated as valid representations of characters, because this helps        |
to spot problems more quickly. (2) Overlong sequences provide                 |
alternative representations of characters, that could maliciously be          |
used to bypass filters that check only for ASCII characters. For              |
instance, a 2-byte encoded line feed (LF) would not be caught by a            |
line counter that counts only 0x0a bytes, but it would still be               |
processed as a line feed by an unsafe UTF-8 decoder later in the              |
pipeline. From a security point of view, ASCII compatibility of UTF-8         |
sequences means also, that ASCII characters are *only* allowed to be          |
represented by ASCII bytes in the range 0x00-0x7f. To ensure this             |
aspect of ASCII compatibility, use only "safe UTF-8 decoders" that            |
reject overlong UTF-8 sequences for which a shorter encoding exists.          |
                                                                              |
4.1  Examples of an overlong ASCII character                                  |
                                                                              |
With a safe UTF-8 decoder, all of the following five overlong                 |
representations of the ASCII character slash ("/") should be rejected         |
like a malformed UTF-8 sequence, for instance by substituting it with         |
a replacement character. If you see a slash below, you do not have a          |
safe UTF-8 decoder!                                                           |
                                                                              |
4.1.1 U+002F = c0 af             = "��"                                        |
4.1.2 U+002F = e0 80 af          = "���"                                        |
4.1.3 U+002F = f0 80 80 af       = "����"                                        |
4.1.4 U+002F = f8 80 80 80 af    = "�����"                                        |
4.1.5 U+002F = fc 80 80 80 80 af = "������"                                        |
                                                                              |
4.2  Maximum overlong sequences                                               |
                                                                              |
Below you see the highest Unicode value that is still resulting in an         |
overlong sequence if represented with the given number of bytes. This         |
is a boundary test for safe UTF-8 decoders. All five characters should        |
be rejected like malformed UTF-8 sequences.                                   |
                                                                              |
4.2.1  U-0000007F = c1 bf             = "��"                                   |
4.2.2  U-000007FF = e0 9f bf          = "���"                                   |
4.2.3  U-0000FFFF = f0 8f bf bf       = "����"                                   |
4.2.4  U-001FFFFF = f8 87 bf bf bf    = "�����"                                   |
4.2.5  U-03FFFFFF = fc 83 bf bf bf bf = "������"                                   |
                                                                              |
4.3  Overlong representation of the NUL character                             |
                                                                              |
The following five sequences should also be rejected like malformed           |
UTF-8 sequences and should not be treated like the ASCII NUL                  |
character.                                                                    |
                                                                              |
4.3.1  U+0000 = c0 80             = "��"                                       |
4.3.2  U+0000 = e0 80 80          = "���"                                       |
4.3.3  U+0000 = f0 80 80 80       = "����"                                       |
4.3.4  U+0000 = f8 80 80 80 80    = "�����"                                       |
4.3.5  U+0000 = fc 80 80 80 80 80 = "������"                                       |
                                                                              |
5  Illegal code positions                                                     |
                                                                              |
The following UTF-8 sequences should be rejected like malformed               |
sequences, because they never represent valid ISO 10646 characters and        |
a UTF-8 decoder that accepts them might introduce security problems           |
comparable to overlong UTF-8 sequences.                                       |
                                                                              |
5.1 Single UTF-16 surrogates                                                  |
                                                                              |
5.1.1  U+D800 = ed a0 80 = "���"                                                |
5.1.2  U+DB7F = ed ad bf = "���"                                                |
5.1.3  U+DB80 = ed ae 80 = "���"                                                |
5.1.4  U+DBFF = ed af bf = "���"                                                |
5.1.5  U+DC00 = ed b0 80 = "���"                                                |
5.1.6  U+DF80 = ed be 80 = "���"                                                |
5.1.7  U+DFFF = ed bf bf = "���"                                                |
                                                                              |
5.2 Paired UTF-16 surrogates                                                  |
                                                                              |
5.2.1  U+D800 U+DC00 = ed a0 80 ed b0 80 = "������"                               |
5.2.2  U+D800 U+DFFF = ed a0 80 ed bf bf = "������"                               |
5.2.3  U+DB7F U+DC00 = ed ad bf ed b0 80 = "������"                               |
5.2.4  U+DB7F U+DFFF = ed ad bf ed bf bf = "������"                               |
5.2.5  U+DB80 U+DC00 = ed ae 80 ed b0 80 = "������"                               |
5.2.6  U+DB80 U+DFFF = ed ae 80 ed bf bf = "������"                               |
5.2.7  U+DBFF U+DC00 = ed af bf ed b0 80 = "������"                               |
5.2.8  U+DBFF U+DFFF = ed af bf ed bf bf = "������"                               |
                                                                              |
5.3 Noncharacter code positions                                               |
                                                                              |
The following "noncharacters" are "reserved for internal use" by              |
applications, and according to older versions of the Unicode Standard         |
"should never be interchanged". Unicode Corrigendum #9 dropped the            |
latter restriction. Nevertheless, their presence in incoming UTF-8 data       |
can remain a potential security risk, depending on what use is made of        |
these codes subsequently. Examples of such internal use:                      |
                                                                              |
 - Some file APIs with 16-bit characters may use the integer value -1         |
   = U+FFFF to signal an end-of-file (EOF) or error condition.                |
                                                                              |
 - In some UTF-16 receivers, code point U+FFFE might trigger a                |
   byte-swap operation (to convert between UTF-16LE and UTF-16BE).            |
                                                                              |
With such internal use of noncharacters, it may be desirable and safer        |
to block those code points in UTF-8 decoders, as they should never            |
occur legitimately in incoming UTF-8 data, and could trigger unsafe           |
behaviour in subsequent processing.                                           |
                                                                              |
Particularly problematic noncharacters in 16-bit applications:                |
                                                                              |
5.3.1  U+FFFE = ef bf be = "￾"                                                |
5.3.2  U+FFFF = ef bf bf = "￿"                                                |
                                                                              |
Other noncharacters:                                                          |
                                                                              |
5.3.3  U+FDD0 .. U+FDEF = "﷐﷑﷒﷓﷔﷕﷖﷗﷘﷙﷚﷛﷜﷝﷞﷟﷠﷡﷢﷣﷤﷥﷦﷧﷨﷩﷪﷫﷬﷭﷮﷯"|
                                                                              |
5.3.4  U+nFFFE U+nFFFF (for n = 1..10)                                        |
                                                                              |
       "🿾🿿𯿾𯿿𿿾𿿿񏿾񏿿񟿾񟿿񯿾񯿿񿿾񿿿򏿾򏿿                                    |
        򟿾򟿿򯿾򯿿򿿾򿿿󏿾󏿿󟿾󟿿󯿾󯿿󿿾󿿿􏿾􏿿"                                   |
                                                                              |
THE END                                                                       |


UTF-8 encoded sample plain-text file
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾

Markus Kuhn [ˈmaʳkʊs kuːn] <http://www.cl.cam.ac.uk/~mgk25/> — 2002-07-25 CC BY


The ASCII compatible UTF-8 encoding used in this plain-text file
is defined in Unicode, ISO 10646-1, and RFC 2279.


Using Unicode/UTF-8, you can write in emails and source code things such as

Mathematics and sciences:

  ∮ E⋅da = Q,  n → ∞, ∑ f(i) = ∏ g(i),      ⎧⎡⎛┌─────┐⎞⎤⎫
                                            ⎪⎢⎜│a²+b³ ⎟⎥⎪
  ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β),    ⎪⎢⎜│───── ⎟⎥⎪
                                            ⎪⎢⎜⎷ c₈   ⎟⎥⎪
  ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ,                   ⎨⎢⎜       ⎟⎥⎬
                                            ⎪⎢⎜ ∞     ⎟⎥⎪
  ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (⟦A⟧ ⇔ ⟪B⟫),      ⎪⎢⎜ ⎲     ⎟⎥⎪
                                            ⎪⎢⎜ ⎳aⁱ-bⁱ⎟⎥⎪
  2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm     ⎩⎣⎝i=1    ⎠⎦⎭

Linguistics and dictionaries:

  ði ıntəˈnæʃənəl fəˈnɛtık əsoʊsiˈeıʃn
  Y [ˈʏpsilɔn], Yen [jɛn], Yoga [ˈjoːgɑ]

APL:

  ((V⍳V)=⍳⍴V)/V←,V    ⌷←⍳→⍴∆∇⊃‾⍎⍕⌈

Nicer typography in plain text files:

  ╔══════════════════════════════════════════╗
  ║                                          ║
  ║   • ‘single’ and “double” quotes         ║
  ║                                          ║
  ║   • Curly apostrophes: “We’ve been here” ║
  ║                                          ║
  ║   • Latin-1 apostrophe and accents: '´`  ║
  ║                                          ║
  ║   • ‚deutsche‘ „Anführungszeichen“       ║
  ║                                          ║
  ║   • †, ‡, ‰, •, 3–4, —, −5/+5, ™, …      ║
  ║                                          ║
  ║   • ASCII safety test: 1lI|, 0OD, 8B     ║
  ║                      ╭─────────╮         ║
  ║   • the euro symbol: │ 14.95 € │         ║
  ║                      ╰─────────╯         ║
  ╚══════════════════════════════════════════╝

Combining characters:

  STARGΛ̊TE SG-1, a = v̇ = r̈, a⃑ ⊥ b⃑

Greek (in Polytonic):

  The Greek anthem:

  Σὲ γνωρίζω ἀπὸ τὴν κόψη
  τοῦ σπαθιοῦ τὴν τρομερή,
  σὲ γνωρίζω ἀπὸ τὴν ὄψη
  ποὺ μὲ βία μετράει τὴ γῆ.

  ᾿Απ᾿ τὰ κόκκαλα βγαλμένη
  τῶν ῾Ελλήνων τὰ ἱερά
  καὶ σὰν πρῶτα ἀνδρειωμένη
  χαῖρε, ὦ χαῖρε, ᾿Ελευθεριά!

  From a speech of Demosthenes in the 4th century BC:

  Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι,
  ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς
  λόγους οὓς ἀκούω· τοὺς μὲν γὰρ λόγους περὶ τοῦ
  τιμωρήσασθαι Φίλιππον ὁρῶ γιγνομένους, τὰ δὲ πράγματ᾿
  εἰς τοῦτο προήκοντα,  ὥσθ᾿ ὅπως μὴ πεισόμεθ᾿ αὐτοὶ
  πρότερον κακῶς σκέψασθαι δέον. οὐδέν οὖν ἄλλο μοι δοκοῦσιν
  οἱ τὰ τοιαῦτα λέγοντες ἢ τὴν ὑπόθεσιν, περὶ ἧς βουλεύεσθαι,
  οὐχὶ τὴν οὖσαν παριστάντες ὑμῖν ἁμαρτάνειν. ἐγὼ δέ, ὅτι μέν
  ποτ᾿ ἐξῆν τῇ πόλει καὶ τὰ αὑτῆς ἔχειν ἀσφαλῶς καὶ Φίλιππον
  τιμωρήσασθαι, καὶ μάλ᾿ ἀκριβῶς οἶδα· ἐπ᾿ ἐμοῦ γάρ, οὐ πάλαι
  γέγονεν ταῦτ᾿ ἀμφότερα· νῦν μέντοι πέπεισμαι τοῦθ᾿ ἱκανὸν
  προλαβεῖν ἡμῖν εἶναι τὴν πρώτην, ὅπως τοὺς συμμάχους
  σώσομεν. ἐὰν γὰρ τοῦτο βεβαίως ὑπάρξῃ, τότε καὶ περὶ τοῦ
  τίνα τιμωρήσεταί τις καὶ ὃν τρόπον ἐξέσται σκοπεῖν· πρὶν δὲ
  τὴν ἀρχὴν ὀρθῶς ὑποθέσθαι, μάταιον ἡγοῦμαι περὶ τῆς
  τελευτῆς ὁντινοῦν ποιεῖσθαι λόγον.

  Δημοσθένους, Γ´ ᾿Ολυνθιακὸς

Georgian:

  From a Unicode conference invitation:

  გთხოვთ ახლავე გაიაროთ რეგისტრაცია Unicode-ის მეათე საერთაშორისო
  კონფერენციაზე დასასწრებად, რომელიც გაიმართება 10-12 მარტს,
  ქ. მაინცში, გერმანიაში. კონფერენცია შეჰკრებს ერთად მსოფლიოს
  ექსპერტებს ისეთ დარგებში როგორიცაა ინტერნეტი და Unicode-ი,
  ინტერნაციონალიზაცია და ლოკალიზაცია, Unicode-ის გამოყენება
  ოპერაციულ სისტემებსა, და გამოყენებით პროგრამებში, შრიფტებში,
  ტექსტების დამუშავებასა და მრავალენოვან კომპიუტერულ სისტემებში.

Russian:

  From a Unicode conference invitation:

  Зарегистрируйтесь сейчас на Десятую Международную Конференцию по
  Unicode, которая состоится 10-12 марта 1997 года в Майнце в Германии.
  Конференция соберет широкий круг экспертов по  вопросам глобального
  Интернета и Unicode, локализации и интернационализации, воплощению и
  применению Unicode в различных операционных системах и программных
  приложениях, шрифтах, верстке и многоязычных компьютерных системах.

Thai (UCS Level 2):

  Excerpt from a poetry on The Romance of The Three Kingdoms (a Chinese
  classic 'San Gua'):

  [----------------------------|------------------------]
    ๏ แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช  พระปกเกศกองบู๊กู้ขึ้นใหม่
  สิบสองกษัตริย์ก่อนหน้าแลถัดไป       สององค์ไซร้โง่เขลาเบาปัญญา
    ทรงนับถือขันทีเป็นที่พึ่ง           บ้านเมืองจึงวิปริตเป็นนักหนา
  โฮจิ๋นเรียกทัพทั่วหัวเมืองมา         หมายจะฆ่ามดชั่วตัวสำคัญ
    เหมือนขับไสไล่เสือจากเคหา      รับหมาป่าเข้ามาเลยอาสัญ
  ฝ่ายอ้องอุ้นยุแยกให้แตกกัน          ใช้สาวนั้นเป็นชนวนชื่นชวนใจ
    พลันลิฉุยกุยกีกลับก่อเหตุ          ช่างอาเพศจริงหนาฟ้าร้องไห้
  ต้องรบราฆ่าฟันจนบรรลัย           ฤๅหาใครค้ำชูกู้บรรลังก์ ฯ

  (The above is a two-column text. If combining characters are handled
  correctly, the lines of the second column should be aligned with the
  | character above.)

Ethiopian:

  Proverbs in the Amharic language:

  ሰማይ አይታረስ ንጉሥ አይከሰስ።
  ብላ ካለኝ እንደአባቴ በቆመጠኝ።
  ጌጥ ያለቤቱ ቁምጥና ነው።
  ደሀ በሕልሙ ቅቤ ባይጠጣ ንጣት በገደለው።
  የአፍ ወለምታ በቅቤ አይታሽም።
  አይጥ በበላ ዳዋ ተመታ።
  ሲተረጉሙ ይደረግሙ።
  ቀስ በቀስ፥ ዕንቁላል በእግሩ ይሄዳል።
  ድር ቢያብር አንበሳ ያስር።
  ሰው እንደቤቱ እንጅ እንደ ጉረቤቱ አይተዳደርም።
  እግዜር የከፈተውን ጉሮሮ ሳይዘጋው አይድርም።
  የጎረቤት ሌባ፥ ቢያዩት ይስቅ ባያዩት ያጠልቅ።
  ሥራ ከመፍታት ልጄን ላፋታት።
  ዓባይ ማደሪያ የለው፥ ግንድ ይዞ ይዞራል።
  የእስላም አገሩ መካ የአሞራ አገሩ ዋርካ።
  ተንጋሎ ቢተፉ ተመልሶ ባፉ።
  ወዳጅህ ማር ቢሆን ጨርስህ አትላሰው።
  እግርህን በፍራሽህ ልክ ዘርጋ።

Runes:

  ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ

  (Old English, which transcribed into Latin reads 'He cwaeth that he
  bude thaem lande northweardum with tha Westsae.' and means 'He said
  that he lived in the northern land near the Western Sea.')

Braille:

  ⡌⠁⠧⠑ ⠼⠁⠒  ⡍⠜⠇⠑⠹⠰⠎ ⡣⠕⠌

  ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠙⠑⠁⠙⠒ ⠞⠕ ⠃⠑⠛⠔ ⠺⠊⠹⠲ ⡹⠻⠑ ⠊⠎ ⠝⠕ ⠙⠳⠃⠞
  ⠱⠁⠞⠑⠧⠻ ⠁⠃⠳⠞ ⠹⠁⠞⠲ ⡹⠑ ⠗⠑⠛⠊⠌⠻ ⠕⠋ ⠙⠊⠎ ⠃⠥⠗⠊⠁⠇ ⠺⠁⠎
  ⠎⠊⠛⠝⠫ ⠃⠹ ⠹⠑ ⠊⠇⠻⠛⠹⠍⠁⠝⠂ ⠹⠑ ⠊⠇⠻⠅⠂ ⠹⠑ ⠥⠝⠙⠻⠞⠁⠅⠻⠂
  ⠁⠝⠙ ⠹⠑ ⠡⠊⠑⠋ ⠍⠳⠗⠝⠻⠲ ⡎⠊⠗⠕⠕⠛⠑ ⠎⠊⠛⠝⠫ ⠊⠞⠲ ⡁⠝⠙
  ⡎⠊⠗⠕⠕⠛⠑⠰⠎ ⠝⠁⠍⠑ ⠺⠁⠎ ⠛⠕⠕⠙ ⠥⠏⠕⠝ ⠰⡡⠁⠝⠛⠑⠂ ⠋⠕⠗ ⠁⠝⠹⠹⠔⠛ ⠙⠑
  ⠡⠕⠎⠑ ⠞⠕ ⠏⠥⠞ ⠙⠊⠎ ⠙⠁⠝⠙ ⠞⠕⠲

  ⡕⠇⠙ ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠁⠎ ⠙⠑⠁⠙ ⠁⠎ ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲

  ⡍⠔⠙⠖ ⡊ ⠙⠕⠝⠰⠞ ⠍⠑⠁⠝ ⠞⠕ ⠎⠁⠹ ⠹⠁⠞ ⡊ ⠅⠝⠪⠂ ⠕⠋ ⠍⠹
  ⠪⠝ ⠅⠝⠪⠇⠫⠛⠑⠂ ⠱⠁⠞ ⠹⠻⠑ ⠊⠎ ⠏⠜⠞⠊⠊⠥⠇⠜⠇⠹ ⠙⠑⠁⠙ ⠁⠃⠳⠞
  ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲ ⡊ ⠍⠊⠣⠞ ⠙⠁⠧⠑ ⠃⠑⠲ ⠔⠊⠇⠔⠫⠂ ⠍⠹⠎⠑⠇⠋⠂ ⠞⠕
  ⠗⠑⠛⠜⠙ ⠁ ⠊⠕⠋⠋⠔⠤⠝⠁⠊⠇ ⠁⠎ ⠹⠑ ⠙⠑⠁⠙⠑⠌ ⠏⠊⠑⠊⠑ ⠕⠋ ⠊⠗⠕⠝⠍⠕⠝⠛⠻⠹
  ⠔ ⠹⠑ ⠞⠗⠁⠙⠑⠲ ⡃⠥⠞ ⠹⠑ ⠺⠊⠎⠙⠕⠍ ⠕⠋ ⠳⠗ ⠁⠝⠊⠑⠌⠕⠗⠎
  ⠊⠎ ⠔ ⠹⠑ ⠎⠊⠍⠊⠇⠑⠆ ⠁⠝⠙ ⠍⠹ ⠥⠝⠙⠁⠇⠇⠪⠫ ⠙⠁⠝⠙⠎
  ⠩⠁⠇⠇ ⠝⠕⠞ ⠙⠊⠌⠥⠗⠃ ⠊⠞⠂ ⠕⠗ ⠹⠑ ⡊⠳⠝⠞⠗⠹⠰⠎ ⠙⠕⠝⠑ ⠋⠕⠗⠲ ⡹⠳
  ⠺⠊⠇⠇ ⠹⠻⠑⠋⠕⠗⠑ ⠏⠻⠍⠊⠞ ⠍⠑ ⠞⠕ ⠗⠑⠏⠑⠁⠞⠂ ⠑⠍⠏⠙⠁⠞⠊⠊⠁⠇⠇⠹⠂ ⠹⠁⠞
  ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠁⠎ ⠙⠑⠁⠙ ⠁⠎ ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲

  (The first couple of paragraphs of "A Christmas Carol" by Dickens)

Compact font selection example text:

  ABCDEFGHIJKLMNOPQRSTUVWXYZ /0123456789
  abcdefghijklmnopqrstuvwxyz £©µÀÆÖÞßéöÿ
  –—‘“”„†•…‰™œŠŸž€ ΑΒΓΔΩαβγδω АБВГДабвгд
  ∀∂∈ℝ∧∪≡∞ ↑↗↨↻⇣ ┐┼╔╘░►☺♀ ﬁ�⑀₂ἠḂӥẄɐː⍎אԱა

Greetings in various languages:

  Hello world, Καλημέρα κόσμε, コンニチハ

Box drawing alignment tests:                                          █
                                                                      ▉
  ╔══╦══╗  ┌──┬──┐  ╭──┬──╮  ╭──┬──╮  ┏━━┳━━┓  ┎┒┏┑   ╷  ╻ ┏┯┓ ┌┰┐    ▊ ╱╲╱╲╳╳╳
  ║┌─╨─┐║  │╔═╧═╗│  │╒═╪═╕│  │╓─╁─╖│  ┃┌─╂─┐┃  ┗╃╄┙  ╶┼╴╺╋╸┠┼┨ ┝╋┥    ▋ ╲╱╲╱╳╳╳
  ║│╲ ╱│║  │║   ║│  ││ │ ││  │║ ┃ ║│  ┃│ ╿ │┃  ┍╅╆┓   ╵  ╹ ┗┷┛ └┸┘    ▌ ╱╲╱╲╳╳╳
  ╠╡ ╳ ╞╣  ├╢   ╟┤  ├┼─┼─┼┤  ├╫─╂─╫┤  ┣┿╾┼╼┿┫  ┕┛┖┚     ┌┄┄┐ ╎ ┏┅┅┓ ┋ ▍ ╲╱╲╱╳╳╳
  ║│╱ ╲│║  │║   ║│  ││ │ ││  │║ ┃ ║│  ┃│ ╽ │┃  ░░▒▒▓▓██ ┊  ┆ ╎ ╏  ┇ ┋ ▎
  ║└─╥─┘║  │╚═╤═╝│  │╘═╪═╛│  │╙─╀─╜│  ┃└─╂─┘┃  ░░▒▒▓▓██ ┊  ┆ ╎ ╏  ┇ ┋ ▏
  ╚══╩══╝  └──┴──┘  ╰──┴──╯  ╰──┴──╯  ┗━━┻━━┛  ▗▄▖▛▀▜   └╌╌┘ ╎ ┗╍╍┛ ┋  ▁▂▃▄▅▆▇█
                                               ▝▀▘▙▄▟

Sanskrit: ﻿काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥
Sanskrit (standard transcription): kācaṃ śaknomyattum; nopahinasti mām.
Classical Greek: ὕαλον ϕαγεῖν δύναμαι· τοῦτο οὔ με βλάπτει.
Greek (monotonic): Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
Greek (polytonic): Μπορῶ νὰ φάω σπασμένα γυαλιὰ χωρὶς νὰ πάθω τίποτα.
Etruscan: (NEEDED)
Latin: Vitrum edere possum; mihi non nocet.
Old French: Je puis mangier del voirre. Ne me nuit.
French: Je peux manger du verre, ça ne me fait pas mal.
Provençal / Occitan: Pòdi manjar de veire, me nafrariá pas.
Québécois: J'peux manger d'la vitre, ça m'fa pas mal.
Walloon: Dji pou magnî do vêre, çoula m' freut nén må.
Champenois: (NEEDED)
Lorrain: (NEEDED)
Picard: Ch'peux mingi du verre, cha m'foé mie n'ma.
Corsican/Corsu: (NEEDED)
Jèrriais: (NEEDED)
Kreyòl Ayisyen (Haitï): Mwen kap manje vè, li pa blese'm.
Basque: Kristala jan dezaket, ez dit minik ematen.
Catalan / Català: Puc menjar vidre, que no em fa mal.
Spanish: Puedo comer vidrio, no me hace daño.
Aragonés: Puedo minchar beire, no me'n fa mal .
Aranés: (NEEDED)
Mallorquín: (NEEDED)
Galician: Eu podo xantar cristais e non cortarme.
European Portuguese: Posso comer vidro, não me faz mal.
Brazilian Portuguese (8): Posso comer vidro, não me machuca.
Caboverdiano/Kabuverdianu (Cape Verde): M' podê cumê vidru, ca ta maguâ-m'.
Papiamentu: Ami por kome glas anto e no ta hasimi daño.
Italian: Posso mangiare il vetro e non mi fa male.
Milanese: Sôn bôn de magnà el véder, el me fa minga mal.
Roman: Me posso magna' er vetro, e nun me fa male.
Napoletano: M' pozz magna' o'vetr, e nun m' fa mal.
Venetian: Mi posso magnare el vetro, no'l me fa mae.
Zeneise (Genovese): Pòsso mangiâ o veddro e o no me fà mâ.
Sicilian: Puotsu mangiari u vitru, nun mi fa mali.
Campinadese (Sardinia): (NEEDED)
Lugudorese (Sardinia): (NEEDED)
Romansch (Grischun): Jau sai mangiar vaider, senza che quai fa donn a mai.
Romany / Tsigane: (NEEDED)
Romanian: Pot să mănânc sticlă și ea nu mă rănește.
Esperanto: Mi povas manĝi vitron, ĝi ne damaĝas min.
Pictish: (NEEDED)
Breton: (NEEDED)
Cornish: Mý a yl dybry gwéder hag éf ny wra ow ankenya.
Welsh: Dw i'n gallu bwyta gwydr, 'dyw e ddim yn gwneud dolur i mi.
Manx Gaelic: Foddym gee glonney agh cha jean eh gortaghey mee.
Old Irish (Ogham): ᚛᚛ᚉᚑᚅᚔᚉᚉᚔᚋ ᚔᚈᚔ ᚍᚂᚐᚅᚑ ᚅᚔᚋᚌᚓᚅᚐ᚜
Old Irish (Latin): Con·iccim ithi nglano. Ním·géna.
Irish: Is féidir liom gloinne a ithe. Ní dhéanann sí dochar ar bith dom.
Ulster Gaelic: Ithim-sa gloine agus ní miste damh é.
Scottish Gaelic: S urrainn dhomh gloinne ithe; cha ghoirtich i mi.
Anglo-Saxon (Runes): ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬
Anglo-Saxon (Latin): Ic mæg glæs eotan ond hit ne hearmiað me.
Middle English: Ich canne glas eten and hit hirtiþ me nouȝt.
English: I can eat glass and it doesn't hurt me.
English (IPA): [aɪ kæn iːt glɑːs ænd ɪt dɐz nɒt hɜːt miː] (Received Pronunciation)
English (Braille): ⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀⠁⠝⠙⠀⠊⠞⠀⠙⠕⠑⠎⠝⠞⠀⠓⠥⠗⠞⠀⠍⠑
Jamaican: Mi kian niam glas han i neba hot mi.
Lalland Scots / Doric: Ah can eat gless, it disnae hurt us.
Glaswegian: (NEEDED)
Gothic (4): 𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸.
Old Norse (Runes): ᛖᚴ ᚷᛖᛏ ᛖᛏᛁ ᚧ ᚷᛚᛖᚱ ᛘᚾ ᚦᛖᛋᛋ ᚨᚧ ᚡᛖ ᚱᚧᚨ ᛋᚨᚱ
Old Norse (Latin): Ek get etið gler án þess að verða sár.
Norsk / Norwegian (Nynorsk): Eg kan eta glas utan å skada meg.
Norsk / Norwegian (Bokmål): Jeg kan spise glass uten å skade meg.
Føroyskt / Faroese: Eg kann eta glas, skaðaleysur.
Íslenska / Icelandic: Ég get etið gler án þess að meiða mig.
Svenska / Swedish: Jag kan äta glas utan att skada mig.
Dansk / Danish: Jeg kan spise glas, det gør ikke ondt på mig.
Sønderjysk: Æ ka æe glass uhen at det go mæ naue.
Frysk / Frisian: Ik kin glês ite, it docht me net sear.
Nederlands / Dutch: Ik kan glas eten, het doet mĳ geen kwaad.
Kirchröadsj/Bôchesserplat: Iech ken glaas èèse, mer 't deet miech jing pieng.
Afrikaans: Ek kan glas eet, maar dit doen my nie skade nie.
Lëtzebuergescht / Luxemburgish: Ech kan Glas iessen, daat deet mir nët wei.
Deutsch / German: Ich kann Glas essen, ohne mir zu schaden.
Ruhrdeutsch: Ich kann Glas verkasematuckeln, ohne dattet mich wat jucken tut.
Langenfelder Platt: Isch kann Jlaas kimmeln, uuhne datt mich datt weh dääd.
Lausitzer Mundart ("Lusatian"): Ich koann Gloos assn und doas dudd merr ni wii.
Odenwälderisch: Iech konn glaasch voschbachteln ohne dass es mir ebbs daun doun dud.
Sächsisch / Saxon: 'sch kann Glos essn, ohne dass'sch mer wehtue.
Pfälzisch: Isch konn Glass fresse ohne dasses mer ebbes ausmache dud.
Schwäbisch / Swabian: I kå Glas frässa, ond des macht mr nix!
Deutsch (Voralberg): I ka glas eassa, ohne dass mar weh tuat.
Bayrisch / Bavarian: I koh Glos esa, und es duard ma ned wei.
Allemannisch: I kaun Gloos essen, es tuat ma ned weh.
Schwyzerdütsch (Zürich): Ich chan Glaas ässe, das schadt mir nöd.
Schwyzerdütsch (Luzern): Ech cha Glâs ässe, das schadt mer ned.
Plautdietsch: (NEEDED)
Hungarian: Meg tudom enni az üveget, nem lesz tőle bajom.
Suomi / Finnish: Voin syödä lasia, se ei vahingoita minua.
Sami (Northern): Sáhtán borrat lása, dat ii leat bávččas.
Erzian: Мон ярсан суликадо, ды зыян эйстэнзэ а ули.
Northern Karelian: Mie voin syvvä lasie ta minla ei ole kipie.
Southern Karelian: Minä voin syvvä st'oklua dai minule ei ole kibie.
Vepsian: (NEEDED)
Votian: (NEEDED)
Livonian: (NEEDED)
Estonian: Ma võin klaasi süüa, see ei tee mulle midagi.
Latvian: Es varu ēst stiklu, tas man nekaitē.
Lithuanian: Aš galiu valgyti stiklą ir jis manęs nežeidžia
Old Prussian: (NEEDED)
Sorbian (Wendish): (NEEDED)
Czech: Mohu jíst sklo, neublíží mi.
Slovak: Môžem jesť sklo. Nezraní ma.
Polska / Polish: Mogę jeść szkło i mi nie szkodzi.
Slovenian: Lahko jem steklo, ne da bi mi škodovalo.
Bosnian, Croatian, Montenegrin and Serbian (Latin): Ja mogu jesti staklo, i to mi ne šteti.
Bosnian, Montenegrin and Serbian (Cyrillic): Ја могу јести стакло, и то ми не штети.
Macedonian: Можам да јадам стакло, а не ме штета.
Russian: Я могу есть стекло, оно мне не вредит.
Belarusian (Cyrillic): Я магу есці шкло, яно мне не шкодзіць.
Belarusian (Lacinka): Ja mahu jeści škło, jano mne ne škodzić.
Ukrainian: Я можу їсти скло, і воно мені не зашкодить.
Bulgarian: Мога да ям стъкло, то не ми вреди.
Georgian: მინას ვჭამ და არა მტკივა.
Armenian: Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։
Albanian: Unë mund të ha qelq dhe nuk më gjen gjë.
Turkish: Cam yiyebilirim, bana zararı dokunmaz.
Turkish (Ottoman): جام ييه بلورم بڭا ضررى طوقونمز
Tatar: Алам да бар, пыяла, әмма бу ранит мине.
Uzbek / O’zbekcha: (Roman): Men shisha yeyishim mumkin, ammo u menga zarar keltirmaydi.
Uzbek / Ўзбекча (Cyrillic): Мен шиша ейишим мумкин, аммо у менга зарар келтирмайди.
Bangla / Bengali: আমি কাঁচ খেতে পারি, তাতে আমার কোনো ক্ষতি হয় না।
Marathi (masculine): मी काच खाऊ शकतो, मला ते दुखत नाही.
Marathi (feminine):   मी काच खाऊ शकते, मला ते दुखत नाही.
Kannada: ನನಗೆ ಹಾನಿ ಆಗದೆ, ನಾನು ಗಜನ್ನು ತಿನಬಹುದು
Hindi (masculine): मैं काँच खा सकता हूँ और मुझे उससे कोई चोट नहीं पहुंचती.
Hindi (feminine):   मैं काँच खा सकती हूँ और मुझे उससे कोई चोट नहीं पहुंचती.
Malayalam: എനിക്ക് ഗ്ലാസ് തിന്നാം. അതെന്നെ വേദനിപ്പിക്കില്ല.
Tamil: நான் கண்ணாடி சாப்பிடுவேன், அதனால் எனக்கு ஒரு கேடும் வராது.
Telugu: నేను గాజు తినగలను మరియు అలా చేసినా నాకు ఏమి ఇబ్బంది లేదు
Sinhalese: මට වීදුරු කෑමට හැකියි. එයින් මට කිසි හානියක් සිදු නොවේ.
Urdu(3): میں کانچ کھا سکتا ہوں اور مجھے تکلیف نہیں ہوتی ۔
Pashto(3): زه شيشه خوړلې شم، هغه ما نه خوږوي
Farsi / Persian(3): .من می توانم بدونِ احساس درد شيشه بخورم
Arabic(3): أنا قادر على أكل الزجاج و هذا لا يؤلمني.
Aramaic: (NEEDED)
Maltese: Nista' niekol il-ħġieġ u ma jagħmilli xejn.
Hebrew(3): אני יכול לאכול זכוכית וזה לא מזיק לי.
Yiddish(3): איך קען עסן גלאָז און עס טוט מיר נישט װײ.
Judeo-Arabic: (NEEDED)
Ladino: (NEEDED)
Gǝʼǝz: (NEEDED)
Amharic: (NEEDED)
Twi: Metumi awe tumpan, ɜnyɜ me hwee.
Hausa (Latin): Inā iya taunar gilāshi kuma in gamā lāfiyā.
Hausa (Ajami) (2): إِنا إِىَ تَونَر غِلَاشِ كُمَ إِن غَمَا لَافِىَا
Yoruba(4): Mo lè je̩ dígí, kò ní pa mí lára.
Lingala: Nakokí kolíya biténi bya milungi, ekosála ngáí mabé tɛ́.
(Ki)Swahili: Naweza kula bilauri na sikunyui.
Malay: Saya boleh makan kaca dan ia tidak mencederakan saya.
Tagalog: Kaya kong kumain nang bubog at hindi ako masaktan.
Chamorro: Siña yo' chumocho krestat, ti ha na'lalamen yo'.
Fijian: Au rawa ni kana iloilo, ia au sega ni vakacacani kina.
Javanese: Aku isa mangan beling tanpa lara.
Burmese (Unicode 4.0): က္ယ္ဝန္‌တော္‌၊က္ယ္ဝန္‌မ မ္ယက္‌စားနုိင္‌သည္‌။ ၎က္ရောင္‌့ ထိခုိက္‌မ္ဟု မရ္ဟိပာ။ (9)
Burmese (Unicode 5.0): ကျွန်တော် ကျွန်မ မှန်စားနိုင်တယ်။ ၎င်းကြောင့် ထိခိုက်မှုမရှိပါ။ (9)
Vietnamese (quốc ngữ): Tôi có thể ăn thủy tinh mà không hại gì.
Vietnamese (nôm) (4): 些 𣎏 世 咹 水 晶 𦓡 空 𣎏 害 咦
Khmer: ខ្ញុំអាចញុំកញ្ចក់បាន ដោយគ្មានបញ្ហារ
Lao: ຂອ້ຍກິນແກ້ວໄດ້ໂດຍທີ່ມັນບໍ່ໄດ້ເຮັດໃຫ້ຂອ້ຍເຈັບ.
Thai: ฉันกินกระจกได้ แต่มันไม่ทำให้ฉันเจ็บ
Mongolian (Cyrillic): Би шил идэй чадна, надад хортой биш
Mongolian (Classic) (5): ᠪᠢ ᠰᠢᠯᠢ ᠢᠳᠡᠶᠦ ᠴᠢᠳᠠᠨᠠ ᠂ ᠨᠠᠳᠤᠷ ᠬᠣᠤᠷᠠᠳᠠᠢ ᠪᠢᠰᠢ
Dzongkha: (NEEDED)
Nepali: ﻿म काँच खान सक्छू र मलाई केहि नी हुन्‍न् ।
Tibetan: ཤེལ་སྒོ་ཟ་ནས་ང་ན་གི་མ་རེད།
Chinese: 我能吞下玻璃而不伤身体。
Chinese (Traditional): 我能吞下玻璃而不傷身體。
Taiwanese(6): Góa ē-tàng chia̍h po-lê, mā bē tio̍h-siong.
Japanese: 私はガラスを食べられます。それは私を傷つけません。
Korean: 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요
Bislama: Mi save kakae glas, hemi no save katem mi.
Hawaiian: Hiki iaʻu ke ʻai i ke aniani; ʻaʻole nō lā au e ʻeha.
Marquesan: E koʻana e kai i te karahi, mea ʻā, ʻaʻe hauhau.
Inuktitut (10): ᐊᓕᒍᖅ ᓂᕆᔭᕌᖓᒃᑯ ᓱᕋᙱᑦᑐᓐᓇᖅᑐᖓ
Chinook Jargon: Naika məkmək kakshət labutay, pi weyk ukuk munk-sik nay.
Navajo: Tsésǫʼ yishą́ągo bííníshghah dóó doo shił neezgai da.
Cherokee (and Cree, Chickasaw, Cree, Micmac, Ojibwa, Lakota, Náhuatl, Quechua, Aymara, and other American languages): (NEEDED)
Garifuna: (NEEDED)
Gullah: (NEEDED)
Lojban: mi kakne le nu citka le blaci .iku'i le se go'i na xrani mi
Nórdicg: Ljœr ye caudran créneþ ý jor cẃran.
''']

for q in QQQ:
    tokens = tokenizer_rust.encode(q)
    if q != tokenizer_rust.decode(tokens):
        print('ERROR', q)

print('All OK\n')

All OK

