In [2]:
!pip install transformers



In [68]:
from tokenizers import (ByteLevelBPETokenizer,
                            CharBPETokenizer,
                            SentencePieceBPETokenizer,
                            BertWordPieceTokenizer)

In [69]:
import csv

## Create data

In [77]:
with open("bbcsport_train.csv") as f:
    reader = csv.reader(f, delimiter=";")
    data = [] 
    for row in reader:
        data.append(row[1])

with open("bbcsport_train.txt", "w") as out:
    out.write("\n\n".join(data))

In [83]:
tokens = []
test_text = """Sri Lankans cleared of misconduct. Då är öarna döda sportåäö. Two Sri Lanka cricketers have been cleared of misconduct dating back to the ICC Champions Trophy in 2004.  Avishka Gunawardene and Kaushal Lokuarachchi were both the subject of an official disciplinary inquiry after allegations of drunken misconduct. A Colombo newspaper had made the claims after a defeat against England in Southampton which led to Sri Lanka exiting the tournament early. But the disciplinary panel could find no evidence against the players. Sri Lanka Cricket chief executive Duleep Mendis said: "Nobody was prepared to give evidence and there was absolutely no evidence to substantiate the article's allegations." Gunawardene, 27, a hard-hitting opener, and Lokuarachchi, a 22-year-old leg-spinning all-rounder, were both dropped from the national squad after Sri Lanka's tour of Pakistan in October."""
for t in (ByteLevelBPETokenizer, CharBPETokenizer, SentencePieceBPETokenizer, BertWordPieceTokenizer):
    tokenizer = t()
    tokenizer.train(["bbcsport_train.txt"], vocab_size=5000)
    out = tokenizer.encode(test_text)
    tokens.append(out.tokens)

In [84]:
print(test_text)
for t in tokens:
    print(" ".join(t))
    #print([i.encode('utf-8') for i in t])
    print("---")

Sri Lankans cleared of misconduct. Då är öarna döda sportåäö. Two Sri Lanka cricketers have been cleared of misconduct dating back to the ICC Champions Trophy in 2004.  Avishka Gunawardene and Kaushal Lokuarachchi were both the subject of an official disciplinary inquiry after allegations of drunken misconduct. A Colombo newspaper had made the claims after a defeat against England in Southampton which led to Sri Lanka exiting the tournament early. But the disciplinary panel could find no evidence against the players. Sri Lanka Cricket chief executive Duleep Mendis said: "Nobody was prepared to give evidence and there was absolutely no evidence to substantiate the article's allegations." Gunawardene, 27, a hard-hitting opener, and Lokuarachchi, a 22-year-old leg-spinning all-rounder, were both dropped from the national squad after Sri Lanka's tour of Pakistan in October.
Sri ĠLankans Ġcleared Ġof Ġmisconduct . ĠD Ã ¥ Ġ Ã ¤ r Ġ Ã ¶ ar na Ġd Ã ¶ d a Ġsport Ã ¥ Ã ¤ Ã ¶ . ĠTwo ĠSri ĠLanka Ġ

# BPE
BPE or byte pair encoding stemms from a compression algorithm created in 1994. Its purpose was there to compress text by replacing the most frequent substrings with special characters. Repeting this process over multiple passes enables units of toknes to be created that in turn build the vocabulary.

Instead of replacing the most common units with special characters in order to compress the text is it possible to use the same technique to build a vocabulary. 

Run the same algorithm, but instead of replacing the most common token units, consider it a part of your vocabulary. 
# Split by space and call it a day?
 - Discuss shortcommings of this simple method
     - Language specific
         - Normalisation
 - Mention vocab size of word2vec and similar and how that is not really viable (cannot deal with rare words)
 - 

# 1. Character level BPE
Source: https://www.aclweb.org/anthology/P16-1162.pdf

Each word is initially segmented into its characters and ended with a special </w> token to indicate the end of a word. BPE is then run to create subword units from the most common ones


## Example code

In [125]:
s = "low lower lowest newer wider finder"
with open('test_string.txt', 'w') as out:
    out.write(s)
tokeniser = CharBPETokenizer()
print(tokeniser)

print(f"\n{'Vocab size':<10} - {'Encoded test string'}")
for vocab_size in [100, 250, 500, 1000]:
    tokeniser.train(['bbcsport_train.txt'], vocab_size=vocab_size)
    encoded = tokeniser.encode(s)
    print(f"{vocab_size:>10} - {' '.join(encoded.tokens)}")

Tokenizer(vocabulary_size=0, model=BPE, unk_token=<unk>, suffix=</w>, dropout=None, lowercase=False, unicode_normalizer=None, bert_normalizer=True, split_on_whitespace_only=False)

Vocab size - Encoded test string
       100 - l o w</w> l o w e r</w> l o w e s t</w> n e w e r</w> w i d e r</w> f i n d e r</w>
       250 - lo w</w> lo w er</w> lo w e st</w> ne w er</w> w i d er</w> f in d er</w>
       500 - lo w</w> lo w er</w> lo w est</w> ne w er</w> w i der</w> fin der</w>
      1000 - low</w> low er</w> low est</w> ne w er</w> wi der</w> fin der</w>


## Comments on CharLevel BPE
- Not entierly lossless encoding. 
- Requires language specific segmentation for preprocessing. 

Q: Is there an initial set of characters or does it adapt to the dataset at hand?
A: Seems like it does not encode åäö in any particularly good way... (simply removes the dots from these characters)

# 2. Byte level
Character level struggles with character-rich languages (Chinese, Japanese) due to the much larger vocabulary. This often leads to unecessarily large vocabularies which slows down processing. Byte level tokenisation addresses this issue through initially represent each character as a set of bytes, which can vary between 1 and 4. The set of 256 bytes used to represent all characters includes the ascii set, which mean that most european language characters can be represented in their original form.

## Example

In [145]:
s = " Swédish example: när åt Örjan? (When did Orjan eat?)"

test_file_name = "test_text_with_swedish_characters.txt"
with open(test_file_name, "w") as out:
    out.write(s)

tokeniser = ByteLevelBPETokenizer()
print(tokeniser)
print(f"\n{'Vocab size':<10} - {'Encoded test string'}")
for vocab_size in [100, 500, 1000]:
    tokeniser.train(["bbcsport_train.txt"], vocab_size=vocab_size)
    encoded = tokeniser.encode(s)
    print(f"{vocab_size:>10} - {' '.join(encoded.tokens)}")

Tokenizer(vocabulary_size=0, model=ByteLevelBPE, add_prefix_space=False, lowercase=False, dropout=None, unicode_normalizer=None, continuing_subword_prefix=None, end_of_word_suffix=None, trim_offsets=False)

Vocab size - Encoded test string
       100 - Ġ S w Ã © d i s h Ġ e x a m p l e : Ġ n Ã ¤ r Ġ Ã ¥ t Ġ Ã ĸ r j a n ? Ġ ( W h e n Ġ d i d Ġ O r j a n Ġ e a t ? )
       500 - ĠS w Ã © d is h Ġex am p le : Ġn Ã ¤ r Ġ Ã ¥ t Ġ Ã ĸ r j an ? Ġ( W hen Ġd id ĠO r j an Ġe at ? )
      1000 - ĠS w Ã © d ish Ġex amp le : Ġn Ã ¤ r Ġ Ã ¥ t Ġ Ã ĸ r j an ? Ġ( W hen Ġdid ĠO r j an Ġe at ? )


In [150]:
print("é".encode('utf-8'))
print("ä".encode("utf-8"))
print("Ã ¤".encode('utf-8'))

b'\xc3\xa9'
b'\xc3\xa4'
b'\xc3\x83 \xc2\xa4'


In [148]:
tokeniser.decode(encoded.ids)

' Swédish example: när åt Örjan? (When did Orjan eat?)'

While the output above looks wiered, there is structure too it. Whitespace characters are replaced by the special "Ġ" character. Its byte representation can be found through the following:

## Notes
This new sequence of "characters" is then processed through the BPE algorithm in order to create tokens of the most common byte-level "subword" units. The original paper gives a great example. They show the learnt tokens with increasing allowed vocabulary size. As this increases, more high level representations (often almost entire words) are stored while at the lowest level the representations are more raw (closer to the original token representation). 

Resources: https://arxiv.org/pdf/1909.03341.pdf, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

### Questions for ByteLevel encodings
Does not seem like the original byte represenattion is used. Encoding "å" with utf-8 results in a different set of bytes compared to what is used for the BPE process. Why is that? What has happend?


# 3. SentencePiece
Adresses some issues with previous tokenisers
1. Requires language specific pretokenisation processing. I think this basically means that we need to know the word units before the new vocabulary can be built from subwords. For european languages is this often possible to generate simply by splitting on spaces etc. It is however not perfect and requires extra work, especially when moving over to non-european languages such as Chinese, Japanese etc, where whitespace is not used to indicate new words.
2. Tokenisation is not reversable. I.e. cannot go from tokenised text back to original one without ambiguity
3. Reproducability of tokenisers described in resarch. 

Implemented through BPE 
Utilise standard for character normalisation defined by the Unicode standard Normalisation Form (e.g NFC and NFKC)
## Unicode Standard Normalisation Form
Canonically equivalent or compatiable
First should be displayed and treated in the same way while the second should be ok to replace with its compatiable counterpart.
> Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.


In [157]:
s = "lower lowest newer finder, åäö entrécôte"
with open('test_string.txt', 'w') as out:
    out.write(s)
tokeniser = SentencePieceBPETokenizer()
print(tokeniser)

print(f"\n{'Vocab size':<10} - {'Encoded test string'}")
for vocab_size in [100, 250, 500, 1000, 5000]:
    tokeniser.train(['bbcsport_train.txt'], vocab_size=vocab_size)
    encoded = tokeniser.encode(s)
    print(f"{vocab_size:>10} - {' '.join(encoded.tokens)}")
print(tokeniser.decode(encoded.ids))

Tokenizer(vocabulary_size=0, model=SentencePieceBPE, unk_token=<unk>, replacement=▁, add_prefix_space=True, dropout=None)

Vocab size - Encoded test string
       100 - ▁ l o w er ▁ l o w e s t ▁ n e w er ▁f in d er , ▁ ▁ e n t r c t e
       250 - ▁l ow er ▁l ow est ▁n e w er ▁f in d er , ▁ ▁ ent r c t e
       500 - ▁l ow er ▁l ow est ▁ne w er ▁fin d er , ▁ ▁ ent r ct e
      1000 - ▁l ow er ▁l ow est ▁new er ▁fin d er, ▁ ▁ ent r ct e
      5000 - ▁lower ▁low est ▁new er ▁find er, ▁ ▁ent r ct e
lower lowest newer finder,  entrcte


In [161]:
from tokenizers import pre_tokenizers


## WordPiece

Rsources: https://arxiv.org/pdf/1609.08144.pdf

In [61]:
byte_level_alphabet = pre_tokenizers.
byte_encoded_alphabet = [c.encode('utf-8') for c in byte_level_alphabet]
print(len(byte_level_alphabet))
for b, c in zip(byte_level_alphabet, byte_encoded_alphabet):
    print((b,c))

256
('ğ', b'\xc4\x9f')
('^', b'^')
('Z', b'Z')
('â', b'\xc3\xa2')
('Ö', b'\xc3\x96')
('8', b'8')
('q', b'q')
('Ć', b'\xc4\x86')
('c', b'c')
('Ä', b'\xc3\x84')
('¹', b'\xc2\xb9')
('ć', b'\xc4\x87')
('Ë', b'\xc3\x8b')
('V', b'V')
('T', b'T')
('Ì', b'\xc3\x8c')
('G', b'G')
('"', b'"')
('¸', b'\xc2\xb8')
('Õ', b'\xc3\x95')
('Ó', b'\xc3\x93')
('´', b'\xc2\xb4')
('ķ', b'\xc4\xb7')
('Ç', b'\xc3\x87')
('\\', b'\\')
('ĝ', b'\xc4\x9d')
('Ė', b'\xc4\x96')
('Ù', b'\xc3\x99')
('ĸ', b'\xc4\xb8')
('y', b'y')
('ü', b'\xc3\xbc')
('¡', b'\xc2\xa1')
('Æ', b'\xc3\x86')
('l', b'l')
('%', b'%')
('¾', b'\xc2\xbe')
('ħ', b'\xc4\xa7')
('ĩ', b'\xc4\xa9')
('Ł', b'\xc5\x81')
('Ĵ', b'\xc4\xb4')
('¬', b'\xc2\xac')
('«', b'\xc2\xab')
('ī', b'\xc4\xab')
('Ċ', b'\xc4\x8a')
('ď', b'\xc4\x8f')
('Ļ', b'\xc4\xbb')
('ł', b'\xc5\x82')
('.', b'.')
('S', b'S')
('_', b'_')
('(', b'(')
('U', b'U')
('ô', b'\xc3\xb4')
('Ń', b'\xc5\x83')
('æ', b'\xc3\xa6')
('ª', b'\xc2\xaa')
('W', b'W')
('=', b'=')
('ċ', b'\xc4\x8b')
('k', b'k')
(

In [82]:
"ĠD Ã ¥ Ġ Ã ¤ r Ġ Ã ¶ ar na Ġd Ã ¶ d a Ġsport Ã ¥ Ã ¤ Ã ¶ ".encode("utf-8")

b'\xc4\xa0D \xc3\x83 \xc2\xa5 \xc4\xa0 \xc3\x83 \xc2\xa4 r \xc4\xa0 \xc3\x83 \xc2\xb6 ar na \xc4\xa0d \xc3\x83 \xc2\xb6 d a \xc4\xa0sport \xc3\x83 \xc2\xa5 \xc3\x83 \xc2\xa4 \xc3\x83 \xc2\xb6 '

In [None]:
ĠD Ã ¥ Ġ Ã ¤ r Ġ Ã ¶ ar na Ġd Ã ¶ d a 