# Byte Pair Encoding 
## Implementing the algorithm to better understand the moving parts

BPE does the follow steps recursively until the required vocab size is reached
- pre-tokenize to get individual words
- pad both sides of the word with `\w` 
- compute frequencies of pairs, the pair with the highest frequency gets merged into a single token and gets added to the dictionary 

### Data structures
- vocab: the set of unique tokens, needs to be updated after every new merge rule
- corpus: the set of tokenized words with their frequency 

In [18]:
from collections import Counter

In [20]:
lang = """
I am the a hero. 
I am looking for a villain because I am bored. 
What is the point of being a hero if there is no villain. 
Then all I am is a person, and that is not fun enough. 
So, villain, where are you?
"""

In [42]:
def pad_word_with_start_token(word:list):
    if word is None:
        return None
    else:
        return word + '#'
    
# removed punctuations
def remove_punctuation(word:list, punct:list = punct):
    if len(word)==0:
        return None
    
    if word[0] in punct:
        word = word.replace(word[0],'')
    if word[-1] in punct:
        word = word.replace(word[-1],'')
        
    word = pad_word_with_start_token(word)
    return word

In [51]:
punct = list(".?,!")
tokens = [remove_punctuation(w) for w in lang.lower().split()]
corpus = Counter(tokens)
tokenized_corpus = Counter()

for word in corpus.keys():
    tokenized_corpus.update(word)

corpus, tokenized_corpus

iter_frequencies = Counter()

for w in corpus:
    for merged_token in zip(w,w[1:]):
        merged_token = "".join(merged_token)
        iter_frequencies.update([merged_token])
for byte_pair in iter_frequencies.most_common(1):
    # add the pair to the tokenized_corpus 
    # remove the counts from the framgments of the byte pair, remove entry if count goes to 0
    ...
    

Counter({'i#': 1,
         'am': 1,
         'm#': 1,
         'th': 4,
         'he': 5,
         'e#': 5,
         'a#': 1,
         'er': 4,
         'ro': 1,
         'o#': 3,
         'lo': 1,
         'oo': 1,
         'ok': 1,
         'ki': 1,
         'in': 4,
         'ng': 2,
         'g#': 2,
         'fo': 1,
         'or': 2,
         'r#': 1,
         'vi': 1,
         'il': 1,
         'll': 2,
         'la': 1,
         'ai': 1,
         'n#': 4,
         'be': 2,
         'ec': 1,
         'ca': 1,
         'au': 1,
         'us': 1,
         'se': 1,
         'bo': 1,
         're': 4,
         'ed': 1,
         'd#': 2,
         'wh': 2,
         'ha': 2,
         'at': 2,
         't#': 4,
         'is': 1,
         's#': 1,
         'po': 1,
         'oi': 1,
         'nt': 1,
         'of': 1,
         'f#': 2,
         'ei': 1,
         'if': 1,
         'no': 3,
         'en': 2,
         'al': 1,
         'l#': 1,
         'pe': 1,
         'rs': 1,
         '

In [15]:
a = "allo."
a.replace(a[-1],"")


'allo.'