## Training Corpus
We first create our corpus (aka Training Data)

In [2]:
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

for doc in corpus:
    print(doc)

This is the first document.
This document is the second document.
And this is the third one.
Is this the first document?


## Initial Vocabulary
Now we must create the initial vocabulary which will have our unique characters

create a list of unique characters

In [3]:
# (no duplicates are allowed in sets)
unique_chars = set()

# add chars from corpus to set
for doc in corpus:
    for char in doc:
        unique_chars.add(char)

In [4]:
print(unique_chars)

{'A', '?', ' ', '.', 'u', 't', 's', 'm', 'f', 'I', 'd', 'n', 'T', 'c', 'h', 'i', 'o', 'e', 'r'}


we now convert it into a list.
(sets are immutable and cannot be indexed)

In [5]:
vocab = list(unique_chars)
vocab.sort()        # simply, for coninstency and repoducibility

In [6]:
print(vocab)

[' ', '.', '?', 'A', 'I', 'T', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']


add an **end of word** token.  
> so the model will be able to differentiate between words and avoid irrelevant/wrong pairs of characters.

In [8]:
end_of_word = '/<w>'
vocab.append(end_of_word)

In [7]:
print('Initial Vocabulary:')
print(vocab)
print(f'size: {len(vocab)}')

Initial Vocabulary:
[' ', '.', '?', 'A', 'I', 'T', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u', '/<w>']
size: 20


## Pre-Tokenization

here, we will split the corpus into words, then characters.
- to split into words, we'll use the space character
- we will add `</w>>` at the end of each word

In [None]:
word_splits = {}

for doc in corpus:

    # splitting by ' ' character
    words = doc.split(' ')

    for word in words:
        
        char_list = list(word) + [end_of_word]      # convert words into list and append the char

        # convert to list because we will need an immutable object to act as a key in the dictionary
        word_tuple = tuple(char_list)

        if word_tuple not in word_splits:
            word_splits[word_tuple] = 0
        word_splits[word_tuple] += 1                # incrememnting count for each word when found

print('\nThe final dictionary with word count:')
print(word_splits)


The final dictionary with word count:
{('T', 'h', 'i', 's', '/<w>'): 2, ('i', 's', '/<w>'): 3, ('t', 'h', 'e', '/<w>'): 4, ('f', 'i', 'r', 's', 't', '/<w>'): 2, ('d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '.', '/<w>'): 2, ('d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '/<w>'): 1, ('s', 'e', 'c', 'o', 'n', 'd', '/<w>'): 1, ('A', 'n', 'd', '/<w>'): 1, ('t', 'h', 'i', 's', '/<w>'): 2, ('t', 'h', 'i', 'r', 'd', '/<w>'): 1, ('o', 'n', 'e', '.', '/<w>'): 1, ('I', 's', '/<w>'): 1, ('d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '?', '/<w>'): 1}


In [55]:
word_tuple

('d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '?', '/<w>')

## Helper Function: get_pair_stats  
This function will pair the adjecent characters and count their frequency.  
example:  
**input** =  
``` {('T', 'h', 'i', 's', '</w>'): 2, ('i', 's', '</w>'): 2, ...} ```
  
**output** =  
``` # {('i', 's'): 4, ('s', '</w>'): 4, ('T', 'h'): 2, ...} ```


In [None]:
import collections

def get_pair_stats(splits):
    # A collection's dictionary will create a new key if it already doesn't exist in the dictionary.
    pair_counts = collections.defaultdict(int)      #defaultdict will have default values of 0

    for word_tuple, freq in splits.items():
        symbols = list(word_tuple)                  # converting tuple to list

        for i in range(len(symbols)):               # iterating through each element in the word
            pair = (symbols[i], symbols[i+1])       # pairing chars with the next char
            pair_counts[pair] += freq               # addin the frequency of the word
    
    return pair_counts
