### Exercise 1 - Implementing a tokenizer
Implement a basic whitespace tokenizer in Python from scratch without the use of any NLP
libraries. This tokenizer should drop whitespaces and create tokens for the following cases:

(a) End-of-sentence (EOS) symbols, brackets and separators

(b) Abbreviations - Assume those are only one of the following: Ph.D., Dr., M.Sc.

(c) Special characters as in prices separated (i.e. $45.55)

(d) Dates - Assume that they follow the format dd/mm/yy (i.e. 01/02/06)

(e) URLs - Assume that they follow the format:
http[s]://[...], (i.e. https://www.stanford.edu)

(f) Hashtags separated (i.e. #nlproc)

(g) Email addresses - Assume that they follow the format:
name@domain.xyz (i.e. someOne@brown.edu)

Apply your code on the test example below, which should yield the specified tokens:

In [10]:
import re

In [78]:
test_text = "He has a M.Sc. in Math and she has a Ph.D. in NLP. A session costs 45.55$ or $50.00. As of 01/02/06, please email X/Y at someone@brown.edu or visit http://www.stanford.edu and if link does not work try https://www.stanford.edu instead. #test#test2#nlproc"

In [79]:
class VanillaTokenizer():
    def __init__(self):
        self.abbreviations = ["Ph.D.", "Dr.", "M.Sc."]
        self.abbr_pattern = r'Ph\.D\.|Dr\.|M\.Sc\.'
        self.separators = [".", ",", "/", "(", ")"]
        self.separators_pattern = r'[.,/()!]'
        self.date_pattern = r'\b\d{2}/\d{2}/\d{2,4}\b'
        self.prices_pattern = r'\$\d+(?:\.\d{1,2})?|\d+(?:\.\d{1,2})?\$'
        self.urls_pattern = r'https?://(?:www\.)?[a-zA-Z0-9\-]+\.[a-zA-Z]{2,}'
        self.emails_pattern = r'[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]+'
        self.hashtags_pattern = r'\#[a-zA-Z0-9]+'
        self.words = r'[a-zA-Z0-9]+'
        self.combined_pattern = f'{self.date_pattern}|{self.abbr_pattern}|{self.prices_pattern}|{self.urls_pattern}|{self.emails_pattern}|{self.hashtags_pattern}|{self.words}|{self.separators_pattern}'
    
    def tokenize(self, text):
        #get words from text
        tokens = re.findall(self.combined_pattern, text)
        return tokens

In [80]:
tokenizer = VanillaTokenizer()
tokens = tokenizer.tokenize(test_text)

In [82]:
tokens

['He',
 'has',
 'a',
 'M.Sc.',
 'in',
 'Math',
 'and',
 'she',
 'has',
 'a',
 'Ph.D.',
 'in',
 'NLP',
 '.',
 'A',
 'session',
 'costs',
 '45.55$',
 'or',
 '$50.00',
 '.',
 'As',
 'of',
 '01/02/06',
 ',',
 'please',
 'email',
 'X',
 '/',
 'Y',
 'at',
 'someone@brown.edu',
 'or',
 'visit',
 'http://www.stanford.edu',
 'and',
 'if',
 'link',
 'does',
 'not',
 'work',
 'try',
 'https://www.stanford.edu',
 'instead',
 '.',
 '#test',
 '#test2',
 '#nlproc']

### Exercise 2 - Implementing a BPE tokenizer
Implement a Byte Pair Encoder (BPE) tokenizer as shown in the lecture and apply it to a
sample text. You are free with your choice of libraries. You can assume that the corpus only
consists of a list of words.

In [88]:
from collections import defaultdict

class BPETokenizer:
    def __init__(self, num_merges=10):
        self.num_merges = num_merges
        self.bpe_codes = {}

    def get_stats(self, corpus):
        pairs = defaultdict(int)
        for word, freq in corpus.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i + 1])] += freq
        return pairs

    def merge_vocab(self, pair, corpus):
        new_vocab = {}
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        for word in corpus:
            new_word = word.replace(bigram, replacement)
            new_vocab[new_word] = corpus[word]
        return new_vocab

    def train(self, corpus):
        # Corpus format: {"l o w </w>": 5, ...}
        for i in range(self.num_merges):
            pairs = self.get_stats(corpus)
            if not pairs:
                break
            best = max(pairs, key=pairs.get)
            corpus = self.merge_vocab(best, corpus)
            self.bpe_codes[best] = i
        self.vocab = corpus

    def tokenize(self, word):
        word = list(word) + ['</w>']
        while True:
            pairs = [(word[i], word[i + 1]) for i in range(len(word) - 1)]
            merge_candidates = [p for p in pairs if p in self.bpe_codes]
            if not merge_candidates:
                break
            best = min(merge_candidates, key=lambda p: self.bpe_codes[p])
            i = pairs.index(best)
            word = word[:i] + [''.join(best)] + word[i+2:]
        return word


In [84]:
# Example training
corpus = {
    "l o w </w>": 5,
    "l o w e r </w>": 2,
    "n e w e s t </w>": 6,
    "w i d e s t </w>": 3
}

tokenizer = BPETokenizer(num_merges=10)
tokenizer.train(corpus)

# Example tokenization
print(tokenizer.tokenize("lowest"))

['low', 'est</w>']


In [86]:
tokenizer.vocab

{'low</w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'wi d est</w>': 3}

In [87]:
tokenizer.bpe_codes

{('e', 's'): 0,
 ('es', 't'): 1,
 ('est', '</w>'): 2,
 ('l', 'o'): 3,
 ('lo', 'w'): 4,
 ('n', 'e'): 5,
 ('ne', 'w'): 6,
 ('new', 'est</w>'): 7,
 ('low', '</w>'): 8,
 ('w', 'i'): 9}

### Exercise 3 - Using pre-implemented tokenizers
Use an existing tokenizer from the T5 Transformer or any other tokenizer of choice from the
HuggingFace library. Apply the tokenizer to a text sample of choice. Compare the output of
this tokenizer with the two tokenizers you implemented in the previous questions and explain
the similarities and differences.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")



In [4]:
tokens = tokenizer.tokenize(test_text)
print(tokens)

NameError: name 'test_text' is not defined

### Exercise 4 - RegEx
Assume we have a lookup table named lookup storing abbreviation definitions. Using regular
expressions in Python, write code that uses lookup to replace abbreviations in any given text
with their full-text counterparts. Apply your code to the following snippet so that example
is transformed to match target_output:

lookup = { ’doa’: ’dead on arrival’, ’gf’: ’girlfriend’, ’bf’: ’boyfriend’, ’btw’: ’by the way’, ’lol’: ’laughing out loud’, }
example = "I was lol when my bf’s phone was doa." 
target_output = "I was laughing out loud when my boyfriend’s phone was dead on arrival."

In [101]:

lookup = {
    'doa': 'dead on arrival',
    'gf': 'girlfriend',
    'bf': 'boyfriend',
    'btw': 'by the way',
    'lol': 'laughing out loud',
}

example = "I was lol when my bf’s phone was doa."

# Pattern to match any abbreviation as whole word (case insensitive)
pattern = re.compile(r'\b(' + '|'.join(re.escape(key) for key in lookup.keys()) + r')\b')

print(pattern)


re.compile('\\b(doa|gf|bf|btw|lol)\\b')


In [102]:
# Replace using lookup
def replace(match):
    word = match.group(0)
    return lookup.get(word, word)

# Apply substitution
output = pattern.sub(replace, example)
print(output)

I was laughing out loud when my boyfriend’s phone was dead on arrival.


In [1]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")
test_text = "He has a M.Sc. in Math and a Ph.D. in NLP. A session costs 45.55$ or $50. As of 01/02/06, please email X at someone@brown.edu or visit http://www.stanford.edu (if link does not work try https://www.stanford.edu). #test#test2#nlproc"
print(tokenizer.convert_ids_to_tokens(tokenizer(test_text).input_ids))

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


['▁He', '▁has', '▁', 'a', '▁M', '.', 'S', 'c', '.', '▁in', '▁Math', '▁and', '▁', 'a', '▁Ph', '.', 'D', '.', '▁in', '▁N', 'LP', '.', '▁A', '▁session', '▁costs', '▁45', '.', '55', '$', '▁or', '▁$50', '.', '▁As', '▁of', '▁01', '/', '02', '/', '06', ',', '▁please', '▁email', '▁', 'X', '▁at', '▁someone', '@', 'brow', 'n', '.', 'e', 'du', '▁or', '▁visit', '▁http', '://', 'www', '.', 'stan', 'ford', '.', 'e', 'du', '▁(', 'if', '▁link', '▁does', '▁not', '▁work', '▁try', '▁https', '://', 'www', '.', 'stan', 'ford', '.', 'e', 'du', ').', '▁#', 'test', '#', 'test', '2', '#', 'n', 'l', 'pro', 'c', '</s>']
