### Exercise 1 - Implementing a tokenizer
Implement a basic whitespace tokenizer in Python from scratch without the use of any NLP
libraries. This tokenizer should drop whitespaces and create tokens for the following cases:

(a) End-of-sentence (EOS) symbols, brackets and separators

(b) Abbreviations - Assume those are only one of the following: Ph.D., Dr., M.Sc.

(c) Special characters as in prices separated (i.e. $45.55)

(d) Dates - Assume that they follow the format dd/mm/yy (i.e. 01/02/06)

(e) URLs - Assume that they follow the format:
http[s]://[...], (i.e. https://www.stanford.edu)

(f) Hashtags separated (i.e. #nlproc)

(g) Email addresses - Assume that they follow the format:
name@domain.xyz (i.e. someOne@brown.edu)

Apply your code on the test example below, which should yield the specified tokens:

In [10]:
import re

In [78]:
test_text = "He has a M.Sc. in Math and she has a Ph.D. in NLP. A session costs 45.55$ or $50.00. As of 01/02/06, please email X/Y at someone@brown.edu or visit http://www.stanford.edu and if link does not work try https://www.stanford.edu instead. #test#test2#nlproc"

In [79]:
class VanillaTokenizer():
    def __init__(self):
        self.abbreviations = ["Ph.D.", "Dr.", "M.Sc."]
        self.abbr_pattern = r'Ph\.D\.|Dr\.|M\.Sc\.'
        self.separators = [".", ",", "/", "(", ")"]
        self.separators_pattern = r'[.,/()!]'
        self.date_pattern = r'\b\d{2}/\d{2}/\d{2,4}\b'
        self.prices_pattern = r'\$\d+(?:\.\d{1,2})?|\d+(?:\.\d{1,2})?\$'
        self.urls_pattern = r'https?://(?:www\.)?[a-zA-Z0-9\-]+\.[a-zA-Z]{2,}'
        self.emails_pattern = r'[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]+'
        self.hashtags_pattern = r'\#[a-zA-Z0-9]+'
        self.words = r'[a-zA-Z0-9]+'
        self.combined_pattern = f'{self.date_pattern}|{self.abbr_pattern}|{self.prices_pattern}|{self.urls_pattern}|{self.emails_pattern}|{self.hashtags_pattern}|{self.words}|{self.separators_pattern}'
    
    def tokenize(self, text):
        #get words from text
        tokens = re.findall(self.combined_pattern, text)
        return tokens

In [80]:
tokenizer = VanillaTokenizer()
tokens = tokenizer.tokenize(test_text)

In [82]:
tokens

['He',
 'has',
 'a',
 'M.Sc.',
 'in',
 'Math',
 'and',
 'she',
 'has',
 'a',
 'Ph.D.',
 'in',
 'NLP',
 '.',
 'A',
 'session',
 'costs',
 '45.55$',
 'or',
 '$50.00',
 '.',
 'As',
 'of',
 '01/02/06',
 ',',
 'please',
 'email',
 'X',
 '/',
 'Y',
 'at',
 'someone@brown.edu',
 'or',
 'visit',
 'http://www.stanford.edu',
 'and',
 'if',
 'link',
 'does',
 'not',
 'work',
 'try',
 'https://www.stanford.edu',
 'instead',
 '.',
 '#test',
 '#test2',
 '#nlproc']