# Tokenization

Most NLP applications require input text to be tokenized where each token represents a meaningful linguistic unit such as a word.

## Contents

* [Split by Whitespace](#Split-by-Whitespace)
* [Substring Matching](#Substring-Matching)
* [Function Definition](#Function-Definition)
* [Exercise](#Exercise)

### Split by Whitespace

It is easy to tokenize a string by whitespace using the `split()` function.

In [10]:
text = 'Mr. Wayne is Batman'
tokens = text.split()
print(tokens)

['Mr.', 'Wayne', 'is', 'Batman']


* [`str.split(sep=None, maxsplit=-1)`](https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split)

However, splitting by whitespaces can cause the resulting tokens to be noisy:

In [11]:
text = 'Mr. Wayne isn\'t the hero we need, but "the one" we deserve.'
tokens = text.split()
print(tokens)

['Mr.', 'Wayne', "isn't", 'the', 'hero', 'we', 'need,', 'but', '"the', 'one"', 'we', 'deserve.']


* `"isn't"` &rarr; `['is', "n't"]`
* `'need,'` &rarr; `['need', ',']`
* `'"the'` &rarr; `['"', 'the']`
* `'one"'` &rarr; `['one', '"']`
* `'deserve.'` &rarr; `['deserve', '.']`

### Substring Matching

It is possible to resolve the above issues through subastring matching:

In [12]:
STARTS = ['"']
ENDS = ["n't", '.', ',', '"']

new_tokens = []
for token in tokens:
    start = next((t for t in STARTS if token.startswith(t)), None)
    if start:
        n = len(start)
        t1 = token[:n]
        t2 = token[n:]
        new_tokens.extend([t1, t2])
        continue
    
    end = next((t for t in ENDS if token.endswith(t)), None)
    if end:
        n = len(end)
        t1 = token[:-n]
        t2 = token[-n:]
        if not (t1 == 'Mr' and t2 == '.'):
            new_tokens.extend([t1, t2])
            continue

    new_tokens.append(token)

print(new_tokens)

['Mr.', 'Wayne', 'is', "n't", 'the', 'hero', 'we', 'need', ',', 'but', '"', 'the', 'one', '"', 'we', 'deserve', '.']


* [List Comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
* [More on Lists](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists)
* [`next(iterator[, default])`](https://docs.python.org/3/library/functions.html#next)

The first parameter in the `next()` function creates an iterator:

In [13]:
d = (t for t in STARTS if token.startswith(t))
print(d)

<generator object <genexpr> at 0x7fb228301890>


### Function Definition

Let us convert the above code into a function:

In [14]:
def tokenize_strmat_0(text):
    tokens = text.split()
    new_tokens = []
    
    for token in tokens:
        start = next((t for t in STARTS if token.startswith(t)), None)
        if start:
            n = len(start)
            t1 = token[:n]
            t2 = token[n:]
            new_tokens.extend([t1, t2])
            continue

        end = next((t for t in ENDS if token.endswith(t)), None)
        if end:
            n = len(end)
            t1 = token[:-n]
            t2 = token[-n:]
            if not (t1 == 'Mr' and t2 == '.'):
                new_tokens.extend([t1, t2])
                continue

        new_tokens.append(token)

    return new_tokens

In [15]:
print(tokenize_strmat_0(text))

['Mr.', 'Wayne', 'is', "n't", 'the', 'hero', 'we', 'need', ',', 'but', '"', 'the', 'one', '"', 'we', 'deserve', '.']


### Exercise

Let us consider the following example:

In [16]:
text = 'Ms. Wayne is "Batgirl" but not "the one".'
print(tokenize_strmat_0(text))

['Ms', '.', 'Wayne', 'is', '"', 'Batgirl"', 'but', 'not', '"', 'the', 'one"', '.']


* `['Ms', '.']` &rarr; `'Ms.'`
* `'Batgirl"'` &rarr; `['Batgirl', '"']`
* `'one"'` &rarr; `['one', '"']`

**Modify the `tokenize_strmat()` function to handle the above example.**

Expected output:
```python
['Ms.', 'Wayne', 'is', '"', 'Batgirl', '"', 'but', 'not', '"', 'the', 'one', '"', '.']
```

In [17]:
def tokenize_strmat_1(text):
    tokens = text.split()
    new_tokens = []
    # to be filled
    return new_tokens