In [1]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [2]:
%%capture
%load_ext autoreload
%autoreload 2
%cd ..
import statnlpbook.tokenization as tok

# Tokenisation

* Identify the **meaningful units** in a string of characters: for example, **words**.

![nospaces](../img/nospaces.jpg)

In Python you can tokenise text via `split`:

In [3]:
text = """Mr. Bob Dobolina is thinkin' of a master plan.
Why doesn't he quit?"""
text.split(" ")

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.\nWhy',
 "doesn't",
 'he',
 'quit?']

What is wrong with this?

In Python you can also tokenise using **patterns** at which to split tokens:
### Regular Expressions

A **regular expression** is a compact definition of a **set** of (character) sequences (strings).

Examples:
* `Mr.`: all strings containing `Mr` followed by any single character
* `Mr\.`: only the string `Mr.`
* <code>&nbsp;</code>`|\n|!!!`: only the strings <code>&nbsp;</code> (space), `\n` and `!!!`
* `[abc]`: only the characters `a`, `b` and `c`
* `\s`: all whitespace characters
* `1+`: all sequences of at least one `1`
* `\w+`: all sequences of alphanumeric characters and `_`


In [4]:
import re
re.compile('\s').split(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan.',
 'Why',
 "doesn't",
 'he',
 'quit?']

Problems:
* Bad treatment of punctuation.  
* Easier to **define a token** than a gap. 

Let us use `findall` instead:

In [5]:
re.compile('\w+|[.?]').findall(text)

['Mr',
 '.',
 'Bob',
 'Dobolina',
 'is',
 'thinkin',
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 'doesn',
 't',
 'he',
 'quit',
 '?']

Problems:
* `Mr.` and `doesn't` are split into two tokens each.
* Lost an apostrophe (`thinkin'`).

Both are fixed below ...

In [6]:
re.compile('Mr\.|[\w\']+|[.?]').findall(text)

['Mr.',
 'Bob',
 'Dobolina',
 'is',
 "thinkin'",
 'of',
 'a',
 'master',
 'plan',
 '.',
 'Why',
 "doesn't",
 'he',
 'quit',
 '?']

## Learning to Tokenise?
* For English, simple pattern matching is often sufficient.
* In some languages (e.g., Japanese), words are not separated by whitespace.
* In some languages (e.g., Vietnamese), whitespace does not indicate word boundary.


In [7]:
jap = "今日もしないといけない。"
viet = "thuế  thu nhập cá nhân"

Try lexicon-based tokenisation ...

In [8]:
re.compile('もし|今日|も|しない|と|いけない').findall(jap)

['今日', 'もし', 'と', 'いけない']

In [9]:
re.compile('thuế  thu nhập|cá nhân').findall(viet)

['thuế  thu nhập', 'cá nhân']

Equally complex for certain English domains (e.g., biomedical text).

In [10]:
bio = """We developed a nanocarrier system of herceptin-conjugated nanoparticles
of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin
prodrug ..."""

* d-alpha-tocopheryl-co-poly is **one** token
* (TPGS)-cisplatin are **five**: 
  * ( 
  * TPGS 
  * ) 
  * - 
  * cisplatin 

In [11]:
re.compile('\s').split(bio)[:15]

['We',
 'developed',
 'a',
 'nanocarrier',
 'system',
 'of',
 'herceptin-conjugated',
 'nanoparticles',
 'of',
 'd-alpha-tocopheryl-co-poly(ethylene',
 'glycol)',
 '1000',
 'succinate',
 '(TPGS)-cisplatin',
 'prodrug']

In [12]:
re.compile('\s').split("New York-based companies")

['New', 'York-based', 'companies']

Solution: Treat tokenisation as a **statistical problem**.

# Subword Tokenisation

Learn from data what is the best way to break down strings to tokens.

- **Why Subword Tokenization?**
  - Efficient handling of Out-of-Vocabulary (OOV) words.
  - Capture meaningful subword information.
- **Popular Algorithms**: 
  - WordPiece
  - Byte Pair Encoding (BPE)
  - Unigram (SentencePiece)

## WordPiece
- **Used In**: BERT, DistillBERT
- **Origin**: Developed by Google for speech recognition and later adapted for text.
- **How it Works**: 
  1. Initialize vocabulary with characters and special tokens.
  2. Merge subwords iteratively based on scoring criteria to form the new vocabulary.

### WordPiece Examples
- **English**: "hugging": 
  - Initial: ("h", "u", "g", "g", "i", "n", "g")
  - After training: ("hug", "##ging")
- **Danish**: "hygge"
  - Initial: ("h", "y", "g", "g", "e")
  - After training: ("hy", "##gge")
- **Japanese**: "こんにちは" 
  - Initial: ("こ", "ん", "に", "ち", "は")
  - After training: ("こん", "##に", "##ち", "##は")
- **Vietnamese**: "xin chào"
  - Initial: ("x", "i", "n", " ", "c", "h", "à", "o")
  - After training: ("x", "##in", " ", "##ch", "##à", "##o")

## Byte Pair Encoding (BPE)
- **Used In**: GPT, RoBERTa
- **Origin**: Initially developed for data compression.


## Unigram Algorithm in SentencePiece
- **Used In**: ALBERT, T5, mBART, Big Bird, XLNet
- **Origin**: Developed by Google for machine translation.

# Sentence Segmentation

* Many NLP tools work sentence-by-sentence. 
* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens.

In [13]:
text

"Mr. Bob Dobolina is thinkin' of a master plan.\nWhy doesn't he quit?"

In [14]:
tokens = re.compile('Mr.|[\w\']+|[.?]').findall(text)
# try different regular expressions
tok.sentence_segment(re.compile('\.'), tokens)

[['Mr.',
  'Bob',
  'Dobolina',
  'is',
  "thinkin'",
  'of',
  'a',
  'master',
  'plan',
  '.'],
 ['Why', "doesn't", 'he', 'quit', '?']]

<center><img src="../img/quiz_time.png"></center>

What are the challenges in sentence splitting? 

Discuss and enter your answer(s) here:

# [tinyurl.com/diku-nlp-q2](https://tinyurl.com/diku-nlp-q2)

([Responses](https://docs.google.com/forms/d/1WANt_ndHZhGkOwPu1klR4HmGAUH1QL9W4AAkNwU6Ulg/edit))

# Background Reading

* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf): Chapter 2, Regular Expressions, Text Normalization, Edit Distance.
* Hugging Face's excellent NLP course: [Tokenizers](https://huggingface.co/learn/nlp-course/chapter6/1)