## <span style="color:purple"> Text segmentation: Tokens </span>

### General overview

Tagging the tokens means that we determine the start and end position of each token, based on whitespace and/or punctuation. There are many whitespace symbols, out of which spaces, tabs, and newlines occur most frequently. When tokens are tagged on the text, the type of whitespace does not matter, but in later analysis, it may be taken into consideration if there was a whitespace between the tokens or not.

Note that segmenting the text into tokens is the most automic segmentation: in later analysis steps, tokens won't be split anymore, but only joined if necessary (e.g. to create words or phrases).

In the following example, we create a text object with the tokens layer and print out the tokens layer.

In [1]:
from estnltk import Text
from estnltk.taggers import TokensTagger
text = TokensTagger().tag(Text('Mis aias sa-das 3me sorti s-saia?'))
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Here we have 11 tokens in the text. In order to see the start and end position of each token, execute:

In [2]:
text.tokens[['start','end','text']]

Unnamed: 0,start,end,text
0,0,3,Mis
1,4,8,aias
2,9,11,sa
3,11,12,-
4,12,15,das
5,16,19,3me
6,20,25,sorti
7,26,27,s
8,27,28,-
9,28,32,saia


#### Under the hood
 The `TokensTagger` applies NLTK's [WordPunctTokenizer](https://www.nltk.org/api/nltk.tokenize.regexp.html#nltk.tokenize.regexp.WordPunctTokenizer) to split the text into tokens. The aim is to produce a tokenization where words ("alphanumeric sequences") are separated from each other, and where punctuation symbols are also separated from words and from each other. However, `WordPunctTokenizer` leaves punctuation symbols unsplit in some cases, and thus, `TokensTagger` applies an additional post-correction step to ensure that all punctuation symbols are split into single tokens. For instance, the string `"(1989.a.)."` is tokenized by  `WordPunctTokenizer` into tokens  `['(', '1989', '.', 'a', '.).']`, and in our post-correction step, it is further split into tokens `['(', '1989', '.', 'a', '.', ')', '.']`.

### Adding new splitting rules

In some situations, applying `TokensTagger` is not enough and you need to add your own, text- or domain-specific splitting rules. 
In the following, we'll show how to add extra splitting rules via `TokenSplitter` and `LocalTokenSplitter`.

### `TokenSplitter`

Use `TokenSplitter` to make additional splits if you can determine splitting locations solely based on regular expression patterns.

In [3]:
from estnltk import Text
from estnltk.taggers import TokenSplitter

# Create an example Text that requires specific token splitting
text = Text('Esimene peatükkKui Arno isaga koolijõudis, olid tunnid juba alanud.')
# Add the tokens layer
text.tag_layer('tokens')
# Browse results
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Esimene
peatükkKui
Arno
isaga
koolijõudis
","
olid
tunnid
juba
alanud


Now, create a `TokenSplitter` with spltting patterns. Each pattern must contain a named group ('end'), which marks a substring in the token after which the token will be split into two pieces:

In [4]:
import re
token_splitter = TokenSplitter(patterns=[re.compile(r'(?P<end>peatükk)Kui'),\
                                         re.compile(r'(?P<end>kooli)jõudis') ])

In [5]:
# Apply token splitter on text
token_splitter.retag( text )
# Browse results
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,13

text
Esimene
peatükk
Kui
Arno
isaga
kooli
jõudis
","
olid
tunnid


Restrictions:
* One token can be split only once. No recursive splitting strategies are supported.
* If several patterns match then the first in the pattern list is applied.
* Decisions to split or not can depend only on the token itself and not general context.

### `LocalTokenSplitter`

Use `LocalTokenSplitter` if you need a more fine-grained control over determining the splitting location. In addition to regular expression, you also provide a customized function to determine the exact split point based on the matching token and the match object.

In [6]:
import re

from estnltk import Text
from estnltk.taggers import LocalTokenSplitter

from estnltk.taggers.standard.morph_analysis.proxy import MorphAnalyzedToken

SUPERSCRIPT_SYMBOLS = '[⁰¹²³⁴⁵⁶⁷⁸⁹]'

def split_if_prefix_is_word(text: str, match: re.Match) -> int:
    prefix = text[0:match.start()]
    if re.match('^[0-9]+$', prefix):
        return -1
    return match.start() if MorphAnalyzedToken(prefix).is_word else -1

token_splitter = LocalTokenSplitter(
    split_rules=[
        # separate prefix from a superscript number only if prefix is a word
        (re.compile(f'({SUPERSCRIPT_SYMBOLS})$'), split_if_prefix_is_word),
    ])

# Create an example Text that requires specific token splitting
text = Text('Pindala¹ oli umbes 20 m², ruumala aga võrratult suur².')
# Add the tokens layer
text.tag_layer('tokens')
# Browse results
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Pindala¹
oli
umbes
20
m²
","
ruumala
aga
võrratult
suur²


In [7]:
token_splitter.retag(text)
# Browse results
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,13

text
Pindala
¹
oli
umbes
20
m²
","
ruumala
aga
võrratult


* More about [MorphAnalyzedToken](https://github.com/estnltk/estnltk/blob/789c32c64dbf6e0508a640002f469d24eba5720b/tutorials/nlp_pipeline/B_morphology/xx_MorphAnalyzedToken.ipynb);
* More examples about using [LocalTokenSplitter](https://github.com/estnltk/smart-search/blob/469f54a1382d5cb2e717cdc7224774b7678a647e/demod/toovood/riigi_teataja_pealkirjaotsing/01_dokumentide_indekseerimine/estnltk_patches/tests/test_local_token_splitter.ipynb);

---