# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Text segmentation: Tokens </span>

### General overview

Tagging the tokens means that we determine the start and end position of each token, based on whitespace and/or punctuation. There are many whitespace symbols, out of which spaces, tabs, and newlines occur most frequently. When tokens are tagged on the text, the type of whitespace does not matter, but in later analysis, it may be taken into consideration if there was a whitespace between the tokens or not. 

In the following example, we create a text object with the tokens layer and print out the tokens layer.

In [1]:
from estnltk import Text
from estnltk.taggers import TokensTagger
text = TokensTagger().tag(Text('Mis aias sa-das 3me sorti s-saia?'))
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Here we have 11 tokens in the text. To see the start and end position of each token print out the span list.

In [2]:
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


#### Under the hood
 The `TokensTagger` applies NLTK's [WordPunctTokenizer](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.WordPunctTokenizer) to split the text into tokens. The aim is to produce a tokenization where words ("alphanumeric sequences") are separated from each other, and where punctuation symbols are also separated from words and from each other. However, `WordPunctTokenizer` leaves punctuation symbols unsplit in some cases, and thus, `TokensTagger` applies an additional post-correction step to ensure that all punctuation symbols are split into single tokens. For instance, the string `"(1989.a.)."` is tokenized by  `WordPunctTokenizer` into tokens  `['(', '1989', '.', 'a', '.).']`, and in our post-correction step, it is further split into tokens `['(', '1989', '.', 'a', '.', ')', '.']`.

---