# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Text segmentation: Compound tokens </span>

###  General overview
Although words and tokens mostly overlap with each other, there are cases where several tokens are combined together to form a word in the traditional sense - the smallest meaningful unit of language. There are also special types of text units -- such as emoticons and web and email addresses -- which need to be detected as a whole (as full token sequences) in order to avoid ambiguities in the following processing steps (for instance, a period inside an email address should not be mistaken with a sentence-ending period).

Compound token tagger takes care of these cases: it adds `compound_tokens` layer that envelopes the `tokens` layer. It means that every element of the `compound_tokens` layer is a list of `tokens` layer elements - tokens. That makes it easy to glue the tokens together to form the words later on.

Compound tokens are formed in a way that they are separate from each other -- no compound token has common tokens with other compound tokens.

In the following example, a text object with the prerequisite layer (tokens) is created, and then compound_tokens layer is added to it:

In [1]:
from estnltk import Text
from estnltk.taggers import TokensTagger
text = TokensTagger().tag(Text('Mis aias sa-das 3me sorti s-saia?'))

In [2]:
from estnltk.taggers import CompoundTokenTagger

CompoundTokenTagger().tag(text)
text['compound_tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
"['sa', '-', 'das']",['hyphenation'],sadas
"['s', '-', 'saia']",['hyphenation'],saia


In this example, two compound tokens are found, both of which consist of three tokens.

Note that the type `hyphenation` includes: a) hyphenations (such as 'sa-das'), b) stammers / stretched out words (such as 's-saia'), and c) compound nouns with hyphens (such as 'Vana-Hiina', 'Mari-Liis').

Here we can see the list of lists of tokens that make up the compound tokens.

In [3]:
[compound_token.text for compound_token in text.compound_tokens]

[['sa', '-', 'das'], ['s', '-', 'saia']]

#### Types of compound tokens

The main aim of the `CompoundTokenTagger` is to join together tokens that were produced by the splitting logic of `TokensTagger`. `CompoundTokenTagger` addresses different types of compound tokens, and producing most of these tokens can also be switched on/off by flags passed to the constructor. In the following, `CompoundTokenTagger`'s compounding types will be listed, along with the flags that can be used to switch these compounds off (by default, all flags are switched on).

##### Numeric expressions (`tag_numbers`)

Tags numeric expressions with decimal separators, numbers with digit group separators, and common date and time formats.

In [4]:
text = Text('02.02.2010 22:55 Mati : saad sa mulle 100,50 asemel 10 000 laenata?')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_numbers = True).tag(text) # tagging numbers switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,type,normalized
"['02', '.', '02', '.', '2010']",['numeric_date'],02.02.2010
"['22', ':', '55']",['numeric_time'],22:55
"['100', ',', '50']",['numeric'],10050
"['10', '000']",['numeric'],10000


As can be seen from the previous example, a compound token can also have attribute `normalized`, which contains a normalized string value for the token. In most cases, the normalization involves removal of whitespace from the string (e.g. `'10 000' => '10000'`). If the pattern that captured the string does not use normalization, then `normalized==None`.

In addition, if `tag_numbers` is switched on, numeric expressions are also augmented with sign symbols and percentages:

In [5]:
text = Text('Mati : +100% kindel, et toon tagasi!!')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_numbers = True).tag(text) # tagging numbers switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,1

text,type,normalized
"['+', '100', '%']","['sign', 'percentage']",+100%


Note: if more than one compounding rule is applied, the resulting compound token can have multiple compound types (like `(sign, percentage)` in the previous example). 

##### Units x-per-y (`tag_units`)

Tags commonly used x-per-y style units that follow numeric expressions:

In [6]:
text = Text('Tänase seisuga tuleb ikka suur lohe vaiksema tuule (6-12 m/s) jaoks ja teine väiksem tormikaks (12-20 m/s) võtta…')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_units = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
"['m', '/', 's']",['unit'],m/s
"['m', '/', 's']",['unit'],m/s


##### XML tags (`tag_xml`)

In [7]:
text = Text('<u>Kirjavahemärgid, hingamiskohad</u>.')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_xml = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
"['<', 'u', '>']",['xml_tag'],
"['<', '/', 'u', '>']",['xml_tag'],


##### Email and www addresses (`tag_email_and_www`)

In [8]:
text = Text('Saada need e-postiaadressile big@boss.com või tule sisesta lehelt www.iamboss.com')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_email_and_www = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,3

text,type,normalized
"['e', '-', 'postiaadressile']",['hyphenation'],
"['big', '@', 'boss', '.', 'com']",['email'],
"['www', '.', 'iamboss', '.', 'com']",['www_address'],www.iamboss.com


##### Common emoticons (`tag_emoticons`)

Tags most common (Western) emoticons:

In [9]:
text = Text('Maja on fantastiline :)) ja mõte on hea :-)')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_emoticons = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
"[':', ')', ')']",['emoticon'],:))
"[':', '-', ')']",['emoticon'],:-)


##### Hashtags and username mentions (`tag_hashtags_and_usernames`)

Tags Twitter-style hashtags and username mentions:

In [10]:
text = Text('@porgandisalat @KalaJaKapsad jah, väga deep, jube lausa #naerma#ajab')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_hashtags_and_usernames = True).tag(text) # tagging switched on
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,type,normalized
"['@', 'porgandisalat']",['username_mention'],@porgandisalat
"['@', 'KalaJaKapsad']",['username_mention'],@KalaJaKapsad
"['#', 'naerma']",['hashtag'],#naerma
"['#', 'ajab']",['hashtag'],#ajab


Note: this flag is _switched off_ by default.

##### Names preceded by initials (`tag_initials`)

In [11]:
text = Text('(arhitektid M. Port, M. Meelak, O. Zhemtshugov, R.-L. Kivi)')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_initials = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,type,normalized
"['M', '.', 'Port']",['name_with_initial'],M. Port
"['M', '.', 'Meelak']",['name_with_initial'],M. Meelak
"['O', '.', 'Zhemtshugov']",['name_with_initial'],O. Zhemtshugov
"['R', '.', '-', 'L', '.', 'Kivi']",['name_with_initial'],R.-L. Kivi


##### Common abbreviations (`tag_abbreviations`)

Tags commonly used abbreviations:

In [12]:
text = Text('Nt. hädas oli juba Vana-Hiina suurim ajaloolane Sima Qian (II—I saj. e. m. a.).')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_abbreviations = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,type,normalized
"['Nt', '.']",['non_ending_abbreviation'],Nt.
"['Vana', '-', 'Hiina']",['hyphenation'],
"['saj', '.']",['abbreviation'],saj.
"['e', '.', 'm', '.', 'a', '.']",['abbreviation'],e.m.a.


Abbreviations are divided into two categories: 1) `non_ending_abbreviation`-s which most likely do not end the sentence (usually it can be expected that some sentence content follows them), and 2) `abbreviation`-s which can also appear at the end of the sentence.

###### Using custom non-ending abbreviations

Tagging only common abbreviations may not be enough, specially if you want to analyse a corpus that is rich in domain-specific abbreviations. Therefore, if `tag_abbreviations` is switched on, you can use parameter `custom_abbreviations` to define your own list of non-ending abbreviations. These will be used as additional hints for creating compound tokens:

In [13]:
text = Text('Kas ta töötab nüüd med., sots . või hoopis maj. valdkonnas?')
text.tag_layer(['tokens'])
my_abbreviations = ['med', 'sots', 'maj']
CompoundTokenTagger(tag_abbreviations = True, custom_abbreviations = my_abbreviations).tag(text) # include custom abbreviations
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,3

text,type,normalized
"['med', '.']",['non_ending_abbreviation'],med.
"['sots', '.']",['non_ending_abbreviation'],sots.
"['maj', '.']",['non_ending_abbreviation'],maj.


Few things to keep in mind:
 * custom non-ending abbreviations will have a higher priority than system-detected compound tokens; so, in case of an overlap between two types of compound tokens, system-detected compound tokens will be removed;
 * custom non-ending abbreviation strings can contain letters and numbers, but *not* whitespace or punctuation. More technically: they must be strings that `TokensTagger` does not split into smaller tokens.

Note: tagging custom `non_ending_abbreviation`-s also affects the results of sentence tokenization: if a sentence boundary is mistakenly added after a custom `non_ending_abbreviation`, then post-correction rules of sentence tokenizer will remove it automatically:

In [14]:
text.tag_layer(['sentences'])
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1

text
"['Kas', 'ta', 'töötab', 'nüüd', 'med.', ',', 'sots .', 'või', 'hoopis', 'maj.', 'valdkonnas', '?']"


##### Morphological case endings (`tag_case_endings`)

Tags morphological case endings preceded by single tokens, and also by compound tokens:  

In [15]:
text = Text("10 000-st LinkedIn 'i kontaktist mitte üks ei hoolinud meie SKT -st, aga meie workshop ' e väisasid küll.")
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_case_endings = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,type,normalized
"['10', '000', '-', 'st']","['numeric', 'hyphenation', 'case_ending']",10000-st
"['LinkedIn', ""'"", 'i']",['case_ending'],LinkedIn'i
"['SKT', '-', 'st']",['case_ending'],SKT-st
"['workshop', ""'"", 'e']",['case_ending'],workshop'e


##### Hyphenations (`tag_hyphenations`)

If consecutive tokens are separated by hyphen symbol, and these tokens consist of letters, then these tokens are joined together as forming a "hyphenated word":

In [16]:
text = Text('See on v-vä-väga huvitav, aga kas ka ka-su-lik?!')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_hyphenations=True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
"['v', '-', 'vä', '-', 'väga']",['hyphenation'],väga
"['ka', '-', 'su', '-', 'lik']",['hyphenation'],kasulik


Note that the language phenomen covered by "hyphenation compound tokens" is actually wider: in addition to hyphenated / syllabified words (such as _'ka-su-lik'_), it also covers stretched out words (such as _'vää-ää-ääga'_), and compound nouns with hyphens (such as _'Vana-Hiina'_, _'Mari-Liis'_).

##### Discard joining over specific strings (`do_not_join_on_strings`)

Upon initialization of `CompoundTokenTagger`, you can specify a list of strings (`do_not_join_on_strings`) that are not allowed  inside a compound token -- if any of the strings happens to be inside a compound token, the compound token will not be created.
For instance, if you have systematically separated sentences and paragraphs in the text with special strings (e.g. sentences by `'\n'`, and paragraphs by `'\n\n'`), then you can use this list to discard token joining at the locations of sentence and paragraph boundaries.
By default, the list only contains `'\n\n'`, in order to avoid joining tokens over paragraph boundaries.

#### Technical details

**`CompoundTokenTagger`** combines the knowledge about token spans (produced by `TokensTagger`), and the knowledge about the original tokenization (i.e. which text units were separated by whitespace in the original text) to determine which tokens should be joined into compound ones. This process consists of the following steps:

1. **Tagging of _strict tokenization hints_**: `estnltk.taggers.RegexTagger` is applied to find non-overlapping text spans that correspond to tokens that need to be joined. In this phase, the following types of compounding hints are tagged:

      1.1 `numeric` expressions like numbers with decimal separators (e.g. `'10,5'`), numbers with digit group separators (e.g. `'10 000 000'`), and common formats of numeric dates (`'02.02.2010'`) and times (`'22:55'`);      
      1.2 "X-per-Y" style `units` (e.g. speed units like `'km/h'` or `'MB/s'`, and emission units like `'g/km'`);   
      1.3 `xml_tags` (like `'<p>'` or `'</br>'`);  
      1.4 `emails` and `www_addresses` (like `'big@boss.com'` or `'www.neti.ee'` or `'https://www.postimees.ee'`);  
      1.5 commonly used `emoticons` (like `:)`, `:D` or  `:-P`);   
      1.6 `names_with_initial` (like `'A. H. Tammsaare'` or `'J.K. Rowling'`);   
      1.7 commonly used `abbreviations` (like `'s.o.'`, `'st'`, or `'a.'`);  
            
2. **Creating an initial list of compound tokens** based on the _strict tokenization hints_ (produced in the previous step),   _the hyphenation logic_ and the user-defined list of custom abbreviations (optional). 

      2.1 (optional) _User-defined non-ending abbreviations_ are marked in the text, and joined into compound tokens if they match token boundaries. If a period follows user-defined abbreviation, it is added to the compound (e.g `'med'` + `'.'`); If an user-defined abbreviation overlaps with an existing compound token, then the existing abbreviation will be deleted if it is not `non_ending_abbreviation`; if the overlapping existing abbreviation is also a `non_ending_abbreviation`, then the shorter abbreviation will be deleted;
      
      2.2 _Strict tokenization hints_ are used in the following way: if a hint's text span starts _exactly_ where a token starts, and hint's text span ends _exactly_ where a sequence of tokens ends, then, and only then, a compound token is created from the sequence of tokens covered by the hint. So, no compound token is created if hint's text span either starts or ends at the middle of a token;
      
      2.3. _Hyphenation logic_ collects consecutive tokens that have a hyphenation symbol '-', but no space in between them, and creates corresponding compound tokens. For instance, the token sequence `['v','-','v','-','v','-','ve','-','ve','-','veri']` (originating from the string `'v-v-v-ve-ve-veri'`) will be joined into a compound token. A normalization of the compound token into corresponding ortographic form is also attempted (e.g.  `'v-v-v-ve-ve-veri'` => `'veri'`), and in case of success, result of normalization will be attached into attribute `normalized`.

3. **Tagging of _non-strict tokenization hints_**: `estnltk.taggers.RegexTagger` is applied with a second set of patterns to find non-overlapping text spans that hint about potential joining places of tokens and/or compound tokens (from the step 2). The following types of compounding hints are tagged:

      3.1. morphological `case_endings` preceded by single tokens (e.g. `"Palace'ist"`), and compound tokens (e.g. in numeric expressions like `"10 000-ni"`, or in web addresses like `"www.neti.ee-st"`);        
      3.2 `sign` symbols (-, +, ±) followed by numbers (like in `'+20'` or `'-10 000'`);        
      3.3 `percentage` symbols preceded by numbers (like in `'20%'` or `'30,567%'`);    

4. ** Extending tokens and compound tokens based on the _non-strict tokenization hints_ (produced in the previous step)**.

      _Non-strict tokenization hints_ differ from the _strict ones_ in a way that they leave one end of the hint's span (either left or right) unspecified. For instance, the pattern detecting `case_endings` leaves left side of the sequence unspecified: the left side could be a single token, or a compound_token with an unspecified length. The pattern only describes the end of the sequence, which must consist of a letter (or a number) followed by a case separator (like `'′'` or `'-'`), and finally followed by a case ending in a single token (like `'st'` or `'ni'`). In similar manner, the pattern adding signs to numbers leaves open the right side (the actual extent of the numeric expression);
      
      Note: as long as the regions described by the hints do not overlap, one token or compound token can be modified by multiple hints, e.g. `sign` symbol could be added before a numeric token, and `percentage` symbol could be added after that token;
      
5. ** Creating the layer `'compound_tokens'` based on the compound tokens aquired in the previous steps.**

##### Tokenization hints

Basically, each tokenization hint is a result of applying a regular expression over the original text. All patterns producing tokenization hints are in the module `estnltk.taggers.text_segmentation.patterns`.  The file contains lists of records in the `estnltk.taggers.RegexTagger` vocabulary format. For instance, a pattern for capturing simple email addresses is conveyed by the following entry:

         {'comment': '*) Pattern for detecting common e-mail formats;',
          'example': 'bla@bla.bl',
          'pattern_type': 'email',
          '_group_': 1,
          '_priority_': (0, 0, 1),
          '_regex_pattern_': r'([{ALPHANUM}_.+-]+@[{ALPHANUM}-]+\.[{ALPHANUM}-.]+)'.format(**MACROS),
          'normalized': 'lambda m: None'},
          
Attribute `'comment'` is used to give a short description of the pattern, and `'example'` exemplifies a string captured by the pattern. Although these attributes are not mandatory, it is highly advisable to use them when adding new entries, as it helps to maintain interpretability of the vocabulary file.

Attribute `'pattern_type'` is mandatory and expresses the category of the compound token. If a compound token is created based on the tokenization hint, then compound token's attribute `'type'` will get its value from the `'pattern_type'`. Compound token's attribute `'type'` is a tuple, as it needs to store more than one type if compound tokens are merged (during the application of _non-strict tokenization hints_).

  * If `'pattern_type'` of a _strict tokenization hint_ (a "1st level pattern") contains prefix `negative:` (e.g.  `'negative:ps_abbreviation'`), then the pattern does not produce any tokenization hints, but it is used instead to prevent other patterns from matching. Basically, it describes strings that are similar to ones captured by some positive pattern, and that should not be captured (as they would be false positives). For instance, a negative pattern is created to capture temperature units followed by sentence ending (e.g. ... _kuumarekord on 38**º C.** Talved on_ ... ) in order to prevent patterns capturing names with initials from matching (e.g. capturing _**C. Talved**_ as a name with an initial). Note that a negative pattern must have `'_priority_'` value smaller than `'_priority_'` values of patterns it prevents from matching.

Attribute `'_priority_'` describes priority of the pattern: smaller the value, higher the priority. Priority comes into play when multiple patterns capture the same string region, or there are overlaps in captured regions. In such cases, the string captured by the pattern with the highest priority (lowest priority value) will be chosen. In case of equal `'_priority_'` values, the default strategy is to choose the longest string.

Attribute `'_regex_pattern_'` gives the regular expression for capturing the string of the compound token. It can be a regular expression pattern string, but also a pre-compiled regular expression object. In the previous example, the pattern string is given as a template, in which named placeholders (`{ALPHANUM}`) are filled in using the information from the dictionary `MACROS`.

Attribute `'_group_'` gives the number (or the name) of the group captured by the regular expression which represents the _actual compound token_. So, the regular expression can also describe compound tokens with some context, and the group number can be used to pick out the _compound token_.

_Non-strict tokenization hints_ ("2nd level patterns") have two additional mandatory attributes, `'left_strict'` and `'right_strict'`, which can reduce strictness of matching either on left or right end of the token sequence:

  * if **`left_strict==False`** `and right_strict==True`, then the pattern only describes the right end of the token sequence, and the left end is unspecified (could be a single token, or a compound token with unspecified length);
  
  * if **`right_strict==False`** `and left_strict==True`, then the pattern only describes the left end of the token sequence, and the right end is unspecified (could be a single token, or a compound token with unspecified length);

And finally, attribute `'normalized'` gives a lambda function (or a string describing a lambda function) which is to be applied on a match object to produce a normalized version of the captured string. If normalization is not necessary, the value can be `'lambda m: None'` (like in the previous example).


##### Normalized word forms

`CompoundTokenTagger` also includes all word normalization that is related to compound tokens, for instance, removal of whitespaces inside abbreviations (`'e. m. a.'` => `'e.m.a.'`) and finding ortographic forms of the words containing hyphens (`'v-vä-väga'` => `'väga'`).
The resulting normalized form will be put into the attribute `normalized` that all compound tokens have. Note that if no normalization was applied for the compound token, then `normalized=None`.

During the creation of layer `'words'`, the content of `normalized` will be carried over to the attribute `normalized_form` in the `'words'` layer. This information is then used in the morphological analysis, where words that have `normalized_form != None` will be analysed not according to their surface forms, but according to their normalized forms.

##### Improving `CompoundTokenTagger`'s rules: an example

When analyzing texts from a specific domain, you may need to improve `CompoundTokenTagger`'s existing rules, or add additional rules. Here is a short example about how it can be done.

Let's consider an example text, where we have a unit that is not analysed as a compound token by the current rules:

In [17]:
text = Text('LED-infotabloo heledus: 5000 cd/m²').tag_layer(['words'])
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,7

text,normalized_form
LED-infotabloo,
heledus,
:,
5000,
cd,
/,
m²,


We want to analyse `'cd/m²'` as a single word.

First, we should examine the module [`estnltk.taggers.text_segmentation.patterns`](https://github.com/estnltk/estnltk/blob/version_1.6/estnltk/taggers/text_segmentation/patterns.py), and find out, which list of rules addresses the same (or similar) type of token compounding. For the previous example, the list `unit_patterns` is the best match.

Now, let's import the list that contains all 1st level patterns, including unit patterns:

In [18]:
# Import list containing all first level patterns
from estnltk.taggers.text_segmentation.compound_token_tagger import ALL_1ST_LEVEL_PATTERNS

Now, let's create our own pattern (see the description of the format in the subsection _Tokenization hints_ ):

In [19]:
import regex as re
# Create a new unit pattern
new_unit_pattern = \
    { 'comment': '2.3) A pattern for capturing cd/m² units;',
      'example': 'cd/m²',
      'pattern_type': 'unit',
      '_regex_pattern_': re.compile(r'(cd\s*/\s*m²)'),
      '_group_': 1,
      '_priority_': (3, 0),
      'normalized': r"lambda m: re.sub(r'\s' ,'' , m.group(1))",
    }

Note: we have set the `_priority_` of the new pattern to `(3, 0)`, so that it has the highest priority among the rules in `unit_patterns`, and it will be executed before all the other _unit patterns_ . This is required because the new pattern overlaps partially with one of the old patterns. In case of an overlap, only the rule with the highest priority will be used for compounding tokens. As the old rule did not manage do the compounding, we must make sure that the new rule has the higher priority, so it will affect the compounding in that specific context.

In [20]:
# Add the new pattern to the list of all patterns
ALL_1ST_LEVEL_PATTERNS.append( new_unit_pattern )

Ok. Now, we have updated the default set of rules. 
Next, we must create a new `CompoundTokenTagger` that uses updated rules instead of the default rules:

In [21]:
# Create new CompoundTokenTagger that uses updated patterns
new_compound_token_tagger = CompoundTokenTagger(patterns_1=ALL_1ST_LEVEL_PATTERNS)

Finally, let's use our improved tagger to analyse the text:

In [22]:
# Prepare text
text = Text('LED-infotabloo heledus: 5000 cd/m²').tag_layer(['tokens'])
# Apply the new tagger on the text
new_compound_token_tagger.tag(text)
# Check the results
text['compound_tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
"['LED', '-', 'infotabloo']",['hyphenation'],
"['cd', '/', 'm²']",['unit'],cd/m²


_Mission complete!_

#### Comparisons to compounding rules used in the EstSyntax pre-processing module

On building EstNLTK's compounding rules, the tokenization postcorrection rules of the pre-processing module of EstSyntax (available at https://github.com/EstSyntax/preprocessing-module and https://github.com/kristiinavaik/ettenten-eeltootlus) were taken as a starting point. A number of these rules were also reimplemented in EstNLTK, but not all of them. The following table compares EstNLTK's and EstSyntax's token compounding approaches:

Type of compound token | Examples | Compounded by EstSyntax <br> preprocessing module? \*\* | Compounded by EstNLTK <br> 1.6?
--- | --- | --- | ---
**`numerics`** `with digit grouping` | `20 000` | yes | yes
`numerics with decimal separator` | `3 , 5` <br> `3,5` <br> `3.5` | yes | yes
`numerics followed by period` <br> (ordinal numbers) | `1995.`  <br> `1 .` | yes | yes
`numerics with sign` | `-3` <br> `± 500` | yes | yes
`numerics with percentage sign` | `10 %` <br> `25%` | yes | yes
`common` **`date and time patterns`** | `15. 04. 2005` <br> `16:30` | yes\*\* | yes
**`ranges`** `of numbers` | `40 000-45 000` <br> `8 - 16%` <br> `14.00 – 16.30` <br> `2 ... 3 , 5` | yes | no
**`scales/ratios`** `of numbers` | `1 , 5 : 0 , 5` <br> `0 : 4` | yes | no
**`proportions`** `of numbers` | `5 36-st` | yes | no
`(binary)` **`arithmetic operations`** | `17± 5` <br> `3 x 15` | yes | no
**`arithmetic expressions`** and <br> formula-like expressions | `2 + 3 = 5` <br> `n = 122` | yes | no
**`units`** `"X-per-Y"`  | `km / h` <br> `g/km` | yes | yes
`quantities with units` | `60 km / h` <br> `2,3 h/m` <br> `1,0 mM` | yes | no
`1-letter` **`abbreviations`** <br> `with numbers`  | `E 961` <br> `I 26` | yes | yes
`common` **`abbreviations`** <br> | `s. o.` <br> `Nt .` <br> `Jr.` | yes | yes
`names with` **`initials`** | `A . H . Tammsaare` <br> `A. H. Tammsaare` <br> `D . Trump` | yes | yes
`names with` **`ampersands`** | `Simon &amp; Schusteri` | yes | no
`morphological` **`case endings`** | `4000-le` <br> `SKT-st` <br> `workshop ' e` | yes | yes
**`xml tags`** | `<p heading="0">` <br> `</br>` | partially\*\* | yes
**`email addresses`** | `big@boss.com` <br> `user [ -at- ] dumb.com` | yes\*\* | yes
**`www addresses`** | `http : //www.offa.org/ stats` <br> `www.esindus.ee/korteriturg` | yes\*\* | yes
`common` **`emoticons`** | `:-)` <br> `:)))` | yes | yes


\*\* The most important difference between EstNLTK's and EstSyntax's token compounding approaches is the following. EstSyntax aims to provide postcorrections -- that is, to fix tokenization that has been broken (e.g. by an earlier automatic tokenization). So, in many cases, EstSyntax's patterns focus only on problematic cases, and do not specifically address similar cases with correct tokenization. For instance, EstSyntax's email detection patterns can capture address `"dumb . user [ -at- ] dumb.com"`, but there is no pattern for capturing address `"big@boss.com"`. EstNLTK, on the other hand, aims to cover correctly tokenized cases, and also to provide postcorrections where necessary (e.g. both email addresses `"dumb . user [ -at- ] dumb.com"` and `"big@boss.com"` are captured).

##### Known limitations and points for further improvement

 * The order in which the _tokenization hints_ are applied is **important** in the current implementation. For instance, consider the case when pattern A comes before pattern B (i.e.  A has higher priority than B), and these patterns cover overlapping strings. If the _tokenization hint_ captured by A is not realized because it fails to meet start/end positions of tokens, then the hint captured by B is also not realized (even if it meets the start/end positions), because B's hint is deleted as a hint being subsumed by A's hint. In order to overcome this situation, you'll currently need to either change the order of patterns, or, if possible, change the patterns in a way that there is no overlap anymore.    
 Technical reason behind the problem: `RegexTagger`-s (in `CompoundTokenTagger`) apply conflict resolving: in case of overlapping spans, only spans with the highest priority are returned by taggers. An alternative solution would be to return all spans (regardless priorities), remove spans that do not meet the start/end positions of tokens, and _only after that_ apply the conflict resolving on the remaining spans. This alternative solution would require some reimplementations in `CompoundTokenTagger`, but the benefit would be more flexibility on defining the patterns, so that tokenization hints would not get unexpectedly subsumed by other hints;
 
 * Although `CompoundTokenTagger` accepts a list of custom non-ending abbreviations, these abbreviations cannot contain punctuation symbols (because punctuation symbols are split into different tokens by `TokensTagger`). Still, defining custom abbreviations that contain punctuation symbols can be desirable, e.g. to capture acronyms such as `'U.S.A.'` or `'J.M.K.E.'`, or (orthographically incorrect) abbreviations such as `'j.n.e.'` and `'n.t.x'`. For this purpose, the custom abbreviation detection logic in the method `CompoundTokenTagger.tag()` should be reimplemented in a way that custom abbreviations are firstly matched on the plain text (not on token texts as currently), and then their spans (starts/ends) are matched with token starts/ends in a similar way as tokenization hints currently are;
 
 * Patterns for capturing `names_with_initials` currently do not cover names where an initial is inside the name (e.g. `'George W. Bush'`), at the end of the name (e.g. `'Precht, D.'`), or names that have nobiliary particles / prepositions in the middle (e.g. `'O. M. von Stackelberg'`); For capturing such names, the list `initial_patterns` (in the module `estnltk.taggers.text_segmentation.patterns`) needs improvement;