# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Text segmentation: Sentences </span>

Once the layer of `words` is established, sentence boundaries between the words will be determined. In EstNLTK, the component responsible for this is the `SentenceTokenizer`.

The `SentenceTokenizer` splits the text into sentences using NLTK's `PunktSentenceTokenizer` with an Estonian-specific model (available from [here](https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers)). After the initial sentence tokenization, `SentenceTokenizer` applies series of post-corrections to the obtained results. Post-corrections include removing sentence endings after non-ending abbreviations, adding sentence endings after emoticons, fixing endings in case of prolonged ending punctuation (e.g. `'...'`, or `'??'`), and merging together sentences that have been mistakenly split due to period-containing numeric expressions (e.g. dates/times), and periods nearby double quotes and parenthesis.

In the following example, we create a text object, add the prerequisite layer (words), and then segment the text into sentences:

In [1]:
from estnltk import Text
from estnltk.taggers import SentenceTokenizer

text = Text('''Esimene lõik. Teine lause.

Teine lõik.''')
text.tag_layer(['words'])
SentenceTokenizer().tag(text)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3

text
"['Esimene', 'lõik', '.']"
"['Teine', 'lause', '.']"
"['Teine', 'lõik', '.']"


#### Post-corrections of SentenceTokenizer

After obtaining the initial sentence boundaries, the `SentenceTokenizer` applies post-corrections to fix erroneous sentence boundaries. Different types of fixes are grouped together and flags can be used to switch these fixes off (by default, all fixes are switched on).

##### Fixes related to compound tokens (`fix_compound_tokens`)

The list of sentence endings is filtered, and all the sentence endings that fall inside `compound_tokens` are removed. Special attention is paid on `compound_tokens` of type `non_ending_abbreviation`: a sentence ending added after `non_ending_abbreviation` will be removed.

In [2]:
text = Text('''Lp. esimees, vt. seda joonist seal leheküljel.''')
text.tag_layer(['words'])
SentenceTokenizer( fix_compound_tokens=True ).tag(text)  # fix sentence endings related to compound tokens (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1

text
"['Lp.', 'esimees', ',', 'vt.', 'seda', 'joonist', 'seal', 'leheküljel', '.']"


In addition to `non_ending_abbreviation`-s, regular `abbreviations` in specific contexts can also cause wrong sentence endings. So, if `fix_compound_tokens=True`, then special merge patterns are also applied to detect regular abbreviations that are followed by period, and then by lowercase letters or non-ending punctuation (e.g. comma, semicolon). If a sentence break appears after the period in such contexts, then two consecutive sentences are joined together:

In [3]:
text = Text('''Jooniste, tabelite jm. abil on kõik selgeks tehtud.''')

text.tag_layer(['words'])
SentenceTokenizer( fix_compound_tokens=True ).tag(text)  # switch on fixes related to compound tokens (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1

text
"['Jooniste', ',', 'tabelite', 'jm.', 'abil', 'on', 'kõik', 'selgeks', 'tehtud', '.']"


In [4]:
text['compound_tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,1

text,type,normalized
"['jm', '.']",['abbreviation'],jm.


In the previous example: the compound token _'jm.'_ has type `abbreviation`, not `non_ending_abbreviation`, because it can also appear at the end of a sentence. So, we can only heuristically determine that it is likely not a sentence ending if the following text passage starts with lowercase letters or non-ending punctuation.

Note: using this setting requires that `abbreviation`-s and `non_ending_abbreviation`-s are detected during the previous processing, and available in the layer `compound_tokens` (see `CompoundTokenTagger` for details);

##### Fixes related to numeric expressions (`fix_numeric`)

Removes sentence endings that are mistakenly added after periods that end date, time and (other) numeric expressions.

In [5]:
text = Text("17 . okt. 1998 a . laekus firmale täpselt 700.- eeku.")

text.tag_layer(['words'])
SentenceTokenizer( fix_numeric=True ).tag(text)  # switch on fixes related to numeric expressions (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1

text
"['17 .', 'okt', '.', '1998', 'a .', 'laekus', 'firmale', 'täpselt', '700.', '-', 'eeku', '.']"


##### Fixes related to parentheses (`fix_parentheses`)

Removes sentence endings that are mistakenly added inside parentheses, and fixes endings that are misplaced with respect to parentheses.

In [6]:
text = Text("( Naerab. )\nEriti siis , kui sõidan mootorratta või jalgrattaga ( v. tasakaaluliikuriga ).")

text.tag_layer(['words'])
SentenceTokenizer( fix_parentheses=True ).tag(text)  # switch on fixes related to parentheses (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2

text
"['(', 'Naerab', '.', ')']"
"['Eriti', 'siis', ',', 'kui', 'sõidan', 'mootorratta', 'või', 'jalgrattaga', '(' ..., type: <class 'list'>, length: 14"


##### Fixes related to double quotes (`fix_double_quotes`)

Removes sentence endings that are misplaced with respect to quotations / double quotes.

In [7]:
text = Text("""« Meeste lihas on tühjem , aga võtab taastamistegevust vastu paremini kui varem . »
« Meie treeningutel on üks uus peateema ! » elavneb Alaver .""")

text.tag_layer(['words'])
SentenceTokenizer( fix_double_quotes=True ).tag(text)  # switch on fixes related to double quotes (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2

text
"['«', 'Meeste', 'lihas', 'on', 'tühjem', ',', 'aga', 'võtab', 'taastamistegevust ..., type: <class 'list'>, length: 15"
"['«', 'Meie', 'treeningutel', 'on', 'üks', 'uus', 'peateema', '!', '»', 'elavneb', 'Alaver', '.']"


Note: these fixes only involve _local context_: every pair of two consecutive sentences is checked for a misplaced sentence boundary in between them. 
However, not all of the misplacements can be detected this way, e.g. if a quotation covers more than two sentences, it will be out of the scope.
For this reason, there is another set of fixes, which considers the usage of quotation marks in the whole document. 
See below for details.

##### Fixes related to inner titles ending with punctuation (`fix_inner_title_punct`)

Removes sentence endings that are mistakenly placed after titles inside the sentence. Currently only fixes cases when a question mark or an exclamation mark is followed by a sentence ending and, a colon or a semicolon starts the next sentence.

In [8]:
text = Text("Laval olid jõulise naissolistiga Conflict OK!, kitarripoppi mängivad Claires Birthday ja Seachers.")

text.tag_layer(['words'])
SentenceTokenizer( fix_inner_title_punct=True ).tag(text)  # switch on fixes related to inner titles (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1

text
"['Laval', 'olid', 'jõulise', 'naissolistiga', 'Conflict', 'OK', '!', ',', 'kitar ..., type: <class 'list'>, length: 15"


##### Fixes related to prolonged sentence ending punctuation (`fix_repeated_ending_punct`)

Adds sentence endings that are missed in places of prolonged ending punctuation (including ellipsis / triple dots), and also  fixes misplaced sentence endings in such contexts.

In [9]:
text = Text("Seda ma ei teadnud... Ja tegelikult ei saanudki teada ! !! ")

text.tag_layer(['words'])
SentenceTokenizer( fix_repeated_ending_punct=True ).tag(text)  # switch on fixes related to repeated ending punct (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2

text
"['Seda', 'ma', 'ei', 'teadnud', '...']"
"['Ja', 'tegelikult', 'ei', 'saanudki', 'teada', '!', '!!']"


##### Use emoticons as sentence endings (`use_emoticons_as_endings`)

If switched on (the default setting), then emoticons are treated as sentence endings. Note: this requires that emoticons are detected during the previous processing, and available in the layer `compound_tokens` (see `CompoundTokenTagger` for details);

In [10]:
text = Text("Nii habras, ilus ja minu oma :) Kõige parem mis kunagi juhtuda saab :):) Magamata öid mul muidugi ei olnud.")

text.tag_layer(['words'])
SentenceTokenizer( use_emoticons_as_endings=True ).tag(text)  # switch on using emoticons as sentence endings (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3

text
"['Nii', 'habras', ',', 'ilus', 'ja', 'minu', 'oma', ':)']"
"['Kõige', 'parem', 'mis', 'kunagi', 'juhtuda', 'saab', ':)', ':)']"
"['Magamata', 'öid', 'mul', 'muidugi', 'ei', 'olnud', '.']"


##### Fix paragraph endings (`fix_paragraph_endings`)

If switched on (the default setting), then paragraph endings (double newlines) are treated as sentence endings.

In [11]:
text = Text('''
Herbes de Provence maitseainesegu

Teistes keeltes

English: herbes de Provence, Provençal herbs

French: herbes de Provence

Kirjeldus

1970ndatel prantsuse köögis populaarseks muutunud maitseainesegu.
''')

text.tag_layer(['words'])
SentenceTokenizer( fix_paragraph_endings=True ).tag(text)  # switch on using double newlines as sentence endings (default)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6

text
"['Herbes', 'de', 'Provence', 'maitseainesegu']"
"['Teistes', 'keeltes']"
"['English', ':', 'herbes', 'de', 'Provence', ',', 'Provençal', 'herbs']"
"['French', ':', 'herbes', 'de', 'Provence']"
['Kirjeldus']
"['1970ndatel', 'prantsuse', 'köögis', 'populaarseks', 'muutunud', 'maitseainesegu', '.']"


_Note_: `fix_paragraph_endings` has higher priority than fixes made by merge rules (see the step "_applying merge patterns (and merge-and-split patterns)_" in technical details below). This means that if `fix_paragraph_endings` has been switched on, and a paragraph ending is between two sentences that could be merged by the rules, then the merging will be cancelled.

##### Fixes of double quotes based on counting (`fix_double_quotes_based_on_counts`)

If switched on, then starting and ending double quotes are counted in the whole text, and this information is used to make additional sentence boundary fixes. 
For instance, if double quotes ending a quotation are wrongly placed at the beginning of a sentence, then they will be moved to the end of a previous sentence:

In [12]:
text = Text('''
" Minul pole häda midägit .
Tervis on enam-vähem .
Pension käib ja puha . "
Selgub, et traktor aias on poja oma .
''')

text.tag_layer(['words'])
SentenceTokenizer( fix_double_quotes_based_on_counts=True ).tag(text)  # count double quotes in the whole document and make fixes based on counting results
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,4

text
"['""', 'Minul', 'pole', 'häda', 'midägit', '.']"
"['Tervis', 'on', 'enam-vähem', '.']"
"['Pension', 'käib', 'ja', 'puha', '.', '""']"
"['Selgub', ',', 'et', 'traktor', 'aias', 'on', 'poja', 'oma', '.']"


_Note_: currently, this fix is switched off by default, because it slows down the sentence tokenization process a bit.

#### Technical details

The initial sentence tokenization is obtained via `PunktSentenceTokenizer`'s method [sentences_from_tokens()]( http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer.sentences_from_tokens), which takes a  list of words as an input, and groups the words into sentences. Using this method ensures that:

  * words consisting of compound tokens (e.g. names with initials like _'A. H. Tammsaare'_) will be treated as single text units, and sentence boundaries will not be added mistakenly inside such words;
  * if a sentence ending symbol mistakenly "glues together" words in text (e.g. `'Kas on niipalju vaja?Ei ole ju.'`), then the list of words maintains the correct separation (e.g. `['Kas', 'on', 'niipalju', 'vaja', '?', 'Ei', 'ole', 'ju', '.']`), and provides basis for a correct sentence tokenization (otherwise, the sentence boundary would be missed because of the "words glued together");

  _Remark_: If you really need to, then you can also change the initial sentence tokenizer (see the subsection _"Customizing base tokenizer of the `SentenceTokenizer`"_ below), but please keep in mind that post-corrections of `SentenceTokenizer` have been specifically created for the `PunktSentenceTokenizer`, and they may not work properly with other sentence tokenizers;

After the initial sentence tokenization, the following post-correction steps are applied:

  1. _fixing compound tokens_ ( flag `fix_compound_tokens` ) -- a built-in logic is used to remove all sentence endings that fall inside `compound_tokens`, and also sentence endings that are added after `compound_tokens` of type `non_ending_abbreviation` are removed. These fixes also have a continuation, see 6.1 for details;
  2. _fixing repeated ending punctuation_ ( flag `fix_repeated_ending_punct` ) -- a built-in logic is used to add sentence endings after prolonged ending punctuation (including ellipsis/triple dots) if the prolonged ending punctuation is followed by a titlecased word. An exception: if prolonged ending punctuation is immediately after starting brackets, then sentence ending won't be added. These fixes also have a continuation, see 6.2 for details;
  3. _adding sentence endings after emoticons_ ( flag `use_emoticons_as_endings` ) -- a built-in logic is used to add sentence endings after emoticons (`compound_tokens` of type `emoticon`) if emoticons are followed by a titlecased word. Also, if a sentence ending already exists before the first emoticon, then it is removed (to assure that emoticons belong with the ending sentence);
  4. _collecting spans of potential sentences_ -- a built-in logic is used to collect spans (starts, ends) of potential sentences. Note that in steps 1-3, only sentence endings were processed, and at this step, full sentence spans are created. As a side effect, spans are also aligned with starts and ends of the words;
  5. _fixing sentence endings related to paragraph endings_ ( flag `fix_paragraph_endings` ) -- a built-in logic is used to add sentence breaks in places where paragraphs end. The current logic marks double newlines as paragraph endings;
  6. _applying merge patterns (and merge-and-split patterns)_ --- merging patterns are applied to join together consecutive "sentences" if the sentence break between them was erroneous. Patterns include a special subset called _merge-and-split patterns_ which first join two sentences together, and then split into two sentences at some other location inside one of the sentences. Merging patterns are divided into different types, which can be switched off / on by passing flags to `SentenceTokenizer`'s constructor. The following types of patterns are applied by default:

  6.1. _fixing sentence endings mistakenly added after regular `abbreviation`-s_ ( flag `fix_compound_tokens` ) -- if a regular `abbreviation` is followed by a sentence break, and then by a lowercase word or non-ending punctuation (comma or semicolon), then the sentence break is removed after such abbreviation. Patterns that are applied in this step have `'fix_type'` that starts with the prefix `'abbrev'`;   
  6.2. _fixing sentence endings mistakenly added inside prolonged punctuation_ ( flag `fix_repeated_ending_punct` ) -- if there is a sentence break inside the prolonged ending punctuation (e.g. the last exclamation mark forms a new sentence), then the sentence break will be removed. Merge patterns that are applied in this step have `'fix_type'` that starts with the prefix `'repeated_ending_punct'`;  
  6.3. _removing sentence endings that were mistakenly added after periods that end date, time and (other) numeric expressions_ ( flag `fix_numeric`). Merge patterns applied in this step have `'fix_type'` starting with `'numeric'`;  
  6.4. _fixing sentence endings that were misplaced with respect to parentheses_ ( flag `fix_parentheses` ). Merge patterns applied in this step have `'fix_type'` starting with `'parentheses'`;  
  6.5. _fixing sentence endings that were misplaced with respect to quotations / double quotes_ ( flag `fix_double_quotes` ); Merge patterns applied in this step have `'fix_type'` starting with `'double_quotes'`;    
  6.6. _removing sentence endings that were mistakenly placed after titles inside the sentence_ ( flag `fix_inner_title_punct` ); Merge patterns applied in this step have `'fix_type'` starting with `'inner_title_punct'`;       

7. _fixing sentence boundaries based on counting quotation marks in the whole text_ (flag `fix_double_quotes_based_on_counts`) -- a built-in logic is used to count starting and ending double quotes in the whole text, and to provide sentence boundary corrections based on the found information. The following fixes will be applied: 7.1. if a sentence starts with an ending quotation mark, then the ending quote is moved to the end of the previous sentence; 7.2. if the movable ending quote is followed by the attribution part of the quote (describing "who uttered the quote"), then the attribution part is also moved to the end of the previous sentence; 7.3. if there is an ending quotation mark inside a sentence, followed instantly by a starting quotation mark, then the sentence is split two after the ending quotation mark;

The final step of `SentenceTokenizer` is creating the layer 'sentences' based on the fixed / post-corrected list of sentence spans.

If `record_fix_types=True` is passed as a parameter to the `SentenceTokenizer`'s constructor, then the layer `'sentences'` will have attribute `'fix_types'` containing information about which types of merge patterns (from the step 6) were applied on the sentences. This information can be used for testing / debugging purposes.

##### Merge patterns (and merge-and-split patterns)

Each merge pattern contains two regular expressions describing two consecutive sentences that need to be joined. The list of patterns is defined in the variable `merge_patterns` inside the module `estnltk.taggers.text_segmentation.sentence_tokenizer`. The following is an example of a pattern that merges sentences that have been mistakenly broken in the middle of a range of ordinal numbers:

      { 'comment'  : '{Numeric_range_start} {period} + {dash} {Numeric_range_end}', \
        'example'  : '"Tartu Muinsuskaitsepäevad toimusid 1988. a 14." + "- 17. aprillil."', \
        'fix_type' : 'numeric_range', \
        'regexes'  : [ re.compile('(.+)?([0-9]+)\s*\.$', re.DOTALL), \
                   re.compile('-+\s*([0-9]+)\s*\.(.*)?$', re.DOTALL)], \
      },
      
Attribute `'comment'` is used to give a generic description of the pattern, and `'example'` exemplifies the string joining performed by the pattern. Although these attributes are not mandatory, it is highly advisable to use them when adding new entries, as it helps to maintain interpretability.

Attribute `'fix_type'` is mandatory and expresses the type of the fix. Flags passed to `SentenceTokenizer`'s constructor instruct which types of fixes will be used during the post-correction, and which ones will be skipped. For instance, the previously exemplified pattern will only be used if the flag `fix_numeric` is switched on. See 'Technical details' above for more information.

Attribute `'regexes'` should be a list containing exactly two precompiled regular expressions that are used for finding the joining spot. The first should describe a sentence ending, and the second a beginning of a follow-up sentence.

Attribute `'shift_end'` is an optional boolean attribute, which can be used to turn the pattern into a _merge-and-split pattern_. If switched on, then one of the regular expressions defined in `'regexes'` should contain a group named `'end'` (see 
[how to define named groups](https://docs.python.org/3.5/howto/regex.html#non-capturing-and-named-groups)), which marks the new sentence end. After two sentences are joined together, a new sentence end is created at the end of the string captured by the group `'end'`. Note that if the group  `'end'` is not defined, or it does not match, then sentences are only merged with no following split.

#####  Improving  `SentenceTokenizer`'s rules: an example

If you are analysing texts from a specific domain, you may encounter situations where you need to improve `SentenceTokenizer`'s existing rules, or add additional rules. Here is a short example how to do it.

Let's consider an example text consisting of two sentences:

In [13]:
text = Text('Meie puuvarud: 3 rm. märgasid lepahalge sellest aastast, 4 rm. poolkuivasid '+\
            'halge eelmisest aastast ning kuivi halge 2 rm. Kas sellest piisab?')

During the sentence segmentation, the first sentence is mistakenly split into three sentences because of misinterpreting the period after the abbreviation _rm_ as sentence ending:

In [14]:
text.tag_layer(['sentences'])
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,4

text
"['Meie', 'puuvarud', ':', '3', 'rm', '.']"
"['märgasid', 'lepahalge', 'sellest', 'aastast', ',', '4', 'rm', '.']"
"['poolkuivasid', 'halge', 'eelmisest', 'aastast', 'ning', 'kuivi', 'halge', '2', 'rm', '.']"
"['Kas', 'sellest', 'piisab', '?']"


This is a tricky case: if we look at the sentence, we can see that the abbreviation _rm_ can be in the middle of the sentence, but it can also end the sentence. 
So, while we can solve part of the problem by adding _rm_ to the list of `non_ending_abbreviation`-s in `CompoundTokenTagger` (see the tutorial `B_02_*` for the details), this would still yield us one error: the last _rm_ would then be mistakenly considered as being in the middle of the sentence, and we will end up having one big sentence (instead of two).

The workaround here is to introduce a new merge pattern, which will cancel the sentence break after the string `'rm.'` if the string is followed by an unlikely sentence start (e.g. lowercase letter).

The module `estnltk.taggers.text_segmentation.sentence_tokenizer` has the list of `merge_patterns`, which contains all merge patterns (and merge-and-split patterns). Let's import it:

In [15]:
from estnltk.taggers.text_segmentation.sentence_tokenizer import merge_patterns

Now, let's create our own pattern (see the description of the format in the subsection _Merge patterns_ above), and add it to the `merge_patterns`:

In [16]:
import regex as re
# Create a new post-correction
rm_fix = \
{ 'comment'  : '{rm} {period} + {lowercase letter}', \
  'example'  : '"Meie puuvarud: 3 rm." + "märgasid lepahalge"', \
  'fix_type' : 'abbrev_common', \
  'regexes'  : [ re.compile('(.+)?\srm\s*\.$', re.DOTALL), \
                 re.compile('^([a-zöäüõžš])\s*(.*)?$', re.DOTALL)], \
}
# Add it to the list of corrections
merge_patterns.append( rm_fix )

Ok. Now, we have updated the default set of rules. Next, we must create a new `SentenceTokenizer` that uses new rules instead of the old ones:

In [17]:
new_sentence_tokenizer = SentenceTokenizer(patterns = merge_patterns)

Finally, let's use our improved tagger to analyse the text:

In [18]:
# Create text and tag prerequisite layers:
text = Text('Meie puuvarud: 3 rm. märgasid lepahalge sellest aastast, 4 rm. poolkuivasid '+\
            'halge eelmisest aastast ning kuivi halge 2 rm. Kas sellest piisab?')
text.tag_layer(['words'])
# Tag sentences with the new post-corrections
new_sentence_tokenizer.tag(text)
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2

text
"['Meie', 'puuvarud', ':', '3', 'rm', '.', 'märgasid', 'lepahalge', 'sellest', 'a ..., type: <class 'list'>, length: 24"
"['Kas', 'sellest', 'piisab', '?']"


_Mission accomplished!_

##### Customizing base tokenizer of the `SentenceTokenizer`

If you really need to, then you can also customize the `SentenceTokenizer`, and change it's base tokenizer from `PunktSentenceTokenizer` to some other tokenizer that inherits from [`nltk.tokenize.api.TokenizerI`](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.api.TokenizerI). For example, if you have an input text, where each sentence is systematically placed on a new line, then you may want to use NLTK's [`LineTokenizer`](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.simple.LineTokenizer) instead of the default `PunktSentenceTokenizer`:

In [19]:
# Create a sentence tokenizer that only splits sentences in places of new lines
from nltk.tokenize.simple import LineTokenizer
newline_sentence_tokenizer = SentenceTokenizer( base_sentence_tokenizer=LineTokenizer() )

In [20]:
# Prepare text
text = Text('''
See on esimene lause
Ja see teine lause
Kolmas lause on kolmandal real
''')
text.tag_layer(['words'])
# Apply the customized sentence tokenizer
newline_sentence_tokenizer.tag(text)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3

text
"['See', 'on', 'esimene', 'lause']"
"['Ja', 'see', 'teine', 'lause']"
"['Kolmas', 'lause', 'on', 'kolmandal', 'real']"


_Things to keep in mind:_
 * by default, the method [sentences_from_tokens()]( http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer.sentences_from_tokens) is used for tokenization, but if the new `base_sentence_tokenizer` does not have that method, then the method [`span_tokenize()`](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.api.TokenizerI.span_tokenize) is used instead;
 * the post-corrections have been created specifically for the default `base_sentence_tokenizer` (which is: `PunktSentenceTokenizer` with the Estonian-specific model). If you change the `base_sentence_tokenizer`, then there is no guarantee that all the post-corrections still work properly. So, it may be a good idea to turn off the post-corrections while using a custom `base_sentence_tokenizer`;

##### Known limitations and points for further improvement

 * `SentenceTokenizer` provides sentence ending fixes related to the most commonly used abbreviations and acronyms (that were found during the analysis of a sample from [KoondKorpus](https://keeleressursid.ee/et/keeleressursid-cl-ut/korpused/83-article/clutee-lehed/192-segakorpus) and [etTenTen](http://www2.keeleveeb.ee/dict/corpus/ettenten/about.html)). If you need to analyse a corpus from a specific domain, you likely need to provide additional / domain-specific fixes. For this purpose, `merge_patterns` defined in `estnltk.taggers.text_segmentation.sentence_tokenizer` could be augmented with additional rules;
 * Sentence breaks can be erroneously added inside _enumerations that contain periods_ (such as `1. ... ; 2. ... ; 3. ...`, or `a. ... ; b. ... ; c. ...`). Note that resloving these cases requires that enumerations are first detected in the text (so that enumerations are made distinct from sentences ending with numbers). As one enumeration item can contain several sentences, this problem goes beyond checking tokens in close proximity (as done in `CompoundTokenTagger`), and checking ends and starts of consecutive sentences (as done in `SentenceTokenizer`) -- a special logic involving analysis of whole document is likely required for detection of enumerations;
 * A sentence break can be erroneously added after a regular `abbreviation` followed by a titlecased word. For instance, in the sentence `'Bodhidharma tõi zeni 6. sajandil e.Kr. Hiinasse.'`, the break is added after `'e.Kr.'`, although the actual sentence continues. A solution to this problem would involve verifying that the titlecased word is actually a proper noun (or a named entity), and also checking that the second sentence is not too small (or that it contains verbs);