## Clause Segmenter

A simple sentence, also called an independent clause, typically contains a finite verb, and expresses a complete thought. However, natural language sentences can also be long and complex, consisting of two or more clauses joined together. The clause structure can be made even more complex due to embedded clauses, which divide their parent clauses into two halves, for instance:

        'Mees, keda seal kohtasime, oli tuttav ja teretas meid.'
        '[Mees, [keda seal kohtasime,] oli tuttav ja] [teretas meid.]'
        (in the example, clauses are surrounded by brackets)

Clause segmenter is a program that splits long and complex natural language sentences into clauses. Example:

In [1]:
from estnltk import Text

text = Text('Mees, keda seal kohtasime, oli tuttav ja teretas meid.')

text.tag_layer(['clauses'])
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,3

text,clause_type
"['Mees', 'oli', 'tuttav', 'ja']",regular
"[',', 'keda', 'seal', 'kohtasime', ',']",embedded
"['teretas', 'meid', '.']",regular


There are two types of clauses: 
 * **regular** clauses, which are usually separated from one another by punctuation and/or conjunctive words (such as _ja, ning, et, sest, kuid_);
 * **embedded** clauses, which are nested inside other clauses, and divide their parent clauses into two parts; embedded clauses can also be nested inside other embedded clauses, that is, the structure of embedding can be recursive;


Linguistic motivations behind the clause segmenting are discussed by Kaalep and Muischnek in the articles [(2012A)](http://www.lrec-conf.org/proceedings/lrec2012/summaries/229.html) and [(2012B)](http://arhiiv.rakenduslingvistika.ee/ajakirjad/index.php/aastaraamat/article/view/ERYa8.04/6).

**Note**: EstNLTK's clause segmenter uses a Java-based implementation of the tool. Before using the clause segmenter, make sure that:
  * Java SE Runtime Environment (version >= 1.8) is installed into the system;
  * `java` is in the [PATH environment variable](https://docs.oracle.com/javase/tutorial/essential/environment/paths.html);

Source code of the Java-based clause segmenter is available [here](https://github.com/soras/osalausestaja).

### ClauseSegmenter class

#### The basic mode

Clause segmenter can also be used as a stand-alone class. Before using `ClauseSegmenter`, the input `Text` object must have layers `"words"`, `"sentences"`, and also `"morph_analysis"`. The last layer is required because clause tagging also needs information about finite verbs in the text.

In [2]:
from estnltk import Text
from estnltk.taggers import ClauseSegmenter

# Create text with required layers
text = Text('Igaüks, kes traktori eest miljon krooni lauale laob, on huvitatud sellest, '+\
            'et traktor meenutaks lisavõimaluste poolest võimalikult palju kosmoselaeva.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])

text
"Igaüks, kes traktori eest miljon krooni lauale laob, on huvitatud sellest, et traktor meenutaks lisavõimaluste poolest võimalikult palju kosmoselaeva."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,23
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,23
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,23


Because Java resources need to be cleaned up after using the segmenter, we recommend to use `ClauseSegmenter` in a **`with`** statement as a _context manager_, so that the resources will be automatically cleaned up afterwards:

In [3]:
# Add clause annotations
with ClauseSegmenter() as clause_segmenter:
    clause_segmenter.tag(text)

# Browse results
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,3

text,clause_type
"['Igaüks', 'on', 'huvitatud', 'sellest', ',']",regular
"[',', 'kes', 'traktori', 'eest', 'miljon', 'krooni', 'lauale', 'laob', ',']",embedded
"['et', 'traktor', 'meenutaks', 'lisavõimaluste', 'poolest', 'võimalikult', 'palj ..., type: <class 'list'>, length: 9",regular


_Note_: after the **`with`** context, the `ClauseSegmenter` instance can no longer be used for tagging texts;

#### Terminating `ClauseSegmenter` manually

If you need to use `ClauseSegmenter` outside `with` statement, you should terminate its Java process manually. 

In [4]:
from estnltk import Text
from estnltk.taggers import ClauseSegmenter

# Create clause segmenter
clause_segmenter = ClauseSegmenter()
clause_segmenter

name,output layer,output attributes,input layers
ClauseSegmenter,clauses,"('clause_type',)","('words', 'sentences', 'morph_analysis')"

0,1
ignore_missing_commas,False
use_normalized_word_form,True


Tag clauses, and use the method `close()` after tagging to terminate the process manually:

In [5]:
# Create text with required layers
text = Text('Igaüks, kes traktori eest miljon krooni lauale laob, on huvitatud sellest, '+\
            'et traktor meenutaks lisavõimaluste poolest võimalikult palju kosmoselaeva.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])

# Tag clause annotations
clause_segmenter.tag(text)

# Terminate the process
clause_segmenter.close()

# Browse results
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,3

text,clause_type
"['Igaüks', 'on', 'huvitatud', 'sellest', ',']",regular
"[',', 'kes', 'traktori', 'eest', 'miljon', 'krooni', 'lauale', 'laob', ',']",embedded
"['et', 'traktor', 'meenutaks', 'lisavõimaluste', 'poolest', 'võimalikult', 'palj ..., type: <class 'list'>, length: 9",regular


_Notes_:

  * After calling the method `close()`, the `ClauseSegmenter` instance can no longer be used for tagging texts;
  
  * Creating many `ClauseSegmenter`-s and not terminating them properly likely leads to unexpected errors;  

#### The `ignore_missing_commas` mode

Because commas are important clause delimiters in Estonian, the quality of the clause segmentation may suffer due to accidentially missing commas in the input text. To address this issue, the clause segmenter can be initialized in a mode in which the program tries to be less sensitive to missing commas while detecting clause boundaries. Example:

In [6]:
from estnltk import Text
from estnltk.taggers import ClauseSegmenter

with ClauseSegmenter(ignore_missing_commas=True) as clause_segmenter_2:
    text = Text('Keegi teine ka siin ju kirjutas et ütles et saab ise asjadele järgi '+
                'minna aga vastust seepeale ei tulnudki.')
    # Add required layers
    text.tag_layer(['words', 'sentences', 'morph_analysis'])
    # Add clause annotation
    clause_segmenter_2.tag(text)

# Browse results
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,4

text,clause_type
"['Keegi', 'teine', 'ka', 'siin', 'ju', 'kirjutas']",regular
"['et', 'ütles']",regular
"['et', 'saab', 'ise', 'asjadele', 'järgi', 'minna']",regular
"['aga', 'vastust', 'seepeale', 'ei', 'tulnudki', '.']",regular


Note that compared to the basic mode, this mode may introduce additional incorrect clause boundaries, although it also improves clause boundary detection in texts with (a lot of) missing commas.

#### The `use_normalized_word_form` parameter

The boolean parameter `use_normalized_word_form` specifies, if the normalized word forms are used in clause segmenter's input instead of the surface word forms (provided that normalized word forms are available). 
You can pass the parameter `use_normalized_word_form` to `ClauseSegmenter`'s constructor upon initialization of the tagger.

Notes: 
  * in case of ambiguity of `normalized_form`, the first `normalized_form` is picked for the input;
  * if `normalized_form` is `None`, then the surface word form is picked for the input;