## Clause Segmenter

A simple sentence, also called an independent clause, typically contains a finite verb, and expresses a complete thought. However, natural language sentences can also be long and complex, consisting of two or more clauses joined together. The clause structure can be made even more complex due to embedded clauses, which divide their parent clauses into two halves, for instance:

        'Mees, keda seal kohtasime, oli tuttav ja teretas meid.'
        '[Mees, [keda seal kohtasime,] oli tuttav ja] [teretas meid.]'
        (in the example, clauses are surrounded by brackets)

Clause segmenter is a program that splits long and complex natural language sentences into clauses. Example:

In [1]:
from estnltk import Text

text = Text('Mees, keda seal kohtasime, oli tuttav ja teretas meid.')

text.tag_layer(['clauses'])
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,3

text,clause_type
"['Mees', 'oli', 'tuttav', 'ja']",regular
"[',', 'keda', 'seal', 'kohtasime', ',']",embedded
"['teretas', 'meid', '.']",regular


There are two types of clauses: 
 * **regular** clauses, which are usually separated from one another by punctuation and/or conjunctive words (such as _ja, ning, et, sest, kuid_);
 * **embedded** clauses, which are nested inside other clauses, and divide their parent clauses into two parts; embedded clauses can also be nested inside other embedded clauses, that is, the structure of embedding can be recursive;


Linguistic motivations behind the clause segmenting are discussed by Kaalep and Muischnek in the articles [(2012A)](http://www.lrec-conf.org/proceedings/lrec2012/summaries/229.html) and [(2012B)](http://arhiiv.rakenduslingvistika.ee/ajakirjad/index.php/aastaraamat/article/view/ERYa8.04/6).

**Note**: EstNLTK's clause segmenter uses a Java-based implementation of the tool. Before using the clause segmenter, make sure that:
  * Java SE Runtime Environment (version >= 1.8) is installed into the system;
  * `java` is in the [PATH environment variable](https://docs.oracle.com/javase/tutorial/essential/environment/paths.html);

Source code of the Java-based clause segmenter is available [here](https://github.com/soras/osalausestaja).

### ClauseSegmenter class

#### The basic mode

Clause segmenter can also be initiated as a stand-alone class:

In [2]:
from estnltk.taggers import ClauseSegmenter
clause_segmenter = ClauseSegmenter()
clause_segmenter

name,output layer,output attributes,input layers
ClauseSegmenter,clauses,"('clause_type',)","['words', 'sentences', 'morph_analysis']"

0,1
ignore_missing_commas,False
depends_on,"['words', 'sentences', 'morph_analysis']"
layer_name,clauses


Note that before applying `ClauseSegmenter`, the input `Text` object must have layers `"words"`, `"sentences"`, and also `"morph_analysis"`. The last layer is required because clause tagging also needs information which about finite verbs in the text.

In [3]:
from estnltk import Text

text = Text('Igaüks, kes traktori eest miljon krooni lauale laob, on huvitatud sellest, '+\
            'et traktor meenutaks lisavõimaluste poolest võimalikult palju kosmoselaeva.')
# Add required layers
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Add clause annotation
clause_segmenter.tag(text)
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,3

text,clause_type
"['Igaüks', 'on', 'huvitatud', 'sellest', ',']",regular
"[',', 'kes', 'traktori', 'eest', 'miljon', 'krooni', 'lauale', 'laob', ',']",embedded
"['et', 'traktor', 'meenutaks', 'lisavõimaluste', 'poolest', 'võimalikult', 'palju', 'kosmoselaeva', '.']",regular


#### The `ignore_missing_commas` mode

Because commas are important clause delimiters in Estonian, the quality of the clause segmentation may suffer due to accidentially missing commas in the input text. To address this issue, the clause segmenter can be initialized in a mode in which the program tries to be less sensitive to missing commas while detecting clause boundaries. Example:

In [4]:
from estnltk import Text
from estnltk.taggers import ClauseSegmenter
clause_segmenter_2 = ClauseSegmenter(ignore_missing_commas=True)

text = Text('Keegi teine ka siin ju kirjutas et ütles et saab ise asjadele järgi '+
            'minna aga vastust seepeale ei tulnudki.')
# Add required layers
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Add clause annotation
clause_segmenter_2.tag(text)
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,4

text,clause_type
"['Keegi', 'teine', 'ka', 'siin', 'ju', 'kirjutas']",regular
"['et', 'ütles']",regular
"['et', 'saab', 'ise', 'asjadele', 'järgi', 'minna']",regular
"['aga', 'vastust', 'seepeale', 'ei', 'tulnudki', '.']",regular


Note that compared to the basic mode, this mode may introduce additional incorrect clause boundaries, although it also improves clause boundary detection in texts with (a lot of) missing commas.