## Inconsistencies between clauses annotation and syntax annotation

### Detecting clause errors with syntax

EstNLTK has a function for detecting potential clause errors based on automatic syntactic annotations.

In [1]:
from estnltk.converters import json_to_text
from estnltk.consistency.clauses_and_syntax_consistency import detect_clause_errors

Load example pre-annotated data:

In [2]:
examples_texts = []
with open('example_clause_errors.jsonl', 'r', encoding='utf-8') as in_f:
    for line in in_f:
        line = line.strip()
        text_obj = json_to_text(line.strip())
        examples_texts.append(text_obj)

Each text contains one sentence with erroneous clause annotations, extracted from the Estonian Reference Corpus:

In [3]:
examples_texts[0].text

'Ka võib kütusehind , mis moodustab piletihinnast kolmandiku , teha ootamatuid pöördeid .'

In [4]:
# Browse erroneous annotation
examples_texts[0]['v169_clauses'][['start', 'end', 'text', 'clause_type']]

Unnamed: 0,start,end,text,clause_type
0,0,20,"['Ka', 'võib', 'kütusehind', ',']",regular
1,21,88,"['mis', 'moodustab', 'piletihinnast', 'kolmandiku', ',', 'teha', 'ootamatuid', 'pöördeid', '.']",regular


For detecting the errors, texts must have the sentences layer and a syntax layer:

In [5]:
examples_texts[0].layers

{'v166_sentences', 'v166_words', 'v168_stanza_syntax', 'v169_clauses'}

Preferably the syntax layer is created by StanzaSyntaxTagger (which has the highest parsing accuracy), but layers created by other [syntactic taggers](https://github.com/estnltk/estnltk/tree/main/tutorials/nlp_pipeline/C_syntax) should also work.

Use parameters `clauses_layer`, `syntax_layer` and `sentences_layer` to specify layer names, if these are different from the defaults:

In [6]:
cl_errors_layer = detect_clause_errors(examples_texts[0], 
                                       clauses_layer='v169_clauses', 
                                       syntax_layer='v168_stanza_syntax', 
                                       sentences_layer='v166_sentences')

In [7]:
cl_errors_layer.display()

In [8]:
cl_errors_layer

0,1
attributive_mis_embedded_clause_wrong_end,1

layer name,attributes,parent,enveloping,ambiguous,span count
clause_errors,"err_type, sent_id, correction_description",,,False,1

text,err_type,sent_id,correction_description
",",attributive_mis_embedded_clause_wrong_end,0,Split clause after position 61 and then embed the clause 19:61 into clause 0:88.


Clause errors layer has the following attributes:
* `text` -- text snippet indicating the error type;
* `err_type` -- type of the error. for details about different error types, see below "Details: current error detection patterns"
* `sent_id` -- index of the sentence in which the error occurred;
* `correction_description` -- detailed description on how the error can be fixed. for details about different corrections, see below "Details: current error detection patterns".

Note that metadata of clause errors layer contains counts of different types of errors detected.

### Repairing clause errors with syntax

Function `fix_clause_errors_with_syntax` creates a new/fixed clauses layer, based on errors detected by `detect_clause_errors`.

In [9]:
from estnltk.consistency.clauses_and_syntax_consistency import fix_clause_errors_with_syntax

Use parameters `clauses_layer`, `syntax_layer` and `sentences_layer` to specify input layer names, and `output_layer` to change name of the new clauses layer:

In [10]:
for text_obj in examples_texts:
    new_clauses_layer = fix_clause_errors_with_syntax(text_obj, 
                                                      clauses_layer='v169_clauses', 
                                                      syntax_layer='v168_stanza_syntax', 
                                                      sentences_layer='v166_sentences',
                                                      output_layer='clauses_fixed')
    # Attach new layer to the text
    text_obj.add_layer( new_clauses_layer )

In [11]:
from IPython.display import Markdown

In [12]:
# Browse results
for text_obj in examples_texts:
    print('OLD:')
    display(text_obj['v169_clauses'][['start', 'end', 'text', 'clause_type']])
    print('FIXED:')
    display(text_obj['clauses_fixed'][['start', 'end', 'text', 'clause_type']])
    display(Markdown('---'))
    print()

OLD:


Unnamed: 0,start,end,text,clause_type
0,0,20,"['Ka', 'võib', 'kütusehind', ',']",regular
1,21,88,"['mis', 'moodustab', 'piletihinnast', 'kolmandiku', ',', 'teha', 'ootamatuid', 'pöördeid', '.']",regular


FIXED:


Unnamed: 0,start,end,text,clause_type
0,0,88,"['Ka', 'võib', 'kütusehind', 'teha', 'ootamatuid', 'pöördeid', '.']",regular
1,19,61,"[',', 'mis', 'moodustab', 'piletihinnast', 'kolmandiku', ',']",embedded


---


OLD:


Unnamed: 0,start,end,text,clause_type
0,0,8,"['Riigid', ',']",regular
1,9,111,"['kelle', 'vaatluspunktide', 'arv', 'on', 'piiratud', ',', 'võivad', 'andmed', ' ..., type: <class 'list'>, length: 13",regular
2,47,58,"['(', 'alla', '20', ')']",embedded


FIXED:


Unnamed: 0,start,end,text,clause_type
0,0,111,"['Riigid', 'võivad', 'andmed', 'esitada', 'lihtsalt', 'kirjalikus', 'vormis', '.']",regular
1,7,60,"[',', 'kelle', 'vaatluspunktide', 'arv', 'on', 'piiratud', ',']",embedded
2,47,58,"['(', 'alla', '20', ')']",embedded


---


OLD:


Unnamed: 0,start,end,text,clause_type
0,0,11,"['Metoodika', ',']",regular
1,12,62,"['kuidas', 'teda', 'ujuma', 'õpetada', ',', 'oleme', 'selgeks', 'teinud', '.']",regular


FIXED:


Unnamed: 0,start,end,text,clause_type
0,0,62,"['Metoodika', 'oleme', 'selgeks', 'teinud', '.']",regular
1,10,39,"[',', 'kuidas', 'teda', 'ujuma', 'õpetada', ',']",embedded


---


OLD:


Unnamed: 0,start,end,text,clause_type
0,0,20,"['Kui', 'tuli', 'otsustada', ',']",regular
1,21,54,"['mida', 'valida', ',', 'jäi', 'peale', 'muusika', '.']",regular


FIXED:


Unnamed: 0,start,end,text,clause_type
0,0,20,"['Kui', 'tuli', 'otsustada', ',']",regular
1,21,34,"['mida', 'valida', ',']",regular
2,35,54,"['jäi', 'peale', 'muusika', '.']",regular


---


OLD:


Unnamed: 0,start,end,text,clause_type
0,0,20,"['Kui', 'tekkis', 'küsimus', ',']",regular
1,21,54,"['keda', 'võtta', ',', 'tehti', 'minuga', 'juttu', '.']",regular


FIXED:


Unnamed: 0,start,end,text,clause_type
0,0,20,"['Kui', 'tekkis', 'küsimus', ',']",regular
1,21,33,"['keda', 'võtta', ',']",regular
2,34,54,"['tehti', 'minuga', 'juttu', '.']",regular


---




### Details: clause error detection patterns

The following error patterns are used (square brackets mark clause boundaries in examples):

**1 -** Errors related to attributive clauses starting with _mis/kes/millal/kus/kust/kuhu/kuna/kuidas/kas_, followed by a verb,  comma and then syntactic root, but no clause boundary between attributive clause start and the syntactic root:
 * Error type name: `attributive_(mis|kes|millal|kus|kust|kuhu|kuna|kuidas|kas)_embedded_clause_wrong_end`;
 * Error pattern: `A:[ ... ] B:[ mis/kes/... ... VERB ... , C: ... ROOT ]`
 * Repair pattern: `A:[ ... B:[ mis/kes/... ... VERB ...  , ] C: ... ROOT ]`
 * Example error:  `[Kahtlemata on iga keel,] [mida inimene valdab (VERB) , topeltrikkus (ROOT) .]`
 * Example repair: `[Kahtlemata on iga keel [, mida inimene valdab (VERB) ,] topeltrikkus (ROOT) .]`
 * Repair splits clauses B and C, then joins A and C, and creates a new embedded (attributive) clause B inside the joined clause.
 * Correction description: `Split clause after position x and then embed the clause y1:y2 into clause z1:z2`


 * 1.1) Exceptions: split clauses at the comma, but do not create an embedded clause if: 
   * 1.1.1) the comma is followed by phrase "välja arvatud";
   * 1.1.2) the clause preceding the attributive clause starts with "kui" or "et";
   * 1.1.3) the clause preceding the attributive clause starts with a discourse marker, such as "ah";
   * Error type name: `attributive_(mis|kes|millal|kus|kust|kuhu|kuna|kuidas|kas)_clause_wrong_end`;
   * Correction description: `Split clause after position x.`
 * 1.2) Exceptions: do not apply the pattern at all if: 
   * 1.2.1) the comma is in the middle of a conjunction phrase, then it is likely a false signal: belongs to the conjunction, and is not a clause break;
   * 1.2.2) the attributive clause is actually at the beginning of a sentence;
   * 1.2.3) the clause preceding the attributive clause is likely another attributive clause, starts with _mis/kes/millal/kus/kust/kuhu/kuna/kuidas/kas_;
   * 1.2.4) there is no verb in the clause preceding the attributive clause, and also not in the clause containing syntactic root;

**2 -** Errors related to disconnected root clauses: there is a clause headed by root, but not containing root nor a verb; it's followed by a simple clause, and then by a root clause:
 * Error type name: `disconnected_root_clause`;
 * Error pattern: `A:[ ... NO_VERB/NO_ROOT/HEADED_BY_ROOT ... , ] B:[ ... , ] C:[ ... ROOT ... ]`
 * Repair pattern: `A:[ ... NO_VERB/NO_ROOT/HEADED_BY_ROOT B:[ , ... , ] C: ... ROOT ... ]`
 * Example error #1:  `[Varakevadel,] [kui talvevarudest napib,] [käime Naissaarel hülgeid küttimas.]`
 * Example repair #1: `[Varakevadel   [,kui talvevarudest napib,] käime Naissaarel hülgeid küttimas.]`
 * Example error #2:  `[Sest selleks,] [et saada relvaluba,] [on vaja täita palju formaalsusi.]`
 * Example repair #2: `[Sest selleks [, et saada relvaluba ,] on vaja täita palju formaalsusi.]`
 * Repair joins A and C into one clause, and creates a new embedded clause B inside the joined clause.
 * Correction description: `Embed the adjusted clause x1:x2 into clause y1:y2`


 * 2.1) Exceptions: do not apply the pattern at all if: 
   * 2.1.1) Clause A contains any of the lemmas _lugupeetud, austatud, tõepoolest, tõsi, muide, muuseas, võimalik, alati, ükskõik, niipea, iseasi, iseküsimus_ or words _kallid, juhul, juhtudel_;
   * 2.1.2) Clause B ends with quotes (likely a part of the direct speech);

#### Performance evaluation

Error detection patterns were developed and evaluated on random subsets of the Reference Corpus of Estonian.
The final performance was evaluated on a 5 million word subset of the corpus (~12k randomly picked documents), where processing yielded:

   * 624 `disconnected_root_clause` errors
       * results of manual checking (of 100 randomly picked errors): 86 correct, 6 incorrect, 8 dubious corrections;
   * 146 `attributive_.+_clause_wrong_end` errors:
       * results of manual checking (of 100 randomly picked errors): 87 correct, 5 incorrect, 8 dubious corrections;


#### Source

For more information about the patterns, including examples of the exception, please see the source of the function: https://github.com/estnltk/estnltk/blob/fc6811e1f329244bb3623f2beba7fe7f082afdac/estnltk/estnltk/consistency/clauses_and_syntax_consistency.py#L274-L650

Source code of the performance evaluation: https://github.com/estnltk/estnltk-workflows/tree/master/detect_inconsistencies/clauses_and_syntax