# How to systematically build a decorator for rule taggers 

The logic of all rule taggers is the same: 
1. Rules are used to extract potential matches.
2. A matching rule assigns a set of initial attributes to each match. 
3. Conflict resolver strategy is used to deal with overlapping patterns.
4. Global or rule-specific decorator is used to update attribute values or filter out spurious macthes

In this notebook, we show how to systematically develop decorators. 
The latter can be quite error prone as its input structure is complex and it is hard to test its behaviour.
As a concrete example, we are developing a decorator for PhraseTagger that should match proper names from the list specified by lemma tuples.
We can refine this extraction strategy by requiring that words in the match have matching cases. 
For that we need to access the morphogical annotations and check their consistency.
To demonstrate attribute derivation, we also convert proper names to their normal form in the nominal case.

In [1]:
from estnltk import Text
from estnltk.taggers.system.rule_taggers import Ruleset 
from estnltk.taggers.system.rule_taggers import StaticExtractionRule 
from estnltk.taggers.system.rule_taggers import DynamicExtractionRule
from estnltk.taggers import PhraseTagger

## I. Define initial set of extraction rules 

We need initail ruleset to proceed with development.
This ruleset does not have to be complete as long as it creates enough matches in test data.

In [2]:
extraction_rules = Ruleset([
    StaticExtractionRule(pattern=('aadu','must'), attributes = {'entity_type': 'PER', 'profession': 'politician', 'age': 63}),
    StaticExtractionRule(pattern=('euroopa','liit'), attributes = {'entity_type': 'ORG'})
])

## II. Define the initil test data
The aim here is to define a set of revealing example sentences which contain true and false matches.
The list here does not have to be complete. 
We need enough examples to carve out main code paths in the decorator.
We also need to add morph analysis layer to these texts, otherwise we cannot check for consistency. 

In [3]:
text1 = Text('Täna räägime Aadu Mustast ja Euroopa Liidust.')
text2 = Text('Eile rääkisime Aadule musta kaabu kinkimisest ja Euroopas liidu sõlmimisest.')

text1.tag_layer('morph_analysis')
text2.tag_layer('morph_analysis')

display(text1['morph_analysis'])
display(text2['morph_analysis'])

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,8

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Täna,Täna,täna,täna,['täna'],0,,,D
räägime,räägime,rääkima,rääki,['rääki'],me,,me,V
Aadu,Aadu,Aadu,Aadu,['Aadu'],0,,sg n,H
Mustast,Mustast,must,must,['must'],st,,sg el,A
ja,ja,ja,ja,['ja'],0,,,J
Euroopa,Euroopa,Euroopa,Euroopa,['Euroopa'],0,,sg g,H
Liidust,Liidust,Liidu,Liidu,['Liidu'],st,,sg el,H
,Liidust,Liidud,Liidud,['Liidud'],st,,sg el,H
,Liidust,Liit,Liit,['Liit'],st,,sg el,H
.,.,.,.,['.'],,,,Z


layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,11

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Eile,Eile,eile,eile,['eile'],0,,,D
rääkisime,rääkisime,rääkima,rääki,['rääki'],sime,,sime,V
Aadule,Aadule,Aadu,Aadu,['Aadu'],le,,sg all,H
musta,musta,must,must,['must'],0,,sg g,A
kaabu,kaabu,kaabu,kaabu,['kaabu'],0,,sg g,S
kinkimisest,kinkimisest,kinkimine,kinkimine,['kinkimine'],st,,sg el,S
ja,ja,ja,ja,['ja'],0,,,J
Euroopas,Euroopas,Euroopa,Euroopa,['Euroopa'],s,,sg in,H
liidu,liidu,liit,liit,['liit'],0,,sg g,S
sõlmimisest,sõlmimisest,sõlmimine,sõlmimine,['sõlmimine'],st,,sg el,S


## III. Extract initial matches for the decorator 

The decorator is a function that matches the following template 

```python
def decorator(text: Text, base_span: BaseSpan, annotations: Dict[str, Any]) -> Optional[Dict[str, Any]]
```

where 
* the input `text` gives a full access to the text object that is tagged  
* the input `base_span` is the current match to be decorated
* the input `annotations` contains the initial set of attributes for the match

The function should return `None` if the match is a false positive and updated dictionary for real matches.
It is safe to modify existing annotations.

**Shortcut for decorator development:**
In principle, one can define decorator function in one go without running it on real data, but this a mentally difficult and error-prone way to develop complex logic. It is far more simpler to extract all inputs on which the decorator is applied on test data and use that as the input for the decorator. 

For that we need to define the initial phrase tagger and use the following method to get the list which will be processed by the decorator:


```python
def extract_matches(self, raw_text: str, layers: Dict[str, Layer]) --> List[Tuple[EnvelopingBaseSpan, str, Any]]
```

where `raw_text` is underlying `Text.text` field in the `Text` object.  Note that the output list will be processed by a conflict resolver if this is provided as an argument during the initialisation of the phase tagger. Hence, you can also test the behaviour of conflict resolvers on the same input. 
For standard conflict resolving strategies `KEEP_MAXIMAL` and `KEEP_MINIMAL` the corresponding resolvers can be imported as follows:

```python
from estnltk.taggers.system.rule_taggers.helper_methods.helper_methods import keep_maximal_matches
from estnltk.taggers.system.rule_taggers.helper_methods.helper_methods import keep_minimal_matches
```

In [4]:
from estnltk.taggers.system.rule_taggers.helper_methods.helper_methods import keep_maximal_matches

In [5]:
initial_tagger = PhraseTagger(
    output_layer='entities',
    input_layer='morph_analysis',
    input_attribute='lemma', 
    ruleset=extraction_rules,
    output_attributes=('match', 'entity_type', 'profession', 'age'),
    conflict_resolver='KEEP_MAXIMAL',
    ignore_case=True)

In [6]:
raw_text = text1.text
layers = {'morph_analysis': text1['morph_analysis']}
output = initial_tagger.extract_annotations(raw_text, layers)
filtered_output = keep_maximal_matches(output)

##  IV. Decorator development 

Let us start with the first input from the extraction output to nail the overall structure of the decorator.
For that we need to collect base spans and corresponding annotations given by static rules. 
To get going, we just define the corresponding annotation by ourselves by knowing that the first match corresponds to Aadu Must. 

Note that the necessary input span is the first element in the tuple corresponding to the first match and the initial annotation also contains the phrase under the key `initial_tagger.self.phrase_attribute` which is `phrase` currently. This key can be renamed to anything else by specifying the input `phrase_attribute` during the initialisation of the tagger. The third element in the input tuple corresponds to the phrase attribute.   

**TODO:** Update this when we have corrected the errors in the tagger

In [7]:
input_text = text1
input_span = output[0][0]
input_annotation = {'phrase': output[0][2], 'entity_type': 'PER', 'profession': 'politician', 'age': 63}

In [8]:
def decorator_that_prints_some_relevant_information(text, span, annotation):
    print(f"Match as compound span: {span}")
    print(f"Match phrase as a list of words: {annotation['phrase']}")
    print(f"The first word of the match: {span[0]}")
    print(f"Number of morphanalysis of the first word: {len(text['morph_analysis'].get(span[0]).annotations)}")
    print(f"Morphological forms of the first word: {text['morph_analysis'].get(span[0])['form']}")

decorator_that_prints_some_relevant_information(input_text, input_span, input_annotation)

Match as compound span: EnvelopingBaseSpan((ElementaryBaseSpan(13, 17), ElementaryBaseSpan(18, 25)))
Match phrase as a list of words: ('aadu', 'must')
The first word of the match: ElementaryBaseSpan(13, 17)
Number of morphanalysis of the first word: 1
Morphological forms of the first word: ['sg n']


Lets now build a decorator that analyses the consistency of morphological forms of the matched phrase

In [9]:
def morphological_consistency_checker(text, span, annotation):
    # Check full names
    if len(input_span) == 2 and annotation['entity_type'] == 'PER':
        word_1 = text['morph_analysis'].get(span[0])
        word_2 = text['morph_analysis'].get(span[1])

        # The first word in the name must be in nominative
        if 'sg n' not in word_1['form'] and 'pg n' not in word_1['form']:
            return None
        
        # For simplicity lets assume that the first analysis is correct 
        annotation['normal_form'] = f"{word_1['lemma'][0]} {word_2['lemma'][0]}"
        return annotation
        
morphological_consistency_checker(input_text, input_span, input_annotation)        

{'phrase': ('aadu', 'must'),
 'entity_type': 'PER',
 'profession': 'politician',
 'age': 63,
 'normal_form': 'Aadu must'}

This seems to work now for the first input lets check how it works over all inputs.
For that we need to convert all matches into decorator inputs and run the decorator over all of them.
We use function `get_decorator_inputs` for that.

In [10]:
# TODO: move this method into PhraseTagger and generalize
def get_decorator_inputs(tagger, text, match_list):
    output = [None] * len(match_list)
    for i, (base_span, _, phrase) in enumerate(match_list):
        static_rulelist = tagger.static_ruleset_map.get(phrase, None)
        for group, priority, annotation in static_rulelist:
            annotation = annotation.copy()
            annotation[tagger.phrase_attribute] = phrase
        if tagger.group_attribute:
            annotation[self.group_attribute] = group
        if tagger.priority_attribute:
            annotation[self.priority_attribute] = priority
        if tagger.pattern_attribute:
            annotation[self.pattern_attribute] = phrase
        output[i] = (text, base_span, annotation)
    return output

In [11]:
decorator_inputs = get_decorator_inputs(initial_tagger, text1, output)
for i, (text, span, annotation) in enumerate(decorator_inputs):
    print(f"Input no: {i}")
    print(f"Input pattern: {' '.join(annotation['phrase'])}")
    print(f"Output: {morphological_consistency_checker(text, span, annotation)}")
    print('')

Input no: 0
Input pattern: aadu must
Output: {'entity_type': 'PER', 'profession': 'politician', 'age': 63, 'phrase': ('aadu', 'must'), 'normal_form': 'Aadu must'}

Input no: 1
Input pattern: euroopa liit
Output: None



**Observation:** The decorator fails to handle the phrase Euroopa Liit which is expected as we did not specify how to handle organisations. Lets refine the decorator to get rid of this error. Here we can use the list of decorator inputs to define a new target input.  

In [12]:
input_text = decorator_inputs[1][0]
input_span = decorator_inputs[1][1]
input_annotation = decorator_inputs[1][2]

In [13]:
def morphological_consistency_checker(text, span, annotation):
    # Check full names
    if len(input_span) == 2 and annotation['entity_type'] == 'PER':
        word_1 = text['morph_analysis'].get(span[0])
        word_2 = text['morph_analysis'].get(span[1])

        # The first word in the name must be in nominative
        if 'sg n' not in word_1['form'] and 'pg n' not in word_1['form']:
            return None
        
        # For simplicity lets assume that the first analysis is correct 
        annotation['normal_form'] = f"{word_1['lemma'][0]} {word_2['lemma'][0]}"
        return annotation
    # Check organisation names
    elif len(input_span) == 2 and annotation['entity_type'] == 'ORG':
        word_1 = text['morph_analysis'].get(span[0])
        word_2 = text['morph_analysis'].get(span[1])

        # The first word in the name must be in genitive
        if 'sg g' not in word_1['form'] and 'pg g' not in word_1['form']:
            return None
        
        # For simplicity lets assume that the first analysis is correct 
        annotation['normal_form'] = f"{word_1['lemma'][0]} {word_2['lemma'][0]}"
        return annotation
        
morphological_consistency_checker(input_text, input_span, input_annotation)  

{'entity_type': 'ORG',
 'phrase': ('euroopa', 'liit'),
 'normal_form': 'Euroopa Liidu'}

The consistency check is fine but the normal form is incorrect. We are not going to correct this as a better option is to define the normal form as an attribute in the static rule.
Let us now define the new tagger and see how it works on test texts.

In [14]:
updated_tagger = PhraseTagger(
    output_layer='entities',
    input_layer='morph_analysis',
    input_attribute='lemma', 
    ruleset=extraction_rules,
    output_attributes=('match', 'entity_type', 'profession', 'age'),
    decorator=morphological_consistency_checker,
    conflict_resolver='KEEP_MAXIMAL',
    ignore_case=True)

In [15]:
if 'entities' in text1.layers:
    text1.pop_layer('entities')
display(updated_tagger(text1)['entities'])

if 'entities' in text2.layers:
    text2.pop_layer('entities')
display(updated_tagger(text2)['entities'])

layer name,attributes,parent,enveloping,ambiguous,span count
entities,"match, entity_type, profession, age",,morph_analysis,False,2

text,match,entity_type,profession,age
"['Aadu', 'Mustast']",,PER,politician,63.0
"['Euroopa', 'Liidust']",,ORG,,


layer name,attributes,parent,enveloping,ambiguous,span count
entities,"match, entity_type, profession, age",,morph_analysis,False,0

text,match,entity_type,profession,age


As it seems to work we do not have to tweak the decorator further. 
However, we should now define test cases to fix the intended behaviour.    

## V. Development of regression tests

The simplest way to fix the desired behaviour is to define a function that we can check with `pytest`. There are now two options for that. First, we can write tests for the decorator. Second, we can test the behaviour of the entire tagger. Both options are useful.

In [16]:
def test_morphological_consistency_checker():
    
    tagger = PhraseTagger(
        output_layer='entities',
        input_layer='morph_analysis',
        input_attribute='lemma', 
        ruleset=extraction_rules,
        output_attributes=('match', 'entity_type', 'profession', 'age'),
        decorator=morphological_consistency_checker,
        conflict_resolver='KEEP_MAXIMAL',
        ignore_case=True)

    # The first test text
    text = Text('Täna räägime Aadu Mustast ja Euroopa Liidust.').tag_layer('morph_analysis')
    raw_text = text.text
    layers = {'morph_analysis': text1['morph_analysis']}
    output = initial_tagger.extract_annotations(raw_text, layers)
    filtered_output = keep_maximal_matches(output)
    decorator_inputs = get_decorator_inputs(tagger, text1, output)

    decorator_inputs = get_decorator_inputs(initial_tagger, text, output)
    result = morphological_consistency_checker(*decorator_inputs[0])
    target = {'entity_type': 'PER', 'profession': 'politician', 'age': 63, 'phrase': ('aadu', 'must'), 'normal_form': 'Aadu must'}
    assert result == target

    result = morphological_consistency_checker(*decorator_inputs[1])
    target = {'entity_type': 'ORG', 'phrase': ('euroopa', 'liit'), 'normal_form': 'Euroopa Liidu'}
    assert result == target
    
    # The second test text
    text = Text('Eile rääkisime Aadule musta kaabu kinkimisest ja Euroopas liidu sõlmimisest.').tag_layer('morph_analysis')
    raw_text = text.text
    layers = {'morph_analysis': text2['morph_analysis']}
    output = initial_tagger.extract_annotations(raw_text, layers)
    filtered_output = keep_maximal_matches(output)
    decorator_inputs = get_decorator_inputs(tagger, text, output)

    result = morphological_consistency_checker(*decorator_inputs[0])
    assert result is None
    
    result = morphological_consistency_checker(*decorator_inputs[1])
    assert result is None

    return True

test_morphological_consistency_checker()

True

In [17]:
def test_final_tagger():
    
    tagger = PhraseTagger(
        output_layer='entities',
        input_layer='morph_analysis',
        input_attribute='lemma', 
        ruleset=extraction_rules,
        output_attributes=('match', 'entity_type', 'profession', 'age'),
        decorator=morphological_consistency_checker,
        conflict_resolver='KEEP_MAXIMAL',
        ignore_case=True)

    # The first test text
    text = tagger(Text('Täna räägime Aadu Mustast ja Euroopa Liidust.').tag_layer('morph_analysis'))
    assert len(text['entities']) == 2
    assert text['entities'][0].text == ['Aadu', 'Mustast']
    assert text['entities'][1].text == ['Euroopa', 'Liidust']

    # The second test text    
    text = tagger(Text('Eile rääkisime Aadule musta kaabu kinkimisest ja Euroopas liidu sõlmimisest.').tag_layer('morph_analysis'))
    assert len(text['entities']) == 0

    return True 

**Final comments:** These tests are naive, since the mix data with code. 
It is much more wiser to build a separate test suite that reads inputs and desired outputs from text files.
For taggers the corresponding framework is implemented in `estnltk_core.taggers.tagger_tester` module.