# PhraseTagger
`PhraseTagger` can be used to tag sequencial attribute values of a layer. The result is an enveloping layer.
Let's create a Text object with `morph_analysis` layer.

In [1]:
from estnltk import Text

text = Text('Eestimaal tunnevad inimesed palju puudust päikesest ja energiast.').tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Eestimaal,Eestimaal,Eestimaa,Eesti_maa,"['Eesti', 'maa']",l,,sg ad,H
tunnevad,tunnevad,tundma,tund,['tund'],vad,,vad,V
inimesed,inimesed,inimene,inimene,['inimene'],d,,pl n,S
palju,palju,palju,palju,['palju'],0,,,D
puudust,puudust,puudus,puudus,['puudus'],t,,sg p,S
päikesest,päikesest,päike,päike,['päike'],st,,sg el,S
,päikesest,päikene,päikene,['päikene'],st,,sg el,S
ja,ja,ja,ja,['ja'],0,,,J
energiast,energiast,energia,energia,['energia'],st,,sg el,S
.,.,.,.,['.'],,,,Z


Use `PhraseTagger` to tag sequences of lemmas on that text. The lemma sequences are read from a file. AmbiguousRuleset is created because the file has different annotations for the same pattern.

In [2]:
from estnltk.taggers.system.rule_taggers import AmbiguousRuleset

vocabulary_file = 'phrase_vocabulary.csv'
ruleset = AmbiguousRuleset()
ruleset.load(file_name=vocabulary_file, key_column='_phrase_')

The decorator function must return an updated annotation to change the annotation. If the decorator returns `None` then the annotation will not be added to the span. The span is added to the output layer only if it has at least one annotation.

In [3]:
from estnltk.taggers import PhraseTagger


def decorator(layer, span, annotation):
    annotation['attr_1'] = 'default_1'
    annotation['attr_2'] = len(span)
    annotation['_priority_'] = 1
    return annotation
    

tagger = PhraseTagger(output_layer='phrases',
                      input_layer='morph_analysis',
                      input_attribute='lemma',
                      ruleset=ruleset,
                      output_attributes=['value', '_priority_', 'attr_1', 'attr_2', '_phrase_'],
                      decorator=decorator,
                      conflict_resolver='KEEP_ALL')
tagger

name,output layer,output attributes,input layers
PhraseTagger,phrases,"('value', '_priority_', 'attr_1', 'attr_2', '_phrase_')","('morph_analysis',)"

0,1
input_attribute,lemma
ruleset,"<estnltk.taggers.system.rule_taggers.extraction_rules.ambiguous_ruleset.Ambiguou ..., type: <class 'estnltk.taggers.system.rule_taggers.extraction_rules.ambiguous_ruleset.AmbiguousRuleset'>"
decorator,<function __main__.decorator>
ignore_case,False
conflict_resolver,KEEP_ALL
phrase_attribute,phrase
static_ruleset_map,"{('tundma', 'inimene'): [(0, 0, {'value': 'TI_1'}), (0, 0, {'value': 'TI_2'})], ..., type: <class 'dict'>, length: 3"
dynamic_ruleset_map,{}


**output_layer** - name of the output layer<br>
**input_layer** - name of the input layer<br>
**input_attribute** - name of the input layer attribute<br>
**ruleset** - Ruleset or AmbiguousRuleset with the rules for creating annotations<br>
**output_attributes** - list of output layer attributes<br>
**decorator** - decorator function that takes three arguments (layer, span and annotation) and returns an updated annotation or None to discard the annotation<br>
**conflict_resolver** - conflict resolving strategy<br>

Single lemma 'päike' is tagged, sequencial lemmas 'tundma', 'inimene' are tagged twice since the vocabulary contains two interpretations for that phrase and also sequence of lemmas 'tundma', 'inimene', 'palju' is tagged:

In [4]:
tagger.tag(text)
text.phrases

layer name,attributes,parent,enveloping,ambiguous,span count
phrases,"value, _priority_, attr_1, attr_2, _phrase_",,morph_analysis,True,3

text,value,_priority_,attr_1,attr_2,_phrase_
"['tunnevad', 'inimesed']",TI_1,1,default_1,2,
,TI_2,1,default_1,2,
"['tunnevad', 'inimesed', 'palju']",TIP,1,default_1,3,
['päikesest'],P,1,default_1,1,


All of the spans are kept by conflict resolving because the priorities are all equal.

In the next two examples conflicts are resolved only by priority. The first example is case sensitive, the second is not.

In [5]:
from estnltk.taggers.system.rule_taggers import StaticExtractionRule, Ruleset
ruleset = Ruleset()
rules = [StaticExtractionRule(pattern=('SUUR', 'ja', 'väike'), attributes={'_priority_': 0}), StaticExtractionRule(pattern=('väike',), attributes={'_priority_': 1})]
ruleset.add_rules(rules)

In [6]:
text = Text('Suur ja väike.').tag_layer(['words'])
comma_tagger = PhraseTagger(input_layer='words',
                            output_layer='size',
                            input_attribute='text',
                            output_attributes=['_priority_'],
                            conflict_resolver='KEEP_ALL',
                            ruleset=ruleset,
                            ignore_case=False
                         )
comma_tagger.tag(text)
text.size

layer name,attributes,parent,enveloping,ambiguous,span count
size,_priority_,,words,False,1

text,_priority_
['väike'],1


In [7]:
text = Text('Suur ja väike.').tag_layer(['words'])
comma_tagger = PhraseTagger(input_layer='words',
                            output_layer='size',
                            input_attribute='text',
                            output_attributes=['_priority_'],
                            conflict_resolver='KEEP_ALL',
                            ruleset=ruleset,
                            ignore_case=True
                         )
comma_tagger.tag(text)
text.size

layer name,attributes,parent,enveloping,ambiguous,span count
size,_priority_,,words,False,2

text,_priority_
"['Suur', 'ja', 'väike']",0
['väike'],1


## One more example

Note how the phrase attribute can be used to add the phrase to the annotation. The decorator can also make other edits or validate the annotation based on the phrase attribute.

In [8]:
from estnltk.taggers import PhraseTagger

#A function will be created to make creating rulesets from lists easier
'''
phrase_list = [StaticExtractionRule(pattern=('jalg',), attributes={'match': ('jalg',)}),
               StaticExtractionRule(pattern=('vasak','jalg'), attributes={'match': ('vasak','jalg')}),
               StaticExtractionRule(pattern=('parem','jalg'), attributes={'match': ('parem','jalg')}),
               StaticExtractionRule(pattern=('kops',), attributes={'match': ('kops',)}),
               StaticExtractionRule(pattern=('vasak','kops'), attributes={'match': ('vasak','kops')}),
               StaticExtractionRule(pattern=('parem','kops'), attributes={'match': ('parem','kops')}),
               StaticExtractionRule(pattern=('kõõlus',), attributes={'match': ('kõõlus',)}),
               StaticExtractionRule(pattern=('lihas',), attributes={'match': ('lihas',)}),
               StaticExtractionRule(pattern=('maks',), attributes={'match': ('maks',)}),
               StaticExtractionRule(pattern=('neer',), attributes={'match': ('neer',)}),
               StaticExtractionRule(pattern=('vasak','neer'), attributes={'match': ('vasak','neer')}),
               StaticExtractionRule(pattern=('parem','neer'), attributes={'match': ('parem','neer')}),
               StaticExtractionRule(pattern=('varvas',), attributes={'match': ('varvas',)}),
               StaticExtractionRule(pattern=('suur','varvas'), attributes={'match': ('suur','varvas')}),
              ]
'''
phrase_list = [StaticExtractionRule(pattern=('jalg',)),
               StaticExtractionRule(pattern=('vasak','jalg')),
               StaticExtractionRule(pattern=('parem','jalg')),
               StaticExtractionRule(pattern=('kops',)),
               StaticExtractionRule(pattern=('vasak','kops')),
               StaticExtractionRule(pattern=('parem','kops')),
               StaticExtractionRule(pattern=('kõõlus',)),
               StaticExtractionRule(pattern=('lihas',)),
               StaticExtractionRule(pattern=('maks',)),
               StaticExtractionRule(pattern=('neer',)),
               StaticExtractionRule(pattern=('vasak','neer')),
               StaticExtractionRule(pattern=('parem','neer')),
               StaticExtractionRule(pattern=('varvas',)),
               StaticExtractionRule(pattern=('suur','varvas')),
              ]
ruleset = Ruleset()
ruleset.add_rules(phrase_list)


latin_dict = {('suur', 'varvas'): 'hallux', ('kõõlus',):'tendo', ('kops',):'pulmo'}

def decorator(layer, span, annotation):
    print(annotation)
    annotation['latin_term'] = latin_dict.get(annotation['match'])
    return annotation


tagger = PhraseTagger(output_layer='body_parts',
                      input_layer='morph_analysis',
                      input_attribute='lemma',
                      ruleset=ruleset,
                      output_attributes=('match', 'latin_term'),
                      decorator=decorator,
                      conflict_resolver= 'KEEP_MAXIMAL',
                      ignore_case=True,
                      phrase_attribute='match')
tagger

name,output layer,output attributes,input layers
PhraseTagger,body_parts,"('match', 'latin_term')","('morph_analysis',)"

0,1
input_attribute,lemma
ruleset,<estnltk.taggers.system.rule_taggers.extraction_rules.ruleset.Ruleset object at 0x000001CE8498C9B0>
decorator,<function __main__.decorator>
ignore_case,True
conflict_resolver,KEEP_MAXIMAL
phrase_attribute,match
static_ruleset_map,"{('jalg',): [(0, 0, {})], ('vasak', 'jalg'): [(0, 0, {})], ('parem', 'jalg'): [( ..., type: <class 'dict'>, length: 14"
dynamic_ruleset_map,{}


In [9]:
from estnltk import Text

text_1 = Text('Patsient lasi jalga, sest vasaku jala suure varba pika \
               painutajalihase kõõluse rebend ajas tal kopsu üle maksa.')
text_1.tag_layer('morph_analysis')

tagger.tag(text_1)
text_1['body_parts']

{'match': ('vasak', 'jalg')}
{'match': ('suur', 'varvas')}
{'match': ('kõõlus',)}
{'match': ('kops',)}


layer name,attributes,parent,enveloping,ambiguous,span count
body_parts,"match, latin_term",,morph_analysis,False,4

text,match,latin_term
"['vasaku', 'jala']","('vasak', 'jalg')",
"['suure', 'varba']","('suur', 'varvas')",hallux
['kõõluse'],"('kõõlus',)",tendo
['kopsu'],"('kops',)",pulmo


Note that 'maksa' is not tagged. This is because Vabamorf thinks its lemma is 'maksma'.