# PhraseTagger
`PhraseTagger` can be used to tag sequencial attribute values of a layer. The result is an enveloping layer.
Let's create a Text object with `morph_analysis` layer.

In [1]:
from estnltk import Text

text = Text('Eestimaal tunnevad inimesed palju puudust päikesest ja energiast.').tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form...",words,,True,9

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Eestimaal,Eestimaa,Eesti_maa,"('Eesti', 'maa')",l,,sg ad,H
tunnevad,tundma,tund,"('tund',)",vad,,vad,V
inimesed,inimene,inimene,"('inimene',)",d,,pl n,S
palju,palju,palju,"('palju',)",0,,,D
puudust,puudus,puudus,"('puudus',)",t,,sg p,S
päikesest,päikene,päikene,"('päikene',)",st,,sg el,S
,päike,päike,"('päike',)",st,,sg el,S
ja,ja,ja,"('ja',)",0,,,J
energiast,energia,energia,"('energia',)",st,,sg el,S
.,.,.,"('.',)",,,,Z


Use `PhraseTagger` to tag sequences of lemmas on that text. The lemma sequences are read from a file.

In [2]:
from estnltk.taggers import Vocabulary

vocabulary_file = 'phrase_vocabulary.csv'
Vocabulary.parse(vocabulary=vocabulary_file, key='_phrase_')

_phrase_,value
"('päike',)",P
"('tundma', 'inimene')",TI_1
,TI_2
"('tundma', 'inimene', 'palju')",TIP


In [3]:
from estnltk.taggers import PhraseTagger


def decorator(span, raw_text):
    return {'attr_1': 'default_1', 'attr_2': len(span), '_priority_': 1}


def validator(span, raw_text):
    return True

tagger = PhraseTagger(output_layer='phrases',
                      input_layer='morph_analysis',
                      input_attribute='lemma',
                      vocabulary=vocabulary_file,
                      key='_phrase_',
                      output_attributes=['value', '_priority_', 'attr_1', 'attr_2', '_phrase_'],
                      global_validator=validator,
                      validator_attribute='_validator_',
                      decorator=decorator,
                      conflict_resolving_strategy='ALL',
                      priority_attribute='_priority_',
                      output_ambiguous=True)
tagger

name,output layer,output attributes,input layers
PhraseTagger,phrases,"('value', '_priority_', 'attr_1', 'attr_2', '_phrase_')","('morph_analysis',)"

0,1
input_attribute,lemma
vocabulary,"Vocabulary(key='_phrase_', len=3)"
global_validator,<function __main__.validator>
validator_attribute,_validator_
decorator,<function __main__.decorator>
conflict_resolving_strategy,ALL
priority_attribute,_priority_
output_ambiguous,True
ignore_case,False


**output_layer** - name of the output layer<br>
**input_layer** - name of the input layer<br>
**input_attribute** - name of the input layer attribute<br>
**vocabulary** - input vocabulary: `str`, `list`, `dict` or `Vocabulary`<br>
**key** - vocabulary key<br>
**output_attributes** - list of output layer attributes<br>
**global_validator** - global validator function that takes two arguments (span and raw text) and returns bool<br>
**validator_attribute** - name of the in vocabulary that points to the validator function<br>
**decorator** - decorator function that takes two arguments (span and raw text) and returns a dict of attribute names and their values; overwrites the vocabulary<br>
**conflict_resolving_strategy** - conflict resolving strategy<br>
**priority_attribute** - name of the priority attribute in the vocabulary<br>
**output_ambiguous** - output ambiguous layer<br>

The vocabulary read from the csv file looks like this.

In [4]:
tagger.vocabulary

_phrase_,value,_validator_
"('päike',)",P,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
"('tundma', 'inimene')",TI_1,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
,TI_2,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
"('tundma', 'inimene', 'palju')",TIP,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>


So single lemma 'päike' is tagged, sequencial lemmas 'tundma', 'inimene' are tagged twice since the vocabulary contains two interpretations for that phrase and also sequence of lemmas 'tundma', 'inimene', 'palju' is tagged:

In [5]:
tagger.tag(text)
text.phrases

layer name,attributes,parent,enveloping,ambiguous,span count
phrases,"value, _priority_, attr_1, attr_2, _phrase_",,morph_analysis,True,3

text,value,_priority_,attr_1,attr_2,_phrase_
tunnevad inimesed,TI_1,1,default_1,2,"('tundma', 'inimene')"
,TI_2,1,default_1,2,"('tundma', 'inimene')"
tunnevad inimesed palju,TIP,1,default_1,3,"('tundma', 'inimene', 'palju')"
päikesest,P,1,default_1,1,"('päike',)"


All of the spans are kept by conflict resolving because the priorities are all equal.

In the next two examples conflicts are resolved only by priority. The first example is case sensitive, the second is not.

In [6]:
text = Text('Suur ja väike.').tag_layer(['words'])
comma_tagger = PhraseTagger(input_layer='words',
                            output_layer='size',
                            output_ambiguous=True,
                            input_attribute='text',
                            output_attributes=[],
                            key='_phrase_',
                            priority_attribute='_priority_',
                            conflict_resolving_strategy='ALL',
                            vocabulary=[{'_priority_': 0, '_phrase_': ('SUUR', 'ja', 'väike')}, 
                                        {'_priority_': 1, '_phrase_': ('väike',)}],
                            ignore_case=False
                         )
comma_tagger.tag(text)
text.size

layer name,attributes,parent,enveloping,ambiguous,span count
size,_priority_,,words,True,1

text,_priority_
väike,1


In [7]:
text = Text('Suur ja väike.').tag_layer(['words'])
comma_tagger = PhraseTagger(input_layer='words',
                            output_layer='size',
                            output_ambiguous=True,
                            input_attribute='text',
                            output_attributes=[],
                            key='_phrase_',
                            priority_attribute='_priority_',
                            conflict_resolving_strategy='ALL',
                            vocabulary=[{'_priority_': 0, '_phrase_': ('SUUR', 'ja', 'väike')}, 
                                        {'_priority_': 1, '_phrase_': ('väike',)}],
                            ignore_case=True
                         )
comma_tagger.tag(text)
text.size

layer name,attributes,parent,enveloping,ambiguous,span count
size,_priority_,,words,True,1

text,_priority_
Suur ja väike,0
