# PhraseTagger
`PhraseTagger` can be used to tag sequencial attribute values of a layer. The result is an enveloping layer.
Let's create a Text object with `morph_analysis` layer.

In [1]:
from estnltk import Text

text = Text('Eestimaal tunnevad inimesed palju puudust päikesest ja energiast.').tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Eestimaal,Eestimaa,Eesti_maa,"(Eesti, maa)",l,,sg ad,H
tunnevad,tundma,tund,"(tund,)",vad,,vad,V
inimesed,inimene,inimene,"(inimene,)",d,,pl n,S
palju,palju,palju,"(palju,)",0,,,D
puudust,puudus,puudus,"(puudus,)",t,,sg p,S
päikesest,päikene,päikene,"(päikene,)",st,,sg el,S
,päike,päike,"(päike,)",st,,sg el,S
ja,ja,ja,"(ja,)",0,,,J
energiast,energia,energia,"(energia,)",st,,sg el,S
.,.,.,"(.,)",,,,Z


Use `PhraseTagger` to tag sequences of lemmas on that text. The lemma sequences are read from a file.

In [2]:
from estnltk.taggers import Vocabulary

vocabulary_file = 'phrase_vocabulary.csv'
Vocabulary(vocabulary=vocabulary_file, key='_phrase_')

_phrase_,value,_priority_
"('päike',)",P,0
"('tundma', 'inimene')",TI_1,0
,TI_2,0
"('tundma', 'inimene', 'palju')",TIP,0


In [3]:
from estnltk.taggers import PhraseTagger

tagger = PhraseTagger(output_layer='phrases',
                      input_layer='morph_analysis',
                      input_attribute='lemma',
                      vocabulary=vocabulary_file,
                      output_attributes=['value', '_priority_'],
                      conflict_resolving_strategy='ALL',
                      priority_attribute='_priority_')
tagger

name,output layer,output attributes,input layers
PhraseTagger,phrases,"['value', '_priority_']",['morph_analysis']

0,1
input_attribute,lemma
global_validator,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
validator_attribute,_validator_
conflict_resolving_strategy,ALL
priority_attribute,_priority_
ambiguous,False


The vocabulary parsed into a dict looks like this

In [4]:
tagger._vocabulary

_phrase_,value,_priority_,_validator_
"('päike',)",P,0,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
"('tundma', 'inimene')",TI_1,0,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
,TI_2,0,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>
"('tundma', 'inimene', 'palju')",TIP,0,<function estnltk.taggers.dict_taggers.phrase_tagger.default_validator>


The vocabulari can be used to initialize the tagger in form of dict as well as in form of file name.

So single lemma 'päike' is tagged, sequencial lemmas 'tundma', 'inimene' are tagged twice since the vocabulary contains two interpretations for that phrase and also sequence of lemmas 'tundma', 'inimene', 'palju' is tagged:

In [5]:
tagger.tag(text)
text.phrases

layer name,attributes,parent,enveloping,ambiguous,span count
phrases,"value, _priority_",,morph_analysis,False,4

text,value,_priority_
tunnevad inimesed,TI_1,0
tunnevad inimesed,TI_2,0
tunnevad inimesed palju,TIP,0
päikesest,P,0


All of the spans are kept by conflict resolving because the priorities are all equal.