# SpanTagger

**SpanTagger** allows us to tag spans on a pre-annotated layer of the **Text object**. For example, we can tag on lemmas if we specify 'morph_analysis' as input_layer and 'lemma' as input_attribute. We also need a ruleset with tokens and their attributes that we want to tag. In this example, the ruleset is saved in the file *span_vocabulary.csv*.

AmbiguousRuleset is used because one pattern applies to different values in the rules.

In [1]:
from estnltk.taggers.system.rule_taggers import AmbiguousRuleset, SpanTagger

vocabulary_file = 'span_vocabulary.csv'
ruleset = AmbiguousRuleset()
ruleset.load(file_name=vocabulary_file, key_column='_token_')

In [2]:
tagger = SpanTagger(output_layer='tagged_tokens',
                    input_layer='morph_analysis',
                    input_attribute='lemma',
                    ruleset=ruleset,
                    output_attributes=['value', '_priority_'], # default: None
                    )

Let's create the **Text** object with the layer that we want to tag on, and then tag the spans:

In [3]:
from estnltk import Text
text = Text('Eestimaal tunnevad inimesed palju puudust päikesest ja energiast.').tag_layer(['morph_analysis'])

In [4]:
tagger.tag(text)
text.tagged_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tagged_tokens,"value, _priority_",morph_analysis,,True,3

text,value,_priority_
tunnevad,T,1
inimesed,K,2
,I,3
päikesest,P,2


Ruleset can also be constructed manually.

In [5]:
from estnltk.taggers.system.rule_taggers import Ruleset, StaticExtractionRule

text = Text('Suur ja väike.').tag_layer(['words'])
ruleset = Ruleset()
ruleset.add_rules([StaticExtractionRule(pattern='SUUR'),StaticExtractionRule(pattern='väike')])
comma_tagger = SpanTagger(input_layer='words',
                          output_layer='size',
                          input_attribute='text',
                          output_attributes=[],
                          ruleset=ruleset)
comma_tagger.tag(text)
text.size

layer name,attributes,parent,enveloping,ambiguous,span count
size,,words,,False,1

text
väike


The next example is case insensitive.

In [6]:
text = Text('Suur ja väike.').tag_layer(['words'])
comma_tagger = SpanTagger(input_layer='words',
                          output_layer='size',
                          input_attribute='text',
                          output_attributes=[],
                          ruleset=ruleset,
                          ignore_case=True
                         )
comma_tagger.tag(text)
text.size

layer name,attributes,parent,enveloping,ambiguous,span count
size,,words,,False,2

text
Suur
väike
