# Raw text taggers

## RegexTagger

For initialisation RegexTagger needs the vocabulary. Vocabulary argument may be a csv file name

In [1]:
vocabulary = 'raw_text_taggers/vocabulary.csv'

a pandas DataFrame

In [2]:
from pandas import read_csv
vocabulary = read_csv(vocabulary, na_filter=False, index_col=False)
vocabulary

Unnamed: 0,_regex_pattern_,_group_,_priority_,normalized,comment,example
0,"-?(\d[\s\.]?)+(,\s?(\d[\s\.]?)+)?",0,1,"lambda m: re.sub('[\s\.]' ,'' , m.group(0))",number,"-34 567 000 123 , 456"
1,([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+),1,2,lambda m: None,e-mail,bla@bla.bl


or a list of dicts

In [3]:
vocabulary = vocabulary.to_dict('records')
vocabulary

[{'_group_': 0,
  '_priority_': 1,
  '_regex_pattern_': '-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?',
  'comment': 'number',
  'example': '-34 567 000 123 , 456',
  'normalized': "lambda m: re.sub('[\\s\\.]' ,'' , m.group(0))"},
 {'_group_': 1,
  '_priority_': 2,
  '_regex_pattern_': '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)',
  'comment': 'e-mail',
  'example': 'bla@bla.bl',
  'normalized': 'lambda m: None'}]

The required keywords are **term**, **group**, **priority**. The banned keywords are **start** and **end**.

**term** is a regular expression, **group** determines which part of the term should be tagged on the text, **priority** is used to resolve conflicts.

In case of intersecting spans, the span with smaller priority is removed, if priorities are equal, the span with bigger start or end is removed.

Any string (except the term strings) which starts with 'lambda m:' is evaluated as a lambda function with argument m, the match object. That function should return the value for the corresponding attribute.

In [4]:
from estnltk.taggers import RegexTagger
from estnltk import Text

tokenization_hints_tagger = RegexTagger(vocabulary=vocabulary,
                                        attributes={'normalized'},
                                        layer_name='tokenization_hints')

text = Text('Aadressilt bla@bla.ee tuli 10 456 kirja aadressile foo@foo.ee 10 tunni jooksul.')

status = {}
tokenization_hints_tagger.tag(text, status)
print(status)
print(tokenization_hints_tagger._number_of_conflicts)
text.tokenization_hints

{'conflicts': 0}
0


SL[Span(bla@bla.ee, {'normalized': None}),
Span(10 456 , {'normalized': '10456'}),
Span(foo@foo.ee, {'normalized': None}),
Span(10 , {'normalized': '10'})]