# Raw text taggers

## RegexTagger

For initialisation RegexTagger needs the vocabulary. Vocabulary argument may be a csv file name

In [1]:
vocabulary = 'raw_text_taggers/vocabulary.csv'

a pandas DataFrame

In [2]:
from pandas import read_csv
vocabulary = read_csv(vocabulary, na_filter=False, index_col=False)
vocabulary

Unnamed: 0,_regex_pattern_,_group_,_priority_,normalized,comment,example
0,"-?(\d[\s\.]?)+(,\s?(\d[\s\.]?)+)?",0,1,"lambda m: re.sub('[\s\.]' ,'' , m.group(0))",number,"-34 567 000 123 , 456"
1,([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+),1,2,lambda m: None,e-mail,bla@bla.bl


or a list of dicts

In [3]:
vocabulary = vocabulary.to_dict('records')
vocabulary

[{'_group_': 0,
  '_priority_': 1,
  '_regex_pattern_': '-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?',
  'comment': 'number',
  'example': '-34 567 000 123 , 456',
  'normalized': "lambda m: re.sub('[\\s\\.]' ,'' , m.group(0))"},
 {'_group_': 1,
  '_priority_': 2,
  '_regex_pattern_': '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)',
  'comment': 'e-mail',
  'example': 'bla@bla.bl',
  'normalized': 'lambda m: None'}]

The keyword **\_regex\_pattern\_** is required. The **\_group\_** and **\_priority\_** keywords default to zero. Banned keywords are **start** and **end**.

**\_regex\_pattern\_** is a regular expression given as a string or a compiled regex pattern. **\_group\_** is an integer that determines which group of the pattern should be tagged on the text. **\_priority\_** is used to resolve conflicts. Smaller numbers (or any other comparables) represent higher priorities and bigger numbers lower priorities.

In case of intersecting spans, the span with lower priority is removed, if priorities are equal, the conflict resolving depends on the chosen conflict resolving strategy.

Any callable is assumed to take one argument m, the match object, and return a value for the corresponding attribute.
Any string (except the \_regex\_pattern\_, \_group\_ and \_priority\_ strings) which starts with 'lambda m:' is compiled. That function should return the value for the corresponding attribute. This is also a way to express Python objects in the csv file. For example,
```python
lambda m: re.sub('[\s\.]' ,'' , m.group(0)) # remove all whitespace from the match
lambda m: 'lambda m: ...' # represent a string that starts with 'lambda m:'
lambda m: None # represent None
```

**attributes** is a set of the names of the attributes to be annotated in the layer in addition to 'start' and 'end'. The default is the empty set.

**conflict_resolving_strategy** is either 'MAX', 'MIN' or 'ALL'. In case of intersecting spans, 'MAX' keeps longer, 'MIN' keeps shorter and 'ALL' keeps all spans. The default is 'MAX'.

If **overlapped** is True, overlapped matches are permitted. The dafault is False

If **return_layer** is True, the layer is returned and the text object is unchanged. If False, the layer is attached to the text object and None is returned. The default is False.

**layer_name** is the name of the layer.

In [4]:
from estnltk.taggers import RegexTagger
from estnltk import Text
from estnltk.layer_operations import repr_html

tokenization_hints_tagger = RegexTagger(vocabulary=vocabulary,
                                        attributes={'normalized', '_priority_'},
                                        conflict_resolving_strategy='MAX',
                                        overlapped=False,
                                        return_layer=False,
                                        layer_name='tokenization_hints')

text = Text('Aadressilt bla@bla.ee tuli 10 456 kirja aadressile foo@foo.ee 10 tunni jooksul.')
status = {}
tokenization_hints_tagger.tag(text, status)
repr_html(text['tokenization_hints'])

Unnamed: 0,text,normalized,_priority_
0,bla@bla.ee,,2
1,10 456,10456.0,1
2,foo@foo.ee,,2
3,10,10.0,1


In [5]:
# The number of intersecting pairs of spans before conflict resolving
status

{'number_of_conflicts': 0}