# RegexTagger

For initialisation RegexTagger needs the vocabulary. Vocabulary argument may be a csv file name

In [1]:
from estnltk.taggers import Vocabulary

vocabulary = 'regex_vocabulary.csv'

A `Vocabulary` object

In [2]:
vocabulary = Vocabulary.read_csv(vocabulary_file=vocabulary,
                                 key='_regex_pattern_')
vocabulary

_regex_pattern_,_group_,_priority_,_validator_,normalized,comment,example
<Regex ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)>,1,2,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,e-mail,bla@bla.bl
"<Regex -?(\d[\s\.]?)+(,\s?(\d[\s\.]?)+)?>",0,1,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,number,"-34 567 000 123 , 456"


a pandas `DataFrame` where the regex column title is '\_regex\_pattern\_'

or a dict in the `Vocabulary` internal format

In [3]:
vocabulary.mapping

{regex.Regex('-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?', flags=regex.V0): [{'comment': 'number',
   'example': '-34 567 000 123 , 456',
   '_group_': 0,
   '_priority_': 1,
   '_validator_': <function estnltk.taggers.dict_taggers.vocabulary.<lambda>(m)>,
   'normalized': <function estnltk.taggers.dict_taggers.vocabulary.<lambda>(m)>,
   '_regex_pattern_': regex.Regex('-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?', flags=regex.V0)}],
 regex.Regex('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)', flags=regex.V0): [{'comment': 'e-mail',
   'example': 'bla@bla.bl',
   '_group_': 1,
   '_priority_': 2,
   '_validator_': <function estnltk.taggers.dict_taggers.vocabulary.<lambda>(m)>,
   'normalized': <function estnltk.taggers.dict_taggers.vocabulary.<lambda>(m)>,
   '_regex_pattern_': regex.Regex('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)', flags=regex.V0)}]}

**\_regex\_pattern\_** is a regular expression given as a string in csv file or a compiled regex pattern. **\_group\_** is an integer that determines which group of the pattern should be tagged on the text (the default is `0`). **\_priority\_** is used to resolve conflicts (the default is `0`). Smaller numbers (or any other comparables) represent higher priorities and bigger numbers lower priorities.

In case of intersecting spans, the span with lower priority is removed, if priorities are equal, the conflict resolving depends on the chosen conflict resolving strategy.

**\_validator\_** is used to validate matches (the default is `lambda m: True`).

Any callable is assumed to take one argument m, the match object, and return a value for the corresponding attribute.

**output_attributes** is a sequence of the attribute names to be annotated in the layer.

**conflict_resolving_strategy** is either 'MAX', 'MIN' or 'ALL'. In case of intersecting spans, 'MAX' keeps longer, 'MIN' keeps shorter and 'ALL' keeps all spans. The default is 'MAX'.

If **overlapped** is True, overlapped matches are permitted. The dafault is False

If **return_layer** is True, the layer is returned and the text object is unchanged. If False, the layer is attached to the text object and None is returned. The default is False.

**layer_name** is the name of the layer.

In [4]:
from estnltk.taggers import RegexTagger
from estnltk import Text

tokenization_hints_tagger = RegexTagger(vocabulary=vocabulary,
                                        output_layer='tokenization_hints', # default 'regexes'
                                        output_attributes=['normalized', '_priority_'], # default: None
                                        conflict_resolving_strategy='MAX', # default 'MAX'
                                        overlapped=False, # default False
                                        priority_attribute=None, # default None
                                        ignore_case=False  # default False
                                        ) 
tokenization_hints_tagger

name,output layer,output attributes,input layers
RegexTagger,tokenization_hints,"('normalized', '_priority_')",()

0,1
conflict_resolving_strategy,MAX
overlapped,False
priority_attribute,
vocabulary,"Vocabulary(key='_regex_pattern_', len=2)"
ambiguous,False


In [5]:
text = Text('Aadressilt bla@bla.ee tuli 10 456 kirja aadressile foo@foo.ee 10 tunni jooksul.')
status = {}
tokenization_hints_tagger.tag(text, status=status)
text['tokenization_hints']

layer name,attributes,parent,enveloping,ambiguous,span count
tokenization_hints,"normalized, _priority_",,,False,4

text,normalized,_priority_
bla@bla.ee,,2
10 456,10456.0,1
foo@foo.ee,,2
10,10.0,1


In [6]:
tokenization_hints_tagger.vocabulary

_regex_pattern_,_group_,_validator_,normalized,_priority_
<Regex ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)>,1,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,2
"<Regex -?(\d[\s\.]?)+(,\s?(\d[\s\.]?)+)?>",0,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,<function estnltk.taggers.dict_taggers.vocabulary.<lambda>>,1


In [7]:
# The number of intersecting pairs of spans before conflict resolving
status

{'number_of_conflicts': 0}

## Conflict resolving

In [8]:
# no _priority_, conflict_resolving_strategy='ALL'
vocabulary = [
              {'_regex_pattern_': 'kaks'},
              {'_regex_pattern_': 'kümme'},
              {'_regex_pattern_': 'kakskümmend'},
              {'_regex_pattern_': 'kakskümmend kaks'}
             ]
vocabulary = Vocabulary.from_records(records=vocabulary,
                                     key='_regex_pattern_',
                                     attributes=['_group_', '_validator_'],
                                     default_rec={'_group_': 0, '_validator_': lambda s: True}
                                     )
vocabulary

_regex_pattern_,_group_,_validator_
kaks,0,<function __main__.<lambda>>
kakskümmend,0,<function __main__.<lambda>>
kakskümmend kaks,0,<function __main__.<lambda>>
kümme,0,<function __main__.<lambda>>


In [9]:
regex_tagger = RegexTagger(vocabulary=vocabulary, conflict_resolving_strategy='ALL')
text = Text('kakskümmend kaks')
status = {}
regex_tagger.tag(text, status)
print(status)
text['regexes']

{'number_of_conflicts': 6}


layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,False,5

text
kaks
kakskümmend
kakskümmend kaks
kümme
kaks


In [10]:
# no _priority_, conflict_resolving_strategy='MAX'
regex_tagger = RegexTagger(vocabulary=vocabulary, conflict_resolving_strategy='MAX')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,False,1

text
kakskümmend kaks


In [11]:
# no _priority_, conflict_resolving_strategy='MIN'
regex_tagger = RegexTagger(vocabulary=vocabulary, conflict_resolving_strategy='MIN')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,False,3

text
kaks
kümme
kaks


In [12]:
# _priority_ given and conflict_resolving_strategy='ALL'
vocabulary = [
              {'_regex_pattern_': 'kaks', '_priority_': 0},
              {'_regex_pattern_': 'kümme', '_priority_': 1},
              {'_regex_pattern_': 'kakskümmend', '_priority_': 2},
              {'_regex_pattern_': 'kakskümmend kaks', '_priority_': 3}
             ]
_vocabulary = Vocabulary.from_records(records=vocabulary,
                                      key='_regex_pattern_',
                                      default_rec={'_group_': 0, '_validator_': lambda s: True},
                                      attributes=['_regex_pattern_', '_group_', '_validator_'])

regex_tagger = RegexTagger(vocabulary=vocabulary,
                           output_attributes=['_priority_'],
                           conflict_resolving_strategy='ALL',
                           priority_attribute='_priority_')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,_priority_,,,False,3

text,_priority_
kaks,0
kümme,1
kaks,0


## Validating
Match results can be validated using a validator functinon. Validator function must take one argument, the match object `m` and return an object that can be tested for the bool value. If the result is `False`, the match is omitted. Validator is passed to the tagger inside the vocabulary using the `_validator_` keyword. Validator function can be a regular function or a lambda function given as a string starting with `lambda m:`. The next example demonstrates both options. The default validator is
```python
lambda m: True
```

In [13]:
def validator(m):
    return not m.group(0).startswith('0')
    

vocabulary = [
              {'_regex_pattern_': '\d+',
               '_validator_': validator, 
               'comment':'starts with non-zero'},
              {'_regex_pattern_': '\d+', 
               '_validator_': lambda m: m.group(0).startswith('0'),
               'comment':'starts with zero'}
             ]

regex_tagger = RegexTagger(output_layer='numbers',
                           vocabulary=vocabulary,
                           output_attributes=['comment'])
text = Text('3209 n  0930 093 2304209 093402')
regex_tagger.tag(text)
text['numbers']

layer name,attributes,parent,enveloping,ambiguous,span count
numbers,comment,,,False,5

text,comment
3209,starts with non-zero
930,starts with zero
93,starts with zero
2304209,starts with non-zero
93402,starts with zero


In [14]:
regex_tagger.vocabulary

_regex_pattern_,_group_,_validator_,comment
<Regex \d+>,0,<function __main__.validator>,starts with non-zero
,0,<function __main__.<lambda>>,starts with zero


## Ignore case
Setting the flag `ignore_case=True` makes the matcing of the regular expressions in the `RegexTagger` vocabulary case insensitive.

In [15]:
vocabulary = [
              {'_regex_pattern_': '\w*Sõna\w*'},
             ]

regex_tagger = RegexTagger(output_layer='matches',
                           vocabulary=vocabulary,
                           output_attributes=[],
                           ignore_case=True
                          )
text = Text('Miljonisõnaline SÕNArikas SõnaRaamat')
regex_tagger.tag(text)
text['matches']

layer name,attributes,parent,enveloping,ambiguous,span count
matches,,,,False,3

text
Miljonisõnaline
SÕNArikas
SõnaRaamat
