# RegexTagger

For initialisation RegexTagger needs the vocabulary. Vocabulary argument may be a csv file name

In [1]:
vocabulary = 'vocabulary.csv'

a pandas DataFrame

In [2]:
from pandas import read_csv
vocabulary = read_csv(vocabulary, na_filter=False, index_col=False)
vocabulary

Unnamed: 0,_regex_pattern_,_group_,_priority_,_validator_,normalized,comment,example
0,"-?(\d[\s\.]?)+(,\s?(\d[\s\.]?)+)?",0,1,lambda m: True,"lambda m: re.sub('[\s\.]' ,'' , m.group(0))",number,"-34 567 000 123 , 456"
1,([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+),1,2,lambda m: True,lambda m: None,e-mail,bla@bla.bl


or a list of dicts

In [3]:
vocabulary = vocabulary.to_dict('records')
vocabulary

[{'_group_': 0,
  '_priority_': 1,
  '_regex_pattern_': '-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?',
  '_validator_': 'lambda m: True',
  'comment': 'number',
  'example': '-34 567 000 123 , 456',
  'normalized': "lambda m: re.sub('[\\s\\.]' ,'' , m.group(0))"},
 {'_group_': 1,
  '_priority_': 2,
  '_regex_pattern_': '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)',
  '_validator_': 'lambda m: True',
  'comment': 'e-mail',
  'example': 'bla@bla.bl',
  'normalized': 'lambda m: None'}]

The keyword **\_regex\_pattern\_** is required. The **\_group\_** and **\_priority\_** keywords default to zero. Banned keywords are **start** and **end**.

**\_regex\_pattern\_** is a regular expression given as a string or a compiled regex pattern. **\_group\_** is an integer that determines which group of the pattern should be tagged on the text (the default is `0`). **\_priority\_** is used to resolve conflicts (the default is `0`). Smaller numbers (or any other comparables) represent higher priorities and bigger numbers lower priorities.

In case of intersecting spans, the span with lower priority is removed, if priorities are equal, the conflict resolving depends on the chosen conflict resolving strategy.

**\_validator\_** is used to validate matches (the default is `lambda m: True`).

Any callable is assumed to take one argument m, the match object, and return a value for the corresponding attribute.
Any string (except the \_regex\_pattern\_, \_group\_ and \_priority\_ strings) which starts with 'lambda m:' is compiled. That function should return the value for the corresponding attribute. This is also a way to express Python objects in the csv file. For example,
```python
lambda m: re.sub('[\s\.]' ,'' , m.group(0)) # remove all whitespace from the match
lambda m: 'lambda m: ...' # represent a string that starts with 'lambda m:'
lambda m: None # represent None
```

**attributes** is a set of the names of the attributes to be annotated in the layer in addition to 'start' and 'end'. The default is the empty set.

**conflict_resolving_strategy** is either 'MAX', 'MIN' or 'ALL'. In case of intersecting spans, 'MAX' keeps longer, 'MIN' keeps shorter and 'ALL' keeps all spans. The default is 'MAX'.

If **overlapped** is True, overlapped matches are permitted. The dafault is False

If **return_layer** is True, the layer is returned and the text object is unchanged. If False, the layer is attached to the text object and None is returned. The default is False.

**layer_name** is the name of the layer.

In [4]:
from estnltk.taggers import RegexTagger
from estnltk import Text

tokenization_hints_tagger = RegexTagger(vocabulary=vocabulary,
                                        attributes=['normalized', '_priority_'],
                                        conflict_resolving_strategy='MAX',
                                        overlapped=False,
                                        layer_name='tokenization_hints')
tokenization_hints_tagger

name,layer,attributes,depends_on
RegexTagger,tokenization_hints,"[normalized, _priority_]",[]

0,1
overlapped,False
conflict_resolving_strategy,MAX


In [5]:
text = Text('Aadressilt bla@bla.ee tuli 10 456 kirja aadressile foo@foo.ee 10 tunni jooksul.')
status = {}
tokenization_hints_tagger.tag(text, return_layer=False, status=status)
text['tokenization_hints']

layer name,attributes,parent,enveloping,ambiguous,span count
tokenization_hints,"normalized, _priority_",,,False,4

text,normalized,_priority_
bla@bla.ee,,2
10 456,10456.0,1
foo@foo.ee,,2
10,10.0,1


In [6]:
# The number of intersecting pairs of spans before conflict resolving
status

{'number_of_conflicts': 0}

## Conflict resolving

In [7]:
# no _priority_, conflict_resolving_strategy='ALL'
vocabulary = [
              {'_regex_pattern_': 'kaks'},
              {'_regex_pattern_': 'kümme'},
              {'_regex_pattern_': 'kakskümmend'},
              {'_regex_pattern_': 'kakskümmend kaks'}
             ]

regex_tagger = RegexTagger(vocabulary=vocabulary, conflict_resolving_strategy='ALL')
text = Text('kakskümmend kaks')
status = {}
regex_tagger.tag(text, status)
print(status)
text['regexes']

{}


layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,False,5

text
kaks
kakskümmend
kakskümmend kaks
kümme
kaks


In [8]:
# no _priority_, conflict_resolving_strategy='MAX'
regex_tagger = RegexTagger(vocabulary=vocabulary, conflict_resolving_strategy='MAX')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,False,1

text
kakskümmend kaks


In [9]:
# no _priority_, conflict_resolving_strategy='MIN'
regex_tagger = RegexTagger(vocabulary=vocabulary, conflict_resolving_strategy='MIN')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,False,3

text
kaks
kümme
kaks


In [10]:
# _priority_ given and conflict_resolving_strategy='ALL'
event_vocabulary = [
                    {'_regex_pattern_': 'kaks', '_priority_': 0},
                    {'_regex_pattern_': 'kümme', '_priority_': 1},
                    {'_regex_pattern_': 'kakskümmend', '_priority_': 2},
                    {'_regex_pattern_': 'kakskümmend kaks', '_priority_': 3}
                   ]

regex_tagger = RegexTagger(vocabulary=event_vocabulary,attributes=['_priority_'], conflict_resolving_strategy='ALL')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,_priority_,,,False,3

text,_priority_
kaks,0
kümme,1
kaks,0


## Validating
Match results can be validated using a validator functinon. Validator function must take one argument, the match object `m` and return an object that can be tested for the bool value. If the result is `False`, the match is omitted. Validator is passed to the tagger inside the vocabulary using the `_validator_` keyword. Validator function can be a regular function or a lambda function given as a string starting with `lambda m:`. The next example demonstrates both options. The default validator is
```python
lambda m: True
```

In [11]:
def validator(m):
    return not m.group(0).startswith('0')
    

vocabulary = [
              {'_regex_pattern_': '\d+',
               '_validator_': validator, 
               'comment':'starts with non-zero'},
              {'_regex_pattern_': '\d+', 
               '_validator_': "lambda m: m.group(0).startswith('0')",
               'comment':'starts with zero'}
             ]

regex_tagger = RegexTagger(layer_name='numbers', vocabulary=vocabulary, attributes=['comment'])
text = Text('3209 n  0930 093 2304209 093402')
regex_tagger.tag(text)
text['numbers']

layer name,attributes,parent,enveloping,ambiguous,span count
numbers,comment,,,False,5

text,comment
3209,starts with non-zero
930,starts with zero
93,starts with zero
2304209,starts with non-zero
93402,starts with zero
