# RegexTagger

For initialisation RegexTagger needs a ruleset. It can be read from a file in the following way.

In [1]:
from estnltk.taggers.system.rule_taggers import RegexTagger, Ruleset

ruleset_file = 'regex_vocabulary.csv'
ruleset = Ruleset()

In [2]:
ruleset.load(file_name=ruleset_file, key_column='_regex_pattern_')

In [3]:
ruleset.static_rules

[StaticExtractionRule(pattern=regex.Regex('-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?', flags=regex.V0), attributes={'_group_': 0, '_priority_': 1, 'comment': 'number', 'example': '-34 567 000 123 , 456'}, group=0, priority=0),
 StaticExtractionRule(pattern=regex.Regex('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)', flags=regex.V0), attributes={'_group_': 1, '_priority_': 2, 'comment': 'e-mail', 'example': 'bla@bla.bl'}, group=0, priority=0)]

**pattern** is a regular expression given as a string. **\_group\_** is an integer that determines which group of the pattern should be tagged on the text (the default is `0`). **\_priority\_** is used to resolve conflicts (the default is `0`). Smaller numbers (or any other comparables) represent higher priorities and bigger numbers lower priorities. Priorities can be used to resolve conflicts in intersecting spans based on the conflict resolving strategy.

Additional attributes must be strings that specify the initial annotation for each rule. To alter the annotation dynamically, a DynamicExtractionRule for the same (pattern, group, priority) tuple must be added. This can not be read directly from the .csv file, it must be added manually.

**ruleset** is the ruleset as specified above

**output_layer** is the name of the output layer

**output_attributes** is a sequence of the attribute names to be annotated in the layer.

**conflict_resolver** is either 'KEEP_MAXIMAL', 'KEEP_MINIMAL', 'KEEP_ALL' or a callable. In case of intersecting spans, 'KEEP_MAXIMAL' keeps longer, 'KEEP_MINIMAL' keeps shorter and 'KEEP_ALL' keeps all spans. A callable should be a function that does the conflict resolving itself. The default is 'KEEP_MAXIMAL'.

If **overlapped** is True, overlapped matches are permitted. The dafault is False

If **lowercase_text** is True, matches are found from the lowercase version of the input text. 

**decorator** is a function applied to each annotation to validate or edit them

**match_attribute** is the name of the attribute in which the match object is stored in the annotation. It can be used in the dynamic rules or the decorator function as shown in the example

In [4]:
from estnltk.taggers import RegexTagger
from estnltk import Text
import re

def global_decorator(layer,span,annotation):
    annotation['normalized'] = re.sub('[\s\.]' ,'' , annotation['match'].group(0))
    return annotation

tokenization_hints_tagger = RegexTagger(ruleset=ruleset,
                                        output_layer='tokenization_hints', # default 'regexes'
                                        output_attributes=['normalized', '_priority_'], # default: None
                                        conflict_resolver='KEEP_MAXIMAL', # default 'KEEP_MAXIMAL'
                                        overlapped=False, # default False
                                        lowercase_text=False,  # default False
                                        decorator=global_decorator, #default None
                                        match_attribute='match' #default 'match'
                                        ) 
tokenization_hints_tagger

name,output layer,output attributes,input layers
RegexTagger,tokenization_hints,"('normalized', '_priority_')",()

0,1
conflict_resolver,KEEP_MAXIMAL
overlapped,False
ruleset,<estnltk.taggers.system.rule_taggers.extraction_rules.ruleset.Ruleset object at 0x00000220E1447F98>
global_decorator,<function __main__.global_decorator>
match_attribute,match
static_ruleset_map,"{regex.Regex('-?(\\d[\\s\\.]?)+(,\\s?(\\d[\\s\\.]?)+)?', flags=regex.V0): [(0, 0 ..., type: <class 'dict'>, length: 2"
dynamic_ruleset_map,{}
lowercase_text,False


In [5]:
text = Text('Aadressilt bla@bla.ee tuli 10 456 kirja aadressile foo@foo.ee 10 tunni jooksul.')
status = {}
tokenization_hints_tagger.tag(text, status=status)
text['tokenization_hints']

layer name,attributes,parent,enveloping,ambiguous,span count
tokenization_hints,"normalized, _priority_",,,True,4

text,normalized,_priority_
bla@bla.ee,bla@blaee,2
10 456,10456,1
foo@foo.ee,foo@fooee,2
10,10,1


## Conflict resolving

In [6]:
from estnltk.taggers.system.rule_taggers import StaticExtractionRule
import regex

# no _priority_, conflict_resolving_strategy='ALL'
rule_list = [
              StaticExtractionRule(pattern=regex.Regex('kaks')),
              StaticExtractionRule(pattern=regex.Regex('kümme')),
              StaticExtractionRule(pattern=regex.Regex('kakskümmend')),
              StaticExtractionRule(pattern=regex.Regex('kakskümmend kaks'))
             ]

ruleset = Ruleset()
ruleset.add_rules(rule_list)

In [7]:
# no _priority_, conflict_resolver='KEEP_ALL'
regex_tagger = RegexTagger(ruleset=ruleset, conflict_resolver='KEEP_ALL')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,True,5

text
kaks
kakskümmend
kakskümmend kaks
kümme
kaks


In [8]:
# no _priority_, conflict_resolver='KEEP_MAXIMAL'
regex_tagger = RegexTagger(ruleset=ruleset, conflict_resolver='KEEP_MAXIMAL')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,True,1

text
kakskümmend kaks


In [9]:
# no _priority_, conflict_resolver='KEEP_MINIMAL'
regex_tagger = RegexTagger(ruleset=ruleset, conflict_resolver='KEEP_MINIMAL')
text = Text('kakskümmend kaks')
regex_tagger.tag(text)
text['regexes']

layer name,attributes,parent,enveloping,ambiguous,span count
regexes,,,,True,3

text
kaks
kümme
kaks


## Validating
Match results can be validated using the decorator. Decorator takes three arguments: layer, span and annotation. If the annotation passes the validation, the validator must return the annotation, otherwise it should return `None` which omits the match.

In [10]:
def decorator(layer,span,annotation):
    if annotation['match'].group(0).startswith('0'):
        return annotation
    else:
        return None
    

rules = [StaticExtractionRule(pattern=regex.Regex('\d+'), attributes={'comment':'stars with zero'})]

ruleset = Ruleset()
ruleset.add_rules(rules)

regex_tagger = RegexTagger(output_layer='numbers',
                           ruleset=ruleset,
                           output_attributes=['comment'],
                           decorator=decorator)
text = Text('3209 n  0930 093 2304209 093402')
regex_tagger.tag(text)
text['numbers']

layer name,attributes,parent,enveloping,ambiguous,span count
numbers,comment,,,True,3

text,comment
930,stars with zero
93,stars with zero
93402,stars with zero


## Lowercase text
Setting the flag `lowercase_text=True` means that the RegexTagger looks for matches in the lowercased version of the text.

In [11]:
rules = [StaticExtractionRule(pattern=regex.Regex('\w*sõna\w*'))]

ruleset = Ruleset()
ruleset.add_rules(rules)

regex_tagger = RegexTagger(output_layer='matches',
                           ruleset=ruleset,
                           output_attributes=[],
                           lowercase_text=True
                          )
text = Text('Miljonisõnaline SÕNArikas SõnaRaamat')
regex_tagger.tag(text)
text['matches']

layer name,attributes,parent,enveloping,ambiguous,span count
matches,,,,True,3

text
Miljonisõnaline
SÕNArikas
SõnaRaamat
