## Rulesets and Extraction Rules

Rule-based taggers need rules for tagging. These rules are defined using ExtractionRule classes and added to rulesets which can be passed to the taggers. There are two kinds of extraction rules: StaticExtractionRule and DynamicExtractionRule. In addition, there are also two kinds of rulesets: AmbiguousRuleset and Ruleset.

#### Ruleset and AmbiguousRuleset

Ruleset and AmbiguousRuleset behave very similarily and the only difference is that AmbiguousRuleset allows two rules to have the same left-hand side (pattern). Here are some very simple examples:

In [1]:
from estnltk.taggers.system.rule_taggers import Ruleset, AmbiguousRuleset, StaticExtractionRule

#this is a basic working example
ruleset = Ruleset()
different_rules = [StaticExtractionRule(pattern="abc",attributes={"value": "xyz"}),
                   StaticExtractionRule(pattern="pattern",attributes={"value": "foobar"})]
ruleset.add_rules(different_rules)

#print the (static) rules currently in the ruleset
ruleset.static_rules

[StaticExtractionRule(pattern='abc', attributes={'value': 'xyz'}, group=0, priority=0),
 StaticExtractionRule(pattern='pattern', attributes={'value': 'foobar'}, group=0, priority=0)]

In [2]:
#duplicate rules in Ruleset
ruleset = Ruleset()
same_rules = [StaticExtractionRule(pattern="abc",attributes={"value": "xyz"}),
              StaticExtractionRule(pattern="abc",attributes={"value": "foobar"})]
try:
    ruleset.add_rules(same_rules)
except ValueError as err:
    print(err)

Two rules in ruleset give a conflicting attribute definition for the same pattern but ambiguous ruleset is not allowed.


Note that only the pattern of the rules must be different for the rules to be allowed in Ruleset. Other properties are not checked. In an AmbiguousRuleset, all rules are allowed even if they have the same pattern:

In [3]:
#duplicate rules in AmbiguousRuleset
ruleset = AmbiguousRuleset()
same_rules = [StaticExtractionRule(pattern="abc",attributes={"value": "xyz"}),
              StaticExtractionRule(pattern="abc",attributes={"value": "foobar"})]
ruleset.add_rules(same_rules)

ruleset.static_rules

[StaticExtractionRule(pattern='abc', attributes={'value': 'xyz'}, group=0, priority=0),
 StaticExtractionRule(pattern='abc', attributes={'value': 'foobar'}, group=0, priority=0)]

As can be seen, rules can be added to a ruleset with the `add_rules` method which takes a list of rules as an argument. It is also possible to give the rules as an argument when creating the ruleset or have the ruleset read the rules from a file.

In [4]:
#identical result to the approach above
ruleset = AmbiguousRuleset(same_rules)
ruleset.static_rules

[StaticExtractionRule(pattern='abc', attributes={'value': 'xyz'}, group=0, priority=0),
 StaticExtractionRule(pattern='abc', attributes={'value': 'foobar'}, group=0, priority=0)]

In [5]:
#read from a file
ruleset = AmbiguousRuleset()
ruleset.load("phrase_vocabulary.csv")
ruleset.static_rules

[StaticExtractionRule(pattern=('tundma', 'inimene'), attributes={'value': 'TI_1'}, group=0, priority=0),
 StaticExtractionRule(pattern=('päike',), attributes={'value': 'P'}, group=0, priority=0),
 StaticExtractionRule(pattern=('tundma', 'inimene'), attributes={'value': 'TI_2'}, group=0, priority=0),
 StaticExtractionRule(pattern=('tundma', 'inimene', 'palju'), attributes={'value': 'TIP'}, group=0, priority=0)]

Only static rules can be created by reading from a file.

Also note the format necessary for the csv file:
* The first header row must contain attribute names for rules.
* The second header row must contain attribute types for each column.
* Attribute type must be int, float, regex, or string.
* The remaining rows must be the rules. All columns must be filled.

In [6]:
from pandas import read_csv

print(read_csv("phrase_vocabulary.csv"))

                       _phrase_   value
0                      callable  string
1           'tundma', 'inimene'    TI_1
2                      'päike',       P
3           'tundma', 'inimene'    TI_2
4  'tundma', 'inimene', 'palju'     TIP


#### StaticExtractionRule

Some examples of StaticExtractionRules were already above. StaticExtractionRule is a data structure which defines rules with static attributes. It takes as parameters a pattern to match and a dictionary of attributes. It can optionally be given a group and a priority as well. It decorates the extracted spans with the intended attribute values. By default, groups and priorities are ignored but they can be used to resolve conflicts.

In [7]:
rules = [StaticExtractionRule(pattern="abc",attributes={"value": "xyz"}),
         StaticExtractionRule(pattern="abc",attributes={"value": "foobar"})]

In [8]:
ruleset = AmbiguousRuleset(rules)

In [9]:
from estnltk.taggers.system.rule_taggers import SubstringTagger
tagger = SubstringTagger(ruleset=ruleset,output_attributes=['value'])

In [10]:
from estnltk import Text

tekst = Text("abcdef")

In [11]:
tagger.tag(tekst)

text
abcdef

layer name,attributes,parent,enveloping,ambiguous,span count
terms,value,,,True,1


In [12]:
tekst.terms

layer name,attributes,parent,enveloping,ambiguous,span count
terms,value,,,True,1

text,value
abc,xyz
,foobar


#### DynamicExtractionRule

DynamicExtractionRules have a decorator instead of attributes. They are very similar to the decorator parameter of rule-based taggers but they only apply to a specific rule, not all the rules.

Dynamic rules are applied after static rules. Each dynamic rule only changes the annotation made by the static rule with the same pattern, group and priority. Just like the tagger decorators, the DynamicExtractionRule decorator takes as arguments text, span and annotation and returns the modified annotation.

In [13]:
from estnltk.taggers.system.rule_taggers import DynamicExtractionRule

In [14]:
rules = [StaticExtractionRule(pattern="abc",attributes={"value": "xyz"}),
         StaticExtractionRule(pattern="def",attributes={"value": "foobar"}),
         DynamicExtractionRule(pattern="def",decorator = lambda text, span, annotation: {"value": annotation['value'].upper()})
        ]

In [15]:
#Here the same pattern in two rules is allowed because one rule is static and other dynamic
ruleset = Ruleset(rules)

In [16]:
tagger = SubstringTagger(ruleset=ruleset,output_attributes=['value'])
tekst = Text("abcdef")
tagger.tag(tekst)

text
abcdef

layer name,attributes,parent,enveloping,ambiguous,span count
terms,value,,,True,2


In [17]:
#dynamic rule is applied to only the corresponding rule
tekst.terms

layer name,attributes,parent,enveloping,ambiguous,span count
terms,value,,,True,2

text,value
abc,xyz
def,FOOBAR


Dynamic rules are always applied to the corresponding static rule but if no such rule exists, an empty one is automatically created in the background. Therefore dynamic rules can also be used like this:

In [18]:
rules = [StaticExtractionRule(pattern="abc",attributes={"value": "xyz"}),
         DynamicExtractionRule(pattern="def",decorator = lambda text, span, annotation: {"value": text.text})
        ]
ruleset = Ruleset(rules)

tagger = SubstringTagger(ruleset=ruleset,output_attributes=['value'])
tekst = Text("abcdef")
tagger.tag(tekst)
tekst.terms

layer name,attributes,parent,enveloping,ambiguous,span count
terms,value,,,True,2

text,value
abc,xyz
def,abcdef
