# How to systematically build regular expression patterns

In this notebook, we show how to systematically develop regular expression patterns that can be used for information extraction.

EstNLTK provides class `RegexElement`, which wraps around the [regex library](https://pypi.org/project/regex/) and simplifies 
documenting and testing regex patterns. 
It is possible to add tests for positive, negative and partial matches, and to automatically test patterns. 
`RegexElement` subclasses add a way to construct regular expressions in an hierarchical manner together with test synthesis, which automatically combines existing tests of sub-expression.

For an example, let's develop a regex for extracting ingredient information from food recipes, such as:

In [1]:
example_ingredients = """
1 pakk tordipulbrit 
4 tk muna 
400 g kohupiimapastat
4 sl suhkrut
200 g hapukoort
100 g võid
"""

In [2]:
from estnltk.taggers.system.rule_taggers.regex_library.regex_element import RegexElement

# Create pattern
ARV = RegexElement('([0-9]+[.,])?[0-9]+', group_name='quantity', 
                   description='Captures integer and decimal quantities.')
# Add tests
ARV.full_match('1')
ARV.full_match('100')
ARV.full_match('1,5')
ARV.full_match('2.5')
ARV.no_match('x')
ARV.no_match('N/A')
ARV.no_match('teadmata arv')
ARV.partial_match('<=3', '3')
ARV.partial_match('~2', '2')

# Create pattern
# (Note: there's a better way to do a string choice, read about the StringList below)
KOGUS = RegexElement('(supilusika[ts]|teelusika[ts]|pakki?|grammi?|tk|sl|g)', group_name='unit',
                     description='Captures 4 types of food units, including 3 unit abbreviations.')
# Add tests
KOGUS.full_match('supilusikat')
KOGUS.full_match('teelusikas')
KOGUS.full_match('tk')
KOGUS.full_match('pakk')
KOGUS.full_match('gramm')
KOGUS.full_match('sl')
KOGUS.full_match('g')
KOGUS.no_match('tonni')
KOGUS.no_match('puuda')
KOGUS.partial_match('100g', 'g')
KOGUS.partial_match('2tk', 'tk')

# Create a pattern combining other patterns
KOOSTISOSA = RegexElement(f'{ARV}\s*{KOGUS}\s+(?P<ingredient>[a-zöäüõšž]+)', 
                          description='Captures ingredients together with quantities and units.')
# Add tests
KOOSTISOSA.full_match('1 supilusikas suhkrut')
KOOSTISOSA.full_match('2.5 sl võid')
KOOSTISOSA.full_match('4 teelusikat siirupit')
KOOSTISOSA.full_match('300g vahukoort')
KOOSTISOSA.no_match('10 tonni telliskive')
KOOSTISOSA.partial_match('100g vahukomme ka', {'quantity':'100', 'unit':'g', 'ingredient':'vahukomme'})

### Testing patterns

Once you've defined a pattern and added tests, you can use the method `test()` to run all tests:

In [3]:
ARV.test()
KOGUS.test()
KOOSTISOSA.test()

The method runs silent if all tests are passed. 
However, if any of the tests should fail, an `AssertionError` will be thrown, informing about the details of the test case on which the pattern fails. 

In Notebook, you can get a quick overview about the pattern and its testing status when you display a `RegexElement` object:

In [4]:
ARV

Test group,passed,failed
positive examples,4,0
negative examples,3,0
extraction tests,2,0


In [5]:
KOGUS

Test group,passed,failed
positive examples,7,0
negative examples,2,0
extraction tests,2,0


In [6]:
KOOSTISOSA

Test group,passed,failed
positive examples,4,0
negative examples,1,0
extraction tests,1,0


To get **exact testing results**, use methods `evaluate_positive_examples()`, `evaluate_negative_examples()` and `evaluate_extraction_examples()`. The results will be returned in a `DataFrame`:

In [7]:
KOGUS.evaluate_positive_examples()

Unnamed: 0,Example,Status
0,supilusikat,+
1,teelusikas,+
2,tk,+
3,pakk,+
4,gramm,+
5,sl,+
6,g,+


In [8]:
KOGUS.evaluate_negative_examples()

Unnamed: 0,Example,Status
0,tonni,+
1,puuda,+


In [9]:
KOGUS.evaluate_extraction_examples()

Unnamed: 0,Example,Status
0,100g,+
1,2tk,+


Status `+` indicates a positive match (passed test) and status `F` indicates a failed match.

Under the hood, different tests use different matching strategies.

* `full_match` (positive example) requires that the pattern matches with the whole example;
* `no_match` (negative example) requires that the pattern cannot be found inside the example. In other words, the pattern must not match even with a substring of the example;
* `partial_match` (extraction example) requires that the pattern matches a substring of the example (a target); the target substring is specified as the second argument of the `partial_match`. Alternatively, you can pass a dictionary as the target, specifying exact matches required from specific capture groups.

### Examples for displaying

You can add some of the positive examples as _display examples_, which will be show when the object is rendered in Notebook.

For instance, let's redefine KOOSTISOSA with first 3 `full_match` tests as _display examples_:

In [10]:
KOOSTISOSA = RegexElement(f'{ARV}\s*{KOGUS}\s+(?P<ingredient>[a-zöäüõšž]+)', 
                          description='Captures ingredients together with quantities and units.')
# Add tests & examples
KOOSTISOSA.example('1 supilusikas suhkrut')
KOOSTISOSA.example('2.5 sl võid')
KOOSTISOSA.example('4 teelusikat siirupit')
KOOSTISOSA.full_match('300g vahukoort')
KOOSTISOSA.no_match('10 tonni telliskive')
KOOSTISOSA.partial_match('100g vahukomme ka', {'quantity':'100', 'unit':'g', 'ingredient':'vahukomme'})

In [11]:
# Browse the pattern
KOOSTISOSA

Example,Status
1 supilusikas suhkrut,+
2.5 sl võid,+
4 teelusikat siirupit,+

Test group,passed,failed
positive examples,4,0
negative examples,1,0
extraction tests,1,0


Note that there are 4 positive examples: all _display examples_ were also included in positive (`full_match`) examples.

### Choice groups

Regex choice groups can contain sub-expressions with overlapping targets. 
For instance, _KOGUS_ (quantity) in the previous example contains choice sub-expressions `grammi?` and `g`, which both match with the string `'gramm'`.
However, the extent of the match depends on the order of the sub-expressions in the group: the maximal extent is achieved only if sub-expressions capturing longest strings come first. 
The result of a wrong ordering is an incomplete match, e.g. `(g|grammi?)` matches only `'g'` inside the string `'gramm'`.

If regex choice groups grow large and complex, it can be difficult to achieve a correct ordering. 
It requires a rigorous work of pattern development and testing. 
However, for subsets of patterns satisfying specific contraints, correct ordering of the sub-expressions can be automatically guaranteed.

#### StringList

Use `StringList` to make a choice group out of (a large number of) strings. It will guarantee that the resulting regular expression matches even the longest string in the list, it will escape all the meta symbols (such as `.` or `+`) and convert the pattern to case insensitive format, if needed.

For instance, let's redefine KOGUS as a `StringList`, which has more units / unit abbreviations and which ignores case while matching the strings:

In [12]:
from estnltk.taggers.system.rule_taggers.regex_library.string_list import StringList

# Create pattern
KOGUS = StringList(['supilusikas', 'supilusikat', 'sl', 
                    'teelusikas', 'teelusikat', 'tl',
                    'pakk', 'pakki', 'pk', 
                    'tükk', 'tükki', 'tk', 
                    'gramm', 'grammi', 'g'], group_name='unit',
                     description='Captures 5 types of food units (incl abbreviations).',
                     ignore_case=True)
# Add tests
KOGUS.full_match('SUPILUSIKAS')
KOGUS.full_match('SupiLusiKaT')
KOGUS.full_match('teelusikas')
KOGUS.full_match('tk')
KOGUS.full_match('TK')
KOGUS.full_match('pakk')
KOGUS.full_match('PAKK')
KOGUS.full_match('gramm')
KOGUS.full_match('sl')
KOGUS.full_match('g')
KOGUS.no_match('tonni')
KOGUS.no_match('puuda')
KOGUS.partial_match('100g', 'g')
KOGUS.partial_match('2tk', 'tk')
KOGUS.partial_match('2Tk', 'Tk')

In [13]:
KOGUS

Test group,passed,failed
positive examples,10,0
negative examples,2,0
extraction tests,3,0


You can also define modifications upon strings, e.g. replace all spaces with a more general pattern `r'\s+'`. 
Parameter `replacements` allows to defined a dictionary of character to regex replacements that is applied to all strings. 
For instance:

In [14]:
example_ingredients = """
1 pakk tordipulbrit 
4 tk muna 
400 g kohupiimapastat
4 sl suhkrut
200 g hapukoort
100 g võid
"""

TOIDUAINED = StringList(['muna', 'tordipulbrit', 'kohupiimapastat', 'suhkrut', 'hapukoort', 'võid', 'siirupit', 
                         'valget šokolaadi', 'tumedat šokolaadi', 'maasika jäätist', 'vahukoort', 'vahukomme'], 
                         group_name='ingredient',
                         description='Captures ingredients together with quantities and units.',
                         replacements={' ' : r'\s+'}, 
                         ignore_case=True)
# Add tests & examples
TOIDUAINED.example('suhkrut')
TOIDUAINED.example('võid')
TOIDUAINED.example('siirupit')
TOIDUAINED.full_match('vahukoort')
TOIDUAINED.full_match('tumedat   šokolaad')
TOIDUAINED.full_match('valget  šokolaadi')
TOIDUAINED.no_match('telliskive')
# TODO : fix the test
TOIDUAINED

Example,Status
suhkrut,+
võid,+
siirupit,+

Test group,passed,failed
positive examples,5,1
negative examples,1,0
extraction tests,0,0


The left hand of the rule can be only a single character that is interpreted as a plain character, not a regex meta character. Proper escaping of special symbols on the right-hand side of the rule is a responsibility of the user.

#### ChoiceGroup

TODO

### Pattern truncation in the output

TODO

### Finalizing the pattern

Once you've documented and covered a regex pattern with tests, use function `str(...)` to reveal the full regular expression string:

In [15]:
str(KOOSTISOSA)

'(?:(?P<quantity>([0-9]+[.,])?[0-9]+)\\s*(?P<unit>(supilusika[ts]|teelusika[ts]|pakki?|grammi?|tk|sl|g))\\s+(?P<ingredient>[a-zöäüõšž]+))'

Use the class method `compile()` to convert the pattern into Python's `Regex`:

In [16]:
KOOSTISOSA.compile()

regex.Regex('(?:(?P<quantity>([0-9]+[.,])?[0-9]+)\\s*(?P<unit>(supilusika[ts]|teelusika[ts]|pakki?|grammi?|tk|sl|g))\\s+(?P<ingredient>[a-zöäüõšž]+))', flags=regex.V0)

You can also pass flags, such as `regex.IGNORECASE` and `regex.DOTALL`, to the `compile()` method.

### Using the pattern with RegexTagger

After creating a pattern, you can use it in a `RegexTagger`:

In [17]:
from estnltk import Text
from estnltk.taggers import RegexTagger
from estnltk.taggers.system.rule_taggers import Ruleset
from estnltk.taggers.system.rule_taggers import StaticExtractionRule

# Make rule set for RegexTagger
rule_list = [StaticExtractionRule(pattern=KOOSTISOSA.compile())]
ruleset = Ruleset()
ruleset.add_rules(rule_list)

# Use a decorator to rewrite capture group values to annotations
def decorator(layer, base_span, annotation):
    annotation['quantity'] = annotation['match'].group('quantity')
    annotation['unit'] = annotation['match'].group('unit')
    annotation['ingredient'] = annotation['match'].group('ingredient')
    return annotation

# Create RegexTagger
regex_tagger = RegexTagger(ruleset=ruleset, 
                           output_layer='food_ingredients',
                           output_attributes=['quantity', 'unit', 'ingredient'], 
                           decorator=decorator)

In [18]:
# Example for analysis
example_text = """
1 pakk tordipulbrit 
4 tk muna 
400 g kohupiimapastat
4 sl suhkrut
200 g hapukoort
100 g võid
"""

# Apply the tagger
text = Text(example_text)
regex_tagger.tag(text)
text['food_ingredients']

layer name,attributes,parent,enveloping,ambiguous,span count
food_ingredients,"quantity, unit, ingredient",,,True,6

text,quantity,unit,ingredient
1 pakk tordipulbrit,1,pakk,tordipulbrit
4 tk muna,4,tk,muna
400 g kohupiimapastat,400,g,kohupiimapastat
4 sl suhkrut,4,sl,suhkrut
200 g hapukoort,200,g,hapukoort
100 g võid,100,g,võid


### Limitations

The encapsulation provided by `RegexElement` makes it much safer to specify what should be matched by the regular expression, but it still has limitations. 
First, one cannot specify additional consistency constraints inside the hierarchical definition nor aggregate the contents of capture groups. If you need such features use grammar rules instead. 
Second, self-overlapping can cause subtle errors. This is particularly true in case of string replacement. One way to diagnose is to compare `regex.sub(..., count=-1)` and several invocations of `regex.sub.(..., count=1)` to see if there are some differences.