# Introduction to GrammarParsingTagger

GrammarParsingTagger is a tool that allows us to write a context-free grammar and apply the grammar on our Text object, creating a new layer that contains all the matches found by the grammar. This means that we can define the sequences of symbols that we want to extract from the text. Let's look at an example of extracting addresses from a text.

First, we need to have our Text object. Let's create one containing an address:

In [1]:
from estnltk import Text

text = Text('Jüri Homenja kontsert toimub E, 22. mai kl 18:00 kultuurimajas Veski 5, Elva, Tartumaa.')

In [2]:
text

text
"Jüri Homenja kontsert toimub E, 22. mai kl 18:00 kultuurimajas Veski 5, Elva, Tartumaa."


Next, we need to tag the symbols whose sequences we want to search with our grammar. You can find out about different taggers from [here](https://github.com/estnltk/estnltk/tree/devel_1.6/tutorials/taggers), but for the GrammarParsingTagger example, let's not dive into this but use an existing tagger called AddressPartTagger. AddressPartTagger needs the text to be segmented into words, that's why we tag the layer *words* on the text before applying the tagger.

In [3]:
from estnltk.taggers import AddressPartTagger

address_part_tagger = AddressPartTagger()
text.tag_layer(['words'])
address_part_tagger.tag(text)
text.address_parts

layer name,attributes,parent,enveloping,ambiguous,span count
address_parts,"grammar_symbol, type",,,True,11

text,grammar_symbol,type
,RANDOM_TEXT,
,RANDOM_TEXT,
Jüri,ASULA,asula
,ASULA,asula
,TÄNAV,tänav
22,MAJA,
18,MAJA,
00,MAJA,
kultuurimajas,RANDOM_TEXT,
Veski,ASULA,asula


Now, we can see that we have different symbols tagged on the text in the layer called *address_parts*. Some of the tagged symbols are in fact parts of the address, but others are not. To know whether the symbol is part of an address, we have to define the sequences of symbols that make up an address. These sequences are called grammar **rules**. 

## Rules and Grammar

To define rules and a grammar, we first need to import the classes Rule and Grammar.

In [4]:
from estnltk.finite_grammar import Rule, Grammar

Then it is possible to start defining the rules. A rule consists of a left side (non-terminal), a right side (non-terminals and terminals), and optional parameters. In the following example of a rule, the left side is 'ADDRESS', the right side is 'TÄNAV MAJA ASULA', and there are optional parameters *group* and *priority*. The rule says that if symbols 'TÄNAV', 'MAJA', and 'ASULA' occur in this order, this is an 'ADDRESS'. The parameter *group* is a name for the group into which the rule belongs, it can be anything, but all rules that we want to put into one group need to have the same name for *group*. *priority* defines which rule of the ones belonging to the *group* is applied if several rules match. **NB!** The **higher** the value, the **lower** the priority. Therefore, if a there are two rules with priorities 2 and 3 that both match, the rule with the priority of 2 is applied.

In [5]:
Rule('ADDRESS', 'TÄNAV MAJA ASULA', group='g0', priority=3)

ADDRESS -> TÄNAV MAJA ASULA	: 3, val: default_validator, dec: default_decorator, scoring: default_scoring

To apply the rules, we need to create a grammar:

In [6]:
grammar = Grammar(start_symbols=['ADDRESS'], 
                  depth_limit=float('inf'), # the default
                  width_limit=float('inf'), # the default
                  legal_attributes=None # the default
                  )

And then we need to add rules to the grammar. Let's first add two rules that belong to the same group and have equal priorities:

In [7]:
grammar.add_rule('ADDRESS', 'TÄNAV MAJA ASULA', group='g0', priority=3)
grammar.add_rule('ADDRESS', 'TÄNAV MAJA',       group='g0', priority=3)
grammar


Grammar:
	start: ADDRESS
	terminals: ASULA, MAJA, TÄNAV
	nonterminals: ADDRESS
	legal attributes: frozenset()
	depth_limit: inf
	width_limit: inf
Rules:
	ADDRESS -> TÄNAV MAJA ASULA	: 3, val: default_validator, dec: default_decorator, scoring: default_scoring
	ADDRESS -> TÄNAV MAJA	: 3, val: default_validator, dec: default_decorator, scoring: default_scoring

To apply the grammar on our text, we need to create a tagger - a GrammarParsingTagger object. This tagger gets our grammar for the parameter *grammar*. *layer_of_tokens* is the name of the layer that we want to apply our grammar on and *layer_name* is the name of the layer that we are creating with the GrammarParsingTagger. 

In [8]:
from estnltk.taggers import GrammarParsingTagger

tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              layer_name='addresses', # default: 'parse'
                              output_ambiguous=True # default False, True recommended
                              )
tagger

name,output layer,output attributes,input layers
GrammarParsingTagger,addresses,(),['address_parts']

0,1
grammar,"\nGrammar:\n\tstart: ADDRESS\n\tterminals: ASULA, MAJA, TÄNAV\n\tnonterminals: ADDRESS\n ..., type: <class 'estnltk.finite_grammar.grammar.Grammar'>"
name_attribute,grammar_symbol
output_nodes,{'ADDRESS'}
resolve_support_conflicts,True
resolve_start_end_conflicts,True
resolve_terminals_conflicts,True
ambiguous,True
gap_validator,
debug,False


Then we can use the tagger to tag the text:

In [9]:
tagger.tag(text)

text
"Jüri Homenja kontsert toimub E, 22. mai kl 18:00 kultuurimajas Veski 5, Elva, Tartumaa."

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,21
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,False,18
address_parts,"grammar_symbol, type",,,True,11
addresses,,,address_parts,True,3


In [10]:
text.addresses

layer name,attributes,parent,enveloping,ambiguous,span count
addresses,,,address_parts,True,3

text
"['Jüri', '22']"
"['Veski', '5']"
"['Veski', '5', 'Elva']"


The first problem that we see is that 'Jüri', '22' has been tagged as address, although the tokens are not even next to each other in the original text. However, as the grammar is looking at the layer *address_parts* and there's nothing between the grammar_symbols of these tokens, they are tagged as an address. To overcome this problem, we can use a gaps_validator function where we can define what kind of gaps we allow between our tagged symbols - e.g. we would probably want to accept a space between the parts of an address but not long sequences of words or sentences. 

# Priorities

In [11]:
grammar = Grammar(start_symbols=['ADDRESS'], 
                  rules=None, # the default, deprecated
                  depth_limit=float('inf'), # the default
                  width_limit=float('inf'), # the default
                  legal_attributes=None # the default
                  )

grammar.add_rule('ADDRESS', 'TÄNAV MAJA ASULA', group='g0', priority=2)
grammar.add_rule('ADDRESS', 'TÄNAV MAJA',       group='g0', priority=3)
tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              name_attribute='grammar_symbol', # the default
                              layer_name='addresses_2', # the default
                              attributes=(), # default: ()
                              output_ambiguous=True # default False
                              )
tagger.tag(text)
text.addresses_2

layer name,attributes,parent,enveloping,ambiguous,span count
addresses_2,,,address_parts,True,2

text
"['Jüri', '22']"
"['Veski', '5', 'Elva']"


# Decorators

In [12]:
def address_decorator(nodes):
    asula = ''
    maakond = ''
    t2nav = ''
    indeks = ''
    maja = ''
    for node in nodes:
        if node.name == 'ASULA':
            asula = node.text#[0]
        elif node.name == 'TÄNAV':
            t2nav = node.text#[0]
        elif node.name == 'MAAKOND':
            maakond = node.text#[0]
        elif node.name == 'MAJA':
            maja = node.text#[0]
        elif node.name == 'INDEKS':
            indeks = node.text#[0]
    return {'grammar_symbol': 'ADDRESS',
            'ASULA': asula,
            'TÄNAV': t2nav,
            'INDEKS': indeks,
            'MAAKOND': maakond,
            'MAJA': maja}

grammar = Grammar(start_symbols=['ADDRESS'], 
                  rules=None, # the default, deprecated
                  depth_limit=float('inf'), # the default
                  width_limit=float('inf'), # the default
                  legal_attributes=['INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA']
                  )

grammar.add_rule('ADDRESS', 'TÄNAV MAJA ASULA', group='g0', priority=3, decorator=address_decorator)
grammar.add_rule('ADDRESS', 'TÄNAV MAJA',       group='g0', priority=3, decorator=address_decorator)
tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              name_attribute='grammar_symbol',
                              layer_name='addresses_3',
                              attributes=('INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA'),
                              output_nodes=None,
                              resolve_support_conflicts=True,
                              resolve_start_end_conflicts=True,
                              resolve_terminals_conflicts=True,
                              output_ambiguous=False # default False
                              )
tagger.tag(text)
text.addresses_3

layer name,attributes,parent,enveloping,ambiguous,span count
addresses_3,"INDEKS, grammar_symbol, MAJA, TÄNAV, MAAKOND, ASULA",,address_parts,False,3

text,INDEKS,grammar_symbol,MAJA,TÄNAV,MAAKOND,ASULA
"['Jüri', '22']",,ADDRESS,22,Jüri,,
"['Veski', '5']",,ADDRESS,5,Veski,,
"['Veski', '5', 'Elva']",,ADDRESS,5,Veski,,Elva


# Validators

In [13]:
text = Text('Inimesed, kes töötavad Tartus Ülikooli 5, Elva haiglas \
            ja Tõravere observatooriumis, söövad esmaspäeviti õunu.').tag_layer(['words'])
address_part_tagger.tag(text)

grammar = Grammar(start_symbols=['ADDRESS'],
                  legal_attributes=['INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA']
                  )

grammar.add_rule('ADDRESS', 'TÄNAV MAJA ASULA', group='g0', priority=3, decorator=address_decorator)
grammar.add_rule('ADDRESS', 'TÄNAV MAJA',       group='g0', priority=3, decorator=address_decorator)
tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              name_attribute='grammar_symbol',
                              layer_name='addresses_4',
                              attributes=('INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA')
                              )
tagger.tag(text)
text.addresses_4

layer name,attributes,parent,enveloping,ambiguous,span count
addresses_4,"INDEKS, grammar_symbol, MAJA, TÄNAV, MAAKOND, ASULA",,address_parts,False,2

text,INDEKS,grammar_symbol,MAJA,TÄNAV,MAAKOND,ASULA
"['Ülikooli', '5']",,ADDRESS,5,Ülikooli,,
"['Ülikooli', '5', 'Elva']",,ADDRESS,5,Ülikooli,,Elva


In [14]:
text = Text('Inimesed, kes töötavad Tartus Ülikooli 5, Elva haiglas \
            ja Tõravere observatooriumis, söövad esmaspäeviti õunu.').tag_layer(['words'])
address_part_tagger.tag(text)

town_streets = {'Elva': {'Veski', 'Tuletõrje'},
                'Tartu': {'Veski', 'Ülikooli'}}

def validator(node):
    street = node[0].text
    town = node[2].text
    if town in town_streets:
        return street in town_streets[town]
    return True

grammar = Grammar(start_symbols=['ADDRESS'], 
                  legal_attributes=['INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA']
                  )

grammar.add_rule('ADDRESS', 'TÄNAV MAJA ASULA', group='g0', priority=3, decorator=address_decorator, validator=validator)
grammar.add_rule('ADDRESS', 'TÄNAV MAJA',       group='g0', priority=3, decorator=address_decorator)
tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              name_attribute='grammar_symbol',
                              layer_name='addresses_4',
                              attributes=('INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA'),
                              output_ambiguous=True
                              )
tagger.tag(text)
text.addresses_4

layer name,attributes,parent,enveloping,ambiguous,span count
addresses_4,"INDEKS, grammar_symbol, MAJA, TÄNAV, MAAKOND, ASULA",,address_parts,True,1

text,INDEKS,grammar_symbol,MAJA,TÄNAV,MAAKOND,ASULA
"['Ülikooli', '5']",,ADDRESS,5,Ülikooli,,


## `SEQ` rules

In [15]:
def address_decorator(nodes):
    asula = ''
    maakond = ''
    t2nav = ''
    indeks = ''
    maja = ''
    for node in nodes:
        if node.name == 'ASULA':
            asula = node.text
        elif node.name == 'TÄNAV':
            t2nav = node.text
        elif node.name == 'MAAKOND':
            maakond = node.text
        elif node.name == 'MAJA':
            maja = node.text
        elif node.name == 'SEQ(MAJA)':
            maja = [n.text for n in node.support]
        elif node.name == 'INDEKS':
            indeks = node.text
    return {'grammar_symbol': 'ADDRESS',
            'ASULA': asula,
            'TÄNAV': t2nav,
            'INDEKS': indeks,
            'MAAKOND': maakond,
            'MAJA': maja}

text = Text('Veekatkestus Tartu Veski tänava majades 3, 5, 7.').tag_layer(['words'])
address_part_tagger.tag(text)

grammar = Grammar(start_symbols=['ADDRESS'],
                  legal_attributes=['INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA']
                  )
def scoring(node):
    return len(node[2].support)

grammar.add_rule('ADDRESS', 'ASULA TÄNAV SEQ(MAJA)', group='g0', priority=3, decorator=address_decorator, scoring=scoring)
tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              name_attribute='grammar_symbol',
                              layer_name='addresses_4',
                              attributes=('INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA'),
                              output_ambiguous=True
                              )
tagger.tag(text)
text.addresses_4

layer name,attributes,parent,enveloping,ambiguous,span count
addresses_4,"INDEKS, grammar_symbol, MAJA, TÄNAV, MAAKOND, ASULA",,address_parts,True,3

text,INDEKS,grammar_symbol,MAJA,TÄNAV,MAAKOND,ASULA
"['Tartu', 'Veski', '3']",,ADDRESS,['3'],Veski,,Tartu
"['Tartu', 'Veski', '3', '5']",,ADDRESS,"['3', '5']",Veski,,Tartu
"['Tartu', 'Veski', '3', '5', '7']",,ADDRESS,"['3', '5', '7']",Veski,,Tartu


## MSEQ rules
MSEQ rule matches the maximal of the symbol sequences.

In [16]:
def address_decorator(nodes):
    asula = ''
    maakond = ''
    t2nav = ''
    indeks = ''
    maja = ''
    for node in nodes:
        if node.name == 'ASULA':
            asula = node.text#[0]
        elif node.name == 'TÄNAV':
            t2nav = node.text#[0]
        elif node.name == 'MAAKOND':
            maakond = node.text#[0]
        elif node.name == 'MAJA':
            maja = node.text#[0]
        elif node.name == 'MSEQ(MAJA)':
            maja = [n.text for n in node.support]
        elif node.name == 'INDEKS':
            indeks = node.text#[0]
    return {'grammar_symbol': 'ADDRESS',
            'ASULA': asula,
            'TÄNAV': t2nav,
            'INDEKS': indeks,
            'MAAKOND': maakond,
            'MAJA': maja}

text = Text('Veekatkestus Tartu Veski tänava majades 3, 5, 7.').tag_layer(['words'])
address_part_tagger.tag(text)

grammar = Grammar(start_symbols=['ADDRESS'],
                  legal_attributes=['INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA']
                  )
def scoring(node):
    return len(node[2].support)

grammar.add_rule('ADDRESS', 'ASULA TÄNAV MSEQ(MAJA)', group='g0', priority=3, decorator=address_decorator, scoring=scoring)
tagger = GrammarParsingTagger(grammar=grammar,
                              layer_of_tokens='address_parts',
                              name_attribute='grammar_symbol',
                              layer_name='addresses_4',
                              attributes=('INDEKS', 'grammar_symbol', 'MAJA', 'TÄNAV', 'MAAKOND', 'ASULA'),
                              output_ambiguous=True
                              )
tagger.tag(text)
text.addresses_4

layer name,attributes,parent,enveloping,ambiguous,span count
addresses_4,"INDEKS, grammar_symbol, MAJA, TÄNAV, MAAKOND, ASULA",,address_parts,True,1

text,INDEKS,grammar_symbol,MAJA,TÄNAV,MAAKOND,ASULA
"['Tartu', 'Veski', '3', '5', '7']",,ADDRESS,"['3', '5', '7']",Veski,,Tartu
