#### Tagging tokens on text

To use a grammar for information extraction, we first need to split the text into some kind of tokens that serve as terminal symbols in our grammar. These tokens do not need to overlap with words - they can be whole words, but also parts of words or consist of multiple words. 

To tag the tokens on text, we can use RegexTagger. Its full tutorial can be found from [here](https://github.com/estnltk/estnltk/blob/devel_1.6/tutorials/raw_text_taggers/regex_tagger.ipynb), but let's look at an example here as well. 

Let's use the following sentences as an example corpus:

In [1]:
lines = ['PSA 03042012 - 0,83ng/ml perearsti poolt .',
 'PSA 2010. 3ng/ml, PSA 2012. 1,53ng/ml . - Bx va',
 'PSA 20105,99 ja 26.01.2012 uuesti .',
 'PSA 2011 oli 0 , 4 nG7ml .',
 'PSA 201222,25ng/ml',
 'PSA 2 aastajooksuldünaamikata , eriuuring',
 ':psa 16,81! ! ! ! ! ,',
 'Happe-aluse tasakaal 6.0 ( 5.0 .. 8.0 )',
 'loli 25 mgx1 ja Monoprili 10 mg Kolesterool 2011a',
 'Kolesterool 1k aastas .',
 'Kõrgenenud kolesterool 2a ( mõõdetud ). Ei pea dieetist kinni',
 'Kontr Verekol 08.12a Per-le juurde .',
 's vas munajuha kasvaja op , günekol 3a tagssi .',
 '08.11.2010 PSA 13.12.2011 7,2ng/ml PSADT on väike .',
 'Rütmihäire tsüklipikkus 330 msek',
 'Loote pikkus : \xa0 3 mm - vastab\xa0 5 nädalat 6 päeva.',
 'Põhjendus: PALAT 10 # ALAT maksanäitaja',
 'ärme vähk 2007 aastast cT3N0M0PSA 59ng/ml .',
 'PSA 8,5( püsib aastaid selles väärtus',
 'S , P-PSA 4.130( <4.100 µg/L )',
 'PSA 5,2.',
 'Kolesterool oli 7,9 mmol/l 0',
 'kolesterool 6.4.',
 'Kolesterool 5,2 mmol/l - esialgu dieet .',
 'SK 3900 g , SP 51 cm .',
 'Lapse kaal 5,4 kg/82 mg/0,82 ml i/m .',
 'Kehakaal 80,2 kg , KMI 25,9',
 'S , P-NT-proBNP 668 ( <125 pg/mL ) S , P-Albumiin 43 ( 35 .. 52 g/L ) S , P-ALAT 25 ( <33 U/L )',
 'PSA 6,5 ng/ml, eesnäärme maht67cm3',
 'rjeldus : Siinusbradükardia Fr 587min']

Assuming that we want to extract the measurements from the texts, we have to decide what kind of tokens we would need to define for that. As there are different measurements in the sentences (PSA, cholesterol, etc), we will need to find the **measurement object**. Of course we need the measurement itself which is expressed as a **number**. However, not every number following a measurement object is actually a measurement. If we look at the examples above, we can define the following ways to deal with these problems:

1) A **number** following a **measurement object** that does not express this measurement: we can also extract **units** and check whether the unit and measurement object are in agreement (e.g "Kõrgenenud kolesterool 2a" - cholesterol is not 2 because 'a' is not a unit that can measure cholesterol but signifies time. 

2) **Dates** right after the measurement objects in texts: we should also tag **dates** in addition to **numbers**. 

So, we should define the appropriate regular expressions for all the symbols mentioned. We can do this in a csv file where we can also describe different attributes that we want to add to the tagged tokens. Let's have a look at an example file written for the sentences above:

However, as we can see from the examples, tagging dates and numbers is not a trivial task: sometimes there is a space inside the number before and/or after the decimal separator, in other cases, there is no space between a date and a number expressing a measurement. To deal with the latter cases, we will add another symbol type **datenum** because these are the only cases where we want to allow a digit to be directly before a **number** (as it is part of the date). 

Let's look at the regexes defined for the examples file:

In [2]:
from pandas import read_csv

In [3]:
vocabulary = read_csv("regexes_fixed.csv", na_filter=False, index_col=False, encoding = 'utf8')

In [4]:
vocabulary

Unnamed: 0,_regex_pattern_,_group_,_priority_,normalized,regex_type,value,grammar_symbol
0,((K|k)olesteroo?l|KOLESTEROOL|(K|k)olester|Cho...,0,-1,,measurement_object,kolesterool,MO
1,(((S|s)iinus)?r.tm(iline|ilised)?|[Ff]rekv?(en...,0,-1,,measurement_object,pulss,MO
2,((([Ss]ünni)|([Kk]eha))?(p|P)ikkus|PIKKUS|pikk...,0,-1,,measurement_object,pikkus,MO
3,(psa|Psa|S-PSA|[Pp]rostataspetsiifiline\s*anti...,0,-1,,measurement_object,psa,MO
4,((((K|k)eha)|((S|s)ünni))?(K|k)aal(uga)?|kAAL|...,0,-1,,measurement_object,kaal,MO
5,ALAT,0,-1,,measurement_object,alat,MO
6,(ng/mL|ng/L|mk(ro)?g/[Ll]|ng/\s*ml|ng7ml|mg/ml...,0,1,,psa_unit,lambda m: re.search('(ng/mL|ng/L|mk(ro)?g/[Ll]...,UNIT
7,(a[. ]|k[. ]|aasta|kuu|nädal|[Xx]|kord|min|mse...,0,1,,time_unit,x,UNIT
8,((mmoo?l?i?|mm|MMOL|mol)(\s*[-/]\s*(L|l))?|MMO...,0,1,,chol_unit,x,UNIT
9,(((l|x|X|lööki))\s*/?\s*(1\s*)?min(utis)?)|/mi...,0,1,,pulss_unit,x,UNIT


Of course, as can be seen from the table *vocabulary*, we can have multiple regular expressions to define one token type or symbol in our grammar, e.g. there are 6 different measurement objects currently that all have different values added to them as an attribute, but also 5 regular expressions for dates which all have *partial_date* as their value. 

In addition to the *value* attribute that will help us understand which symbols make up a measurement together in the grammar (e.g. psa and psa_unit are compatible), there are also the columns _priority_ and _group_. _priority_ allows us to define which regular expression should be matched on the text if several regexes would match the same character(s). _group_ defines the the regex match group that should be extracted.

Now, we should define the RegexTagger(s) to tag the example sentences with the symbols. To keep things easy, let's make only one RegexTagger that tags all the symbols and adds them the attributes from the _vocabulary_ file:

In [5]:
from estnltk.taggers import RegexTagger

In [6]:
regex_tagger = RegexTagger(vocabulary=vocabulary[:30],
                                    attributes=['regex_type', 'value', 'grammar_symbol'],
                                    conflict_resolving_strategy='ALL',
                                    overlapped=True,
                                    layer_name='type')

In [7]:
regex_tagger

name,layer,attributes,depends_on
RegexTagger,type,"[regex_type, value, grammar_symbol]",[]

0,1
conflict_resolving_strategy,ALL
overlapped,True


In [8]:
regex_tagger._vocabulary

[{'_group_': 0,
  '_priority_': -1,
  '_regex_pattern_': regex.Regex('((K|k)olesteroo?l|KOLESTEROOL|(K|k)olester|Chol|(K|k)olest?|kol|chol|CHol|CHL|KOL|Kol|cHOL|CHOL|ÜK|ük|Ük)', flags=regex.V0),
  '_validator_': <function estnltk.taggers.raw_text_tagging.regex_tagger.RegexTagger._read_expression_vocabulary.<locals>.<lambda>>,
  'grammar_symbol': 'MO',
  'regex_type': 'measurement_object',
  'value': 'kolesterool'},
 {'_group_': 0,
  '_priority_': -1,
  '_regex_pattern_': regex.Regex('(((S|s)iinus)?r.tm(iline|ilised)?|[Ff]rekv?(ents)?|fr\\.?|Fr|BPM|bpm|SR|SLS|FR|HR|(P|p)ulss(i)?|Ps)(\\s*[xX]\\s*)?', flags=regex.V0),
  '_validator_': <function estnltk.taggers.raw_text_tagging.regex_tagger.RegexTagger._read_expression_vocabulary.<locals>.<lambda>>,
  'grammar_symbol': 'MO',
  'regex_type': 'measurement_object',
  'value': 'pulss'},
 {'_group_': 0,
  '_priority_': -1,
  '_regex_pattern_': regex.Regex('((([Ss]ünni)|([Kk]eha))?(p|P)ikkus|PIKKUS|pikkusega|[^A-Z]SP|sp|pikk|kasv|Kasv)', flags=r

If we have 'ALL' as the conflict_resolving_strategy, all the possible matches of different regular expressions are given out, even the conflicting/overlapping ones. overlapped = True ensures that we do get overlapping matches from the same RegexTagger regular expression as well.

In addition to the defined symbols, we want to check if there's **any other text** between the tagged symbols - if a **measurement object** is at the beginning of a sentence and a **number** comes 20 tokens later at the end of the sentence, it might not be the measurement, although no other defined symbols appeared between them. For this, we need to define another tagger - GapsTagger:

In [9]:
import re

As we do not want every space and other random character to be considered as "text between symbols", we exclude those with a trim function that removes the characters that are accepted between symbols. These are mostly punctuation markers, but here also the verbs *on*, *oli* are included:

In [10]:
def trim(text):
    
    t_1 = re.sub('^[-=.>< ]*', '', text)
    t_1 = re.sub('^\.?\s*-?\s*', '', t_1)
    t_1 = re.sub('^[-=.>< ]*(on|oli)\s*', '', t_1)
    t_1 = re.sub('^\s*-?:?\s*\<?', '', t_1)
    t_1 = re.sub('^\s*', '', t_1)
    t_1 = re.sub('[-=.>< ]*$', '', t_1)
    t_1 = re.sub('\.?\s*-?\s*$', '', t_1)
    t_1 = re.sub('[-=.>< ]*(on|oli)\s*$', '', t_1)
    t_1 = re.sub('\s*-?:?\s*\<?$', '', t_1)
    t_1 = re.sub('\s*$', '', t_1)
    
    return t_1

We also add a decorator that gives the gaps a grammar_symbol attribute, here it is named **RANDOM_TEXT**:

In [11]:
def decorator(text:str):
    return {'gap_length':len(text), 'grammar_symbol': 'RANDOM_TEXT'}

In [12]:
from estnltk.taggers.gaps_tagging.gap_tagger import GapTagger

In [13]:
gap_tagger = GapTagger(layer_name='gaps',
                       input_layers=['type'],
                       trim=trim, 
                       decorator=decorator,
                       attributes=['gap_length', 'grammar_symbol'])

If we want the symbols tagged with different taggers to be in the same layer, we can create and use a MergeTagger. Here, it takes the layers **type** created with regex_tagger and **gaps** created with gaps_tagger and merges them into **grammar_tags** layer.

In [14]:
from estnltk.taggers import MergeTagger

In [15]:
merge_tagger = MergeTagger(layer_name='grammar_tags',
                           input_layers=['type', 'gaps'],
                           attributes=('grammar_symbol', 'value'))

Next, we need to run all the taggers in correct order on our examples. As GapsTagger uses layers from RegexTagger as its input, it needs to be the second, MergeTagger needs all the previous layers, so it needs to be the last:

In [16]:
from estnltk import Text

In [17]:
tagged_lines = []
for line in lines:
    text = Text(line)
    regex_tagger.tag(text)

    gap_tagger.tag(text)
    merge_tagger.tag(text)
    tagged_lines.append(text)

And now we can see the layers tagged on the texts:

In [18]:
tagged_lines[1]

text
"PSA 2010. 3ng/ml, PSA 2012. 1,53ng/ml . - Bx va"

layer name,attributes,parent,enveloping,ambiguous,span count
gaps,"gap_length, grammar_symbol",,,False,3
grammar_tags,"grammar_symbol, value",,,False,17
type,"regex_type, value, grammar_symbol",,,False,14


In [19]:
tagged_lines[1].type

layer name,attributes,parent,enveloping,ambiguous,span count
type,"regex_type, value, grammar_symbol",,,False,14

text,start,end,regex_type,value,grammar_symbol
PSA,0,3,measurement_object,psa,MO
2010,4,8,date9,partial_date,DATE
3,10,11,anynumber,3,NUMBER
ng/ml,11,16,psa_unit,ng/ml,UNIT
g,12,13,kaal_unit,x,UNIT
m,14,15,pikkus_unit,x,UNIT
PSA,17,21,measurement_object,psa,MO
PSA,18,21,measurement_object,psa,MO
2012,22,26,date9,partial_date,DATE
153,28,32,anynumber,1.53,NUMBER


In [20]:
tagged_lines[1].gaps

layer name,attributes,parent,enveloping,ambiguous,span count
gaps,"gap_length, grammar_symbol",,,False,3

text,start,end,gap_length,grammar_symbol
",",16,17,1,RANDOM_TEXT
B,42,43,1,RANDOM_TEXT
va,45,47,2,RANDOM_TEXT


In [21]:
tagged_lines[1].grammar_tags

layer name,attributes,parent,enveloping,ambiguous,span count
grammar_tags,"grammar_symbol, value",,,False,17

text,start,end,grammar_symbol,value
PSA,0,3,MO,psa
2010,4,8,DATE,partial_date
3,10,11,NUMBER,3
ng/ml,11,16,UNIT,ng/ml
g,12,13,UNIT,x
m,14,15,UNIT,x
",",16,17,RANDOM_TEXT,
PSA,17,21,MO,psa
PSA,18,21,MO,psa
2012,22,26,DATE,partial_date


### TESTING

In [22]:
import unittest

In [23]:
def test_function(line):
    line = Text(line)
    regex_tagger.tag(line)
    numbers = []
    for n, r in zip(line.type.text, line.type.grammar_symbol):
        if r == 'NUMBER' or r == 'DATE' or r == 'DATENUM':
            numbers.append(n)
    return numbers

In [24]:
test_function('PSA 12 , 53')

['12 , ', '12 , 53', '53']

In [25]:
def more_complex_test_function(line):
    line = Text(line)
    regex_tagger.tag(line)
    #numbers = []
    #for n, r in zip(line.type.text, line.type.grammar_symbol):
    #    if r == 'NUMBER' or r == 'DATE' or r == 'DATENUM':
    #        numbers.append(n)
    return line.type

In [26]:
class MyTest(unittest.TestCase):
    def test(self):
        self.assertEqual(test_function('PSA 2012. 1,53'), ['2012', '1,53'])
    def test2(self):
        self.assertEqual(test_function('PSA 12. 53'), ['12. ', '12. 53', '53'])
    def test3(self):
        self.assertEqual(test_function('PSA 12, 53'), ['12, ', '12, 53', '53'])
    def test4(self):
        self.assertEqual(test_function('PSA 12 , 53'), ['12 , ', '12 , 53', '53'])
    def test5(self):
        self.assertEqual(test_function('PSA 2012.1,53'), ['2012', '1,53'])
    def test6(self):
        self.assertEqual(test_function('PSA 20121,53'), ['2012', '1,53'])
    def test7(self):
        self.assertEqual(test_function('PSA ,315'), [' ,315'])    
    def test8(self):
        self.assertEqual(test_function('PSA 030420121,53'), ['03042012', '1,53'])  
        
    def test9(self):
        t = more_complex_test_function('PSA 2012. 1,53')
        self.assertEqual(t[0].regex_type, 'measurement_object')
        self.assertEqual(t[0].start, 0)
        self.assertEqual(t[0].end, 3)
        self.assertEqual(t[1].regex_type, 'date9')
        self.assertEqual(t[1].start, 4)
        self.assertEqual(t[1].end, 8)
        self.assertEqual(t[2].regex_type, 'anynumber')
        self.assertEqual(t[2].start, 10)
        self.assertEqual(t[2].end, 14) 
    
    def test10(self):
        t = more_complex_test_function('PSA 12. 53')
        self.assertEqual(t[0].regex_type, 'measurement_object')
        self.assertEqual(t[0].start, 0)
        self.assertEqual(t[0].end, 3)
        self.assertEqual(t[1].regex_type, 'numbercomma')
        self.assertEqual(t[1].start, 4)
        self.assertEqual(t[1].end, 8)
        self.assertEqual(t[2].regex_type, 'anynumber')
        self.assertEqual(t[2].start, 4)
        self.assertEqual(t[2].end, 10)
        self.assertEqual(t[3].regex_type, 'anynumber')
        self.assertEqual(t[3].start, 8)
        self.assertEqual(t[3].end, 10) 

In [27]:
if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

..........
----------------------------------------------------------------------
Ran 10 tests in 0.022s

OK


In [28]:
class MyTest(unittest.TestCase):
    def test(self):
        t = more_complex_test_function('PSA 2012. 1,53')
        self.assertEqual(t[0].grammar_symbol, 'MO')
        self.assertEqual(t[0].start, 0)
        self.assertEqual(t[0].end, 3)
        self.assertEqual(t[0].grammar_symbol, 'DATE')
        self.assertEqual(t[0].start, 4)
        self.assertEqual(t[0].end, 8)
        self.assertEqual(t[0].grammar_symbol, 'NUMBER')
        self.assertEqual(t[0].start, 10)
        self.assertEqual(t[0].end, 14)
        #for thing in t:
        #    if thing
        #self.assertEqual(test_function('PSA 2012. 1,53'), ['2012', '1,53'])
    