# EventTagger

A class that finds a list of events from **Text** object based on user-provided vocabulary. The events are tagged by several metrics (**start**, **end**, **cstart**, **wstart**) and user-provided classificators.

## Usage
### Example 1

In [1]:
from episode_miner import EventTagger
from pprint import pprint
from estnltk import Text
from pandas import DataFrame

Create ``pandas`` ``DataFrame``

In [2]:
event_vocabulary = DataFrame([['Harv',    'sagedus'], 
                              ['peavalu', 'sümptom']], 
                      columns=['term',    'type'])

or list of ``dict``s

In [3]:
event_vocabulary = [{'term': 'Harv',    'type': 'sagedus'}, 
                    {'term': 'peavalu', 'type': 'sümptom'}]

or file `data/event vocabulary.csv` in *csv* format:
```
term,type
Harv,sagedus
peavalu,sümptom
```

In [4]:
event_vocabulary = 'data/event vocabulary.csv'

There must be one key (column) called **term** in ``event_vocabulary``. That refers to the strings searched from the text. Other keys (**type** in this example) are optional. No key may have name **start**, **end**, **cstart**, **wstart**, **wstart_raw** or **wend_raw**.

Create **Text** object, **EventTagger** object and find the list of events.

In [5]:
text = Text('Harva esineb peavalu.')
event_tagger = EventTagger(event_vocabulary, search_method='ahocorasick', case_sensitive=True,
                           conflict_resolving_strategy='ALL', return_layer=True)
event_tagger.tag(text)

[{'cstart': 0,
  'end': 4,
  'start': 0,
  'term': 'Harv',
  'type': 'sagedus',
  'wend_raw': 1,
  'wstart': 0,
  'wstart_raw': 0},
 {'cstart': 10,
  'end': 20,
  'start': 13,
  'term': 'peavalu',
  'type': 'sümptom',
  'wend_raw': 3,
  'wstart': 2,
  'wstart_raw': 2}]

The **search_method** is either 'ahocorasick' or 'naive'. 'naive' is slower in general but does not depend on **pyahocorasic** package. 

The **conflict_resolving_strategy** is either 'ALL', 'MIN' or 'MAX' (see the next example).

The events in output are ordered by ``start`` and ``end``.

The word start ```wstart``` and char start ```cstart``` are calculated as if all the events consist of one char.

The defaults are:

```python
search_method='naive' # for Python < 3
search_method='ahocorasick' # for Python >= 3
case_sensitive=True
conflict_resolving_strategy='MAX'
return_layer=False
layer_name='events'
```

### Example 2

In [6]:
event_vocabulary = [
                    {'term': 'kaks', 'value': 2, 'type': 'väike'},
                    {'term': 'kümme', 'value': 10, 'type': 'keskmine'},
                    {'term': 'kakskümmend', 'value': 20, 'type': 'suur'},
                    {'term': 'kakskümmend kaks', 'value': 22, 'type': 'suur'}
                   ]
text = Text('kakskümmend kaks')

``conflict_resolving_strategy='ALL'`` returns all events.

In [7]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='ALL', return_layer=True)
event_tagger.tag(text)

[{'end': 4,
  'start': 0,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 1,
  'wstart_raw': 0},
 {'end': 11,
  'start': 0,
  'term': 'kakskümmend',
  'type': 'suur',
  'value': 20,
  'wend_raw': 1,
  'wstart_raw': 0},
 {'end': 16,
  'start': 0,
  'term': 'kakskümmend kaks',
  'type': 'suur',
  'value': 22,
  'wend_raw': 0,
  'wstart_raw': 0},
 {'end': 9,
  'start': 4,
  'term': 'kümme',
  'type': 'keskmine',
  'value': 10,
  'wend_raw': 1,
  'wstart_raw': 0},
 {'end': 16,
  'start': 12,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 0,
  'wstart_raw': 2}]

``conflict_resolving_strategy='MAX'`` returns all the events that are not contained by any other event.

In [8]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='MAX', return_layer=True)
event_tagger.tag(text)

[{'cstart': 0,
  'end': 16,
  'start': 0,
  'term': 'kakskümmend kaks',
  'type': 'suur',
  'value': 22,
  'wend_raw': 0,
  'wstart': 0,
  'wstart_raw': 0}]

``conflict_resolving_strategy='MIN'`` returns all the events that don't contain any other event.

In [9]:
event_tagger = EventTagger(event_vocabulary, search_method='naive', conflict_resolving_strategy='MIN', return_layer=True)
event_tagger.tag(text)

[{'cstart': 0,
  'end': 4,
  'start': 0,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 1,
  'wstart': 0,
  'wstart_raw': 0},
 {'cstart': 1,
  'end': 9,
  'start': 4,
  'term': 'kümme',
  'type': 'keskmine',
  'value': 10,
  'wend_raw': 1,
  'wstart': 0,
  'wstart_raw': 0},
 {'cstart': 5,
  'end': 16,
  'start': 12,
  'term': 'kaks',
  'type': 'väike',
  'value': 2,
  'wend_raw': 0,
  'wstart': 2,
  'wstart_raw': 2}]