# episode-miner

Provides methods to find events from text based on user defined vocabulary. Stores the result in ```'events'``` layer of ```EventText``` object. Uses Winepi algorithm to find frequent serial episodes in event sequence.

## Installation
```bash
git clone https://github.com/estnltk/episode-miner.git
cd episode-miner
python setup.py install

# or pre-install requieremets separately: 
pip install -r requirements.txt
```
## Usage
For more details see [docs](docs).

In [1]:
from episode_miner import EventTagger, EventText, EventSequence, Episode, find_sequential_episodes, rel_support
from pprint import pprint
from IPython.display import HTML

We shall find from the text

In [2]:
event_vocabulary = [{'term': 'üks'}, 
                    {'term': 'kaks'}]    
event_tagger = EventTagger(event_vocabulary, case_sensitive=False, return_layer=True)
event_text = EventText('Üks kaks kolm neli kolm. Kaks üks kaks kolm neli kolm üks kaks.', event_tagger=event_tagger)
event_sequence = EventSequence(event_text=event_text, classificator='term', time_scale='start')
html = event_sequence.pretty_print()
HTML(html)

frequent serial episodes which consist of words ```üks``` and ```kaks```. Let the width of the Winepi search window be 31 characters and minimal relative frequency of serial episodes be 30%.

In [3]:
find_sequential_episodes(event_sequence, 
                         window_width=31, 
                         min_frequency=0.3, 
                         only_full_windows=False, 
                         allow_intermediate_events=True)

[85 ('kaks',),
 85 ('üks',),
 38 ('kaks', 'kaks'),
 35 ('kaks', 'üks'),
 78 ('üks', 'kaks'),
 29 ('kaks', 'üks', 'kaks')]

It turns out that the episode ```('kaks', 'üks', 'kaks')``` appears in 29 Winepi windows. Since the length of the text is 63 characters, the relative frequency of this episode is 29 / (63 + 31 - 1) = 31%. This can also be found using ```rel_support``` method.

In [4]:
rel_support(event_sequence,
            Episode(('kaks', 'üks', 'kaks')),
            window_width=31, 
            only_full_windows=False, 
            allow_intermediate_events=True)

[0.3118279569892473]

 Find all instances of that episode.

In [5]:
examples = event_sequence.find_episode_examples(Episode(('kaks','üks', 'kaks')), 
                                                window_width=31, allow_intermediate_events=True)
html = event_sequence.pretty_print(sequence_of_events_generator=examples)
HTML(html)