# `run_201905181215`

## Configuration

### `EstNLTK` version
```bash
(py36) ptammo@p12:~/Projects/cda-data-cleaning$ git -C ../estnltk show --summary
```
```
commit fd5b8314c31a227fb1135985c59ffe0d13aa1ece
Author: pault <pault@ut.ee>
Date:   Sat May 18 11:43:03 2019 +0300

    replaced re.Pattern with re.regex.Pattern in Vocabulary for compatibility with newer version of regex module
```

### `cda-data-cleaning` version
```bash
(py36) ptammo@p12:~/Projects/cda-data-cleaning$ git show --summary
```
```
commit f4b47bedf437511f1457b08c54ed3e33eefcdfe4
Author: pault <pault@ut.ee>
Date:   Fri May 17 23:06:15 2019 +0300

    #1120 update EventSegmentsTagger.output_attributes
```

### `egcut_epi_original.ini`
```
(py36) ptammo@p12:~/Projects/cda-data-cleaning$ cat configurations/egcut_epi_original.ini 
[database-configuration]
host = p12.stacc.ee
port = 5432
database_name = egcut_epi

username = ptammo
password = xxxxxx

role = egcut_epi_work_create
table_create_role = egcut_epi_work_create
table_read_role = egcut_epi_work_read

original_schema = original
work_schema = work


[luigi]
folder = luigi_targets


# LEGACY TO BE REMOVED
[classifications-database-configuration]
host = p12.stacc.ee
port = 5432
database_name = qs_rel
schema = rel_classifications
```

## Run `CreateTextsCollection`
```bash
(py36) ptammo@p12:~/Projects/cda-data-cleaning$ luigi --scheduler-port 8089 --module cda_data_cleaning CreateTextsCollection  --prefix=run_201905181215 --conf=egcut_epi_original.ini --workers=8
```

## Run `CreateEventsCollection`
```bash
(py36) ptammo@p12:~/Projects/cda-data-cleaning$ luigi --scheduler-port 8089 --module cda_data_cleaning CreateEventsCollection --prefix=run_201905181215 --conf=egcut_epi_original.ini --workers=1
```

## Results

In [1]:
from estnltk.storage import PostgresStorage

storage = PostgresStorage(pgpass_file='~/.pgpass',
                          dbname='egcut_epi',
                          schema='work',
                          role='egcut_epi_work_create')

INFO:storage.py:41: connecting to host: 'p12.stacc.ee', port: '5432', dbname: 'egcut_epi', user: 'ptammo'
INFO:storage.py:57: schema: 'work', temporary: False, role: 'egcut_epi_work_create'


### `run_201905181215_texts` collection

In [2]:
texts_collection = storage['run_201905181215_texts']
texts_collection

Unnamed: 0,data type
epi_id,text
epi_type,text
schema,text
table,text
field,text
row_id,text
effective_time,timestamp without time zone

Unnamed: 0,layer_type,attributes,ambiguous,parent,enveloping,_base,meta
anonymised,attached,"(id, type, form, partofspeech)",True,,,anonymised,[]
event_headers,detached,"(DATE, HEADERWORD, DOCTOR, DOCTOR_CODE, SPECIA...",True,,event_tokens,event_headers,[]
event_segments,detached,"(header, header_offset, DATE, HEADERWORD, DOCT...",False,,,event_segments,[]
event_tokens,detached,"(grammar_symbol, unit_type, value, specialty_c...",True,,,event_tokens,[]


In [3]:
texts_collection.selected_layers = ['anonymised', 'event_tokens', 'event_headers', 'event_segments']

In [4]:
# find a text with an interesting `event_segments` layer
stop = False
for t in texts_collection:
    if len(t.event_segments) > 1:
        for span in t.event_segments:
            if span.DOCTOR:
                stop = True
                break
    if stop:
        break

In [5]:
t.event_segments

layer name,attributes,parent,enveloping,ambiguous,span count
event_segments,"header, header_offset, DATE, HEADERWORD, DOCTO...",,,False,7

text,header,header_offset,DATE,HEADERWORD,DOCTOR,DOCTOR_CODE,SPECIALTY,SPECIALTY_CODE,ANONYM
"OPERATSIOONID \nOperatsioon 07.04.2011 14:01 Kirurg: PUUORG, EGON - D01460, E260 ..., type: <class 'str'>, length: 644",,,,,,,,,
"WBC 8.91 (3,5 .. 8,8 E9/L ) \nRBC 5.08 (4,2 .. 5,7 E12/L ) \nHGB 153 (134 .. 1 ..., type: <class 'str'>, length: 261",06.04.2011 Hemogramm,644.0,\n06.04.2011,Hemogramm \n,,,,,[]
"\nS,P-Glükoos 5.3 (mmol/L ) \nS,P-Uurea 3.6 (<8.3 mmol/L ) \nS,P-CRP <1 (<5 ) \nS ..., type: <class 'str'>, length: 252",06.04.2011,930.0,\n06.04.2011,,,,,,[]
"\nMärkus: Osakonnas määratud veregrupp 0 \nErütrotsütaarsete antikehade sõeluurin ..., type: <class 'str'>, length: 230",06.04.2011,1194.0,\n06.04.2011,,,,,,[]
"Radioloogiline uuring, vastus nr 849_15803_1 \nKIRJELDUS: Vasak põlveliiges 2-s. ..., type: <class 'str'>, length: 284","07.04.2011 12:29 - VESKE, KARIN - D04454 - E340 - radioloogia",1436.0,\n07.04.2011 12:29,,"VESKE, KARIN",D04454,radioloogia,E340,[]
\nT. pallidum IgM+IgG Negatiivne,07.04.2011,1784.0,\n07.04.2011,,,,,,[]
\nT. pallidum IgM+IgG Negatiivne \n,07.04.2011,1832.0,\n07.04.2011,,,,,,[]


### `run_201905181215_events` collection

In [6]:
collection = storage['run_201905181215_events']
collection

Unnamed: 0,data type
texts_id,integer
epi_id,text
epi_type,text
schema,text
table,text
field,text
row_id,text
effective_time,timestamp without time zone
header,text
header_offset,integer

Unnamed: 0,layer_type,attributes,ambiguous,parent,enveloping,_base,meta
anonymised,attached,"(id, type, form, partofspeech)",True,,,anonymised,[]


In [7]:
for id_, text, meta in collection.select(
    collection_meta=['texts_id', 'epi_id', 'epi_type', 'schema', 'table', 'field',
                     'row_id', 'effective_time', 'header', 'header_offset', 'event_offset'],
    layers=['anonymised']):

    if len(text.anonymised) > 1:
        break

In [8]:
meta

{'texts_id': 5,
 'epi_id': '1000808',
 'epi_type': 's',
 'schema': 'original',
 'table': 'procedures',
 'field': 'text',
 'row_id': '9',
 'effective_time': None,
 'header': '09.07.2009 13:27 - <ANONYM id="6" type="per" morph="_Y_ ?"/>, <ANONYM id="7" type="per" morph="_H_ sg n"/>',
 'header_offset': 1374,
 'event_offset': 1482}

In [9]:
text

text
"- D04467 - RadioloogRadioloogiline uuring, vastus nr 14_15166_1KIRJELDUS: Paremal ACI algusosas põhiliselt pehmekoline naast, stenoos 59%. hemodünaamikat ei mõjuta. Vasakul ACI algusosas lubitihedaid naaste, senoosi maksimaalselt 30%. Vertebraalarterites vool tavapärane. 15.07.2009 14:28 - <ANONYM id=""8"" type=""per"" morph=""_H_ sg n""/>, <ANONYM id=""9"" type=""per"" morph=""_H_ sg n""/> - D04454 - E62 - radioloogiaRadioloogiline uuring, vastus nr 612_15172_1KIRJELDUS: MRT angiograafia kaelaarteritest. Süstitud Omniscani 20,0. Kaelaarterites hemodünaamiliselt olulisi stenoose nähtavale ei tule. Paremal vertebraalarteri läbimõõt on asümmeetriliselt väiksem, alguskoht ei tule selgelt nähtavale- siin tõenäoliselt stenoos. Vasak vertebraalarter iseärasusteta. Ajuarterid Willise ringi piirkonnas normipärast, sümmeetrilist laiust. FLAIR uuringul bilat. periventrikulaarses valgeaines tõusnud signaaliga vaskulaarsele entsefalopaatiale viitavad alad. Paremal periventriklulaarsel ja vasemal basaaltuuamde piirkonnas likvoriseerunud lakunaarse infarkti kolded."

layer name,attributes,parent,enveloping,ambiguous,span count
anonymised,"id, type, form, partofspeech",,,True,2


In [10]:
storage.close()