In [1]:
import numpy as np

# Open a connection to OMOP
from fleming_lib.tools import connect_to_omop
conn = connect_to_omop()

# Get ready for SQL query processing
import pandas as pd

[INFO] adding /home/paulroujansky/git/DataForGood/batch4_diafoirus_fleming to sys.path


# Building the data

## Filters
Prealably, we apply certain filters to our patients in order to build a given base cohort.

Criteria: patients need to:
- enter an ICU block (?)
- be above 15
- have an IGS II score above 20
- not be "limited" (see https://docs.google.com/document/d/12Co20pVyEfQrSa9fBaKpUe1_TLoMTJzK74vK1NY2vIQ/edit)


## Type of data

1. Meta data: age, gender, ethnicity...

2. Measurements: measurements (both categorical and continuous numerical variables), each associated to a given timestamp

3. Conditions: type of illness (categorical variables) diagnosed at a given timestamp

4. Final label: death


In order to fit our model, the ideal would be to build a matrix containing features at a given timestamp.

Each one of the L meta data is based on a mapping:
- Meta 1 = Age
- ...
- Meta L = Ethnicity

Some meta data might change across age (ex: age if patient stays for too long in the hospital).

Each one of the N measures is based on a mapping:
- Measure 1 = Heart Rate (bpm)
- Measure 2 = Temperate (°F)
...
- Measure N = Blood pH

Units are assumed identical across time for every measures obviously!

Equally, each one of the M conditions is based on a mapping:
- Condition 1 = Diabete II
- ...
- Condition M = Obesity

1 codes the presence of the condition while 0 codes the absence of it.


Finally, Dead = 1 symbolises the death of the patient and potentially is the last sample for that patient in the table.

| Timestamp | Meta 1 | ... |  Meta L | Measure 1 | ... | Measure N | Condition 1 | ... | Condition M | Dead |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | ---  | --- |
|2018-04-22 12:15 | 60 | ... | 30 | 10 | ... | 140 | 1 | ... | 0 | 0 |
|2018-04-22 18:33 | 60 | ... | 30 | 40 | ... | 130 | 1 | ... | 1 | 0 |
|2018-04-23 15:00 | 60 | ... | 30 | 50 | ... | 100 | 1 | ... | 1 | 1 |

Data might potentially be missing for a given timestamp: this is an issue we might want to deal with, with great caution!


## HOW-TO:

The idea is to extract / build the relevant data independently and then aggregate it in the above matrix structure a posteriori (in SQL or Python).

Where to find the data:
- meta data can be found in the `person` table
- measures can be found in the `measurement` table
- conditions can be found in the `condition_occurrence` table
- death status can be found in the `death` table

Additional information might worth taking into account such as the current unit the patient is in (info in `visit_occurrence`, `visit_detail` and elsewhere).



# Filters

Build a table containing the ID (`patient_id`) of patient in the base cohort (see `omop_create table.ipynb` to build a table).

# Data

##  Meta

## Measures

We only take the measures defined in https://docs.google.com/spreadsheets/d/18DAAFagLvbwgmFobmtUyAnRnyjIPCQ1PKO7PxGgvKzY/edit#gid=451798074.

Counts

In [5]:
query = """
SELECT person_id, COUNT(measurement_id)
FROM measurement
WHERE
    measurement_concept_id IN
    (3022318,   -- heart_rhythm
     3024171,   -- respiratory_rate
     3028354,   -- vent_settings
     3012888,   -- diastolic_bp
     3027598,   -- map_bp
     3004249,   -- systolic_bp
     3027018,   -- heart_rate
     3020891,   -- temperature
     3016502,   -- spo2
     3020716,   -- fio2
     3032652    -- glasgow coma scale
    )
GROUP BY person_id
LIMIT 10
;"""

temp = pd.read_sql_query(query, conn)
temp

Unnamed: 0,person_id,L3
0,62073122,988
1,62080413,600
2,62106943,2404
3,62102837,1176
4,62081774,93
5,62102210,8161
6,62090850,364
7,62073886,444
8,62069455,2690
9,62085688,176


Example of selection for patient 62073122.

In [6]:
query = """
SELECT person_id, measurement_datetime, measurement_concept_name, value_source_value, unit_source_value
FROM measurement
WHERE
    measurement_concept_id IN
    (3022318,   -- heart_rhythm
     3024171,   -- respiratory_rate
     3028354,   -- vent_settings
     3012888,   -- diastolic_bp
     3027598,   -- map_bp
     3004249,   -- systolic_bp
     3027018,   -- heart_rate
     3020891,   -- temperature
     3016502,   -- spo2
     3020716,   -- fio2
     3032652    -- glasgow coma scale
    )
AND person_id = 62073122
ORDER BY measurement_datetime
LIMIT 100
;"""

temp = pd.read_sql_query(query, conn)
temp

Unnamed: 0,person_id,measurement_datetime,measurement_concept_name,value_source_value,unit_source_value
0,62073122,2108-04-06 16:30:00,Heart rate,115,BPM
1,62073122,2108-04-06 16:30:00,Body temperature,36.111099243164062,Deg. C
2,62073122,2108-04-06 16:30:00,Mean blood pressure,100.66699981689453,mmHg
3,62073122,2108-04-06 16:30:00,Respiratory rate,22,BPM
4,62073122,2108-04-06 16:30:00,Oxygen saturation in Arterial blood,100,%
5,62073122,2108-04-06 16:30:00,Heart rate rhythm,Sinus Tachy,
6,62073122,2108-04-06 16:30:00,BP systolic,130,mmHg
7,62073122,2108-04-06 16:30:00,BP diastolic,86,mmHg
8,62073122,2108-04-06 16:30:00,Body temperature,97,Deg. F
9,62073122,2108-04-06 17:00:00,BP systolic,118,mmHg
