### Merging data from WardWatcher (ICNARC) and Philips (ICCA).

Here use the following files:

* 'encounter_summary (1).rpt'  - a tab separated file with output from a simple SQL run on ICCA to extract basic information about patient encounters (ITU stays).

* 'ICNARC 2015-2018 encounterIds and Readmissions.TXT' - a file containing ICNARC patient IDs and the corresponding 'CIS Patient ID', which link to encounterID in Philips.

* 'Philips encounterId Issue List (New).xlsx' - a file documenting known issues with either encounterIds in Philips or CIS Patient IDs in WW. We clean up the IDs using this file before joining the two datasets.

* 'ICNARC_Dataset_2015-2018__clean_.xml' - xml file containing output of ICNARC dataset

* 'ICNARC CMP Dataset Properties.xlsx' - description of variables in the ICNARC dataset

In doing so we simplify somewhat the table strucutre used by the Philips ICCA system. This is possible in part by the exlucsion of data that is not currently deemed relevant for research. We draw inspiration for our table names from [MIMIC-III](https://mimic.physionet.org/).  

In [1]:
VERBOSE = False ## For reasons of data protection we supress printing of results and data summaries.
import numpy as np
import pandas as pd

  from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
  from pandas._libs import (hashtable as _hashtable,
  from pandas._libs import algos, lib
  from pandas._libs import hashing, tslib
  from pandas._libs import (lib, index as libindex, tslib as libts,
  import pandas._libs.tslibs.offsets as liboffsets
  from pandas._libs import algos as libalgos, ops as libops
  from pandas._libs.interval import (
  from pandas._libs import internals as libinternals
  import pandas._libs.sparse as splib
  import pandas._libs.window as _window
  from pandas._libs import (lib, reduction,
  from pandas._libs import algos as _algos, reshape as _reshape
  import pandas._libs.parsers as parsers
  from pandas._libs import algos, lib, writers as libwriters


#### Functions for cleaning up the encounter IDs and parsing the ICNARC xml file are imported: 

In [2]:
from clean_encounterids import *
from parse_ICNARC_xml import *

#### Clean up ICNARC CIS ID numbers.

Each record in the ICNARC data links to an encounterId in Philips. In some few cases the links are wrong (e.g. it links to an empty record). The following function fixes those encounter IDs in the ICNARC data by replacing the eroneous encounterIds in the column "CIS Patient ID". It also creates a new column "CIS Patient ID Original" which contains the original encounterIds prior to replacement. At the time of writing there are only 5 such errors for GICU.

In [3]:
icnarc_numbers = clean_icnarc_cis_ids('../ICNARC 2015-2018 encounterIds and Readmissions.TXT', 
                                      '../Philips encounterId Issue List (New).xlsx',
                                    verbose=VERBOSE)

#### Clean up Philips encounterId numbers.

The Philips database creates a new encounterId for each intensive care stay. Sometimes this process doesn't work as desired e.g. erroneous records are created, patients are accidentally discharged and readmitted creating a new record etc. 

The following function fixes the class of errors that results in multiple encounterIds that correspond to the same physical intensive care stay. It does so by replacing the erroneous encounterIds with the 'main' encounterId for that stay (i.e. the encounterId that links to the corresponding ICNARC record). It also creates a new column, encounterId_original, that contains the original ids prior to replacement. 

When the flag "log_error_type" is set to True, the function creates another new column "error_type" which specifies the type of error associated with each encounterId. It species one of 16 error types identified by Josh Inoue (UHB) and described in the file "Philips encounterId Read Me.txt", or contains "NA" if there is no known error. This flag is useful because there are encounterIds which we know have an issue but which we do not alter. For example, very occasionally a patient is discharged from ICU and then readmitted but is never removed from the Philips system. In such cases we know that a single encounterId in Philips (and in ICNARC) corresponds to two physical ICU stays. With the "error_type" flag, the user of the data can decide how they want to deal with different types of error during data processing.    

Note: here we are processing a patient summary (demographic) table so we want to log the error type for each encounter. It is not recomended or required to do this when processing a large table of physiological measurements/interventions, so the default value of log_error_type is False.    

In [4]:
philips_data = clean_philips_encounterids('../encounter_summary (1).rpt', 
                                  '../Philips encounterId Issue List (New).xlsx',
                                  verbose=VERBOSE, log_error_type=True)

In [5]:
philips_data

Unnamed: 0,ptCensusId,encounterId_original,inTime,outTime,age,tNumber,lengthOfStay (mins),gender,encounterId,error_type
0,1099,1089,2015-03-01 13:27:05,2015-03-09 16:42:57,58.0,T7788877,11715,Male,1089,
1,1102,1092,2015-03-01 19:01:34,2015-03-02 17:06:43,60.0,T7364685,1325,Female,1092,
2,1103,1093,2015-03-01 22:03:42,2015-03-04 02:57:47,90.0,T5004466,3174,Female,1093,
3,1104,1094,2015-03-01 22:45:51,2015-03-01 23:03:52,73.0,T5875123,18,Female,1094,
4,1105,1095,2015-03-01 22:49:16,2015-04-13 19:26:25,43.0,T5835459,61657,Female,1095,
5,1106,1096,2015-03-02 18:35:46,2015-03-04 02:58:16,54.0,T7806868,1943,Male,1096,
6,1107,1097,2015-03-02 18:56:55,2015-03-04 00:05:24,79.0,T5089790,1749,Female,1097,
7,1108,1098,2015-03-02 20:57:39,2015-03-10 12:24:30,78.0,T5050350,11007,Female,1098,
8,1109,1099,2015-03-03 06:01:25,2015-03-12 15:58:11,70.0,T5052198,13557,Female,1099,
9,1110,1100,2015-03-03 15:26:41,2015-03-05 16:31:14,61.0,T5007198,2945,Male,1100,


#### Link ICNARC to Philips.

Having cleaned the encounterId numbers we can now link the records in the ICNARC data to the corresponding ICU stays in Philips. The following function renames the "encounterId" column in the Philips data as "CIS Patient ID" and then does an sql-style inner join on the column using pandas.DataFrame.merge

In [6]:
merged_data = join_icnarc_to_philips(philips_data, icnarc_numbers, verbose=VERBOSE)

Because the function clean_philips_encounterIds() replaced some erroneous encounterIds with their correct value, it produced multiple rows with the same ID (now called CIS Patient ID). In general this is  not an issue (for example when processing physiological data extracted form Philips), but here we are trying to produce a table with one row per stay that contains summary data about that stay. The following function combines the duplicate rows: 

In [7]:
merged_data = combine_non_unique_encounters(merged_data, combine='simple', verbose=VERBOSE)

#### We now have a clean link between ICNARC and Philips with some summary data on each intensive care stay. This dataframe will serve as our lookup table as we continue to process more data. We now rename this dataframe 'icustays' because it defines each intensive care stay and as such will be a core table in the research database.

#### Next:
* We will add variables, that have been extracted from the ICNARC CMP dataset in a standard xml format, by joining on 'ICNARC number' and 'Unit ID'
* We will process patient phsyiological data extractions from Philips, which link to this dataframe on 'CIS Patient ID' 

The dataframe contains the following columns:

In [8]:
icustays = merged_data
icustays.columns

Index([u'CIS Patient ID', u'Readmission during this hospital stay',
       u'ICNARC number', u'tNumber', u'encounterId_original', u'inTime',
       u'outTime', u'lengthOfStay (mins)', u'gender', u'Unit ID',
       u'CIS Patient ID Original', u'ptCensusId', u'CIS Episode ID', u'age'],
      dtype='object')

#### Add ICNARC CMP variables.

We now parse the xml file that contains the ICNARC CMP dataset variables. This will form another core table in the research database and would be standard across all UK intensive care units.  

In [9]:
icnarc_data = parse_icnarc_xml("../ICNARC_Dataset_2015-2018__clean_.xml",
                               "../ICNARC CMP Dataset Properties.xlsx",
                              verbose=VERBOSE)

The ICNARC data links to 'icustays' on 'ICNARC number' and 'Unit ID':

In [10]:
icnarc_data = icnarc_data.merge(merged_data, on=['ICNARC number', 'Unit ID'])

Note: There are three icustays missing in this xml extraction of the ICNARC data. These stays can be determined with the following line of code:
    icustays[~icustays['ICNARC number'].isin(icnarc_data['ICNARC number'])]

## Cleanup from here...
#### We now evaluate some basic statistics to characterise the patient cohort:

In [11]:
convert_minutes_to_days = lambda x: x/float(24*60)

#### First we use variables from the initial WW/Philips join:

In [13]:
def print_ww_philips_summary(df):

    median_age = np.median(df['age'].values)
    age_q25, age_q75 = np.percentile(df['age'].values, [25,75])
    print "Age, median years (IQR): %.1f (%.1f, %.1f)" %(median_age,age_q25,age_q75) 

    median_los = convert_minutes_to_days(np.median(df['lengthOfStay (mins)'].values))
    los_q25, los_q75 = map(convert_minutes_to_days, np.percentile(df['lengthOfStay (mins)'].values, [25,75]))
    print "LOS, median days (IQR): %.1f (%.1f, %.1f)" %(median_los,los_q25,los_q75)
    print ""
    
    n_male = sum(df['gender']=='Male')
    n_female = sum(df['gender']=='Female')
    no_gender = sum(df['gender'].isna())    
    print "Gender, %% female: %.1f" %(100 * n_female / float(n_male + n_female))
    print "(%.1f %% of patients have no gender recorded in Philips.)" %(100*no_gender/float(len(icnarc_data)))
    print " "
    
    readmit = sum(merged_data['Readmission during this hospital stay']=='Yes')
    no_readmit = sum(merged_data['Readmission during this hospital stay'].isna())
    print "Readmission to ICU, #(%%) : %d (%.1f)" %(readmit, 100*readmit/float(len(merged_data)))
    print "(%.1f %% of patients have no recording for this variable in WW.)" %(100*no_readmit/float(len(icnarc_data)))


In [14]:
print_ww_philips_summary(icnarc_data)

Age, median years (IQR): 64.0 (50.0, 73.0)
LOS, median days (IQR): 3.0 (1.7, 5.6)

Gender, % female: 39.7
(0.1 % of patients have no gender recorded in Philips.)
 
Readmission to ICU, #(%) : 147 (3.0)
(0.0 % of patients have no recording for this variable in WW.)


#### To validate these we now look at the same variables in the XML file of the ICNARC dataset:

### TODO: convert the dates and times to datetime objects and calculate Age and LOS...

#### How much missing data in the ICNARC xml?

### Note: is this right or has the parsing missed values? e.g. 1231 with no outcome on leaving hospital?

In [15]:
print sum(icnarc_data['Weight in kg'].isna())
print sum(icnarc_data['Height in cm'].isna())
print sum(icnarc_data['Status at discharge from your hospital'].isna())
print sum(icnarc_data['Status at ultimate discharge from hospital'].isna())
print sum(icnarc_data['Time when fully ready to discharge'].isna())
print sum(icnarc_data['Status at discharge from your unit'].isna())
print sum(icnarc_data['Status at ultimate discharge from ICU/HDU'].isna())
print sum(icnarc_data['Reason for discharge from your unit'].isna())
print sum(icnarc_data['Sex'].isna())

0
0
1231
4505
1103
0
4733
703
0


#### We calculate BMI:

In [16]:
icnarc_data['bmi'] = icnarc_data['Weight in kg'].astype(float)/((icnarc_data['Height in cm'].astype(float)/100.0)**2)

In [17]:
print sum(icnarc_data['Sex']=='F')/float(len(icnarc_data))
print np.median(icnarc_data['bmi'].values)

0.39639826123
26.564344746162927


In [18]:
no_outcome = icnarc_data[icnarc_data['Status at discharge from your hospital'].isna()]

#### Make some plots...

### Need to get dictionary for ICNARC codes

In [19]:
icnarc_data['Primary reason for admission to your unit']

0        2.7.1.13.3
1       2.11.1.27.1
2        1.3.5.30.1
3        2.3.9.28.1
4        1.1.5.27.2
5        2.6.8.34.5
6        1.3.3.39.1
7        2.7.1.13.1
8       1.3.10.27.6
9        1.3.9.39.1
10       2.1.2.30.1
11       2.1.2.27.1
12        2.1.4.6.1
13       2.2.1.30.1
14       1.7.3.39.1
15       1.3.6.39.2
16      2.2.12.35.4
17       2.4.2.33.1
18       1.3.9.39.1
19      2.2.12.35.2
20        2.6.8.1.2
21       1.3.2.15.1
22       1.3.7.27.1
23       2.9.2.25.1
24       2.1.4.27.5
25       2.1.4.27.1
26       1.1.4.39.1
27        2.1.4.6.1
28       2.7.1.13.4
29       1.3.9.39.1
           ...     
4801      2.4.2.7.3
4802      2.4.2.7.3
4803     1.3.5.41.4
4804     2.6.8.1.10
4805     2.1.2.30.1
4806     2.6.8.34.5
4807      2.4.2.7.3
4808     1.3.1.39.3
4809    1.10.2.56.2
4810     1.3.6.57.1
4811     2.7.1.13.1
4812      2.4.2.7.3
4813     2.1.10.7.3
4814     2.6.8.34.9
4815     1.7.3.39.1
4816     1.1.9.56.1
4817     1.4.1.39.5
4818     2.7.1.27.1
4819     1.1.5.39.2


#### We now load and process physiological data from Philips.

This data comes from two tables in ICCA:
* PtAssessment - containing flowsheet data, often recorded at high frequnecy (~1record/hr)
* PtLabResult - containing laboratory results data (and similar), usually recorded at lower frequency (~1record/day) 

For simplicity we combine these data into a single table, 'chartevents'. 

In [12]:
chartevents = clean_philips_encounterids('../ptassess_physiological_data.rpt', 
                                  '../Philips encounterId Issue List (New).xlsx',
                                  verbose=VERBOSE, date_columns=['chartTime', 'storeTime'])

KeyboardInterrupt: 

In [17]:
chartevents = pd.read_csv('ptassess_physiological_data.rpt', delimiter='\t')
chartevents = chartevents.append(pd.read_csv('labresults_physiological_data.rpt', delimiter='\t'))

Next we:
    * clean the encountr ids (and rename the colum)
    * do the same for labresults and concatenate it
    * do the simple stats (frequency of recording, frequnecy of missingness, avergae values and variance...)

In [32]:
t1 = pd.DataFrame()
t1['o'] = [1,2,3,4,5,6]

In [33]:
t2 = pd.DataFrame()
t2['o'] = [3,6]
t2['p'] = [1,5]
t2['q'] = [np.nan,'tom']

In [34]:
t3 = t1.merge(t2, how='left', on='o')

In [35]:
t3

Unnamed: 0,o,p,q
0,1,,
1,2,,
2,3,1.0,
3,4,,
4,5,,
5,6,5.0,tom


In [36]:
t3['p'] = t3['p'].fillna(t3['o'])
t3['q'] = t3['q'].fillna('NA')

In [37]:
t3

Unnamed: 0,o,p,q
0,1,1.0,
1,2,2.0,
2,3,1.0,
3,4,4.0,
4,5,5.0,
5,6,5.0,tom


In [38]:
icustays

Unnamed: 0,CIS Patient ID,Readmission during this hospital stay,ICNARC number,tNumber,encounterId_original,inTime,outTime,lengthOfStay (mins),gender,Unit ID,CIS Patient ID Original,ptCensusId,CIS Episode ID,age
0,1064.0,No,20150290,T6834947,1064,2015-03-20 15:46:19,2015-03-26 12:34:26,8448,Male,1,1064.0,1297,0,76.0
1,1089.0,No,20150221,T7788877,1089,2015-03-01 13:27:05,2015-03-09 16:42:57,11715,Male,1,1089.0,1099,0,58.0
2,1092.0,No,20150222,T7364685,1092,2015-03-01 19:01:34,2015-03-02 17:06:43,1325,Female,1,1092.0,1102,0,60.0
3,1093.0,No,20150223,T5004466,1093,2015-03-01 22:03:42,2015-03-04 02:57:47,3174,Female,1,1093.0,1103,0,90.0
4,1095.0,No,20150224,T5835459,1095,2015-03-01 22:49:16,2015-04-13 19:26:25,61657,Female,1,1095.0,1105,0,43.0
5,1096.0,No,20150225,T7806868,1096,2015-03-02 18:35:46,2015-03-04 09:50:52,1944,Male,1,1096.0,1106,0,54.0
6,1097.0,No,20150226,T5089790,1097,2015-03-02 18:56:55,2015-03-04 09:51:00,1751,Female,1,1097.0,1107,0,79.0
7,1098.0,No,20150227,T5050350,1098,2015-03-02 20:57:39,2015-03-10 12:24:30,11007,Female,1,1098.0,1108,0,78.0
8,1099.0,No,20150228,T5052198,1099,2015-03-03 06:01:25,2015-03-12 15:58:11,13557,Female,1,1099.0,1109,0,70.0
9,1100.0,Yes,20150229,T5007198,1100,2015-03-03 15:26:41,2015-03-05 16:31:14,2945,Male,1,1100.0,1110,0,61.0
