### Merging data from WardWatcher (ICNARC) and Philips (ICCA).

Here use the following files:

* 'encounter_summary (1).rpt'  - a tab separated file with output from a simple SQL run on ICCA to extract basic information about patient encounters (ITU stays).

* 'ICNARC 2015-2018 encounterIds and Readmissions.TXT' - a file containing ICNARC patient IDs and the corresponding 'CIS Patient ID', which link to encounterID in Philips.

* 'Philips encounterId Issue List (New).xlsx' - a file documenting known issues with either encounterIds in Philips or CIS Patient IDs in WW. We clean up the IDs using this file before joining the two datasets.

* 'ICNARC_Dataset_2015-2018__clean_.xml' - xml file containing output of ICNARC dataset

* 'ICNARC CMP Dataset Properties.xlsx' - description of variables in the ICNARC dataset

In [1]:
VERBOSE = False ## For reasons of data protection we supress printing of results and data summaries.
import numpy as np

#### Functions for cleaning up the encounter IDs and parsing the ICNARC xml file are imported: 

In [2]:
from clean_encounterids import *
from parse_ICNARC_xml import *

  from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
  from pandas._libs import (hashtable as _hashtable,
  from pandas._libs import algos, lib
  from pandas._libs import hashing, tslib
  from pandas._libs import (lib, index as libindex, tslib as libts,
  import pandas._libs.tslibs.offsets as liboffsets
  from pandas._libs import algos as libalgos, ops as libops
  from pandas._libs.interval import (
  from pandas._libs import internals as libinternals
  import pandas._libs.sparse as splib
  import pandas._libs.window as _window
  from pandas._libs import (lib, reduction,
  from pandas._libs import algos as _algos, reshape as _reshape
  import pandas._libs.parsers as parsers
  from pandas._libs import algos, lib, writers as libwriters


In [3]:
icnarc_numbers = clean_icnarc_cis_ids('../ICNARC 2015-2018 encounterIds and Readmissions.TXT', 
                                      '../Philips encounterId Issue List (New).xlsx',
                                    verbose=VERBOSE)

In [4]:
philips_data = clean_philips_encounterids('../encounter_summary (1).rpt', 
                                  '../Philips encounterId Issue List (New).xlsx',
                                  verbose=VERBOSE)

In [5]:
merged_data = join_icnarc_to_philips(philips_data, icnarc_numbers, verbose=VERBOSE)

In [6]:
merged_data = combine_non_unique_encounters(merged_data, combine='simple', verbose=VERBOSE)

#### 'merged_data' now has a one-to-one mapping from ICNARC numbers to Philips encounterIds (CIS Patient ID). It also contains some summary data on each ITU episode that was extracted from Philips:

In [7]:
print(merged_data.columns)

Index([u'CIS Patient ID', u'Readmission during this hospital stay',
       u'ICNARC number', u'tNumber', u'encounterId_original', u'inTime',
       u'outTime', u'lengthOfStay (mins)', u'gender', u'Unit ID',
       u'CIS Patient ID Original', u'ptCensusId', u'CIS Episode ID', u'age'],
      dtype='object')


#### We now parse the xml file that contains the ICANRC CMP dataset variables:

In [10]:
icnarc_data = parse_icnarc_xml("../ICNARC_Dataset_2015-2018__clean_.xml",
                               "../ICNARC CMP Dataset Properties.xlsx",
                              verbose=VERBOSE)

In [18]:
icnarc_data = icnarc_data.merge(merged_data, on=['ICNARC number', 'Unit ID'])

In [71]:
print("The XML file has %d ITU stays," %len(icnarc_data))
print("But the merged WW and Philips dataframe has %d stays." %len(merged_data))

The XML file has 4831 ITU stays,
But the merged WW and Philips dataframe has 4834 stays.


#### There are 3 stays missing from the XML file, these are:

In [None]:
merged_data[~merged_data['ICNARC number'].isin(icnarc_data['ICNARC number'])] ## Output deleted for data protection.

#### We now evaluate some basic statistics to characterise the patient cohort:

In [8]:
convert_minutes_to_days = lambda x: x/float(24*60)

#### First we use variables from the initial WW/Philips join:

In [126]:
def print_ww_philips_summary(df):

    median_age = np.median(df['age'].values)
    age_q25, age_q75 = np.percentile(df['age'].values, [25,75])
    print "Age, median years (IQR): %.1f (%.1f, %.1f)" %(median_age,age_q25,age_q75) 

    median_los = convert_minutes_to_days(np.median(df['lengthOfStay (mins)'].values))
    los_q25, los_q75 = map(convert_minutes_to_days, np.percentile(df['lengthOfStay (mins)'].values, [25,75]))
    print "LOS, median days (IQR): %.1f (%.1f, %.1f)" %(median_los,los_q25,los_q75)
    print ""
    
    n_male = sum(df['gender']=='Male')
    n_female = sum(df['gender']=='Female')
    no_gender = sum(df['gender'].isna())    
    print "Gender, %% female: %.1f" %(100 * n_female / float(n_male + n_female))
    print "(%.1f %% of patients have no gender recorded in Philips.)" %(100*no_gender/float(len(icnarc_data)))
    print " "
    
    readmit = sum(merged_data['Readmission during this hospital stay']=='Yes')
    no_readmit = sum(merged_data['Readmission during this hospital stay'].isna())
    print "Readmission to ICU, #(%%) : %d (%.1f)" %(readmit, 100*readmit/float(len(merged_data)))
    print "(%.1f %% of patients have no recording for this variable in WW.)" %(100*no_readmit/float(len(icnarc_data)))


In [127]:
print_ww_philips_summary(icnarc_data)

Age, median years (IQR): 64.0 (50.0, 73.0)
LOS, median days (IQR): 3.0 (1.7, 5.6)

Gender, % female: 39.7
(0.1 % of patients have no gender recorded in Philips.)
 
Readmission to ICU, #(%) : 147 (3.0)
(0.0 % of patients have no recording for this variable in WW.)


#### To validate these we now look at the same variables in the XML file of the ICNARC dataset:

### TODO: convert the dates and times to datetime objects and calculate Age and LOS...

#### How much missing data in the ICNARC xml?

### Note: is this right or has the parsing missed values? e.g. 1231 with no outcome on leaving hospital?

In [130]:
print sum(icnarc_data['Weight in kg'].isna())
print sum(icnarc_data['Height in cm'].isna())
print sum(icnarc_data['Status at discharge from your hospital'].isna())
print sum(icnarc_data['Status at ultimate discharge from hospital'].isna())
print sum(icnarc_data['Time when fully ready to discharge'].isna())
print sum(icnarc_data['Status at discharge from your unit'].isna())
print sum(icnarc_data['Status at ultimate discharge from ICU/HDU'].isna())
print sum(icnarc_data['Reason for discharge from your unit'].isna())
print sum(icnarc_data['Sex'].isna())

0
0
1231
4505
1103
0
4733
703
0


#### We calculate BMI:

In [70]:
icnarc_data['bmi'] = icnarc_data['Weight in kg'].astype(float)/((icnarc_data['Height in cm'].astype(float)/100.0)**2)

In [128]:
print sum(icnarc_data['Sex']=='F')/float(len(icnarc_data))
print np.median(icnarc_data['bmi'].values)

0.39639826123
26.564344746162927


In [50]:
no_outcome = icnarc_data[icnarc_data['Status at discharge from your hospital'].isna()]

#### Make some plots...

### Need to get dictionary for ICNARC codes

In [131]:
icnarc_data['Primary reason for admission to your unit']

0        2.7.1.13.3
1       2.11.1.27.1
2        1.3.5.30.1
3        2.3.9.28.1
4        1.1.5.27.2
5        2.6.8.34.5
6        1.3.3.39.1
7        2.7.1.13.1
8       1.3.10.27.6
9        1.3.9.39.1
10       2.1.2.30.1
11       2.1.2.27.1
12        2.1.4.6.1
13       2.2.1.30.1
14       1.7.3.39.1
15       1.3.6.39.2
16      2.2.12.35.4
17       2.4.2.33.1
18       1.3.9.39.1
19      2.2.12.35.2
20        2.6.8.1.2
21       1.3.2.15.1
22       1.3.7.27.1
23       2.9.2.25.1
24       2.1.4.27.5
25       2.1.4.27.1
26       1.1.4.39.1
27        2.1.4.6.1
28       2.7.1.13.4
29       1.3.9.39.1
           ...     
4801      2.4.2.7.3
4802      2.4.2.7.3
4803     1.3.5.41.4
4804     2.6.8.1.10
4805     2.1.2.30.1
4806     2.6.8.34.5
4807      2.4.2.7.3
4808     1.3.1.39.3
4809    1.10.2.56.2
4810     1.3.6.57.1
4811     2.7.1.13.1
4812      2.4.2.7.3
4813     2.1.10.7.3
4814     2.6.8.34.9
4815     1.7.3.39.1
4816     1.1.9.56.1
4817     1.4.1.39.5
4818     2.7.1.27.1
4819     1.1.5.39.2
