### Merging data from WardWatcher (ICNARC) and Philips (ICCA).

Here use the following files:

* 'encounter_summary (1).rpt'  - a tab separated file with output from a simple SQL run on ICCA to extract basic information about patient encounters (ITU stays).

* 'ICNARC 2015-2018 encounterIds and Readmissions.TXT' - a file containing ICNARC patient IDs and the corresponding 'CIS Patient ID', which link to encounterID in Philips.

* 'Philips encounterId Issue List (New).xlsx' - a file documenting known issues with either encounterIds in Philips or CIS Patient IDs in WW. We clean up the IDs using this file before joining the two datasets.

* 'ICNARC_Dataset_2015-2018__clean_.xml' - xml file containing output of ICNARC dataset

* 'ICNARC CMP Dataset Properties.xlsx' - description of variables in the ICNARC dataset

In [1]:
VERBOSE = False ## For reasons of data protection we supress printing of results and data summaries.

In [None]:
from clean_encounterids import *

In [3]:
icnarc_numbers = clean_icnarc_cis_ids('../ICNARC 2015-2018 encounterIds and Readmissions.TXT', 
                                      '../Philips encounterId Issue List (New).xlsx',
                                    verbose=VERBOSE)

In [4]:
philips_data = clean_philips_encounterids('../encounter_summary (1).rpt', 
                                  '../Philips encounterId Issue List (New).xlsx',
                                  verbose=VERBOSE)

In [5]:
merged_data = join_icnarc_to_philips(philips_data, icnarc_numbers, verbose=VERBOSE)

In [6]:
merged_data = combine_non_unique_encounters(merged_data, combine='simple', verbose=VERBOSE)

In [9]:
merged_data['age'].values.mean()
#merged_data['age_min'].values.mean() ## if using concat

60.50744724865536

In [11]:
print sum(merged_data['gender']=='Male')
print sum(merged_data['gender']=='Female')
print len(merged_data['gender'])

#print sum(merged_data['gender_first']=='Male')
#print sum(merged_data['gender_first']=='Female')
#print len(merged_data['gender_first'])

2911
1915
4834


#### We now parse the xml file

In [12]:
from parse_ICNARC_xml import *

In [13]:
icnarc_data = parse_icnarc_xml("../ICNARC_Dataset_2015-2018__clean_.xml",
                               "../ICNARC CMP Dataset Properties.xlsx",
                              verbose=VERBOSE)

In [14]:
icnarc_data.columns

Index([u'Assisted conception used for recent pregnancy',
       u'Advanced cardiovascular support days', u'ICNARC Number', u'HIV/AIDS',
       u'Acute myelogenous/lymphocytic leukaemia or multiple myeloma',
       u'Antimicrobial use after 48 hours in your unit',
       u'Associated neutrophil count highest WBC',
       u'Associated neutrophil count lowest WBC',
       u'Associated neutrophil count pre-admit WBC',
       u'Advanced respiratory support days',
       ...
       u'Type of adult ICU/HDU (out)',
       u'Status at ultimate discharge from ICU/HDU',
       u'Status at ultimate discharge from hospital', u'Urine output',
       u'Ultimate primary reason for admission to your unit',
       u'Associated verbal component', u'VRE present',
       u'Very severe cardiovascular disease', u'Weight in kg',
       u'Weight in kg estimated'],
      dtype='object', length=205)

#### We now join on ICNARC number and looks at some of the important columns.

In particular:
* do the LOS number match?
* do ages match?

In [None]:
merged_data