### Merging data from WardWatcher (ICNARC) and Philips (ICCA).

Here use the following files:

* 'encounter_summary (1).rpt'  - a tab separated file with output from a simple SQL run on ICCA to extract basic information about patient encounters (ITU stays).

* 'ICNARC 2015-2018 encounterIds and Readmissions.TXT' - a file containing ICNARC patient IDs and the corresponding 'CIS Patient ID', which link to encounterID in Philips.

* 'Philips encounterId Issue List (New).xlsx' - a file documenting known issues with either encounterIds in Philips or CIS Patient IDs in WW. We clean up the IDs using this file before joining the two datasets.

* 'ICNARC_Dataset_2015-2018__clean_.xml' - xml file containing output of ICNARC dataset

* 'ICNARC CMP Dataset Properties.xlsx' - description of variables in the ICNARC dataset

In [10]:
VERBOSE = False ## For reasons of data protection we supress printing of results and data summaries.
import numpy as np

#### Functions for cleaning up the encounter IDs and parsing the ICNARC xml file are imported: 

In [2]:
from clean_encounterids import *
from parse_ICNARC_xml import *

  from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
  from pandas._libs import (hashtable as _hashtable,
  from pandas._libs import algos, lib
  from pandas._libs import hashing, tslib
  from pandas._libs import (lib, index as libindex, tslib as libts,
  import pandas._libs.tslibs.offsets as liboffsets
  from pandas._libs import algos as libalgos, ops as libops
  from pandas._libs.interval import (
  from pandas._libs import internals as libinternals
  import pandas._libs.sparse as splib
  import pandas._libs.window as _window
  from pandas._libs import (lib, reduction,
  from pandas._libs import algos as _algos, reshape as _reshape
  import pandas._libs.parsers as parsers
  from pandas._libs import algos, lib, writers as libwriters


In [3]:
icnarc_numbers = clean_icnarc_cis_ids('../ICNARC 2015-2018 encounterIds and Readmissions.TXT', 
                                      '../Philips encounterId Issue List (New).xlsx',
                                    verbose=VERBOSE)

In [4]:
philips_data = clean_philips_encounterids('../encounter_summary (1).rpt', 
                                  '../Philips encounterId Issue List (New).xlsx',
                                  verbose=VERBOSE)

In [5]:
merged_data = join_icnarc_to_philips(philips_data, icnarc_numbers, verbose=VERBOSE)

In [6]:
merged_data = combine_non_unique_encounters(merged_data, combine='simple', verbose=VERBOSE)

#### 'merged_data' now has a one-to-one mapping from ICNARC numbers to Philips encounterIds (CIS Patient ID). It also contains some summary data on each ITU episode that was extracted from Philips:

In [9]:
print(merged_data.columns)

Index([u'CIS Patient ID', u'Readmission during this hospital stay',
       u'ICNARC number', u'tNumber', u'encounterId_original', u'inTime',
       u'outTime', u'lengthOfStay (mins)', u'gender', u'Unit ID',
       u'CIS Patient ID Original', u'ptCensusId', u'CIS Episode ID', u'age'],
      dtype='object')


In [20]:
convert_minutes_to_days = lambda x: x/float(24*60)

In [24]:
median_age = np.median(merged_data['age'].values)
age_q25, age_q75 = np.percentile(merged_data['age'].values, [25,75])

print age_q25
print age_q75

n_male = sum(merged_data['gender']=='Male')
n_female = sum(merged_data['gender']=='Female')
no_gender = len(merged_data['gender']) - (n_male + n_female)
print no_gender
print n_female / float(n_male + n_female)

median_los = convert_minutes_to_days(np.median(merged_data['lengthOfStay (mins)'].values))
los_q25, los_q75 = map(convert_minutes_to_days, np.percentile(merged_data['lengthOfStay (mins)'].values, [25,75]))

print los_q25
print los_q75

50.0
73.0
8
0.396808951513
1.7300347222222223
5.5671875


#### We now parse the xml file

In [38]:
icnarc_data = parse_icnarc_xml("../ICNARC_Dataset_2015-2018__clean_.xml",
                               "../ICNARC CMP Dataset Properties.xlsx",
                              verbose=VERBOSE)

In [28]:
print(len(icnarc_data))
print(len(merged_data))

5068
4834


In [29]:
icnarc_data.merge(merged_data, on='CIS Patient ID')

KeyError: 'CIS Patient ID'

In [30]:
for col in icnarc_data.columns:
    print(col)

Assisted conception used for recent pregnancy
Advanced cardiovascular support days
ICNARC Number
HIV/AIDS
Acute myelogenous/lymphocytic leukaemia or multiple myeloma
Antimicrobial use after 48 hours in your unit
Associated neutrophil count highest WBC
Associated neutrophil count lowest WBC
Associated neutrophil count pre-admit WBC
Advanced respiratory support days
Assent for solid organ or tissue donation
Basic cardiovascular support days
Biopsy proven cirrhosis
Basic respiratory support days
Burned surface area
Brainstem death declared
Critical care visit prior to this admission to your unit
Critical care visit post-discharge from your unit
Level 0 days
Level 1 days
Level 2 days
Level 3 days
Clostridium difficile present
Chemotherapy
Congenital immunohumoral or cellular immune deficiency state
Classification of surgery
Chronic myelogenous/lymphocytic leukaemia
Cardiopulmonary resuscitation (CPR) within 24 hours prior to admission to your unit
Admission currently/recently pregnant
Chro

#### We now join on ICNARC number and looks at some of the important columns.

In particular:
* do the LOS number match?
* do ages match?

In [37]:
merged_data['Unit ID']
#icnarc_data['ICNARC CMP Number']
#icnarc_data['ICNARC Number']
#merged_data['ICNARC number']

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
       ..
4804    1
4805    1
4806    1
4807    1
4808    1
4809    1
4810    1
4811    1
4812    1
4813    1
4814    1
4815    1
4816    1
4817    1
4818    1
4819    1
4820    1
4821    1
4822    1
4823    1
4824    1
4825    1
4826    1
4827    1
4828    1
4829    1
4830    1
4831    1
4832    1
4833    1
Name: Unit ID, Length: 4834, dtype: int64