## Need-To-Do

- Analyze how much of a difference recoded transfers make?
- Properly characterize missingness based on dropped discharge dates
- Properly characterize amount of data loss with each cohort change
- Remove patients with no demographic information -- detectable by a null age

Shape of final table:

| ruid | visit_id | admit_date | discharge_date | hospital_day | n_transfers | stay_length | readmit_time | readmit_30d |
|------|----------|------------|----------------|--------------|-------------|-------------|--------------|-------------|
| user id | hospital stay # | date admitted | date discharged | date in hospital | number of transfers | duration of stay | time from last discharge to this admission | was the patient a 30d readmit? |


## Nice-To-Do

- Construct missing discharge/admit dates from CPT codes -- do not do this for events where both are missing as these may be ER visits w/o admit, but do check if they fall in the range of an existing stay
- Characterize the amount of missingness of entire hospital visits from CPT codes

## Loading data

In [1]:
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
adt = pd.read_table('../data/FONNESBECK_ADT_20151202.csv', encoding='latin1', sep = ',', 
                    infer_datetime_format=True, parse_dates=['Admission_date','Event_Date','DISCHARGE_DATE'],
                    dtype={'RUID':np.str})
pheno = pd.read_table('../data/FONNESBECK_phenotype_20151202.csv', encoding='latin1', sep = ',',
                    infer_datetime_format=True, parse_dates=['DOB','DOD'],
                    dtype={'RUID':np.str})
cpt = pd.read_table('../data/FONNESBECK_CPT_20151202.csv', encoding='latin1', sep = ',',
                    infer_datetime_format=True, parse_dates=['Event_date'],
                   dtype={'RUID':np.str})

In [3]:
svc = pd.read_excel('../data/FONNESBECK_DD_2014102014.xlsx',sheet_name='Service code', sep = ',', header=0, names=['SVC','Desc'])

In [4]:
adt.Event = pd.Categorical(adt.Event,categories = ['Admit','Transfer','Discharge'])
adt = adt.sort_values(by = ['RUID','Admission_date','Event','Event_Date']).reset_index(drop = True)

## Adding ages and filtering out pediatric & psych patients

In [5]:
# calculate age of patient when an event occurs
adt_age = pd.merge(adt,pheno)
events = adt_age.Event_Date.dt
birthdays = adt_age.DOB.dt

adt_age['age'] = events.year - birthdays.year + ((events.month < birthdays.month) & (events.day < birthdays.day))
# above from https://stackoverflow.com/questions/2217488/age-from-birthdate-in-python/9754466#9754466

In [21]:
# getting rid of peds & psychiatric patients...
# we're removing these because they aren't part of the CMS criteria so 30-day readmits for them don't lose the hospital money
ped_svc = '|'.join(svc.SVC[svc.Desc.str.contains("CHILD|PED")])
psych_svc = '|'.join(svc.SVC[svc.Desc.str.contains("PSYCH")])

ped_filter = (adt_age.age < 18)
# removing all rows where the patient would be classified as a pediatric patient;
# we also experimented with removing all rows with a pediatric service code, but this is more straightforward
# it does not however account for older patients getting services through the pediatric hospital, e.g.
# 20-year-olds getting follow-up for pediatric cancers

psych_filter = (adt_age.SRV_CODE.str.contains(psych_svc) & (adt_age.Event == "Admit"))
# explicitly remove psychiatric admits but not psych consults during admits for other reasons;
# because we're only working with admit dates as records of patient visits later, this means we effectively remove any visits where the patient
# was a primary admit to psychiatric services

In [28]:
adt_cms = adt[~(ped_filter | psych_filter | adt_age.age.isnull())].copy()

## Filtering to admits & eliminating missing discharges

In [29]:
adt_cms['imputed_transfer'] = [0]*adt_cms.shape[0] # create a new column to store flag

txmask = (adt_cms.Event == "Admit") & (adt_cms.Admission_date != adt_cms.Event_Date)
# there are 431 of these that we suspect are transfers, not admits
# because they're labeled "admit" but happen after the listed admit date for the visit

adt_cms.loc[txmask,'Event'] = "Transfer"
adt_cms.loc[txmask,'imputed_transfer'] = 1

In [30]:
adt_cms_admits = adt_cms[(adt_cms.Event == 'Admit') & ~(adt_cms.DISCHARGE_DATE.isnull())].copy().reset_index(drop = True)

# remove everything where we're missing a discharge date

## Constructing variables

In [12]:
adt_cms_admits['stay_length'] = adt_cms_admits.DISCHARGE_DATE - adt_cms_admits.Admission_date
adt_cms_admits['readmit_time'] = adt_cms_admits.Admission_date - adt_cms_admits.DISCHARGE_DATE.shift()

didx = ~(adt_cms_admits.RUID.shift() == adt_cms_admits.RUID)

adt_cms_admits['readmit_time'] = adt_cms_admits['readmit_time'].mask(didx)

adt_cms_admits['readmit_30d'] = np.where(adt_cms_admits.readmit_time <= datetime.timedelta(days=30),1,0)
adt_cms_admits = adt_cms_admits[~(adt_cms_admits.readmit_time < datetime.timedelta(days=0))] # get rid of double admits where we had a different
# chief complaint or svc code

In [13]:
event_counts = (adt_cms[~(adt_cms.DISCHARGE_DATE.isnull())].groupby(by=['RUID','Admission_date'])
                .Event
                .value_counts(sort=False)
                .unstack(fill_value = 0))

n_transfers = event_counts['Transfer'] # now pull the number of transfers and we're good
# merge this by multindex onto the other table once it's cleaned & ready to go

In [14]:
adt_cms_admits2 = (adt_cms_admits.drop(labels=['Event','Event_Date','SRV_CODE','imputed_transfer','CHIEF_COMPLAINT'], axis = 1)
                  .set_index(['RUID','Admission_date'])
                  .join(n_transfers)
                  .reset_index(drop = False)
                  .rename({'RUID': 'ruid', 'Admission_date': 'admit_date', 'DISCHARGE_DATE': 'discharge_date', 'Transfer': 'n_transfers'},axis = 1))

In [15]:
adt_cms_admits2['visit_id'] = adt_cms_admits2.groupby('ruid').cumcount()

In [16]:
def date_ranger(x):
    start = x.iloc[0]['admit_date']
    end = x.iloc[0]['discharge_date']
    return pd.DataFrame(pd.date_range(start=start, end=end).tolist())

In [17]:
hospital_day = (adt_cms_admits2.groupby(['ruid','visit_id'])
                .apply(date_ranger)
                .reset_index(drop = False)
                .drop('level_2',axis = 1)
                .set_index(['ruid','visit_id']))

# takes a bit to run

In [18]:
adt_cms_final = (adt_cms_admits2.set_index(['ruid','visit_id'])
                .join(hospital_day)
                .reset_index(drop = False)
                .rename({0:'hospital_day'},axis=1))[['ruid','visit_id','admit_date','discharge_date','hospital_day',
                                                     'stay_length','n_transfers','readmit_time','readmit_30d']]

In [19]:
adt_cms_final

Unnamed: 0,ruid,visit_id,admit_date,discharge_date,hospital_day,stay_length,n_transfers,readmit_time,readmit_30d
0,50135262,0,2007-02-08,2007-02-12,2007-02-08,4 days,2,NaT,0
1,50135262,0,2007-02-08,2007-02-12,2007-02-09,4 days,2,NaT,0
2,50135262,0,2007-02-08,2007-02-12,2007-02-10,4 days,2,NaT,0
3,50135262,0,2007-02-08,2007-02-12,2007-02-11,4 days,2,NaT,0
4,50135262,0,2007-02-08,2007-02-12,2007-02-12,4 days,2,NaT,0
5,50135262,1,2007-08-03,2007-08-06,2007-08-03,3 days,3,172 days,0
6,50135262,1,2007-08-03,2007-08-06,2007-08-04,3 days,3,172 days,0
7,50135262,1,2007-08-03,2007-08-06,2007-08-05,3 days,3,172 days,0
8,50135262,1,2007-08-03,2007-08-06,2007-08-06,3 days,3,172 days,0
9,50135262,2,2007-08-28,2007-08-29,2007-08-28,1 days,1,22 days,1


In [20]:
final_ruids = adt_cms_final.ruid.unique()

In [34]:
len(final_ruids) # from 8000 patients, we're down to 5664.

5664

In [35]:
adt_cms_final.to_pickle("../data/adt_cms_final.pkl")