# 2019 Labels

This notebook generates labels for the 2019 data.

We will generate the following labels:
1. `admit_label`: level of care at time of admit
2. `24hr_label`: most recent level of care prior to 24 hrs after admit
3. `12hr_label`: most recent level of care prior to 12 hrs after admit
4. `acute_to_icu_label`: label switched from acute to ICU from admit to 24 hr time
4. `icu_to_acute_label`: label switched from ICU to acute from admit to 24 hr time


Labels 1 through 3 reflect the following:
- 1 = critical care
- 0 = not critical care 

Labels 4 through 5 reflect:
- 1 = switched care level
- 0 = did not switch care level

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import os.path
from os import path
import math

# Download latest cohort data

# Load Data

The data was downloaded from the BQ from the `Traige_TE.triage_cohort_final` table. This table contains the newest cohort.

In [None]:
forrepo = 1 # change this to 0 if you want to see hidden columns displayed

hidecols = []
if forrepo:
    hidecols=['anon_id', 'pat_enc_csn_id_coded', 'inpatient_data_id_coded'] # these are hidden columns

In [None]:
# folder with downloaded data
datadir = "../../../../2019_data"

cohort = pd.read_csv("{}/triage_cohort_final.csv".format(datadir))

cohort.drop('Unnamed: 0', axis=1, inplace=True)

print(cohort.shape)
cohort.drop(hidecols, axis=1, errors='ignore').head()

The `admit_time` column contains the adjusted/earliest time between `admit_time_jittered` from the `ADT` table and the admission ordered time form the `order_proc` table. This is the time that we will use for the admission time.

We are not going to use this admit time. Instead, we use the earliest admit time based on the `effective_time_jittered_utc` column in the `ADT` table. We'll grab the admit time from Conor's cohort for consistency.

In [None]:
conors_cohort = pd.read_csv("{}/triage_cohort_draft_2019.csv".format(datadir))
conors_cohort.head()

# join this to the cohort table
cohort_joined = cohort.merge(conors_cohort[['pat_enc_csn_id_coded', 'admit_time_jittered']], how='left')
cohort_joined.head()

cohort_joined[cohort_joined.admit_time == cohort_joined.admit_time_jittered].drop(hidecols, axis=1, errors='ignore')

Looks like the admit time was already updated in this table. We can just use the original time.

# Pull the ADT table for this cohort

We need to pull the `ADT` data for this new cohort. I already ran a SQL call directly on BQ to form the cohort and stored the table under `traige_TE.triage_cohort_adt_2019`

Here's the SQL code used:

`select shc.anon_id, shc.pat_enc_csn_id_coded, pat_class, pat_lv_of_care, event_type, pat_service, status_of_bed, accomodation
  from shc_core.adt shc
  right join conor_db.triage_cohort_draft_2019 c 
  on shc.anon_id = c.anon_id and shc.pat_enc_csn_id_coded = c.pat_enc_csn_id_coded`

In [None]:
# connect to BQ - uncomment all of this if the data is not on computer
# cohort_adt_file = "../2019_data/triage_cohort_adt_2019.csv"

# %load_ext google.cloud.bigquery

In [None]:
# %%bigquery adt
# select *
# from traige_TE.triage_cohort_2019_adt

In [None]:
# # save the data
# adt.to_csv(cohort_adt_file, index=False)

# Load ADT table

Start here if the data has already been pulled from BQ.

In [None]:
# use this to describe the tables
def describe_df(df):
    print("df shape", df.shape)
    print("unique CSNs: ", df.pat_enc_csn_id_coded.nunique())
    print("unique patients: ", df.anon_id.nunique())

In [None]:
adt = pd.read_csv("{}/triage_cohort_adt_2019.csv".format(datadir))

describe_df(adt)

adt.drop(hidecols, axis=1, errors='ignore').head()

In [None]:
# join the adt to the cohort to make sure we have the correct cohort
joined_cohort_adt = cohort.merge(adt, on=['anon_id', 'pat_enc_csn_id_coded'],
                       how='left')

describe_df(joined_cohort_adt)

joined_cohort_adt.drop(hidecols, axis=1, errors='ignore').head()

In [None]:
print("cohort: ", cohort.shape, "adt: ", adt.shape, "joined_demo_adt: ", joined_cohort_adt.shape)

print("\nNumber of unique CSNs:")
print("cohort:", cohort.pat_enc_csn_id_coded.nunique(),
     "adt:", adt.pat_enc_csn_id_coded.nunique(),
     "joined:", joined_cohort_adt.pat_enc_csn_id_coded.nunique())

print("\nNumber of unique anon_ids:")
print("cohort:", cohort.anon_id.nunique(),
     "adt:", adt.anon_id.nunique(),
     "joined:", joined_cohort_adt.anon_id.nunique())

# Prepare table for making labels

We need to change the times into datetime format so we can work with them to create the labels. Also, create a column that tells us how long it's been since admit time for each event in the `ADT` table.

In [None]:
# change the admit time to datetime
joined_cohort_adt.admit_time = pd.to_datetime(joined_cohort_adt.admit_time)

# change the effective time to datetime
joined_cohort_adt['effective_time_jittered_utc'] = pd.to_datetime(joined_cohort_adt.effective_time_jittered_utc)

# change the event time to datetime
joined_cohort_adt['event_time_jittered_utc'] = pd.to_datetime(joined_cohort_adt.event_time_jittered_utc)

In [None]:
# compute time since event - this step take a little while
joined_cohort_adt['effective_time_since_admit'] = joined_cohort_adt.apply(lambda x: x.effective_time_jittered_utc - x.admit_time, axis=1)
joined_cohort_adt['event_time_since_admit'] = joined_cohort_adt.apply(lambda x: x.event_time_jittered_utc - x.admit_time, axis=1)

Check out the CSN from last time that was messed up. This CSN had two patients last time.

In [None]:
joined_cohort_adt[joined_cohort_adt.pat_enc_csn_id_coded == 131227093710].drop(hidecols, axis=1, errors='ignore') # looks fine

In [None]:
time_used = "effective" # change this to look at the different times (either effective or event)
# time_used = "event"

key = "{}_time_since_admit".format(time_used)

# check out time since admit
print(joined_cohort_adt[key].describe())

# look at all of the ADT events for these CSNs that occur prior to admission
prior_to_admit = joined_cohort_adt[joined_cohort_adt[key] < timedelta(hours=0)]

# number of ADT events that occur before admission
print(prior_to_admit.shape)

# this tells us where people were before admission
print(prior_to_admit.pat_class.value_counts())

prior_to_admit.drop(hidecols, axis=1, errors='ignore').head()

# Some patients have many CSNs in the cohort

Most patients only have one CSN in the `demo` table, but there are some with many CSNs. The max number of CSNs for one individual is 37. 

In [None]:
counts = joined_cohort_adt.groupby('anon_id')['pat_enc_csn_id_coded'].nunique().sort_values(ascending=False)

# distribution of CSNs per patient
print(counts.describe())

# number of patients with more than one CSN
print("\nPatients with multiple CSNs: ", sum(counts > 1))

counts.head(20)
# joined_cohort_adt[filtered_demo.anon_id == 'JCe8f38d'].sort_values('admit_time')

# Look into event time and effective time differences

In [None]:
# get difference between events
joined_cohort_adt['effective_minus_event_time'] = (joined_cohort_adt.effective_time_jittered_utc 
                                                   - joined_cohort_adt.event_time_jittered_utc)

joined_cohort_adt.drop(hidecols, axis=1, errors='ignore').head()

# Make Label: Level of Care at Admit

We need to get the first admission event. 

# Patients missing lv of care 

There are some patients that are missing the `pat_lv_of_care` entry for the admit event. I looked through a handfull manually and it looks like these individuals tend to be missing the `pat_lv_of_care` for all of their ADT events. This means we can't assign them any labels and so should be removed from the dataset.

There are **1007** cases that have no admit lv of care. See the next two boxes below for more.

In [None]:
# try filtering for exact admit events
admit_events = joined_cohort_adt[joined_cohort_adt.admit_time == joined_cohort_adt.effective_time_jittered_utc] 
# admit_events = admit_events[admit_events.pat_class == "Inpatient"]

print("joined_cohort_adt")
describe_df(joined_cohort_adt)
print("\nadmit events")
describe_df(admit_events) # we don't lose any CSNs so that's good



# there are multiple admit events for some CSNs - check that the lv of care are the same
care_counts = admit_events.groupby(['pat_enc_csn_id_coded']).pat_lv_of_care.nunique()
print("\n",care_counts.describe())
# looks like some CSNs are missing an admit event label, 
# some have multiple level of care labels --> need to pull sequence numbers for events

missing_admits = care_counts[care_counts != 1].reset_index()

missing_admits.drop(hidecols, axis=1, errors='ignore')

In [None]:
cols = ['pat_class', 'pat_lv_of_care', 
        'admit_time', 'effective_time_jittered_utc', 'event_time_jittered_utc', 'effective_time_since_admit',
       'event_time_since_admit', 'effective_minus_event_time', 
       'event_type', 'anon_id', 'pat_enc_csn_id_coded']

# check on these missing admit CSNs
i = 6
missing_csns = missing_admits[missing_admits.pat_lv_of_care == 0]


missing_csns = missing_csns.pat_enc_csn_id_coded.values
print(len(missing_csns))
missing_csn = missing_csns[i]
check = joined_cohort_adt[joined_cohort_adt.pat_enc_csn_id_coded == missing_csn]
check[cols].sort_values(by=['anon_id', 'pat_enc_csn_id_coded', 'effective_time_jittered_utc']).drop(hidecols, axis=1, errors='ignore')

In [None]:
# find out when these missing admit individuals get their first label
len(missing_csns)

In [None]:
# read in the final labels that we used
final_labels = pd.read_csv("{}/triage_to_keep_cohort_with_labels_updated.csv".format(datadir))

final_labels.head().drop(hidecols, axis=1, errors='ignore')

In [None]:
final_labels_sub = final_labels[final_labels.pat_enc_csn_id_coded.isin(cohort.pat_enc_csn_id_coded)]

final_labels_sub.has_admit_label.value_counts()

In [None]:
extra_labels = final_labels_sub[(~final_labels_sub.pat_enc_csn_id_coded.isin(missing_csns)) &
                               (final_labels_sub.has_admit_label == 0)]
extra_labels.drop(hidecols, axis=1, errors='ignore')

In [None]:
# check out these extra labels that are in the final labels but not found as missing admit labels here
# this might be an error - there are only 9 though so it's not a huge dealb
i = 0
extra_csns = extra_labels.pat_enc_csn_id_coded.values
csn = extra_csns[i]

print(len(extra_csns))

joined_cohort_adt[joined_cohort_adt.pat_enc_csn_id_coded == csn].drop(hidecols, axis=1, errors='ignore')

In [None]:
# find patients with NaN only for lv of care
missing_adt = joined_cohort_adt[joined_cohort_adt.pat_enc_csn_id_coded.isin(missing_csns)]

not_all_nan = missing_adt[~missing_adt.pat_lv_of_care.isnull()]

not_all_nan_csns = not_all_nan.pat_enc_csn_id_coded.unique()

fully_missing_csns = [x for x in missing_csns if not x in not_all_nan_csns]

missing_csns_adt = cohort[cohort.pat_enc_csn_id_coded.isin(fully_missing_csns)]

print(len(fully_missing_csns))

# missing_csns_adt.to_csv("../2019_data/cohort_2019_missing_pat_lv_of_care.csv")

# Remove patients with no labels in any events

In [None]:
filtered_cohort_adt = joined_cohort_adt[~joined_cohort_adt.pat_enc_csn_id_coded.isin(fully_missing_csns)]

print("joined_cohort_adt")
describe_df(joined_cohort_adt)
print("\nfiltered_cohort_adt")
describe_df(filtered_cohort_adt) # removed 202 patients

filtered_cohort = cohort[~cohort.pat_enc_csn_id_coded.isin(fully_missing_csns)]
print("\n\njoined_cohort_adt")
describe_df(joined_cohort_adt)
print("\nfiltered_cohort")
describe_df(filtered_cohort) # removed 202 patients

# Continue with patients that have admit events with lv of care 

In [None]:
# keep only patients with one lv of care label at admit time
counts = care_counts.reset_index()
keep_csns = counts[counts.pat_lv_of_care == 1].pat_enc_csn_id_coded.values

keep_admits = admit_events[admit_events.pat_enc_csn_id_coded.isin(keep_csns)]

print("filtered_cohort_adt")
describe_df(filtered_cohort_adt)
print("\nkeep admits")
describe_df(keep_admits) # there are multiple inpatient admit events, but they have the same label so it's okay

print("\nkeep admits + removed csns =", 
      keep_admits.pat_enc_csn_id_coded.nunique() + len(missing_csns)) # we're only missing the ones with no labels

print("\nmissing csns: ", len(missing_csns))

# keep only the first admit event for each csn
# all admit events have same label so it's okay to keep
unique_admits = keep_admits.groupby("pat_enc_csn_id_coded").first().reset_index()

admit_labels = unique_admits[['pat_enc_csn_id_coded', 'pat_lv_of_care']]
print(admit_labels.pat_lv_of_care.value_counts()) # these NeoNatal ICU and Newborn Nursery might be weird to keep ***

admit_labels['admit_label'] = (admit_labels.pat_lv_of_care == "Critical Care").astype(int)

print(admit_labels.admit_label.value_counts())

admit_labels.drop(hidecols, axis=1, errors='ignore')

In [None]:
csn = admit_labels[admit_labels.pat_lv_of_care == 'Newborn Nursery - VC Only'].pat_enc_csn_id_coded.values[0]

In [None]:
check = filtered_cohort_adt[filtered_cohort_adt.pat_enc_csn_id_coded == csn]
check[cols].sort_values('event_time_jittered_utc').drop(hidecols, axis=1, errors='ignore')

In [None]:
filtered_cohort_labels = filtered_cohort.merge(admit_labels[['pat_enc_csn_id_coded', 'admit_label']], how='left')

# some are mising an admit label because they were NaN at admit time
sum(filtered_cohort_labels.admit_label.isnull())

filtered_cohort_labels.drop(hidecols, axis=1, errors='ignore')

# Most Recent Care Labels

Now we'll get the labels for the most recent event prior to X hours, for X=12 and 24.

In [None]:
X = 24

def get_adtX(X):
    # take only events that occur after admit but before the X hours mark
    adt_X = filtered_cohort_adt[(filtered_cohort_adt.effective_time_since_admit >= timedelta(hours=0))
                                 &
                                 (filtered_cohort_adt.effective_time_since_admit <= timedelta(hours=X))
                                &
                                (~filtered_cohort_adt.pat_lv_of_care.isnull())
                                ]

    # check out distribution of times
    print(adt_X.effective_time_since_admit.describe(), "\n")
    
    print("filtered_cohort_adt")
    describe_df(filtered_cohort_adt)
    print("\nadt_X")
    # check if we lost anyone - we're missing some
    describe_df(adt_X)


    print("\nmissing number of csns:")
    filtered_cohort_csns = filtered_cohort_adt.pat_enc_csn_id_coded.unique()
    adt_X_csns = adt_X.pat_enc_csn_id_coded.unique()
    print(len(filtered_cohort_csns) - len(adt_X_csns))
    
    # get these missing csns to manual check
    missing_csns = [x for x in filtered_cohort_csns if x not in adt_X_csns]
    
    return (adt_X, missing_csns)


adt_X, missing_csns = get_adtX(X=24)

# Missing 24 hour labels

There are 84 individuals missing 24 hr labels. I checked the first three and they seem to be those with NaN values for `pat_lv_of_care` for all events that occur within 24 hours.

In [None]:
i = 0
missing_csn = missing_csns[i]
check = filtered_cohort_adt[filtered_cohort_adt.pat_enc_csn_id_coded == missing_csn]
check[cols].sort_values('effective_time_jittered_utc').drop(hidecols, axis=1, errors='ignore')

In [None]:
def get_recent_events():
    # sort by the effective time of events
    adt_X.sort_values(by=['anon_id', 'pat_enc_csn_id_coded','effective_time_since_admit', 'seq_num_in_enc', 'seq_num_in_bed_min'], inplace=True)

    describe_df(adt_X)

    # group by CSN and get first event 
    recent_events = adt_X.groupby('pat_enc_csn_id_coded').last().reset_index()

    print(recent_events.effective_time_since_admit.describe())

    return recent_events

recent_events = get_recent_events()

# Some patients' most recent event is very early

Minimum is 1 minute after admit. Check this guy out.

In [None]:
recent_events.sort_values(by=['effective_time_since_admit']).head().drop(hidecols, axis=1, errors='ignore')

In [None]:
# this checks out, so at least nothing wrong with code
check = filtered_cohort_adt[filtered_cohort_adt.pat_enc_csn_id_coded == 131087352675]
check[cols].sort_values('effective_time_since_admit').drop(hidecols, axis=1, errors='ignore')

In [None]:
# we have one row for each event now - each row is the last entry for each event prior to 24 hr mark
print(recent_events.pat_enc_csn_id_coded.nunique())
print(recent_events.shape)

describe_df(recent_events)

recent_events.effective_time_since_admit.describe()

## Continue with labels again

In [None]:
# look at the level of care assignments across all individuals
print(recent_events.pat_lv_of_care.value_counts())

sum(recent_events.pat_lv_of_care.isnull()) # these are NaN lv of care prior to 24 hr

In [None]:
label_name = "label_{}hr_recent".format(X)
recent_events[label_name] = (recent_events.pat_lv_of_care == 'Critical Care').astype(int)
recent_events.head().drop(hidecols, axis=1, errors='ignore')

In [None]:
# grab relevant columns
labels = recent_events[['pat_enc_csn_id_coded', label_name]]

# join to demo table
filtered_cohort_labels2 = filtered_cohort_labels.merge(labels, how='left', on='pat_enc_csn_id_coded')
filtered_cohort_labels2[label_name] = filtered_cohort_labels2[label_name]

filtered_cohort_labels2.head().drop(hidecols, axis=1, errors='ignore')

# Make 12 hour most recent labels

In [None]:
X=12
adt_X, missing_csns = get_adtX(X=12)

# Missing 12 hr labels

There are 136 CSNs missing 12 hour labels. We can check out a couple of them. Looks legit. This should be a superset of the missing 24 hour label CSNs.

In [None]:
i = 0
missing_csn = missing_csns[i]
check = filtered_cohort_adt[filtered_cohort_adt.pat_enc_csn_id_coded == missing_csn]
check[cols].sort_values('effective_time_jittered_utc').drop(hidecols, axis=1, errors='ignore')

In [None]:
# get the last event within 12 hours for each CSN
recent_events = get_recent_events()

In [None]:
# look at the level of care assignments across all individuals
print(recent_events.pat_lv_of_care.value_counts())

sum(recent_events.pat_lv_of_care.isnull()) # these are NaN lv of care prior to 24 hr

In [None]:
label_name = "label_{}hr_recent".format(X)
recent_events[label_name] = (recent_events.pat_lv_of_care == 'Critical Care').astype(int)

print(recent_events[label_name].value_counts())
recent_events.head().drop(hidecols, axis=1, errors='ignore')

In [None]:
# grab relevant columns
labels = recent_events[['pat_enc_csn_id_coded', label_name]]

# join to demo table
filtered_cohort_labels3 = filtered_cohort_labels2.merge(labels, how='left', on='pat_enc_csn_id_coded')
filtered_cohort_labels3[label_name] = filtered_cohort_labels3[label_name]

filtered_cohort_labels3.head().drop(hidecols, axis=1, errors='ignore')

In [None]:
filtered_cohort_labels3.admit_label = filtered_cohort_labels3.admit_label.astype("Int64")
filtered_cohort_labels3['label_24hr_recent'] = filtered_cohort_labels3['label_24hr_recent'].astype("Int64")
filtered_cohort_labels3['label_12hr_recent'] = filtered_cohort_labels3['label_12hr_recent'].astype("Int64")

filtered_cohort_labels3.drop(hidecols, axis=1, errors='ignore')

In [None]:
# check the missing label count for all three labels
print(sum(filtered_cohort_labels3.admit_label.isnull()))

print(sum(filtered_cohort_labels3['label_24hr_recent'].isnull()))

print(sum(filtered_cohort_labels3['label_12hr_recent'].isnull()))

In [None]:
print(filtered_cohort_labels3['admit_label'].value_counts(), "\n")
print(filtered_cohort_labels3['label_24hr_recent'].value_counts(), "\n")
print(filtered_cohort_labels3['label_12hr_recent'].value_counts(), "\n")

In [None]:
# save these to file to upload to BQ
# filtered_cohort_labels3.to_csv("../2019_data/triage_cohort_final_with_labels.csv", index=False)

In [None]:
describe_df(filtered_cohort_labels3)

# Create Switch Labels

We're creating labels that indicate whether a CSN switched from acute -> ICU or vice versa between admit time and 24 hours. For CSNs with both labels, this will be easy. 

However, some CSNs do not have admit labels, but they do have 24 hour labels. We will 
1. give these people a label based on their earliest level of care and their 24 hour label. 
2. We'll include a flag to indicate that these people didn't have admit labels. 
3. We'll also include the time of their earliest label.

In [None]:
# get flag to indicate whether individuals had an admit label
switch_labels = filtered_cohort_labels3
switch_labels['has_admit_label'] = (~switch_labels.admit_label.isnull()).astype(int)
print(switch_labels.has_admit_label.value_counts())

switch_labels.head()

# subset to those that do not have admit label
no_admit = switch_labels[switch_labels.has_admit_label == 0]
no_admit.head()
no_admit_csns = no_admit.pat_enc_csn_id_coded.values

print("new cohort:")
print(describe_df(filtered_cohort_labels3))
print("\nno admit")
print(describe_df(no_admit)) # 796 CSNs with no admit time

# Missing all labels

There are some people who are missing labels for admit time, 12 hr, and 24 hr. We can't really do anything with these right now. We'll identify these individuals and remove them from the cohort.

We can also look at the distribution of their first labels.

In [None]:
# find csns with no labels across the board
missing_labels = filtered_cohort_labels3[['admit_label', 
                        'label_12hr_recent', 'label_24hr_recent']].isnull().astype(int)
missing_labels['pat_enc_csn_id_coded'] = filtered_cohort_labels3['pat_enc_csn_id_coded']

no_admit = missing_labels[missing_labels.admit_label == 1] 
no_labels = no_admit[no_admit.label_24hr_recent == 1]

print(no_labels['admit_label'].value_counts()) # 82 people have no labels before 24 hrs

# manuall check some of them - looks good
# new_cohort[new_cohort.pat_enc_csn_id_coded == no_labels.pat_enc_csn_id_coded.values[50]]

no_labels_csns = no_labels.pat_enc_csn_id_coded.values # will probably remove these from cohort

# pull the adt for these CSNs with no admit times
no_labels_adt = filtered_cohort_adt[filtered_cohort_adt.pat_enc_csn_id_coded.isin(no_labels_csns)]

## these individuals must have a label somewhere, find out where

# sort out events with no lv of care
has_lv_of_care = no_labels_adt[~no_labels_adt.pat_lv_of_care.isnull()]
has_lv_of_care = has_lv_of_care[has_lv_of_care.effective_time_since_admit > timedelta(hours=0)]
has_lv_of_care.pat_enc_csn_id_coded.nunique() # we have at least one per csn 

# find first event for each csn
has_lv_of_care.sort_values(by=['pat_enc_csn_id_coded', 'event_time_jittered_utc'], inplace=True)
first_label = has_lv_of_care.groupby(['pat_enc_csn_id_coded']).first()

sec_since_admit = first_label.effective_time_since_admit.astype('timedelta64[s]')
hour_since_admit = sec_since_admit / 3600 # num of sec in an hour
print(hour_since_admit.describe())
hour_since_admit.hist()

# Continue without these individuals

In [None]:
# keep the no_admit cohort but not the no_label cohort
has_some_label = filtered_cohort_labels3[~filtered_cohort_labels3.pat_enc_csn_id_coded.isin(no_labels_csns)] # 714 are left
switch_cohort = has_some_label[has_some_label.pat_enc_csn_id_coded.isin(no_admit_csns)]

## find the earliest level of care 
switch_cohort_csns = switch_cohort.pat_enc_csn_id_coded.values

# pull the adt for these CSNs with no admit times
switch_cohort_adt = filtered_cohort_adt[
    filtered_cohort_adt.pat_enc_csn_id_coded.isin(switch_cohort_csns)]

## these individuals must have a label somewhere, find out where

# sort out events with no lv of care
has_lv_of_care = switch_cohort_adt[~switch_cohort_adt.pat_lv_of_care.isnull()]
print(has_lv_of_care.pat_enc_csn_id_coded.nunique())
has_lv_of_care = has_lv_of_care[has_lv_of_care.effective_time_since_admit > timedelta(hours=0)]
print(has_lv_of_care.pat_enc_csn_id_coded.nunique()) # lost one patient here
has_lv_of_care.pat_enc_csn_id_coded.nunique() # we have at least one per csn 

# find first event for each csn
has_lv_of_care.sort_values(by=['pat_enc_csn_id_coded', 'event_time_jittered_utc',
                              'seq_num_in_enc', 'seq_num_in_bed_min'], inplace=True)
first_label = has_lv_of_care.groupby(['pat_enc_csn_id_coded']).first().reset_index()

sec_since_admit = first_label.effective_time_since_admit.astype('timedelta64[s]')
hour_since_admit = sec_since_admit / 3600 # num of sec in an hour
print(hour_since_admit.describe())
hour_since_admit.hist()

In [None]:
# we now have the first event for each CSN
first_label.pat_enc_csn_id_coded.nunique() # 713 - we have everyone, except that one that dropped off
first_label_short = first_label[['pat_enc_csn_id_coded', 'pat_lv_of_care', 'effective_time_since_admit']]

# check out lv of care here
print(first_label_short.pat_lv_of_care.value_counts())

# make sure no null values
print(sum(first_label_short.pat_lv_of_care.isnull())) # 0 = good

# create label
first_label_short['first_label'] = (first_label_short.pat_lv_of_care == 'Critical Care').astype(int)

print(first_label_short.first_label.value_counts()) # makes sense

# rename the time since admit column
first_label_short.rename({'effective_time_since_admit': 'first_label_time_since_admit'}, 
                         axis='columns', inplace=True)

first_label_short.drop(hidecols, axis=1, errors='ignore')

# Checking into the first label time distribution

In [None]:
first_label_short['hours_since_admit'] = first_label_short.first_label_time_since_admit / pd.Timedelta('1 hour')

In [None]:
ordered_first_labels = first_label_short.sort_values('hours_since_admit', ascending=False)
ordered_first_labels.drop(hidecols, axis=1, errors='ignore')

In [None]:
i=1
csn = ordered_first_labels.pat_enc_csn_id_coded.values[i]
joined_cohort_adt[joined_cohort_adt.pat_enc_csn_id_coded == csn][['pat_enc_csn_id_coded', 'admit_time',
                                                                 'pat_class', 'pat_lv_of_care',
                                                                 'effective_time_jittered_utc', 
                                                                 'event_time_jittered_utc',
                                                                 'effective_time_since_admit']].sort_values('effective_time_jittered_utc').drop(hidecols, axis=1, errors='ignore')

In [None]:
i=1
csn = ordered_first_labels.pat_enc_csn_id_coded.values[i]
switch_cohort_adt[switch_cohort_adt.pat_enc_csn_id_coded == csn][['pat_enc_csn_id_coded', 'admit_time',
                                                                 'pat_class', 'pat_lv_of_care',
                                                                 'effective_time_jittered_utc', 
                                                                 'event_time_jittered_utc',
                                                                 'effective_time_since_admit']].sort_values('effective_time_jittered_utc').drop(hidecols, axis=1, errors='ignore')

In [None]:
first_label_short.hist('hours_since_admit')

In [None]:
# join the first labels to the new cohort
new_cohort_labels = filtered_cohort_labels3.merge(first_label_short[['pat_enc_csn_id_coded', 'first_label', 'first_label_time_since_admit']],
                how='left')

In [None]:
new_cohort_labels_full = new_cohort_labels[~new_cohort_labels.pat_enc_csn_id_coded.isin(no_labels_csns)]

print("new cohort labels")
describe_df(new_cohort_labels)
print("\nnew cohort labels full")
describe_df(new_cohort_labels_full) # we lose those 82 csns that had no labels across the board

# Looking at patient trajectories

Some people do not go straight from Emergency Services to Inpatient. They go to other places like Observation in between. I don't remember if this was what we wanted to do. 

In [None]:
df_temp = filtered_cohort_adt.sort_values(by=['anon_id', 'pat_enc_csn_id_coded', 
                                            'event_time_jittered_utc', 'seq_num_in_enc', 'seq_num_in_bed_min'])

# keep only cases where pat_lv_of_care changed from row above or csn changed
# logic: (row.csn == last.csn) --> (row.care != last.care) ::: p --> q
# equivalent: (!(row.csn == last.csn) OR (row.care != last.care)) ::: !p OR q
df_temp['csn_pat_class'] = df_temp.pat_enc_csn_id_coded.astype(str) + df_temp.pat_class

df_temp['match'] = ~(df_temp.csn_pat_class ==  df_temp.csn_pat_class.shift())
df_temp

df_change = df_temp[~(df_temp.csn_pat_class ==  df_temp.csn_pat_class.shift())]
                                      

# want to see trajectories for each patient on the adt table
grouped = df_change.sort_values(by=['anon_id', 'pat_enc_csn_id_coded', 'effective_time_jittered_utc'])[['pat_enc_csn_id_coded', 'pat_class']]
pat_traj = grouped.groupby('pat_enc_csn_id_coded').pat_class.apply(lambda x: ' -> '.join(x)).reset_index()

# join to the adt table
pat_traj['trajectory'] = pat_traj.pat_class
new_adt = filtered_cohort_adt.merge(pat_traj, how='left', on='pat_enc_csn_id_coded')

trajectories = new_adt.trajectory.unique()

new_adt.drop(hidecols, axis=1, errors='ignore').head()

In [None]:
counts = pat_traj.groupby('trajectory').pat_enc_csn_id_coded.count().reset_index()
print(counts.sort_values('pat_enc_csn_id_coded', ascending=False).head(50))

In [None]:
for i in range(len(trajectories)):
    print(i, " : ", trajectories[i])

In [None]:
i = 12
traj = trajectories[i]
print(traj)
traj_csns = new_adt[new_adt.trajectory == traj].pat_enc_csn_id_coded.values

k = 0
df_temp[df_temp.pat_enc_csn_id_coded == traj_csns[k]].sort_values('effective_time_jittered_utc').drop(hidecols, axis=1, errors='ignore')

# Back to the switch labels

In [None]:
print(new_cohort_labels_full.has_admit_label.value_counts())
new_cohort_labels_full.drop(hidecols, axis=1, errors='ignore')

In [None]:
# move values from events that have admit labels to first label
def first_label(row):
    if isinstance(row['admit_label'], int):
        return row.admit_label
    else:
        return row.first_label

def first_label_time(row):
    if isinstance(row['admit_label'], int):
        return 0
    else:
        return row.first_label_time_since_admit

new_cohort_labels_full['first_label_full'] = new_cohort_labels_full.apply(lambda row: 
                                                                          first_label(row),
                                                                         axis=1)
new_cohort_labels_full['first_label_time_since_admit_full'] = new_cohort_labels_full.apply(lambda row: 
                                                                          first_label_time(row),
                                                                         axis=1)

print(new_cohort_labels_full.first_label_full.value_counts())
print(new_cohort_labels_full.first_label.value_counts())
print(sum(new_cohort_labels_full.first_label_full.isnull()))

new_cohort_labels_full.drop(hidecols, axis=1, errors='ignore')

In [None]:
# sanity check
check = new_cohort_labels_full
print(sum(check.first_label_full.isnull()))
print(sum(check.first_label_time_since_admit_full.isnull()))


check = new_cohort_labels_full[new_cohort_labels_full.admit_label.isnull()]

print(sum(check.first_label_full.isnull()))
print(sum(check.first_label_time_since_admit_full.isnull()))

check.drop(hidecols, axis=1, errors='ignore')

In [None]:
check = new_cohort_labels_full[~new_cohort_labels_full.admit_label.isnull()]

print(sum(check.first_label_full.isnull()))
print(sum(check.first_label_time_since_admit_full.isnull()))

check.drop(hidecols, axis=1, errors='ignore')

# Create switch labels now

In [None]:
# new_cohort_labels_full.first_label_full = new_cohort_labels_full.first_label_full

# acute to critical
new_cohort_labels_full['acute_to_critical_label'] = (
    (new_cohort_labels_full.first_label_full == 0) &
    (new_cohort_labels_full.label_24hr_recent == 1)).astype(int)

new_cohort_labels_full.acute_to_critical_label.value_counts()

In [None]:
# critical to acute
new_cohort_labels_full['critical_to_acute_label'] = (
    (new_cohort_labels_full.first_label_full == 1) &
    (new_cohort_labels_full.label_24hr_recent == 0)).astype(int)

new_cohort_labels_full.critical_to_acute_label.value_counts()

In [None]:
new_cohort_labels_full.drop(hidecols, axis=1, errors='ignore')

In [None]:
final_labels = new_cohort_labels_full.copy()

# drop unwanted columns
final_labels.drop(['first_label', 'first_label_time_since_admit'], axis=1, errors='ignore',
                 inplace=True)

# rename columns
final_labels.rename({'first_label_full': 'first_label',
                    'first_label_time_since_admit_full': 'first_label_time_since_admit'},
                   inplace=True, axis=1)

# final_labels['first_label'] = final_labels.first_label.astype(int)

final_labels.drop(hidecols, axis=1, errors='ignore')

In [None]:
# save to file
final_labels.to_csv("{}/triage_cohort_2019_all_labels.csv".format(datadir), index=False)

In [None]:
final_labels = pd.read_csv("{}/triage_cohort_2019_all_labels.csv".format(datadir))

In [None]:
describe_df(final_labels)

In [None]:
for colname in final_labels.columns:
    print(final_labels[colname].value_counts())
    print(sum(final_labels[colname].isnull()))

# Update format of table columns

There is a column that is a double, so we will update to keep everything consistent.

In [None]:
final_labels = pd.read_csv("{}/triage_cohort_2019_all_labels.csv".format(datadir))
final_labels.head().drop(hidecols, axis=1, errors='ignore')

In [None]:
# find the types of each column
final_labels.dtypes

In [None]:
# change any of the float columns to int64
float_cols = ['admit_label', 'label_12hr_recent', 'first_label']

for col in float_cols:
    final_labels[col] = final_labels[col].astype('Int64')

final_labels.dtypes

In [None]:
final_labels.head().drop(hidecols, axis=1, errors='ignore')

In [None]:
# final cohort size is 42,475
print(sum(final_labels.admit_label.isnull()))
print(sum(~final_labels.admit_label.isnull()))

final_labels.shape

In [None]:
# save to file
final_labels.to_csv("{}/triage_to_keep_cohort_with_labels_updated.csv".format(datadir), index=False)