In [None]:
# set up notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read the admissions table
df_adm = pd.read_csv('NOTEEVENTS.csv')

The main columns of interest are:
- SUBJECT_ID
- HADM_ID
- CATEGORY: includes ‘Discharge summary’, ‘Echo’, ‘ECG’, ‘Nursing’, ‘Physician ‘, ‘Rehab Services’, ‘Case Management ‘, ‘Respiratory ‘, ‘Nutrition’, ‘General’, ‘Social Work’, ‘Pharmacy’, ‘Consult’, ‘Radiology’, ‘Nursing/other’
- TEXT: our clinical notes column

Dataset 
- 2,083,180 rows
- Indicating that there are multiple notes per hospitalization.

In the notes, 
- the dates and PHI (name, doctor, location) have been converted for confidentiality.

In [None]:
# filter to discharge summary
df_notes_dis_sum = df_notes.loc[df_notes.CATEGORY == 'Discharge summary']

## Merge 
The notes on the admissions table 

We might have the assumption that there is one discharge summary per admission, but we should probably check this. We can check this with an assert statement, which ends up failing.

In [None]:
assert df_notes_dis_sum_duplicated(['HADM_ID']).sum() == 0, 'Multiple discharge summaries per admission'

*Investigate why there are multiple summaries, 
but for simplicity let’s just use the last one

In [None]:
df_notes_dis_sum_last = (df_notes_dis_sum.groupby(['SUBJECT_ID','HADM_ID']).nth(-1)).reset_index()
assert df_notes_dis_sum_last.duplicated(['HADM_ID']).sum() == 0, 'Multiple discharge summaries per admission'

Use a left merge to account for when notes are missing. 

There are a lot of cases where you get multiple rows after a merge(although we dealt with it above), so I like to add assert statements after a merge

In [None]:
df_adm_notes = pd.merge(df_adm[['SUBJECT_ID','HADM_ID','ADMITTIME','DISCHTIME','DAYS_NEXT_ADMIT','NEXT_ADMITTIME','ADMISSION_TYPE','DEATHTIME']],
                        df_notes_dis_sum_last[['SUBJECT_ID','HADM_ID','TEXT']], 
                        on = ['SUBJECT_ID','HADM_ID'],
                        how = 'left')
assert len(df_adm) == len(df_adm_notes), 'Number of rows increased'

10.6 % of the admissions are missing `(df_adm_notes.TEXT.isnull().sum()` / `len(df_adm_notes)`), so I investigated a bit further with


In [None]:
df_adm_notes.groupby('ADMISSION_TYPE').apply(lambda g: g.TEXT.isnull().sum())/df_adm_notes.groupby('ADMISSION_TYPE').size()


And discovered that 53% of the NEWBORN admissions were missing discharge summaries vs ~4% for the others. 

At this point I decided to remove the NEWBORN admissions. Most likely, these missing NEWBORN admissions have their discharge summary stored outside of the MIMIC dataset.

For this problem, we are going to classify if a patient will be admitted in the next 30 days. 

Therefore, we need to create a variable with the output label (1 = readmitted, 0 = not readmitted)

In [None]:
df_adm_notes_clean['OUTPUT_LABEL'] = (df_adm_notes_clean.DAYS_NEXT_ADMIT < 30).astype('int')

Count of positive and negative results in 
- 3004 positive samples
- 48109 negative samples. 

This indicates that we have an imbalanced dataset, which is a common occurrence in healthcare data science.

## Split the data 
- training 
- validation 
- test sets. 

For reproducible results, I have made the random_state always 42.

In [None]:
# shuffle the samples
df_adm_notes_clean = df_adm_notes_clean.sample(n = len(df_adm_notes_clean), random_state = 42)
df_adm_notes_clean = df_adm_notes_clean.reset_index(drop = True)

# Save 30% of the data as validation and test data 
df_valid_test=df_adm_notes_clean.sample(frac=0.30,random_state=42)
df_test = df_valid_test.sample(frac = 0.5, random_state = 42)
df_valid = df_valid_test.drop(df_test.index)

# use the rest of the data as training data
df_train_all=df_adm_notes_clean.drop(df_valid_test.index)

Since the prevalence is so low, we want to prevent the model from always predicting negative (not re-admitted). To do this, we have a few options to balance the training data
- sub-sampling the negatives
- over-sampling the positives
- create synthetic data (e.g. SMOTE)

In [None]:
# split the training data into positive and negative
rows_pos = df_train_all.OUTPUT_LABEL == 1
df_train_pos = df_train_all.loc[rows_pos]
df_train_neg = df_train_all.loc[~rows_pos]
# merge the balanced data
df_train = pd.concat([df_train_pos, df_train_neg.sample(n = len(df_train_pos), random_state = 42)],axis = 0)
# shuffle the order of training samples 
df_train = df_train.sample(n = len(df_train), random_state = 42).reset_index(drop = True)