# Predicting hospital readmissions using NLP-DL
### By Adam Bradfield

Dataset: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008

Journal: http://downloads.hindawi.com/journals/bmri/2014/781670.pdf

In [5]:
import numpy as np
import pandas as pd
import keras
import tensorflow as tf
import matplotlib.pyplot as plt

In [8]:
# Import the data

pdata = pd.read_csv('dataset_diabetes/diabetic_data.csv')
pdata.shape

(101766, 50)

In [14]:
pdata.head

<bound method NDFrame.head of         encounter_id  patient_nbr             race  gender      age weight  \
0            2278392      8222157        Caucasian  Female   [0-10)      ?   
1             149190     55629189        Caucasian  Female  [10-20)      ?   
2              64410     86047875  AfricanAmerican  Female  [20-30)      ?   
3             500364     82442376        Caucasian    Male  [30-40)      ?   
4              16680     42519267        Caucasian    Male  [40-50)      ?   
...              ...          ...              ...     ...      ...    ...   
101761     443847548    100162476  AfricanAmerican    Male  [70-80)      ?   
101762     443847782     74694222  AfricanAmerican  Female  [80-90)      ?   
101763     443854148     41088789        Caucasian    Male  [70-80)      ?   
101764     443857166     31693671        Caucasian  Female  [80-90)      ?   
101765     443867222    175429310        Caucasian    Male  [70-80)      ?   

        admission_type_id  discha

Pretty messy and contains data which we may not need, lets try to clean it up a bit

In [10]:
list(pdata.columns)

['encounter_id',
 'patient_nbr',
 'race',
 'gender',
 'age',
 'weight',
 'admission_type_id',
 'discharge_disposition_id',
 'admission_source_id',
 'time_in_hospital',
 'payer_code',
 'medical_specialty',
 'num_lab_procedures',
 'num_procedures',
 'num_medications',
 'number_outpatient',
 'number_emergency',
 'number_inpatient',
 'diag_1',
 'diag_2',
 'diag_3',
 'number_diagnoses',
 'max_glu_serum',
 'A1Cresult',
 'metformin',
 'repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'acetohexamide',
 'glipizide',
 'glyburide',
 'tolbutamide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'miglitol',
 'troglitazone',
 'tolazamide',
 'examide',
 'citoglipton',
 'insulin',
 'glyburide-metformin',
 'glipizide-metformin',
 'glimepiride-pioglitazone',
 'metformin-rosiglitazone',
 'metformin-pioglitazone',
 'change',
 'diabetesMed',
 'readmitted']

Looking at all fields here there are a few identifying columns:

'patient_nbr' contains a unique identifier of each patient, this is useful so that we can track return visits and see how the patient does over time.

'race', 'gender', & 'age' are all useful to try and notice trends among the patients which are being amitted to the hospital. 'weight' would also be a good metric, but this data set is missing that value for 97% of encounters which is just too many to factor in.

The 'readmitted' field also proves particularly useful as it will be able to tell us if the patient came back in after the appointment

'payer_code' & 'medical_specialty' also have large amounts of missing data (40% and 47% respectively) so these also will not be making the cut

Using only the first visit of each patient also keeps bias from creeping during multiple vists, so we drop repeated vists

In [26]:
pdata.drop(['weight', 'payer_code', 'medical_specialty'], axis=1)

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),6,25,1,1,41,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),1,1,7,3,59,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,11,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),1,1,7,2,44,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),1,1,7,1,51,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
5,35754,82637451,Caucasian,Male,[50-60),2,1,2,3,31,...,No,Steady,No,No,No,No,No,No,Yes,>30
6,55842,84259809,Caucasian,Male,[60-70),3,1,2,4,70,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
7,63768,114882984,Caucasian,Male,[70-80),1,1,7,5,73,...,No,No,No,No,No,No,No,No,Yes,>30
8,12522,48330783,Caucasian,Female,[80-90),2,1,4,13,68,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,15738,63555939,Caucasian,Female,[90-100),3,3,4,12,33,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [29]:
# pdata.loc[pdata['patient_nbr'] == 33230016]

# Remove duplicates
pdata = pdata.drop_duplicates('patient_nbr')
pdata.shape

# Remove dead & hospice discharge
pdata = pdata.drop(pdata[(pdata.discharge_disposition_id == 11) | 
                         (pdata.discharge_disposition_id == 13) | 
                         (pdata.discharge_disposition_id == 14) | 
                         (pdata.discharge_disposition_id == 19) | 
                         (pdata.discharge_disposition_id == 20) | 
                         (pdata.discharge_disposition_id == 21)].index)
# Now it is considerably smaller, but hopefully will produce better results
pdata.shape

(69973, 50)

In [35]:
pdata['diag_3']

0              ?
1            255
2            V27
3            403
4            250
           ...  
101754    250.02
101755       518
101756       403
101758       304
101765       787
Name: diag_3, Length: 69973, dtype: object