In [1]:
import pandas as pd
import numpy as np

# Diabetes study

## Problem understanding

- Diabetes is a major contributor to risk for hospital readmission, representing nearly one-fifth of all 30-day unplanned hospital readmissions. (Soh et al. 2020)
- <u>Patient characteristics may affect the outcomes</u> (such as gender, age, race, and comorbidities) (Soh et al. 2020)

<u>**Objective**</u>

- Create a model for early identification of readmission. Such a model could help plan interventions for high-risk patients.
- Point out patterns and insights about readmission. Which variables are the strongest readmission predictors?

## Data understanding

The data set comprises two files, `diabetic_data.csv` and `IDs_mapping.csv`. The former contains anonymized medical data that were collected during 100k encounters accross several hospitals in nearly a decade. The latter contains a legend for some of the numerical categories (like admission type and discharge disposition).

In [2]:
df = pd.read_csv("../data/diabetic_data.zip")
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


Missing values are represented by question marks in this data set. I will replace these with np.nan, so I can use Pandas' default methods to view and work with missing data as NaNs.

In [3]:
df.replace("?", np.nan, inplace=True)
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


Overral view of the dataframe, including missing values and dtypes:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      99493 non-null   object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    3197 non-null    object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                61510 non-null   object
 11  medical_specialty         51817 non-null   object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

Percentage of missing vales for each column:

In [56]:
df.isnull().mean()

encounter_id                0.000000
patient_nbr                 0.000000
race                        0.022336
gender                      0.000000
age                         0.000000
weight                      0.968585
admission_type_id           0.000000
discharge_disposition_id    0.000000
admission_source_id         0.000000
time_in_hospital            0.000000
payer_code                  0.395574
medical_specialty           0.490822
num_lab_procedures          0.000000
num_procedures              0.000000
num_medications             0.000000
number_outpatient           0.000000
number_emergency            0.000000
number_inpatient            0.000000
diag_1                      0.000206
diag_2                      0.003518
diag_3                      0.013983
number_diagnoses            0.000000
max_glu_serum               0.000000
A1Cresult                   0.000000
metformin                   0.000000
repaglinide                 0.000000
nateglinide                 0.000000
c

I need more details about possible values. Knowing the variable's dtypes only won't be enough, so let's see some examples of what to expect.

In [59]:
print("Variable              Unique values  Examples")
print("="*100)
for column in df:
    uniques = set(df[column])
    n_uniques = len(uniques)
    examples = list(uniques)[:5]
    output = "{:<25} {:<10} {:<20}".format(column, n_uniques, str(examples))
    print(output)
    

Variable              Unique values  Examples
encounter_id              101766     [77856768, 173015040, 84934662, 273678342, 17563668]
patient_nbr               71518      [33947649, 92667906, 82706436, 83623941, 128319494]
race                      6          [nan, 'Other', 'Asian', 'Caucasian', 'AfricanAmerican']
gender                    3          ['Female', 'Unknown/Invalid', 'Male']
age                       10         ['[30-40)', '[0-10)', '[40-50)', '[80-90)', '[50-60)']
weight                    10         [nan, '[175-200)', '[100-125)', '[50-75)', '[25-50)']
admission_type_id         8          [1, 2, 3, 4, 5]     
discharge_disposition_id  26         [1, 2, 3, 4, 5]     
admission_source_id       17         [1, 2, 3, 4, 5]     
time_in_hospital          14         [1, 2, 3, 4, 5]     
payer_code                18         [nan, 'SP', 'CP', 'DM', 'MD']
medical_specialty         73         [nan, 'Resident', 'PhysicianNotFound', 'Orthopedics', 'Hematology']
num_lab_procedures  

### IDs Mapping

Now taking a look at `IDs_mapping.csv`, the file isn't in standard CSV format. It's actually a text file mapping numerical categories to strings. Using this file, I'll replace each number with the appropriate name in the original dataframe.

In [8]:
# Map for admission_type_id
admtype_map = pd.read_csv("../data/IDs_mapping.zip", skiprows=0, nrows=8)

# Map for discharge_disposition_id
discharge_map = pd.read_csv("../data/IDs_mapping.zip", skiprows=10, nrows=30)

# Map for admission_source_id
admsrc_map = pd.read_csv("../data/IDs_mapping.zip", skiprows=42, nrows=25)

In [9]:
admsrc_map.head()

Unnamed: 0,admission_source_id,description
0,1,Physician Referral
1,2,Clinic Referral
2,3,HMO Referral
3,4,Transfer from a hospital
4,5,Transfer from a Skilled Nursing Facility (SNF)


### Data set characteristics

- Covers a 10-year span (1999-2008).
- Hospital admissions in the data set are supposed to be diabetic-related **only**.
- Lab tests were performed during admission.
- Medications were administered during admission.
- Data set contains <u>multiple readmissions of the same people</u>.
- Missing values are represented by "?"
  - `weight` variable is nearly useless, with about 97% of missing values.
  - `payercode` could be an useful variable, but has a concerning number of missing values (around 40%)
  - `medical_specialty` has half of its values missing.
- Lots of low-variance variables in this data set (e.g. `examide` and `citoglipton` have only 1 unique value), which won't add to the model's predictive power.
- `diag_1` to `diag_3` columns have both strings and floating points. Googling the keyword "diagnosis" plus some of these cryptic codes showed that they are, in fact, ICD-9-CM Diagnosis Codes (e.g. 427 = Cardiac dysrhythmias, E852 =  Accidental poisoning by other sedatives and hypnotics)

**Target variable:** 30-day remission, i.e. unplanned or unexpected readmission to the same hospital within 30 days of being discharged.

## Data preparation

## Modeling

## Evaluation