################################################################################################################
# PREDICTING DIABETES 30-DAY READMISSIONS #
################################################################################################################

## DESCRIPTION :: Diabetes 130-US hospitals for years 1999-2008 Data Set 

## SOURCE :: UC IRVINE MACHINE LEARNING REPOSITORY ::

* https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 

## WHY I'M INTERESTED / MY DOMAIN EXPERTISE

With ObamaCare we have more people getting care than before.  This is great, but it is forcing the health system to change its payment model from 'fee for service' to 'fee for performance'.  A good example of this is that hospitals used to get paid when you were in the hospital for a heart attack.  But the hospital would not get paid while you recovered.  The new paradigm puts more risk on the hospital's side by punishing hospitals if a patient is re-admitted within certain time periods for problems that are deemed preventable.  

- My group at work is trying to deal with this change from a structural way.
- My role is to help patients be better able to self-manage their chronic conditions so they need less health-system utilization.
- I work specifically with patients with diabetes and specifically around keeping them out of the ED and hospital.
- When a patient is in the hospital, they incur lots of other risk factors from deconditioning, hospital-acquired infections, and potential errors...on top of that, all of these tests are expensive.

## WHAT I'M HOPING TO GAIN FROM THIS
- What factors predict a readmission within 30-days; causing a costly punishment? 
- What factors predict a readmission after 30 days (that's not currently punishable, but may be preventable)
- What factors lead to no readmission?  
- Can we find any top and bottom performers?

> # BASIC PACKAGES IMPORTS

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

> # DATA IMPORT

In [3]:
data = pd.read_csv('http://localhost:8889/files/project/dataset_diabetes/diabetic_data.csv')
data.shape

(101766, 50)

### overview

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
encounter_id                101766 non-null int64
patient_nbr                 101766 non-null int64
race                        101766 non-null object
gender                      101766 non-null object
age                         101766 non-null object
weight                      101766 non-null object
admission_type_id           101766 non-null int64
discharge_disposition_id    101766 non-null int64
admission_source_id         101766 non-null int64
time_in_hospital            101766 non-null int64
payer_code                  101766 non-null object
medical_specialty           101766 non-null object
num_lab_procedures          101766 non-null int64
num_procedures              101766 non-null int64
num_medications             101766 non-null int64
number_outpatient           101766 non-null int64
number_emergency            101766 non-null int64
number_inpatient            10176

### basic cleaning

In [14]:
data.columns = [each.replace('-','_').lower() for each in data.columns]
data = data.rename(columns = {'a1cresult' : 'a1c', 'diabetesmed' : 'diabetes_med' })
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
encounter_id                101766 non-null int64
patient_nbr                 101766 non-null int64
race                        101766 non-null object
gender                      101766 non-null object
age                         101766 non-null object
weight                      101766 non-null object
admission_type_id           101766 non-null int64
discharge_disposition_id    101766 non-null int64
admission_source_id         101766 non-null int64
time_in_hospital            101766 non-null int64
payer_code                  101766 non-null object
medical_specialty           101766 non-null object
num_lab_procedures          101766 non-null int64
num_procedures              101766 non-null int64
num_medications             101766 non-null int64
number_outpatient           101766 non-null int64
number_emergency            101766 non-null int64
number_inpatient            10176

### first look

In [15]:
data.head(50)

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide_metformin,glipizide_metformin,glimepiride_pioglitazone,metformin_rosiglitazone,metformin_pioglitazone,change,diabetes_med,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
5,35754,82637451,Caucasian,Male,[50-60),?,2,1,2,3,...,No,Steady,No,No,No,No,No,No,Yes,>30
6,55842,84259809,Caucasian,Male,[60-70),?,3,1,2,4,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
7,63768,114882984,Caucasian,Male,[70-80),?,1,1,7,5,...,No,No,No,No,No,No,No,No,Yes,>30
8,12522,48330783,Caucasian,Female,[80-90),?,2,1,4,13,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,15738,63555939,Caucasian,Female,[90-100),?,3,3,4,12,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


# SOME NOTES

### Will need to clean columns
- drop encounter_id
- drop patient_nbr
- age saved as a list, should be a category
- n/a for weight == '?'
- change codes to category names for
- - admission_type_id	
- - discharge_disposition_id
- - admission_source_id
- - time_in_hospital (units?)
- - diag_1, 2, 3

### CLASSIFY based on 
- demographics
- - race, gender, age, weight, payor code, A1c

- meds
- - various meds Y/N
- - count of meds not no
- - classify med types (oral, inject) and y/n
- - insulin changes

- utilization
- - admission_type_id, discharge_disposition_id, admission_source_id, time_in_hospital
- - previous :: number_outpatient, number_emergency, number_inpatient
- - num_lab_procedures, num_procedures, num_medications, number_diagnoses

### PREDICT :: READMISSION == NO | < 30 | > 30 