## Exploratory Data Anlysis

### Identifiers
- SUBJECT_ID: identifiers to a unique patient
- HADM_ID: identifiers to an admission to hospital
- ICUSTAY_ID: identifiers to a unique admission to intensive care unit

### Charted events
- OUTPUT events table contains all measurements related to output for a given patient.

### Create the primary outcome variable
- The length of stay is the time from the admission to the discharge from hospital.
- The data contains protected health information so the actual admission time and discharge time are shifted.
- The distribution of the outcome variable is highly skewed.
    - Dichotomize the outcome variable (imbalanced classification problem).
    - Keep the continuous variable and find optimal methods to model skewed distribution.

### Discharge Location
- Most of the patients are discharged to home/home health care/hospice-home/home with home IV provider.
- Some of the patients are discharged to SNF (Skilled Nursing Facility). It can also be a secondary outcome variable.
- Some of the patients are discharged to hospice-medical facility.  

**Note:**  
- Home Health Care: Home health care is a wide range of health care services that can be given in your home for an illness or injury. Home health care is usually less expensive, more convenient, and just as effective as care you get in a hospital or skilled nursing facility (SNF). References: https://www.medicare.gov/what-medicare-covers/whats-home-health-care
- Long term care hospital: Most patients who need to be in intensive care for an extended time are often transferred to a long-term care hospital to continue that care.

### Potential predictors
#### Demographical Data
- Insurance type, Language, ethnicity, marital_status, age, sex

#### Diagnosis
- Diagnosis variable contains too many levels. 
    - Delete some levels with too less patients.
    - Combine some levels, such as diagnosis starting with nausea and subtypes of coronary artery disease.
- After combining the patients table with the admission table, there is no missing data.

#### Vitals

### DRG codes
- The diagnosis-related group codes is a system to classify hospital cases into one of 467 cases.

According to [physionet tutorial](https://physionet.org/content/mimiciii/1.4/), all the observations who are older than 89 years old will be assigned a fake age. Here, I converted all the observations who are older than 89 years old to 89.

________________________________________

In [1]:
import pandas as pd
import numpy as np

In [18]:
df = pd.read_csv('../Data/processed_data1.csv')

In [19]:
df = df.drop('Unnamed: 0', axis=1)
df = df.drop('ROW_ID', axis=1)

In [20]:
df

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,...,INTIME,OUTTIME,ICU_LOS,Height,Weight,GENDER,DOB,DOD,Hosp_LOS,age
0,2,163353,2138-07-17 19:04:00,2138-07-21 15:48:00,,NEWBORN,PHYS REFERRAL/NORMAL DELI,HOME,Private,,...,2138-07-17T21:20:07,2138-07-17T23:32:21,0.0918,,,M,2138-07-17 00:00:00,,3 days 20:44:00.000000000,0
1,3,145834,2101-10-20 19:08:00,2101-10-31 13:58:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Medicare,,...,2101-10-20T19:10:11,2101-10-26T20:43:09,6.0646,179.07,96.8,M,2025-04-11 00:00:00,2102-06-14T00:00:00,10 days 18:50:00.000000000,76
2,4,185777,2191-03-16 00:28:00,2191-03-23 18:41:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME WITH HOME IV PROVIDR,Private,,...,2191-03-16T00:29:31,2191-03-17T16:46:31,1.6785,,53.6,F,2143-05-12 00:00:00,,7 days 18:13:00.000000000,48
3,5,178980,2103-02-02 04:31:00,2103-02-04 12:15:00,,NEWBORN,PHYS REFERRAL/NORMAL DELI,HOME,Private,,...,2103-02-02T06:04:24,2103-02-02T08:06:00,0.0844,,,M,2103-02-02 00:00:00,,2 days 07:44:00.000000000,0
4,6,107064,2175-05-30 07:15:00,2175-06-15 16:00:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,ENGL,...,2175-05-30T21:30:54,2175-06-03T13:39:54,3.6729,,,F,2109-06-21 00:00:00,,16 days 08:45:00.000000000,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46515,99985,176670,2181-01-27 02:47:00,2181-02-12 17:05:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME HEALTH CARE,Private,ENGL,...,2181-01-29T05:33:34,2181-02-09T12:45:20,11.2998,,,M,2127-04-08 00:00:00,,16 days 14:18:00.000000000,54
46516,99991,151118,2184-12-24 08:30:00,2185-01-05 12:15:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME,Private,ENGL,...,2184-12-28T17:30:58,2184-12-31T20:56:20,3.1426,,100.5,M,2137-04-07 00:00:00,,12 days 03:45:00.000000000,47
46517,99992,197084,2144-07-25 18:03:00,2144-07-28 17:56:00,,EMERGENCY,CLINIC REFERRAL/PREMATURE,SNF,Medicare,ENGL,...,2144-07-25T18:04:42,2144-07-27T17:27:55,1.9745,,65.4,F,2078-10-17 00:00:00,,2 days 23:53:00.000000000,66
46518,99995,137810,2147-02-08 08:00:00,2147-02-11 13:15:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME,Medicare,ENGL,...,2147-02-08T13:53:58,2147-02-10T17:46:30,2.1615,159.00,68.0,F,2058-05-29 00:00:00,2147-09-29T00:00:00,3 days 05:15:00.000000000,89


In [21]:
np.sum(df['SUBJECT_ID'].value_counts() != 1)

0

In [39]:
df.dtypes

SUBJECT_ID                int64
HADM_ID                   int64
ADMITTIME                object
DISCHTIME                object
DEATHTIME                object
ADMISSION_TYPE           object
ADMISSION_LOCATION       object
DISCHARGE_LOCATION       object
INSURANCE                object
LANGUAGE                 object
RELIGION                 object
MARITAL_STATUS           object
ETHNICITY                object
EDREGTIME                object
EDOUTTIME                object
DIAGNOSIS                object
HOSPITAL_EXPIRE_FLAG      int64
HAS_CHARTEVENTS_DATA      int64
HeartRate_Min           float64
HeartRate_Max           float64
HeartRate_Mean          float64
SysBP_Min               float64
SysBP_Max               float64
SysBP_Mean              float64
DiasBP_Min              float64
DiasBP_Max              float64
DiasBP_Mean             float64
TempC_Max               float64
RespRate_Max            float64
RespRate_Mean           float64
HeartRate_Mean_1        float64
HeartRat

In [27]:
len(df['DIAGNOSIS'].value_counts())

12607

In [31]:
df['HeartRate_Min']

0        140.0
1         75.0
2         74.0
3          NaN
4         76.0
         ...  
46515     74.0
46516     91.0
46517     60.0
46518     49.0
46519     78.0
Name: HeartRate_Min, Length: 46520, dtype: float64

## Correlation

In [37]:
df_new = df.drop(['SUBJECT_ID', 'HADM_ID'], axis=1)
df_new.corr()

Unnamed: 0,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,DiasBP_Max,...,HeartRate_Mean_1,HeartRate_Min_1,Glucose_Max,Glucose_Min,Glucose_Mean,icustay_id,ICU_LOS,Height,Weight,age
HOSPITAL_EXPIRE_FLAG,1.0,0.039791,-0.09434,0.021636,-0.026477,-0.199091,-0.00318,-0.113825,-0.155743,-0.004643,...,-0.026477,-0.09434,0.000306,0.112074,0.007575,-0.002174,0.024218,-0.040363,-0.061062,0.185233
HAS_CHARTEVENTS_DATA,0.039791,1.0,,,,,,,,,...,,,,,,-0.00048,0.017055,0.000673,0.000244,-0.020156
HeartRate_Min,-0.09434,,1.0,0.769227,0.934514,0.016643,-0.098277,-0.063177,0.152474,0.064184,...,0.934514,1.0,-0.009288,-0.132156,-0.017269,-0.002164,0.227665,0.000393,0.047394,-0.712567
HeartRate_Max,0.021636,,0.769227,1.0,0.905678,-0.162068,0.042769,-0.087142,0.029541,0.186637,...,0.905678,0.769227,-0.004493,-0.051748,-0.006641,-0.002317,0.29083,-0.023418,-0.005984,-0.623271
HeartRate_Mean,-0.026477,,0.934514,0.905678,1.0,-0.110468,-0.043447,-0.102403,0.09357,0.131375,...,1.0,0.934514,-0.007778,-0.103479,-0.013083,-0.001725,0.270255,-0.010141,0.028596,-0.714095
SysBP_Min,-0.199091,,0.016643,-0.162068,-0.110468,1.0,0.319046,0.743117,0.544952,0.170438,...,-0.110468,0.016643,-0.002894,0.078657,-0.003539,0.008767,-0.102206,0.029536,0.037662,-0.131625
SysBP_Max,-0.00318,,-0.098277,0.042769,-0.043447,0.319046,1.0,0.745131,0.142468,0.504979,...,-0.043447,-0.098277,0.01298,0.051855,0.016799,0.001734,0.072462,-0.025801,0.02297,0.115947
SysBP_Mean,-0.113825,,-0.063177,-0.087142,-0.102403,0.743117,0.745131,1.0,0.40956,0.382058,...,-0.102403,-0.063177,0.001658,0.099509,0.004327,0.00542,-0.029253,0.000465,0.036976,0.006108
DiasBP_Min,-0.155743,,0.152474,0.029541,0.09357,0.544952,0.142468,0.40956,1.0,0.287293,...,0.09357,0.152474,-0.004299,0.029177,-0.005917,0.008619,-0.056817,0.120198,0.074982,-0.305714
DiasBP_Max,-0.004643,,0.064184,0.186637,0.131375,0.170438,0.504979,0.382058,0.287293,1.0,...,0.131375,0.064184,0.002558,0.074082,0.004533,0.00586,0.023929,0.01087,0.051855,-0.095927


## added variable plots

In [41]:
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
#prestige_model = ols("prestige ~ income + education", data=prestige).fit()

In [None]:
#fig = sm.graphics.plot_partregress_grid(prestige_model)
#fig.tight_layout(pad=1.0)

In [None]:
#fig = sm.graphics.plot_partregress("prestige", "income", ["income", "education"], data=prestige)
#fig.tight_layout(pad=1.0)