## To-do list
* Some cohort exploration based on deletion of subjects

* Look at variable correlations & clustering (if possible) -- suspect eg that systolic/diastolic will cluster based on prev work & means we can get away with using a variable score or one but not the other

* Model checking!!

**Noted:**
* labs need transformation
* meds can't be used; too sparse (though try exposure variables)
* everything else should be good
* 

## For a general modeling pipeline:
Need a utility function that:
1. Takes a DF and a list of variables,
2. Makes a GLM and samples from it,
3. Returns the trace & the model.

This will create a bunch of models & traces to check model diagnostics on, LOO, WAIC, etc & pick some "good" models.

Then need utility functions for:
1. Plotting chosen diagnostics.
2. Doing model comparisons using WAIC/LOO.
3. Measuring the accuracy vs validation set -- this will require refitting the model & using sample_ppc w/ shared theano variables to predict off the validation set.

FINALLY, take the best model that pops out of the above & refit it with both train+validation sets, then use it to predict the test set as final check of accuracy.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pymc3 as pm
sns.set()

from sklearn import preprocessing, model_selection
from sklearn.metrics import make_scorer, confusion_matrix, f1_score, roc_auc_score

In [2]:
import warnings
warnings.filterwarnings("ignore", module="mkl_fft")
warnings.filterwarnings("ignore", module="matplotlib")

In [65]:
df = pd.read_csv('../data/merged.csv', parse_dates=['admit_date','discharge_date','dob','dod'])

In [4]:
df.head()

Unnamed: 0,ruid,visit_id,admit_date,discharge_date,stay_length,n_transfers,readmit_time,readmit_30d,sex,dob,...,pcv,plt-ct,systolic,diastolic,bmi,pregnancy_indicator,egfr,age,total_encounters,group
0,50135262,0,2007-02-08,2007-02-12,4,2,172 days 00:00:00.000000000,0,F,1949-09-20,...,32.0,334.0,140.0,58.0,44.71,0.0,123.67783,57.385352,10,train
1,50135262,1,2007-08-03,2007-08-06,3,3,22 days 00:00:00.000000000,1,F,1949-09-20,...,39.0,291.5,121.0,61.0,45.025,0.0,89.505,57.867214,10,train
2,50135262,2,2007-08-28,2007-08-29,1,1,179 days 00:00:00.000000000,0,F,1949-09-20,...,38.0,308.0,131.0,60.0,46.23,0.0,107.45,57.935661,10,train
3,50135262,3,2008-02-24,2008-02-28,4,2,44 days 00:00:00.000000000,0,F,1949-09-20,...,38.0,274.0,151.0,74.0,47.14,0.0,73.01077,58.428474,10,train
4,50135262,4,2008-04-12,2008-04-13,1,1,928 days 00:00:00.000000000,0,F,1949-09-20,...,36.0,330.0,134.0,66.0,47.36,0.0,84.358415,58.55989,10,train


In [5]:
df.group.value_counts()

train    12912
test      4229
valid     3992
Name: group, dtype: int64

In [66]:
df.sex = (df.sex == 'M')*1
df.drop('race',axis=1, inplace = True) # can't use this; there's just too few of some races for it to be meaningful

In [67]:
df.columns = df.columns.str.replace(" |,","").str.replace("-","_")

In [69]:
id_vars = list(df.columns[0:3]) + ['total_encounters','group','dob','dod']
meds = list(df.columns[df.columns.str.contains('med_')])
cpts = list(df.columns[df.columns.str.contains('cpt_')])
dx = list(df.columns[df.columns.str.contains('icd_dx')])
labs = list(df.columns[df.columns.str.contains('lab_')])
demos = ['age','sex']
outcomes = ['readmit_time','readmit_30d']
visit = cpts + ['n_transfers','stay_length','icd_proc','icd_visit']
final_var = demos + dx + labs + visit # drop meds; can't use them
trans_var = ['age'] + labs

In [60]:
meds_desc = df[meds].describe()
visit_desc = df[visit].describe()
dx_desc = df[dx].describe()
labs_desc = df[labs].describe()
demos_desc = df[demos].describe()

In [70]:
X = df[['readmit_30d'] + final_var + ['group']]

In [70]:
scaler = StandardScaler()
X[trans_vars] = scaler.fit_transform[X[trans_vars]]

In [71]:
train = X[X.group=='train'].copy()
train.drop(columns='group', inplace=True)

valid = X[X.group=='valid'].copy()
valid.drop(columns='group', inplace=True)

test = X[X.group=='test'].copy()
test.drop(columns='group', inplace=True)

assert(X.shape[0]==(train.shape[0] + valid.shape[0] + test.shape[0]))

In [72]:
train_red = train.dropna()

In [104]:
def model_gen(outcome = 'readmit_30d', variables = final_var, data = train, draws = 1000):
    import pymc3 as pm
    import pymc3.glm as glm
    
    formula = outcome + ' ~ ' + ' + '.join(variables)
    family = pm.glm.families.Binomial()
    
    with pm.Model() as model:
        glm.GLM.from_formula(formula,data,family = family)
    
        start = pm.find_MAP()
        
        trace = pm.sample(draws, start = start)
        
    return model, trace