# Report Machine Learning in Health Care - Project 2

## Introduction

In this jupyter notebook we are analysing the Diabetes 130-US hospitals for years 1999-2008 data set. The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals. It includes over 50 features representing patient and hospital outcomes.

## Methods

We started with all features and categorised them manually into text features (for NLP), irrelevant features, categorical features and numeric features. This was done based on manual inspection of the data set. We observed that many features have missing values which we all replaced with NaN. Furthermore we observed that some numeric features also have special entries in rare cases. For example the otherwise numeric feature "diag_3" has the some entries like "V15". Such values were replaced with NaN as well. 

Once all features were casted to the right type we imputed the missing values in the numeric features. For categorical features we regarded missing values as belonging to a nan class. All categorical features were then replaced by one hot encodings.

The we used different approaches to classify the patients.
* No NLP: This basic approch disregarded text features and simply used numeric and categorical features
* NLP: Here we implemented different approaches as discussed in the tutorial. We filtered for stop words, used ngram (from 1, i.e Bag of Words) up to trigrams, used stemming and Tfidf transforms of the count matrix.

The processed features were then used to train different classifiers like random forest classifiers, AdaBoost and SVM. We performed a 5-fold crossvalidation with the training and validation set before selecting the most promising parameters and evaluating it on the test set. For the parameter search we performed a grid serach with various parameter scans (see 'Run grid search'). 

## Results

### No NLP

In our first approach we disregarded the text features and only used numeric and categorical features. We observed an CV-accuracy of around 59%. The grid search did not improve this accurecy score by much.

### NLP

The NLP approach was more exhaustive since we were playing around with differend processing pipelines.
* Filtering stop words or not
* Using raw count or Tfidf
* Stemming or not
* Degree of n-grams
* Pooling the different text features for each patient into one string or processing them separately
After introducing NLP processing steps we observed CV-accuracy scores of around 64% for various parameter settings.

### Best model

Our best model was a random forest model with stop word removal, stemming, tfidf of 2-grams. We observed a accuracy of 0.6195 on the test set and 

## Discussion

Using the NLP features did not provide a significant improvement of the accuracy score even though we used many different NLP and classifier settings. Furthermore the NLP processing introduced many features, especially when using n-grams > 1.

In [11]:
import numpy as np
import pandas as pd

In [12]:
df_train = pd.read_csv("10k_diabetes/diab_train.csv",
                       na_values = ["?", "Not Available", "Not Mapped"])
df_test = pd.read_csv("10k_diabetes/diab_test.csv")
df_validate = pd.read_csv("10k_diabetes/diab_validation.csv")

In [13]:
print(df_train.shape)
print(df_train.dtypes)

(6000, 52)
Unnamed: 0                   int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id           object
discharge_disposition_id    object
admission_source_id         object
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride                 object
acetohexa

In [14]:
type_txt = ["diag_1_desc",
           "diag_2_desc",
           "diag_3_desc"]

type_drop = ["discharge_disposition_id",
           "medical_specialty",]

type_cat = ["race",
          "gender",
          "age",
          "weight",
          "admission_type_id",
        "admission_source_id",
        "payer_code",
          "max_glu_serum",
           "A1Cresult",
           "metformin",
           "repaglinide",
           "nateglinide",
           "chlorpropamide",
           "glimepiride",
           "acetohexamide",
           "glipizide",
           "glyburide",
           "tolbutamide",
           "pioglitazone",
           "rosiglitazone",
           "acarbose",
           "miglitol",
           "troglitazone",
           "tolazamide",
           "examide",
           "citoglipton",
           "insulin",
           "glyburide.metformin",
           "glipizide.metformin",
           "glimepiride.pioglitazone",
           "metformin.rosiglitazone",
           "metformin.pioglitazone",
           "change",
           "diabetesMed"]

type_le = ["age", "weight", "A1Cresult"]

type_int = ["time_in_hospital",
           "num_lab_procedures",
           "num_procedures",
           "num_medications",
           "number_outpatient",
           "number_emergency",
           "number_inpatient",
           "number_diagnoses"]

type_float = ["diag_1",
             "diag_2",
             "diag_3"]

In [15]:
def prep_df(df, pool=True):
    y = df["readmitted"]
    df = df.drop(columns=['readmitted', 'Unnamed: 0'])
    df = df.drop(columns=type_drop)
    
    #Convert data types
    for i in type_int:
        #df_train[i] = df_train[i].astype('int32')
        df[i] = pd.to_numeric(df[i], errors='coerce', downcast='integer')

    for i in type_float:
        df[i] = pd.to_numeric(df[i], errors='coerce', downcast='float')
        
    for i in type_txt:
        df[i] = df[i].astype('str')
        df[i] = df[i].str.lower()
        
    for i in type_cat:
        df[i] = df[i].astype('str')
        df[i] = df[i].str.lower()
     
    if pool:
        #Combine descriptions
        tmp = df[type_txt[0]] + " " + df[type_txt[1]] + " " + df[type_txt[2]]
        tmp = pd.DataFrame({'description':tmp})
        df = pd.concat([tmp, df], axis = 1)
        df = df.drop(columns = type_txt)
    
    return df, y

In [26]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

def get_transforms(df, impute=True, imp="mean"):
    ll = type_txt.copy()
    ll.append('description')
    
    #Get features that are categorical and create oh encoding
    ohe_mask = df.dtypes==object
    txt_mask = [i not in ll for i in df.columns]
    mask = [i == True and j == True for i,j in zip(txt_mask, ohe_mask.tolist())]
    col_mask = df.columns[mask]
    
    #Generate OneHotEncoder
    ohe = [OneHotEncoder().fit(df[i].values.reshape(-1,1)) for i in col_mask]
    enc = [ohe[i].transform(df[name].values.reshape(-1,1)).toarray() for i,name in enumerate(col_mask)]
    
    #Concat transformed features
    tmp = np.concatenate(enc, axis = 1)
    tmp = pd.DataFrame(tmp, columns = ['ohe' + str(i) for i in range(tmp.shape[1])])
    
    #Append to dataframe and drop categorical features that have been transformed
    df = pd.concat([df.reset_index(drop=True), tmp.reset_index(drop=True)], axis = 1)
    df = df.drop(columns=col_mask)
    
    #Impute missing values
    if impute:
        idx = pd.isnull(df).any().tolist()
        print("Impute values for the following attributes")
        print(df.columns[idx])

        df_imp = SimpleImputer(strategy=imp).fit_transform(df.loc[:,idx])
        df.loc[:,idx] = df_imp
        
    return df

In [17]:
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

def tokenize(x,stem = False):
    rm_word = stopwords.words('english')
    rm_word.extend(',')
    
    if stem:
        stemmer = PorterStemmer()
        return [stemmer.stem(i) for i in list(filter(lambda x: x not in rm_word, word_tokenize(x)))]
    else:
        return list(filter(lambda x: x not in rm_word, word_tokenize(x)))

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def get_nlp(df, n_gram_range=(1,1), stem=False, tfidf=False):
    
    #NLP 
    tfidfTransformer = TfidfTransformer()
    count_vect = CountVectorizer(ngram_range=n_gram_range)
    
    if 'description' in df.columns:
        #Filter description and possibly stem
        df['description'] = df['description'].apply(lambda txt: ' '.join(tokenize(txt, stem=stem)))

        X_transformed = count_vect.fit_transform(df['description'].values)

        if tfidf:
            X_transformed = tfidfTransformer.fit_transform(X_transformed)

        tmp = pd.DataFrame(X_transformed.A, columns = ['nlp' + str(i) for i in range(X_transformed.shape[1])])    
        df = pd.concat([df.reset_index(drop=True), tmp.reset_index(drop=True)], axis = 1)
        df = df.drop(columns=['description'])
    
    else:
        for i in type_txt:
            #Filter description and possibly stem
            df[i] = df[i].apply(lambda txt: ' '.join(tokenize(txt, stem=stem)))

            X_transformed = count_vect.fit_transform(df[i].values)

            if tfidf:
                X_transformed = tfidfTransformer.fit_transform(X_transformed)

            tmp = pd.DataFrame(X_transformed.A, columns = [str(i) for i in range(X_transformed.shape[1])])    
            df = pd.concat([df.reset_index(drop=True), tmp.reset_index(drop=True)], axis = 1)
            df = df.drop(columns=[i])
    
    return df

In [19]:
def split_df(DF,df,y):
    idx = np.cumsum([i.shape[0] for i in DF])
    X_train = df.iloc[0:idx[0],]
    y_train = y.iloc[0:idx[0],]
    
    X_val = df.iloc[idx[0]:idx[1],]
    y_val = y.iloc[idx[0]:idx[1],]
    
    X_test = df.iloc[idx[1]:idx[2],]
    y_test = y.iloc[idx[1]:idx[2],]
    
    return X_train, y_train, X_val, y_val, X_test, y_test

In [35]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import ParameterGrid

param_dict = {"RandomForest": {
                             'max_depth': [None, 10, 50],
                             'min_samples_split': [2, 10],
                             'nlp': [True],
                             'pool':[True, False],
                             'stem': [True],
                             'tfidf':[True],
                             'n_gram_range': [None, (1,1), (2,2)]},
              "AdaBoost": {'n_estimators': [50, 100],
                           'learning_rate':[0.1, 1.0],
                           'nlp': [True],
                           'pool':[True, False],
                           'stem': [True],
                           'tfidf':[True],
                           'n_gram_range': [None, (1,1), (2,2)]}
             }

param_grid_rf = list(ParameterGrid(param_dict['RandomForest']))
param_grid_ab = list(ParameterGrid(param_dict['AdaBoost']))

In [37]:
from sklearn.model_selection import KFold

#Run grid search
res_rf = list()
n_fold = 5
DF = [df_train, df_validate, df_test]

# RandomForest
for param in param_grid_rf:

    if not param['nlp'] and param['stem']:
        #Skip this parameter setting since unreasonable
        continue
        
    if not param['nlp'] and not param['n_gram_range'] == None:
        #Skip this parameter setting since unreasonable
        continue
        
    if not param['nlp'] and param['pool'] == True:
        #Skip this parameter setting since unreasonable
        continue
        
    print("Model params:", param)
        
    df = pd.concat(DF, axis=0)    
    df,y = prep_df(df, pool=param['pool'])
    df = get_transforms(df, impute=True, imp="mean")
    
    if param['nlp']:
        df = get_nlp(df, n_gram_range=(1,1), stem=param['stem'], tfidf=param['tfidf'])
    else:
        if param['pool']:
            df = df.drop(columns=['description'])
        else:
            df = df.drop(columns=type_txt)
            
    X_train, y_train, X_val, y_val, X_test, y_test = split_df(DF,df,y)
    
    #CV over grid search
    acc_cv = []
    for train, val in zip(KFold(n_splits=n_fold).split(X_train), KFold(n_splits=n_fold).split(X_val)):
        idx_train,_ = train
        idx_val,_ = val
        
        clf = RandomForestClassifier(max_depth=param['max_depth'], min_samples_split=param['min_samples_split'])
        clf.fit(X_train.values[idx_train],y_train[idx_train])
        pred_test = clf.predict(X_val.values[idx_val])
        acc_cv.append(accuracy_score(y_val[idx_val], pred_test))
    
    acc = np.mean(acc_cv)
    acc_sd = np.std(acc_cv)
    print('Mean: {}, Std: {}'.format(acc, acc_sd))
    res_rf.append((acc, acc_sd, param))

Model params: {'max_depth': None, 'min_samples_split': 2, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5967499999999999, Std: 0.007909646009778198
Model params: {'max_depth': None, 'min_samples_split': 2, 'n_gram_range': None, 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.57925, Std: 0.0072715197861244665
Model params: {'max_depth': None, 'min_samples_split': 2, 'n_gram_range': (1, 1), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5900000000000001, Std: 0.013139872526017912
Model params: {'max_depth': None, 'min_samples_split': 2, 'n_gram_range': (1, 1), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.598625, Std: 0.008342286856731803
Model params: {'max_depth': None, 'min_samples_split': 2, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.58525, Std: 0.0045859840819610686
Model params: {'max_depth': None, 'min_samples_split': 2, 'n_gram_range': (2, 2), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.592875, Std: 0.01667989058717114
Model params: {'max_depth': None, 'min_samples_split': 10, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.585625, Std: 0.017997395644925966
Model params: {'max_depth': None, 'min_samples_split': 10, 'n_gram_range': None, 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.6014999999999999, Std: 0.01247246968326644
Model params: {'max_depth': None, 'min_samples_split': 10, 'n_gram_range': (1, 1), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.578125, Std: 0.013020416659999779
Model params: {'max_depth': None, 'min_samples_split': 10, 'n_gram_range': (1, 1), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.595375, Std: 0.022031937046024797
Model params: {'max_depth': None, 'min_samples_split': 10, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5773750000000001, Std: 0.01688842058926765
Model params: {'max_depth': None, 'min_samples_split': 10, 'n_gram_range': (2, 2), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5825, Std: 0.027397536385594983
Model params: {'max_depth': 10, 'min_samples_split': 2, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.6172500000000001, Std: 0.006876136269737552
Model params: {'max_depth': 10, 'min_samples_split': 2, 'n_gram_range': None, 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.6050000000000001, Std: 0.018787462574813032
Model params: {'max_depth': 10, 'min_samples_split': 2, 'n_gram_range': (1, 1), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.606375, Std: 0.021536306554281775
Model params: {'max_depth': 10, 'min_samples_split': 2, 'n_gram_range': (1, 1), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.603, Std: 0.03697423089125724
Model params: {'max_depth': 10, 'min_samples_split': 2, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.604, Std: 0.020545680811304343
Model params: {'max_depth': 10, 'min_samples_split': 2, 'n_gram_range': (2, 2), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.6243749999999999, Std: 0.01311606838957465
Model params: {'max_depth': 10, 'min_samples_split': 10, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.604375, Std: 0.009536115561380306
Model params: {'max_depth': 10, 'min_samples_split': 10, 'n_gram_range': None, 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.6118750000000001, Std: 0.025501225460749913
Model params: {'max_depth': 10, 'min_samples_split': 10, 'n_gram_range': (1, 1), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.618875, Std: 0.007929375763576869
Model params: {'max_depth': 10, 'min_samples_split': 10, 'n_gram_range': (1, 1), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.61025, Std: 0.01325589491509344
Model params: {'max_depth': 10, 'min_samples_split': 10, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.615125, Std: 0.022449248317037256
Model params: {'max_depth': 10, 'min_samples_split': 10, 'n_gram_range': (2, 2), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.617, Std: 0.012458681711962948
Model params: {'max_depth': 50, 'min_samples_split': 2, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.586, Std: 0.014888124462134248
Model params: {'max_depth': 50, 'min_samples_split': 2, 'n_gram_range': None, 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5900000000000001, Std: 0.01641264756216985
Model params: {'max_depth': 50, 'min_samples_split': 2, 'n_gram_range': (1, 1), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5737500000000001, Std: 0.013732943966972267
Model params: {'max_depth': 50, 'min_samples_split': 2, 'n_gram_range': (1, 1), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.57025, Std: 0.019512015528899106
Model params: {'max_depth': 50, 'min_samples_split': 2, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.584375, Std: 0.016737868741270506
Model params: {'max_depth': 50, 'min_samples_split': 2, 'n_gram_range': (2, 2), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.574125, Std: 0.020716086744363683
Model params: {'max_depth': 50, 'min_samples_split': 10, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5855, Std: 0.014558073361540658
Model params: {'max_depth': 50, 'min_samples_split': 10, 'n_gram_range': None, 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.59975, Std: 0.008068069781552474
Model params: {'max_depth': 50, 'min_samples_split': 10, 'n_gram_range': (1, 1), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.58275, Std: 0.014260960696951663
Model params: {'max_depth': 50, 'min_samples_split': 10, 'n_gram_range': (1, 1), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5962500000000001, Std: 0.011504075364843531
Model params: {'max_depth': 50, 'min_samples_split': 10, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.57325, Std: 0.016280740155164945
Model params: {'max_depth': 50, 'min_samples_split': 10, 'n_gram_range': (2, 2), 'nlp': True, 'pool': False, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')




Mean: 0.5945, Std: 0.011280514172678472


In [39]:
from sklearn.model_selection import KFold

#Run grid search
res_ab = list()
n_fold = 5
DF = [df_train, df_validate, df_test]

# RandomForest
for param in param_grid_ab:

    if not param['nlp'] and param['stem']:
        #Skip this parameter setting since unreasonable
        continue
        
    if not param['nlp'] and not param['n_gram_range'] == None:
        #Skip this parameter setting since unreasonable
        continue
        
    if not param['nlp'] and param['pool'] == True:
        #Skip this parameter setting since unreasonable
        continue
        
    print("Model params:", param)
        
    df = pd.concat(DF, axis = 0)    
    df,y = prep_df(df, pool=param['pool'])
    df = get_transforms(df, impute=True, imp="mean")

    if param['nlp']:
        df = get_nlp(df, n_gram_range=(1,1), stem=True, tfidf=True)
    else:
        if param['pool']:
            df = df.drop(columns=['description'])
        else:
            df = df.drop(columns=[type_txt])
            
    X_train, y_train, X_val, y_val, X_test, y_test = split_df(DF,df,y)
    
    #CV over grid search
    acc_cv = []
    for train, val in zip(KFold(n_splits=n_fold).split(X_train), KFold(n_splits=n_fold).split(X_val)):
        idx_train,_ = train
        idx_val,_ = val
        
        clf = AdaBoostClassifier(n_estimators=param['n_estimators'], learning_rate=param['learning_rate'])
        clf.fit(X_train.values[idx_train],y_train[idx_train])
        pred_test = clf.predict(X_val.values[idx_val])
        acc_cv.append(accuracy_score(y_val[idx_val], pred_test))
    
    acc = np.mean(acc_cv)
    acc_sd = np.std(acc_cv)
    print('Mean: {}, Std: {}'.format(acc, acc_sd))
    res_ab.append((acc, acc_sd, param))

Model params: {'learning_rate': 0.1, 'n_estimators': 50, 'n_gram_range': None, 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')
Mean: 0.6232500000000001, Std: 0.014461154863979538
Model params: {'learning_rate': 0.1, 'n_estimators': 50, 'n_gram_range': None, 'nlp': True, 'pool

Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')
Mean: 0.60625, Std: 0.016630168369562597
Model params: {'learning_rate': 0.1, 'n_estimators': 100, 'n_gram_range': (2, 2), 'nlp': True, 'pool': True, 'stem': True, 'tfidf': True}
Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')
Mean: 0.

KeyboardInterrupt: 

In [46]:
best_model = 35
param_best_model = res_rf[35][2]
param_best_model

df = pd.concat(DF, axis = 0)    
df,y = prep_df(df, pool=False)
df = get_transforms(df, impute=True, imp="mean")
df = get_nlp(df, n_gram_range=(2,2), stem=True, tfidf=True)
X_train, y_train, X_val, y_val, X_test, y_test = split_df(DF,df,y)

Index(['race', 'gender', 'age', 'weight', 'admission_type_id',
       'admission_source_id', 'payer_code', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide.metformin', 'glipizide.metformin',
       'glimepiride.pioglitazone', 'metformin.rosiglitazone',
       'metformin.pioglitazone', 'change', 'diabetesMed'],
      dtype='object')
Impute values for the following attributes
Index(['diag_1', 'diag_2', 'diag_3'], dtype='object')


In [48]:
X = np.concatenate((X_train, X_val), axis = 0)
y = np.concatenate((y_train, y_val), axis = 0)

In [52]:
clf = RandomForestClassifier(max_depth=50, min_samples_split=10)
clf.fit(X, y)

pred_test = clf.predict(X_test.values)
acc_test = accuracy_score(y_test, pred_test)
con_test = confusion_matrix(y_test, pred_test)



In [53]:
acc_test

0.6195

In [54]:
con_test

array([[967, 240],
       [521, 272]])