# Diabetes Classification - Models
[Jose R. Zapata](https://joserzapata.github.io)
- https://joserzapata.github.io
- https://twitter.com/joserzapata
- https://www.linkedin.com/in/jose-ricardo-zapata-gonzalez/       


## Introduction


Analyze factors related to readmission as well as other outcomes pertaining to patients in order to classify a patient-hospital outcome

3 different outputs:

1. No readmission

2. A readmission in less than `30` days (this situation is not good, because maybe your treatment was not appropriate);

3. A readmission in more than 30 days (this one is not so good as well the last one, however, the reason could be the state of the patient.


## Main Objective

> **How effective was the treatment received in hospital?**

## Principal References

### Paper

Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

https://www.hindawi.com/journals/bmri/2014/781670/

### Dataset

https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

### Data description

https://www.hindawi.com/journals/bmri/2014/781670/tab1/

# Import libraries

In [54]:
import pandas as pd

from sklearn.model_selection import train_test_split

# Load Dataset

In [24]:
data = pd.read_csv("../data/processed/data_imbalanced.csv")

In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53214 entries, 0 to 53213
Data columns (total 43 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   race                      53214 non-null  object
 1   gender                    53214 non-null  int64 
 2   age                       53214 non-null  object
 3   admission_type_id         53214 non-null  int64 
 4   discharge_disposition_id  53214 non-null  int64 
 5   admission_source_id       53214 non-null  int64 
 6   time_in_hospital          53214 non-null  int64 
 7   medical_specialty         53214 non-null  object
 8   num_lab_procedures        53214 non-null  int64 
 9   num_procedures            53214 non-null  int64 
 10  num_medications           53214 non-null  int64 
 11  number_outpatient         53214 non-null  int64 
 12  number_emergency          53214 non-null  int64 
 13  number_inpatient          53214 non-null  int64 
 14  diag_1                

## Encode categorical colums

In [26]:
cat_cols = list(data.select_dtypes('object').columns)
class_dict = {}
for col in cat_cols:
    data = pd.concat([data.drop(col, axis=1), pd.get_dummies(data[col], prefix=col, drop_first=True)], axis=1)

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53214 entries, 0 to 53213
Data columns (total 85 columns):
 #   Column                                             Non-Null Count  Dtype
---  ------                                             --------------  -----
 0   gender                                             53214 non-null  int64
 1   admission_type_id                                  53214 non-null  int64
 2   discharge_disposition_id                           53214 non-null  int64
 3   admission_source_id                                53214 non-null  int64
 4   time_in_hospital                                   53214 non-null  int64
 5   num_lab_procedures                                 53214 non-null  int64
 6   num_procedures                                     53214 non-null  int64
 7   num_medications                                    53214 non-null  int64
 8   number_outpatient                                  53214 non-null  int64
 9   number_emergency            

## Input and output data

Get the names of input features and output feature

In [28]:
X = data.drop("readmitted", axis=1)
y= data["readmitted"]

In [29]:
X.shape

(53214, 84)

In [30]:
y.shape

(53214,)

## Split dataset 70/30

70% to train the model
30% to test the model

Split data stratify = y because the dataset is imbalanced

In [31]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3,random_state =123, stratify=y)

In [32]:
perc_test = y_test.value_counts()[1]*100/len(y_test)
perc_train = y_train.value_counts()[1]*100/len(y_train)

print(f" Percentage of positives in test set={perc_test} , train set={perc_train}")

Percentage of positives in test set=4.046351393673661 , train set=4.0457461945287125


## Balance train data

In [33]:
#from imblearn.over_sampling import SMOTE
#sm = SMOTE(random_state=27, ratio=1.0)
#X_train, y_train = sm.fit_sample(X_train, y_train)

# Simple BaseLine

In [34]:
#import lazypredict
#from lazypredict.Supervised import LazyClassifier

In [35]:
#clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
#models,predictions = clf.fit(X_train, X_test, y_train, y_test)

In [36]:
#models

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>ROC AUC</th>
      <th>F1 Score</th>
      <th>recall_score</th>
      <th>Time Taken</th>
    </tr>
    <tr>
      <th>Model</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>NearestCentroid</th>
      <td>0.63</td>
      <td>0.83</td>
      <td>0.49</td>
      <td>0.11</td>
    </tr>
    <tr>
      <th>DecisionTreeClassifier</th>
      <td>0.53</td>
      <td>0.92</td>
      <td>0.12</td>
      <td>0.28</td>
    </tr>
    <tr>
      <th>LinearDiscriminantAnalysis</th>
      <td>0.53</td>
      <td>0.94</td>
      <td>0.07</td>
      <td>0.23</td>
    </tr>
    <tr>
      <th>ExtraTreeClassifier</th>
      <td>0.52</td>
      <td>0.93</td>
      <td>0.08</td>
      <td>0.09</td>
    </tr>
    <tr>
      <th>LabelPropagation</th>
      <td>0.52</td>
      <td>0.93</td>
      <td>0.07</td>
      <td>60.74</td>
    </tr>
    <tr>
      <th>LabelSpreading</th>
      <td>0.52</td>
      <td>0.93</td>
      <td>0.06</td>
      <td>75.70</td>
    </tr>
    <tr>
      <th>XGBClassifier</th>
      <td>0.51</td>
      <td>0.94</td>
      <td>0.02</td>
      <td>45.12</td>
    </tr>
    <tr>
      <th>Perceptron</th>
      <td>0.51</td>
      <td>0.93</td>
      <td>0.04</td>
      <td>0.17</td>
    </tr>
    <tr>
      <th>CalibratedClassifierCV</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.01</td>
      <td>28.87</td>
    </tr>
    <tr>
      <th>DummyClassifier</th>
      <td>0.50</td>
      <td>0.92</td>
      <td>0.05</td>
      <td>0.06</td>
    </tr>
    <tr>
      <th>LogisticRegression</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.01</td>
      <td>0.25</td>
    </tr>
    <tr>
      <th>BaggingClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.01</td>
      <td>1.42</td>
    </tr>
    <tr>
      <th>KNeighborsClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.01</td>
      <td>38.69</td>
    </tr>
    <tr>
      <th>PassiveAggressiveClassifier</th>
      <td>0.50</td>
      <td>0.92</td>
      <td>0.04</td>
      <td>0.13</td>
    </tr>
    <tr>
      <th>LGBMClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>82.92</td>
    </tr>
    <tr>
      <th>LinearSVC</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>8.18</td>
    </tr>
    <tr>
      <th>RidgeClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>0.11</td>
    </tr>
    <tr>
      <th>RidgeClassifierCV</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>0.19</td>
    </tr>
    <tr>
      <th>AdaBoostClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>1.32</td>
    </tr>
    <tr>
      <th>GaussianNB</th>
      <td>0.50</td>
      <td>0.01</td>
      <td>1.00</td>
      <td>0.08</td>
    </tr>
    <tr>
      <th>QuadraticDiscriminantAnalysis</th>
      <td>0.50</td>
      <td>0.01</td>
      <td>1.00</td>
      <td>0.18</td>
    </tr>
    <tr>
      <th>CheckingClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>0.06</td>
    </tr>
    <tr>
      <th>RandomForestClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>3.30</td>
    </tr>
    <tr>
      <th>SGDClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>0.26</td>
    </tr>
    <tr>
      <th>SVC</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>145.44</td>
    </tr>
    <tr>
      <th>ExtraTreesClassifier</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>3.25</td>
    </tr>
    <tr>
      <th>BernoulliNB</th>
      <td>0.50</td>
      <td>0.94</td>
      <td>0.00</td>
      <td>0.10</td>
    </tr>
  </tbody>
</table>

# MODELS

## Evaluation Metric

The selected evaluation metric is the 
> **Recall**

Becasuse the Benefit of a True Negative is much higher than a True Positive.

A False Negative might delay more tests or treatment, 
however a False Positive may just lead to more tests or treatments – not as costly as putting a life at stake.


In [37]:
# Evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

Interpretability is important so first intents will be with classic Machine Learning Models, No ensembles, No Neural Networks

In [38]:
# Selected algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

## Modeling Functions

In [39]:
# list of feature columns
FEATURES = data.drop("readmitted", axis=1).columns

In [40]:
result_dict = {}
def summarize_classification(y_test, y_pred):
    
    acc = accuracy_score(y_test, y_pred, normalize=True)
    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    return {'accuracy': acc, 
            'precision': prec,
            'recall':recall 
            }

In [41]:
def build_model(classifier_fn,                
                name_of_y_col, 
                names_of_x_cols, 
                dataset, 
                test_frac=0.2):
    
    X = dataset[names_of_x_cols]
    Y = dataset[name_of_y_col]

    #Split data stratify = y because the dataset is imbalanced
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac, stratify = Y)
       
    model = classifier_fn(x_train, y_train)
    
    y_pred = model.predict(x_test)

    y_pred_train = model.predict(x_train)
    
    train_summary = summarize_classification(y_train, y_pred_train)
    test_summary = summarize_classification(y_test, y_pred)
    
    pred_results = pd.DataFrame({'y_test': y_test,
                                 'y_pred': y_pred})
    
    model_crosstab = pd.crosstab(pred_results.y_pred, pred_results.y_test)
    
    return {'training': train_summary, 
            'test': test_summary,
            'confusion_matrix': model_crosstab}

In [42]:
def compare_results():
    for key in result_dict:
        print('Classification: ', key)

        print()
        print('Training data')
        for score in result_dict[key]['training']:
            print(score, result_dict[key]['training'][score])

        print()
        print('Test data')
        for score in result_dict[key]['test']:
            print(score, result_dict[key]['test'][score])
       
        print()

## Training

#### Quadratic Discriminant Analysis


In [43]:
def quadratic_discriminant_fn(x_train, y_train):
    model = QuadraticDiscriminantAnalysis()
    model.fit(x_train, y_train)    
    return model

In [44]:
result_dict['quadratic_discriminant_analysis'] = build_model(quadratic_discriminant_fn, "readmitted",FEATURES, data)

#### SVC Lineal

In [45]:
def linear_svc_fn(x_train, y_train, C=1.0, max_iter=1000, tol=1e-3):
    
    model = LinearSVC(C=C, max_iter=max_iter, tol=tol, dual=False)
    model.fit(x_train, y_train) 
    
    return model

In [46]:
result_dict['linear_svc'] = build_model(linear_svc_fn, "readmitted",FEATURES, data)

#### Decision Tree classifier

In [47]:
def decision_tree_fn(x_train, y_train, max_depth=None, max_features=None): 
    
    model = DecisionTreeClassifier(max_depth=max_depth, max_features=max_features)
    model.fit(x_train, y_train)
    
    return model

In [48]:
result_dict['decision_tree'] = build_model(decision_tree_fn, "readmitted",FEATURES, data)

#### Naïve Bayes

In [49]:
def naive_bayes_fn(x_train,y_train, priors=None):
    
    model = GaussianNB(priors=priors)
    model.fit(x_train, y_train)
    
    return model

In [50]:
result_dict['naive_bayes'] = build_model(naive_bayes_fn, "readmitted",FEATURES, data)

#### Logistic Regression

In [51]:
def logistic_fn(x_train, y_train):
    
    model = LogisticRegression(solver='liblinear')
    model.fit(x_train, y_train)
    
    return model

In [65]:
final_result

Unnamed: 0,quadratic_discriminant_analysis,linear_svc,decision_tree,naive_bayes,logistic
training,"{'accuracy': 0.05076225599586573, 'precision':...","{'accuracy': 0.9595969086937117, 'precision': ...","{'accuracy': 1.0, 'precision': 1.0, 'recall': ...","{'accuracy': 0.1843273590002584, 'precision': ...","{'accuracy': 0.9591271053064293, 'precision': ..."
test,"{'accuracy': 0.048858404585173355, 'precision'...","{'accuracy': 0.9594099408061637, 'precision': ...","{'accuracy': 0.9160950859720004, 'precision': ...","{'accuracy': 0.1808700554354975, 'precision': ...","{'accuracy': 0.9584703561026027, 'precision': ..."
confusion_matrix,y_test 0 1 y_pred 0 ...,y_test 0 1 y_pred 0 ...,y_test 0 1 y_pred 0 97...,y_test 0 1 y_pred 0 15...,y_test 0 1 y_pred 0 ...


In [None]:
models = []
#logistic Regression
models.append(('LR', LogisticRegression(solver='liblinear')))

models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

## Results

In [53]:
compare_results()

Classification:  quadratic_discriminant_analysis

Training data
accuracy 0.05076225599586573
precision 0.040827952905431064
recall 0.9988385598141696

Test data
accuracy 0.048858404585173355
precision 0.04066350710900474
recall 0.9953596287703016

Classification:  linear_svc

Training data
accuracy 0.9595969086937117
precision 0.6
recall 0.003484320557491289

Test data
accuracy 0.9594099408061637
precision 0.3333333333333333
recall 0.002320185614849188

Classification:  decision_tree

Training data
accuracy 1.0
precision 1.0
recall 1.0

Test data
accuracy 0.9160950859720004
precision 0.08
recall 0.10208816705336426

Classification:  naive_bayes

Training data
accuracy 0.1843273590002584
precision 0.04467439293598234
recall 0.9401858304297329

Test data
accuracy 0.1808700554354975
precision 0.044018928139099814
recall 0.9280742459396751

Classification:  logistic

Training data
accuracy 0.9591271053064293
precision 0.3269230769230769
recall 0.009872241579558653

Test data
accuracy 0.958

# Hyper Parameter optimization

# Final evaluation Test set

# Save Model

In [None]:
from joblib import dump # libreria de serializacion

# garbar el modelo en un archivo
#dump(Modelo_final, 'Nombre_Archivo_Modelo.joblib')

# Comunicacion de Resultados (Data Story Telling)

# Conclusiones

# Ayudas Y Referencias

- Correction to: Hospital Readmission of Patients with Diabetes - https://link.springer.com/article/10.1007/s11892-018-0989-1

- Center for disease control and prevention, Diabetes atlas- https://gis.cdc.gov/grasp/diabetes/DiabetesAtlas.html

- https://medium.com/@joserzapata/paso-a-paso-en-un-proyecto-machine-learning-bcdd0939d387
- [a-complete-machine-learning-walk-through-in-python-part-one](https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420)

- https://www.kaggle.com/vignesh1609/readmission-classification-model

- https://www.kaggle.com/kavyarall/predicting-effective-treatments/

[Jose R. Zapata](https://joserzapata.github.io)
- https://joserzapata.github.io
- https://twitter.com/joserzapata
- https://www.linkedin.com/in/jose-ricardo-zapata-gonzalez/   