# Diabetes Classification - Models
[Jose R. Zapata](https://joserzapata.github.io)
- https://joserzapata.github.io
- https://twitter.com/joserzapata
- https://www.linkedin.com/in/jose-ricardo-zapata-gonzalez/       


## Introduction


Analyze factors related to readmission as well as other outcomes pertaining to patients in order to classify a patient-hospital outcome

3 different outputs:

1. No readmission

2. A readmission in less than `30` days (this situation is not good, because maybe your treatment was not appropriate);

3. A readmission in more than 30 days (this one is not so good as well the last one, however, the reason could be the state of the patient.


## Main Objective

> **How effective was the treatment received in hospital?**

## Principal References

### Paper

Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

https://www.hindawi.com/journals/bmri/2014/781670/

### Dataset

https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

### Data description

https://www.hindawi.com/journals/bmri/2014/781670/tab1/

# Import libraries

In [20]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import model_selection
import plotly.express as px

# Load Dataset

In [21]:
data = pd.read_csv("../data/processed/data_imbalanced.csv")

In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53214 entries, 0 to 53213
Data columns (total 43 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   race                      53214 non-null  object
 1   gender                    53214 non-null  int64 
 2   age                       53214 non-null  object
 3   admission_type_id         53214 non-null  int64 
 4   discharge_disposition_id  53214 non-null  int64 
 5   admission_source_id       53214 non-null  int64 
 6   time_in_hospital          53214 non-null  int64 
 7   medical_specialty         53214 non-null  object
 8   num_lab_procedures        53214 non-null  int64 
 9   num_procedures            53214 non-null  int64 
 10  num_medications           53214 non-null  int64 
 11  number_outpatient         53214 non-null  int64 
 12  number_emergency          53214 non-null  int64 
 13  number_inpatient          53214 non-null  int64 
 14  diag_1                

## Encode categorical colums

In [23]:
cat_cols = list(data.select_dtypes('object').columns)
class_dict = {}
for col in cat_cols:
    data = pd.concat([data.drop(col, axis=1), pd.get_dummies(data[col], prefix=col, drop_first=True)], axis=1)

In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53214 entries, 0 to 53213
Data columns (total 85 columns):
 #   Column                                             Non-Null Count  Dtype
---  ------                                             --------------  -----
 0   gender                                             53214 non-null  int64
 1   admission_type_id                                  53214 non-null  int64
 2   discharge_disposition_id                           53214 non-null  int64
 3   admission_source_id                                53214 non-null  int64
 4   time_in_hospital                                   53214 non-null  int64
 5   num_lab_procedures                                 53214 non-null  int64
 6   num_procedures                                     53214 non-null  int64
 7   num_medications                                    53214 non-null  int64
 8   number_outpatient                                  53214 non-null  int64
 9   number_emergency            

## Input and output data

Get the names of input features and output feature

In [25]:
X = data.drop("readmitted", axis=1)
y= data["readmitted"]

In [26]:
X.shape

(53214, 84)

In [27]:
y.shape

(53214,)

## Split dataset 70/30

70% to train the model
30% to test the model

Split data stratify = y because the dataset is imbalanced

In [28]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3,random_state =123, stratify=y)

In [29]:
perc_test = y_test.value_counts()[1]*100/len(y_test)
perc_train = y_train.value_counts()[1]*100/len(y_train)

print(f" Percentage of positives in test set={perc_test} , train set={perc_train}")

Percentage of positives in test set=4.046351393673661 , train set=4.0457461945287125


# MODELS

## Evaluation Metric

The selected evaluation metric is the 
> **Recall - Sensitivity**

A False Negative might delay more tests or treatment, 
however a False Positive may just lead to more tests or treatments – not as costly as putting a life at stake.


In [30]:
# Evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

Interpretability is important so first intents will be with classic Machine Learning Models, No ensembles, No Neural Networks

In [31]:
# Selected algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

## Model Selection

In [32]:
seed =2
models = []

#logistic Regression
models.append(('LR', LogisticRegression(solver='liblinear')))

# Decision Tree classifier
models.append(('CART', DecisionTreeClassifier()))

# Naïve Bayes
models.append(('NB', GaussianNB()))
# SVM
models.append(('SVM', SVC(C=1.0, kernel='rbf', max_iter=1000, tol=1e-3)))
# evaluate each model in turn
results = []
names = []
scoring = 'recall'
for name, model in models:
	# Kfol cross validation for model selection
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	#X train , y train
	cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
	print(msg)

(LR, 0.010619194876535438, 0.0072737053369331325
(KNN, 0.0, 0.0
(NB, 0.9378174963868734, 0.008952183865945028
(SVM, 0.7663281896928434, 0.03593820728648552


## Results

In [33]:
result_df = pd.DataFrame(results, index=names).T
px.box(result_df,title = 'Algorithm Comparison')

Test if results gest better results transforming the variables using standard scaler


In [34]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [35]:
# Naïve Bayes with standardscaler
NB_pipeline = make_pipeline( StandardScaler(), GaussianNB())
models.append(('NB_pipeline', NB_pipeline))

('NB_pipeline',
 Pipeline(steps=[('standardscaler', StandardScaler()),
                 ('gaussiannb', GaussianNB())]))

In [38]:
scoring = 'recall'
name, model
# Kfol cross validation for model selection
kfold = model_selection.KFold(n_splits=10, random_state=seed)
#X train , y train
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
print(msg)

ValueError: too many values to unpack (expected 2)

In [None]:
result_df = pd.DataFrame(results, index=names).T
px.box(result_df,title = 'Algorithm Comparison')

# Hyper Parameter optimization


In [None]:
#from sklearn.model_selection import GridSearchCV

In [None]:


#parameters = {'max_depth': [2, 4, 5, 7, 9, 10]}


#grid_search = GridSearchCV(GaussianNB(), parameters, cv=10, return_train_score=True)
#grid_search.fit(x_train, y_train)

#grid_search.best_params_

# Final evaluation Test set

In [None]:
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

## Naïve Bayes

In [None]:
NB = GaussianNB()
NB.fit(X_train,y_train)
y_pred = NB.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
plot_confusion_matrix(NB, X_test, y_test,cmap=plt.cm.Blues); 

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
sensitivity = tp / (tp+fn)
sensitivity

# Model Interpretability

In [None]:
len(NB.theta_[0])

In [None]:
NB.epsilon_

In [None]:
NB.sigma_

# Save Model

In [None]:
from joblib import dump # libreria de serializacion

# garbar el modelo en un archivo
dump(NB, '../models/NB_imbalance_recall.joblib')

# Conclusiones

# Ayudas Y Referencias

- Correction to: Hospital Readmission of Patients with Diabetes - https://link.springer.com/article/10.1007/s11892-018-0989-1

- Center for disease control and prevention, Diabetes atlas- https://gis.cdc.gov/grasp/diabetes/DiabetesAtlas.html

- https://medium.com/@joserzapata/paso-a-paso-en-un-proyecto-machine-learning-bcdd0939d387
- [a-complete-machine-learning-walk-through-in-python-part-one](https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420)

- https://www.kaggle.com/vignesh1609/readmission-classification-model

- https://www.kaggle.com/kavyarall/predicting-effective-treatments/

[Jose R. Zapata](https://joserzapata.github.io)
- https://joserzapata.github.io
- https://twitter.com/joserzapata
- https://www.linkedin.com/in/jose-ricardo-zapata-gonzalez/   