# <center>Diabetes Classification</center>

### Aim :
- To classify / predict whether a patient is prone to diabetes depending on multiple features.
- It is a **binary classification** with multiple numerical features.

### <center>Dataset Attributes</center>
    
- **Pregnancis** : Number of times pregnant
- **Glucose** : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **BloodPressure** : Diastolic blood pressure (mm Hg)
- **SkinThickness** : Triceps skin fold thickness (mm)
- **Insulin** : 2-Hour serum insulin (mu U/ml)
- **BMI** : Body mass index (weight in kg/(height in m)^2)
- **DiabetesPedigreeFunction** : indicates the function which scores likelihood of diabetes based on family history
- **Age** : Age (years)
- **Outcome** : Class variable (0 or 1) 

---
# Load Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams
import scipy.stats as stats
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings(action='ignore')


### Outliers Function

In [None]:
def out_remove(col_name,df,cond,m):
    quartile1 = df[col_name].quantile(0.25)
    quartile3 = df[col_name].quantile(0.75)
    iqr = quartile3 - quartile1
    upper = quartile3 + m * iqr
    lower = quartile1 - m * iqr
    if(cond=='both'):
        new_df = df[(df[col_name] < upper) & (df[col_name] > lower)]
    elif(cond=='lower'):
        new_df = df[(df[col_name] > lower)]
    else:
        new_df = df[(df[col_name] < upper)]
    return new_df

------
# Reading and Checking data

In [None]:
diabetes_df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
diabetes_df.head()

In [None]:
diabetes_df.shape

In [None]:
diabetes_df.info()

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(16, 5))
sns.heatmap(diabetes_df.isnull(), cbar=False, ax=ax1)
percent_missing = diabetes_df.isnull().mean() * 100
sns.barplot(x=percent_missing.index, y=percent_missing ,ax=ax2)
plt.xticks(rotation=90)
plt.show()

- **No null values** present in the data!
**there is no need to preprocess for missing values.**

In [None]:
num_cols=diabetes_df.columns
rcParams['figure.figsize'] =5,5

sns.countplot(diabetes_df['Outcome'],palette=["#FC766AFF","#5B84B1FF"]).set_title('Distribution of Outcome')

- The dataset is **unbalanced**
- Due to this, predictions will be biased towards **Non-Diabetes** cases.
- so, we have to **balance** this class

In [None]:
diabetes_df.describe().T

**Among the features, there are many features whose min() value is 0. Let's check out more of these features.**

In [None]:
#Replace zeros with nan
d_copy = diabetes_df.copy()
d_copy=d_copy.drop(columns=['Outcome'],axis=1)
d_copy = d_copy.replace(0,np.nan)
#sns.heatmap(d_copy.isnull(),cmap = 'magma',cbar = False);

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(16, 5))
sns.heatmap(d_copy.isnull(), cbar=False, ax=ax1)
percent_missing = d_copy.isnull().mean() * 100
sns.barplot(x=percent_missing.index, y=percent_missing ,ax=ax2)
plt.xticks(rotation=90)
plt.show()

- Insulin ------ <span style="color:red"> 50%  </span> nan values. so we may remove insulin column. 
- Age and DiabetesPedigreeFunction ------ <span style="color:blue"> NO </span>nan values.


### Data Splitting

In [None]:
X = diabetes_df.iloc[:,:-1]
y = diabetes_df.iloc[:,-1]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
X_train=pd.concat([X_train, y_train], axis=1)
X_test=pd.concat([X_test, y_test], axis=1)
X_train.to_csv('train_data.csv', index=False)
X_test.to_csv('test_data.csv', index=False)
print(X_train.shape)
print(y_train.shape)

--------------------------------------------
# EDA

In [None]:
rcParams['figure.figsize'] =30,15
sns.set(font_scale = 1.5)
sns.set_style("white")
plt.subplots_adjust(hspace=1)
fig, axes = plt.subplots(2, 4)
for i in range(4):
    sns.distplot(X_train[num_cols[i]],ax = axes[0,i],rug=True,color='darkblue')
    #sns.boxplot(diabetes_df[num_cols[i]],ax = axes[1,i],color='red')  
    #stats.probplot(diabetes_df[num_cols[i]],plot = axes[2,i])
    sns.despine()
for i in range(4,8):
    sns.distplot(X_train[num_cols[i]],ax = axes[1,i-4],rug=True,color='darkblue')
    #sns.boxplot(diabetes_df[num_cols[i]],ax = axes[4,i-4],color='red')  
    #stats.probplot(diabetes_df[num_cols[i]],plot = axes[5,i-4])
    sns.despine()
    #5B84B1FF

- **Pregnancies**, **Insulin**, **DiabetesPedigreeFunction**, **Skin Thickness** and **Age** have **positively or rightly** skewed data distribution.
- Data distributions of **Glucose**, **BloodPressure** & **BMI** are near a **normal distribution** 

--------------------------------------------
## Preprocessing

### Pregnancies

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[0]],ax = plt.subplot(2,3,1)  ,rug=True,color='Aqua')
    sns.boxplot(X_train[num_cols[0]],ax = plt.subplot(2,3,2)  ,color='Aqua')  
    stats.probplot(X_train[num_cols[0]],plot = plt.subplot(2,3,3))  

In [None]:
print(X_train.shape)
print(y_train.shape)

It is normal to have  **zeroes values** and also there are **outliers**.

In [None]:
#Treating Outlier and then verifying it
X_train = out_remove('Pregnancies',X_train,'both',1.5)
#---------------------
X_test = out_remove('Pregnancies',X_test,'both',1.5)
X_train

### Glucose

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[1]],ax = plt.subplot(2,3,1)  ,rug=True,color='Violet')
    sns.boxplot(X_train[num_cols[1]],ax = plt.subplot(2,3,2)  ,color='Violet')  
    stats.probplot(X_train[num_cols[1]],plot = plt.subplot(2,3,3)) 

 There is **few outliers** and also distribution is **normal** , So we decided to fill **zeroes** with **mean** value.

In [None]:
X_train['Glucose'] = X_train['Glucose'].replace(0,X_train['Glucose'].mean())
#------------------
X_test['Glucose'] = X_test['Glucose'].replace(0,X_test['Glucose'].mean())

In [None]:
X_train = out_remove('Glucose',X_train,'both',1.5)
#-----------------
X_test = out_remove('Glucose',X_test,'both',1.5)


### BloodPressure

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[2]],ax = plt.subplot(2,3,1)  ,rug=True,color='DarkOrange')
    sns.boxplot(X_train[num_cols[2]],ax = plt.subplot(2,3,2)  ,color='DarkOrange')  
    stats.probplot(X_train[num_cols[2]],plot = plt.subplot(2,3,3)) 

It looks like there are few Outliers at both higher end and lower end. **But at higher end maximum Blood Pressure is 122, So it is considerable.** Now at lower end BP near 25 is not making sense. so we decided to treplace zeroes with median and remove the outliers.

In [None]:
X_train['BloodPressure'] = X_train['BloodPressure'].replace(0,X_train['BloodPressure'].median())
#------------
X_test['BloodPressure'] = X_test['BloodPressure'].replace(0,X_test['BloodPressure'].median())

In [None]:
X_train = out_remove('BloodPressure',X_train,'lower',1.5)
#------------
X_test = out_remove('BloodPressure',X_test,'lower',1.5)

### SkinThickness

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[3]],ax = plt.subplot(2,3,1)  ,rug=True,color='blue')
    sns.boxplot(X_train[num_cols[3]],ax = plt.subplot(2,3,2)  ,color='blue')  
    stats.probplot(X_train[num_cols[3]],plot = plt.subplot(2,3,3)) 

In [None]:
X_train['SkinThickness'] = X_train['SkinThickness'].replace(0,X_train['SkinThickness'].mean())
#-----------
X_test['SkinThickness'] = X_test['SkinThickness'].replace(0,X_test['SkinThickness'].mean())

In [None]:
X_train = out_remove('SkinThickness',X_train,'both',1.5)
#------------
X_test = out_remove('SkinThickness',X_test,'both',1.5)

### Insulin

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[4]],ax = plt.subplot(2,3,1)  ,rug=True,color='black')
    sns.boxplot(X_train[num_cols[4]],ax = plt.subplot(2,3,2)  ,color='black')  
    stats.probplot(X_train[num_cols[4]],plot = plt.subplot(2,3,3)) 

**We can see there are many outliers. So we decided to fill Zeroes with Median of Insulin and also treat Outliers after removing zero.**

In [None]:
X_train['Insulin'] = X_train['Insulin'].replace(0,X_train['Insulin'].median())
#-----------
X_test['Insulin'] = X_test['Insulin'].replace(0,X_test['Insulin'].median())

In [None]:
X_train = out_remove('Insulin',X_train,'both',1.5)
#------------
X_test = out_remove('Insulin',X_test,'both',1.5)

### BMI

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[5]],ax = plt.subplot(2,3,1)  ,rug=True,color='green')
    sns.boxplot(X_train[num_cols[5]],ax = plt.subplot(2,3,2)  ,color='green')  
    stats.probplot(X_train[num_cols[5]],plot = plt.subplot(2,3,3)) 

In [None]:
X_train['BMI'] = X_train['BMI'].replace(0,X_train['BMI'].mean())
#----------
X_test['BMI'] = X_test['BMI'].replace(0,X_test['BMI'].mean())

### DPF

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[6]],ax = plt.subplot(2,3,1)  ,rug=True,color='brown')
    sns.boxplot(X_train[num_cols[6]],ax = plt.subplot(2,3,2)  ,color='brown')  
    stats.probplot(X_train[num_cols[6]],plot = plt.subplot(2,3,3)) 

In [None]:
X_train = out_remove('DiabetesPedigreeFunction',X_train,'both',1.5)
#------------
X_test = out_remove('DiabetesPedigreeFunction',X_test,'both',1.5)

### Age

In [None]:
fig, axes = plt.subplots(1,3)
for i in range(1):
    sns.distplot(X_train[num_cols[7]],ax = plt.subplot(2,3,1)  ,rug=True,color='y')
    sns.boxplot(X_train[num_cols[7]],ax = plt.subplot(2,3,2)  ,color='y')  
    stats.probplot(X_train[num_cols[7]],plot = plt.subplot(2,3,3)) 

there are some **outliers** but they are meanigful that age can be **60,70,80**

## Data after Cleaning

In [None]:
rcParams['figure.figsize'] =30,15
sns.set(font_scale = 1.5)
sns.set_style("white")
plt.subplots_adjust(hspace=1)
fig, axes = plt.subplots(2, 4)
for i in range(4):
    sns.distplot(X_train[num_cols[i]],ax = axes[0,i],rug=True,color='brown')
    #sns.boxplot(diabetes_df[num_cols[i]],ax = axes[1,i],color='red')  
    #stats.probplot(diabetes_df[num_cols[i]],plot = axes[2,i])
    sns.despine()
for i in range(4,8):
    sns.distplot(X_train[num_cols[i]],ax = axes[1,i-4],rug=True,color='brown')
    #sns.boxplot(diabetes_df[num_cols[i]],ax = axes[4,i-4],color='red')  
    #stats.probplot(diabetes_df[num_cols[i]],plot = axes[5,i-4])
    sns.despine()

----
## Correlation Matrix :

In [None]:
#plt.figure(figsize = (35,15))
sns.heatmap(X_train.corr(),cmap = 'RdGy',annot = True,cbar=True);

<span style="color:red"> **Observation:**
* The correlation between Outcome and Glucose is high.
* High correlation coefficient with Pregnancies and Age. 
* High correlation coefficient with SkinThickness and BMI.

**- Now, let's visualize the correlation between BMI & SkinThickness**

In [None]:
plt.figure(figsize=(8,4),dpi=110)
sns.scatterplot(data=X_train, x='BMI', y='SkinThickness', color='darkblue');
plt.plot([18, 47], [0, 55], 'red', linewidth=4)
plt.xlim(20, 60)
plt.ylim(4, 60)


In [None]:
plt.figure(figsize=(8,4),dpi=110)
sns.scatterplot(data=X_train, x='Age', y='Pregnancies',color='darkblue');
plt.plot([0, 80], [1, 12], 'red', linewidth=4)
plt.ylim(0, 18)
plt.xlim(18, 90)


plt.show()

----
## Feature Importance

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Train a decision tree model
plt.figure(figsize=(10, 5))

model = DecisionTreeClassifier()
z_copy = X_train.copy()
y = z_copy.iloc[:,-1]
X = z_copy.iloc[:,:-1]
model.fit(X, y)
importances = model.feature_importances_
#print(model.feature_importances_)
indices = np.argsort(importances)[::-1]
df = pd.DataFrame({'Feature': X.columns[indices], 'Importance': importances[indices]})
sns.barplot(x='Importance', y='Feature', palette="Blues_r",data=df )


<span style="color:red"> **Observation:**
* Glucose, BMI and Age have high Importance.
    
* Since Insulin has 50% of data **Nan** values and has low Importance so, we decided to **remove** it.
* Since SkinThickness has correlation with **BMI**  and has low Importance so, we decided to **remove** it.

In [None]:
X_train.drop(['Insulin', 'SkinThickness'], axis=1, inplace=True)
X_test.drop(['Insulin', 'SkinThickness'], axis=1, inplace=True)


-----
### Glucose

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(18,10 ),dpi=140)
sns.scatterplot(data=X_train, x='Glucose', y='BMI',  hue='Outcome', ax=ax[0][0])
sns.scatterplot(data=X_train, x='Glucose', y='Age', hue='Outcome', ax=ax[0][1])
sns.scatterplot(data=X_train, x='Glucose', y='BloodPressure',  hue='Outcome', ax=ax[1][0])
sns.scatterplot(data=X_train, x='Glucose', y='DiabetesPedigreeFunction',  hue='Outcome', ax=ax[1][1])


---
# Scaling & Modeling

In [None]:
y_train = X_train.iloc[:,-1]
X_train = X_train.iloc[:,:-1]
y_test = X_test.iloc[:,-1]
X_test = X_test.iloc[:,:-1]

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, plot_confusion_matrix, plot_roc_curve
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier 
from imblearn.over_sampling import SMOTE



-----
## Data Balancing:

In [None]:
SMT=SMOTE()

**StandardScaler**

In [None]:
scaler= StandardScaler()

**QuantileTransformer**

In [None]:
QUAT = QuantileTransformer(random_state=5, output_distribution='normal')


----
## AdaBoost Model

In [None]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
Param = {
    'ada__n_estimators': [10, 25, 50, 75, 100, 125, 150, 175 , 200],
    'ada__learning_rate': [0.1, 0.5, 1.0],
}
p = Pipeline([('SMT', SMT),('scaler',scaler),('ada', ada)])
Grid=GridSearchCV(p,param_grid=Param,cv=5,scoring='recall')
Grid.fit(X_train,y_train)
print(Grid.best_score_)
print(Grid.best_params_)

In [None]:
Final = AdaBoostClassifier(learning_rate= 0.5, n_estimators= 50)

Model = Pipeline([('SMT', SMT),('scaler',scaler), ('Final', Final)])
Model.fit(X_train, y_train)
Pred = Model.predict(X_test)

In [None]:
confusion = confusion_matrix(y_test, Pred)
print(classification_report(y_test, Pred))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
ax1.matshow(confusion, cmap='Blues',alpha=0.8)
for i in range(confusion.shape[0]):
    for j in range(confusion.shape[1]):
        ax1.text(x=j, y=i, s=confusion[i, j], va='center', ha='center')
ax1.set_xlabel('Predicted label')
ax1.set_ylabel('True label')

plot_roc_curve(Model, X_test, y_test, ax=ax2)
ax2.plot([0, 1], [0, 1], linestyle='--', color='red')
ax2.set_xlabel('False positive rate')
ax2.set_ylabel('True positive rate')

----
## RandomForest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

RANDOM_FOREST=RandomForestClassifier()
Param = {'RANDOM_FOREST__bootstrap': [True, False],
  'RANDOM_FOREST__max_depth': [10, 20, 30, None],
  'RANDOM_FOREST__n_estimators': [200, 600, 800]}
p = Pipeline([('SMT', SMT),('scaler',scaler),('RANDOM_FOREST', RANDOM_FOREST)])
Grid=GridSearchCV(p,param_grid=Param,cv=5,scoring='recall')
Grid.fit(X_train,y_train)
print(Grid.best_score_)
print(Grid.best_params_)


In [None]:
Final = RandomForestClassifier(bootstrap=True,n_estimators=600,max_depth=10)

Model = Pipeline([('SMT', SMT),('scaler',scaler), ('Final', Final)])
Model.fit(X_train, y_train)
Pred = Model.predict(X_test)

In [None]:
confusion = confusion_matrix(y_test, Pred)
print(classification_report(y_test, Pred))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
ax1.matshow(confusion, cmap='Blues',alpha=0.8)
for i in range(confusion.shape[0]):
    for j in range(confusion.shape[1]):
        ax1.text(x=j, y=i, s=confusion[i, j], va='center', ha='center')
ax1.set_xlabel('Predicted label')
ax1.set_ylabel('True label')

plot_roc_curve(Model, X_test, y_test, ax=ax2)
ax2.plot([0, 1], [0, 1], linestyle='--', color='red')
ax2.set_xlabel('False positive rate')
ax2.set_ylabel('True positive rate')

----
## KNN Model

In [None]:
KNN=KNeighborsClassifier()
lis = list(range(1,300))
Param = {'KNN__n_neighbors': lis}
p = Pipeline([('SMT', SMT),('QUAT',QUAT),('KNN', KNN)])
Grid=GridSearchCV(p,param_grid=Param,cv=5,scoring='recall')
Grid.fit(X_train,y_train)
print(Grid.best_score_)
print(Grid.best_params_)


In [None]:
Final = KNeighborsClassifier(n_neighbors=59)

Model = Pipeline([('SMT', SMT),('QUAT',QUAT), ('Final', Final)])
Model.fit(X_train, y_train)
Pred = Model.predict(X_test)

In [None]:
confusion = confusion_matrix(y_test, Pred)
print(classification_report(y_test, Pred))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
ax1.matshow(confusion, cmap='Blues',alpha=0.8)
for i in range(confusion.shape[0]):
    for j in range(confusion.shape[1]):
        ax1.text(x=j, y=i, s=confusion[i, j], va='center', ha='center')
ax1.set_xlabel('Predicted label')
ax1.set_ylabel('True label')

plot_roc_curve(Model, X_test, y_test, ax=ax2)
ax2.plot([0, 1], [0, 1], linestyle='--', color='red')
ax2.set_xlabel('False positive rate')
ax2.set_ylabel('True positive rate')

--- 
## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
LGR = LogisticRegression()

Param={"LGR__C":np.logspace(-3,3,7), "LGR__penalty":["l1","l2"]}

p = Pipeline([('SMT', SMT),('QUAT',QUAT),('LGR', LGR)])
Grid=GridSearchCV(p,param_grid=Param,cv=5,scoring='recall')
Grid.fit(X_train,y_train)
print(Grid.best_score_)
print(Grid.best_params_)


In [None]:
Final = LogisticRegression(penalty='l2',C=0.1)

Model = Pipeline([('SMT', SMT),('QUAT',QUAT), ('Final', Final)])
Model.fit(X_train, y_train)
Pred = Model.predict(X_test)

In [None]:
confusion = confusion_matrix(y_test, Pred)
print(classification_report(y_test, Pred))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
ax1.matshow(confusion, cmap='Blues',alpha=0.8)
for i in range(confusion.shape[0]):
    for j in range(confusion.shape[1]):
        ax1.text(x=j, y=i, s=confusion[i, j], va='center', ha='center')
ax1.set_xlabel('Predicted label')
ax1.set_ylabel('True label')

plot_roc_curve(Model, X_test, y_test, ax=ax2)
ax2.plot([0, 1], [0, 1], linestyle='--', color='red')
ax2.set_xlabel('False positive rate')
ax2.set_ylabel('True positive rate')

-----
## Ensemble Classifier

In [None]:
from sklearn.ensemble import VotingClassifier
ada = AdaBoostClassifier(learning_rate= 0.5, n_estimators= 50)
rf = RandomForestClassifier(bootstrap=True,n_estimators=600,max_depth=10)
knn = KNeighborsClassifier(n_neighbors=59)
LGR =LogisticRegression(penalty='l2',C=0.1)

X_train, y_train = SMT.fit_resample(X_train, y_train)

voting_clf = VotingClassifier(estimators=[('ada', ada), ('rf', rf),('knn', knn)], voting='hard')
voting_clf.fit(X_train, y_train)

accuracy = voting_clf.score(X_test, y_test)
print('Voting classifier accuracy: {:.2f}'.format(accuracy))
Pred = voting_clf.predict(X_test)

In [None]:
confusion = confusion_matrix(y_test, Pred)
print(classification_report(y_test, Pred))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
ax1.matshow(confusion, cmap='Blues',alpha=0.8)
for i in range(confusion.shape[0]):
    for j in range(confusion.shape[1]):
        ax1.text(x=j, y=i, s=confusion[i, j], va='center', ha='center')
ax1.set_xlabel('Predicted label')
ax1.set_ylabel('True label')

plot_roc_curve(Model, X_test, y_test, ax=ax2)
ax2.plot([0, 1], [0, 1], linestyle='--', color='red')
ax2.set_xlabel('False positive rate')
ax2.set_ylabel('True positive rate')

----
## Final Evaluation

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
import numpy as np
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(26, 7),dpi=100)
# logistic Regression
Final = LogisticRegression(penalty='l2',C=0.1)
LGR = Pipeline([('SMT', SMT),('scaler',scaler), ('Final', Final)])
LGR.fit(X_train, y_train)
# KNN
Final = KNeighborsClassifier(n_neighbors=59)
KNN = Pipeline([('SMT', SMT),('QUAT',QUAT), ('Final', Final)])
KNN.fit(X_train, y_train)
# Random Forest
Final = RandomForestClassifier(bootstrap=True,n_estimators=600,max_depth=10)
rf = Pipeline([('SMT', SMT),('scaler',scaler), ('Final', Final)])
rf.fit(X_train, y_train)
# AdaBoost
Final = AdaBoostClassifier(learning_rate= 0.5, n_estimators= 50)
ada = Pipeline([('SMT', SMT),('scaler',scaler), ('Final', Final)])
ada.fit(X_train, y_train)
#ensemble
voting_clf = VotingClassifier(estimators=[('ada', ada), ('rf', rf),('knn', knn)], voting='hard')
voting_clf.fit(X_train, y_train)

y_pred = np.array([ada.predict(X_test), rf.predict(X_test),KNN.predict(X_test),LGR.predict(X_test),voting_clf.predict(X_test)])
model=['AdaBoost','Random Forest','KNN','logistic Regression'
      ,'Ensemble ']
recalls=[]
r=[]
Acurve=[]
for i in range(5):
    # Calculate FPR and TPR for this model
    fpr, tpr, _ = roc_curve(y_test, y_pred[i])
    roc_auc = auc(fpr, tpr)
    recalls.append([model[i],recall_score(y_test, y_pred[i])])
    r.append(recall_score(y_test, y_pred[i]))
    Acurve.append(roc_auc_score(y_test, y_pred[i]))
    plt.plot(fpr, tpr, label=' {} (AUC = {:.3f})'.format(model[i], roc_auc))

plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
   
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')

df = pd.DataFrame(recalls, columns=['Model', 'Recall'])
df = df.sort_values(by='Recall', ascending=False)
sns.barplot(x='Model', y='Recall', data=df, palette='Blues_r',ax=ax1)

In [None]:

plt.figure_format = 'svg'
plt.rcParams['figure.figsize'] = (10, 5)
df = pd.DataFrame({'AUC': Acurve, 'Recall': r}, index=model)
df.style.background_gradient()

## KNN is best model as it is has the highest Recall

----
# References & Links

https://github.com/a5medashraf/Diabetes-Classification<br>

https://medium.com/analytics-vidhya/what-is-balance-and-imbalance-dataset-89e8d7f46bc5