# Diabetes 
#### Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy.

#### Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy.

#### With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease.

#### There are 3 types of diabetes, and in this data we are dealing with type 2 diabetes.

#### diabates has 2 stages:
- Pre-diabetes
- Diabetes

## Summary:

### We are trying to build a model that predicts the diabetes in patients.


In [None]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model

from sklearn.linear_model import LogisticRegression
from sklearn.tree  import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score,  recall_score, precision_score, plot_roc_curve, roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2 , f_classif

from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
from sklearn.feature_selection import RFE
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv ('../input/diabetes-health-indicators-dataset/diabetes_012_health_indicators_BRFSS2015.csv') 


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# Exploratory Data Analysis (EDA)

## Take a quick look at the data:

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

### Data types: float64 (12 columns), object (7 columns)


In [None]:
data.columns

In [None]:
data.describe()

In [None]:
data.isnull().sum()

#### There is no mising data

In [None]:
data.duplicated().sum()

#### There are 23899 dupicated rows so they should be removed

In [None]:
data.drop_duplicates(inplace=True)

In [None]:
data.shape

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(), annot=True, cmap="YlGnBu")

In [None]:
df_vis=data.copy()

In [None]:
#transform data
df_vis.Diabetes_012[df_vis['Diabetes_012'] == 0] = 'No Diabetes'
df_vis.Diabetes_012[df_vis['Diabetes_012'] == 1] = 'Pre Diabetes'
df_vis.Diabetes_012[df_vis['Diabetes_012'] == 2] = 'Diabetes'

df_vis.HighBP[df_vis['HighBP'] == 0] = 'No High'
df_vis.HighBP[df_vis['HighBP'] == 1] = 'High BP'

df_vis.HighChol[df_vis['HighChol'] == 0] = 'No High Cholesterol'
df_vis.HighChol[df_vis['HighChol'] == 1] = 'High Cholesterol'

df_vis.CholCheck[df_vis['CholCheck'] == 0] = 'No Cholesterol Check in 5 Years'
df_vis.CholCheck[df_vis['CholCheck'] == 1] = 'Cholesterol Check in 5 Years'

df_vis.Smoker[df_vis['Smoker'] == 0] = 'No'
df_vis.Smoker[df_vis['Smoker'] == 1] = 'Yes'

df_vis.Stroke[df_vis['Stroke'] == 0] = 'No'
df_vis.Stroke[df_vis['Stroke'] == 1] = 'Yes'

df_vis.HeartDiseaseorAttack[df_vis['HeartDiseaseorAttack'] == 0] = 'No'
df_vis.HeartDiseaseorAttack[df_vis['HeartDiseaseorAttack'] == 1] = 'Yes'

df_vis.PhysActivity[df_vis['PhysActivity'] == 0] = 'No'
df_vis.PhysActivity[df_vis['PhysActivity'] == 1] = 'Yes'

df_vis.Fruits[df_vis['Fruits'] == 0] = 'No'
df_vis.Fruits[df_vis['Fruits'] == 1] = 'Yes'

df_vis.Veggies[df_vis['Veggies'] == 0] = 'No'
df_vis.Veggies[df_vis['Veggies'] == 1] = 'Yes'

df_vis.HvyAlcoholConsump[df_vis['HvyAlcoholConsump'] == 0] = 'No'
df_vis.HvyAlcoholConsump[df_vis['HvyAlcoholConsump'] == 1] = 'Yes'

df_vis.AnyHealthcare[df_vis['AnyHealthcare'] == 0] = 'No'
df_vis.AnyHealthcare[df_vis['AnyHealthcare'] == 1] = 'Yes'

df_vis.NoDocbcCost[df_vis['NoDocbcCost'] == 0] = 'No'
df_vis.NoDocbcCost[df_vis['NoDocbcCost'] == 1] = 'Yes'
df_vis.GenHlth[df_vis['GenHlth'] == 1] = 'Excellent'
df_vis.GenHlth[df_vis['GenHlth'] == 2] = 'Very Good'
df_vis.GenHlth[df_vis['GenHlth'] == 3] = 'Good'
df_vis.GenHlth[df_vis['GenHlth'] == 4] = 'Fair'
df_vis.GenHlth[df_vis['GenHlth'] == 5] = 'Poor'

df_vis.DiffWalk[df_vis['DiffWalk'] == 0] = 'No'
df_vis.DiffWalk[df_vis['DiffWalk'] == 1] = 'Yes'

df_vis.Sex[df_vis['Sex'] == 0] = 'Female'
df_vis.Sex[df_vis['Sex'] == 1] = 'Male'

df_vis.Education[df_vis['Education'] == 1] = 'Never Attended School'
df_vis.Education[df_vis['Education'] == 2] = 'Elementary'
df_vis.Education[df_vis['Education'] == 3] = 'Some high school'
df_vis.Education[df_vis['Education'] == 4] = 'High school graduate'
df_vis.Education[df_vis['Education'] == 5] = 'Some college or technical school'
df_vis.Education[df_vis['Education'] == 6] = 'College graduate'

df_vis.Income[df_vis['Income'] == 1] = 'Less Than $10,000'
df_vis.Income[df_vis['Income'] == 2] = 'Less Than $10,000'
df_vis.Income[df_vis['Income'] == 3] = 'Less Than $10,000'
df_vis.Income[df_vis['Income'] == 4] = 'Less Than $10,000'
df_vis.Income[df_vis['Income'] == 5] = 'Less Than $35,000'
df_vis.Income[df_vis['Income'] == 6] = 'Less Than $35,000'
df_vis.Income[df_vis['Income'] == 7] = 'Less Than $35,000'
df_vis.Income[df_vis['Income'] == 8] = '$75,000 or More'


## Visualizing Data


In [None]:
unique_values = {}
for col in df_vis.columns:
    unique_values[col] = df_vis[col].value_counts().shape[0]

pd.DataFrame(unique_values, index=['unique value count']).transpose()

In [None]:
cols = list(df_vis.columns)
cols_df=cols[1:]

In [None]:
plt.figure(figsize=(15,40))
for i in range(len(cols_df)):
    plt.subplot(8,3,i+1)
    plt.title(cols_df[i])
    plt.xticks(rotation=90)
    plt.hist(df_vis[cols_df[i]])
    
plt.tight_layout()

## Ratio of diabetes in the dataset with their types


In [None]:
df_vis['Diabetes_012'].value_counts()

In [None]:
# pie plot of diabetes ratio 
plt.figure(figsize=(8,6))
labels = ['No Diabetes', 'Diabetes', 'Pre Diabetes']
sizes = [df_vis['Diabetes_012'].value_counts()[0], df_vis['Diabetes_012'].value_counts()[1], df_vis['Diabetes_012'].value_counts()[2]]
colors = ['lightskyblue', 'grey', 'lightcoral']
explode = (0.05, 0.05, 0)  # explode 1st slice
plt.pie(sizes, explode=explode, labels=labels, autopct='%.1f%%', colors=colors, data = df_vis);



#### "No diabetes" is the most common case in the dataset, followed by "diabetes" and then "pre-diabetes".

## correlation with Diabetes_012 through bar graph

In [None]:
data.drop('Diabetes_012', axis=1).corrwith(data.Diabetes_012).plot(kind='bar', grid=True, figsize=(15, 6)
, title="Correlation with Diabetes_binary",color="blue");

**Diabetes_binary's relation with other columns Through bar Graph Result:**

1. Fruits , AnyHealthcare , NoDocbccost and sex are least correlated with Diabetes_binary.

2. HighBP , HighChol , BMI , smoker , stroke , HeartDiseaseorAttack , PhysActivity , Veggies , HvyAlcoholconsump , GenHlth , PhysHlth , Age , Education , Income and DiffWalk have a significant correlation with Diabetes_012.

## Distribution of diabetes among genders

In [None]:
plt.figure(figsize=(12,4))
x= sns.countplot(x='Diabetes_012',data=df_vis,hue='Sex')
plt.xticks(rotation=90)
plt.title('Diabetes by gender',fontdict={'fontsize':20})
for i in x.patches:
    x.annotate('{:.2f}'.format((i.get_height()/df_vis.shape[0])*100)+'%',(i.get_x()+0.25, i.get_height()+0.01))
plt.show()

#### Gender has no effect to developing diabetes.

## Smoker

In [None]:

plt.figure(figsize=(12,5))

x= sns.countplot(x='Smoker', hue='Diabetes_012' , data = df_vis);
for i in x.patches:
    x.annotate('{:.2f}'.format((i.get_height()/df_vis.shape[0])*100)+'%',(i.get_x()+0.25, i.get_height()+0.01))
plt.show()

### HvyAlcoholConsump

In [None]:
df_vis['HvyAlcoholConsump'].value_counts()

In [None]:
plt.figure(figsize=(12,5))

x= sns.countplot(x='HvyAlcoholConsump', hue='Diabetes_012' , data = df_vis);
for i in x.patches:
    x.annotate('{:.2f}'.format((i.get_height()/df_vis.shape[0])*100)+'%',(i.get_x()+0.25, i.get_height()+0.01))
plt.show()

## Smoker and HvyAlcoholConsump's combined effect on Diabetes

In [None]:
# (1 in Smoker is Yes), (1 in HvyAlcoholConsump is Yes), and (0 is No Diabetes, 1 is Pre Diabetes, 2 is Diabetes)

sns.catplot(x="Smoker" , y ="HvyAlcoholConsump" , data = data , hue="Diabetes_012"  , kind="bar"  );  
plt.title("Relation b/w Smoker ,HvyAlcoholConsump and Diabetes")

**result: Acording to this data, smoking and HvyAlcoholConsump both togather increase the risk of diabetes.**

---

### Stroke

In [None]:
sns.countplot(df_vis['Stroke'])

In [None]:

plt.figure(figsize=(12,5))

x= sns.countplot(x='Stroke', hue='Diabetes_012' , data = df_vis);
for i in x.patches:
    x.annotate('{:.2f}'.format((i.get_height()/df_vis.shape[0])*100)+'%',(i.get_x()+0.25, i.get_height()+0.01))
plt.show()

In [None]:
# plt.figure(figsize=(10,6))
sns.countplot(data=df_vis[df_vis['Diabetes_012']=='Diabetes'],x='Stroke',palette='Set1');

#### Diabetes have a low effect on having a stroke but in fact, Diabetes increases the chance of having a stroke, which can damage brain tissue and cause disability or even death

### HeartDiseaseorAttack

In [None]:
sns.countplot(data=df_vis,x='HeartDiseaseorAttack',hue='Stroke')

In [None]:
plt.figure(figsize=(12,5))

x= sns.countplot(x='HeartDiseaseorAttack', hue='Diabetes_012' , data = df_vis);
for i in x.patches:
    x.annotate('{:.2f}'.format((i.get_height()/df_vis.shape[0])*100)+'%',(i.get_x()+0.25, i.get_height()+0.01))
plt.show()

#### The chance of diabetes increases as the person has Heart Disease or Attack

In [None]:
# plt.figure(figsize=(10,6))
sns.countplot(data=df_vis[df_vis['HeartDiseaseorAttack']=="Yes"],x='Stroke',palette='Set1');

#### It is normal that if the person have heart disease or attack it causes to have stroke

## Stroke and HeartDiseaseorAttack's combined effect on Diabetes

In [None]:
# (1 in Stroke is Yes), (1 in HeartDiseaseorAttack is Yes), and (0 is No Diabetes, 1 is Pre Diabetes, 2 is Diabetes)  

sns.catplot(x="Stroke" , y ="HeartDiseaseorAttack" , data = data , hue="Diabetes_012"  , kind="bar"  );
plt.title("Relation b/w Stroke ,HeartDiseaseorAttack and Diabetes")

**result: Acording to this data, stroke and heartDiseaseorAttack togather increases the risk of Diabetes**

---

### High BP

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=df_vis,x='Diabetes_012',hue='HighBP',palette='husl')

In [None]:
sns.displot(data=df_vis,x='Diabetes_012',col='HighBP',color='#b3b3ff')

### HighChol

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=df_vis,x='Diabetes_012',hue='HighChol',palette='husl')

In [None]:
sns.displot(data=df_vis,x='Diabetes_012',col='HighChol',color='#b3b3ff')

#### Most diabetics tend to have high blood pressure and cholesterol 

In [None]:
# HighChol with HighBP
plt.figure(figsize=(10,6))
x=sns.countplot(data=df_vis,x='HighChol',hue='HighBP',palette='hls')
for i in x.patches:
    x.annotate('{:.2f}'.format(i.get_height()/df_vis.shape[0]*100)+'%',(i.get_x()+0.25, i.get_height()+0.01))
plt.show()

#### high cholesterol and high blood pressure are highly related to each other as people with high cholesterol tend to have high blodd pressure

The link between high blood pressure and high cholesterol goes in both directions. When the body can’t clear cholesterol from the bloodstream, that excess cholesterol can deposit along artery walls. When arteries become stiff and narrow from deposits, the heart has to work overtime to pump blood through them. This causes blood pressure to go up and up.

## Checking HighBP and HighChol's combined effect on Diabetes

In [None]:
# (1 in HighBP is Yes), (1 in HighChol is Yes), and (0 is No Diabetes, 1 is Pre Diabetes, 2 is Diabetes)  

sns.catplot(x="HighBP" , y ="HighChol" , data = data , hue="Diabetes_012" , kind="bar" );
plt.title("Relation b/w HighBP ,HighChol and Diabetes")

**result: Acording to this data, HighBP and HighChol both togather increase the risk of diabetes.**

---

### BMI

In [None]:

plt.figure(figsize=(12,5))
sns.displot(x='BMI', col='Diabetes_012' , data = df_vis, kind="kde" ,palette="Set2");


In [None]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(8,6))
fig = sns.scatterplot(data=df_vis, x="Age", y="BMI", hue='Sex')
fig.axhline(y= 25, linewidth=3, color='k', linestyle= '--')
plt.show()

#### The BMI of the most peaople is more than the normal

## Split the BMI into (Underweight,Normal weight,Overweight,Obesity)

In [None]:
BMI=pd.cut( data['BMI'],bins=[0,18.5,25,30,80],labels=['Underweight','Normal weight','Overweight','Obesity'])

In [None]:
dd=pd.crosstab(df_vis['Diabetes_012'],BMI,rownames=['Diabetes'])
dd=dd.astype(float)
dd

In [None]:
Diabetes_sum_lst=list(dd.transpose().sum().values)
Diabetes_sum_lst

In [None]:
for idx in range(dd.values.shape[0]):
    dd.values[idx]= dd.values[idx]/Diabetes_sum_lst[idx]*100

dd

In [None]:
dd.plot(kind="bar",figsize=(10,6));

---

### Age

In [None]:

plt.figure(figsize=(12,5))
sns.displot(x='Age', col='Diabetes_012' , data = df_vis, kind="kde")
plt.show()

In [None]:
age = pd.cut(df_vis['Age'],bins=[0,4,7,10,12,14],labels=['18:34','35:49','50:64','65:74','75 and older'])
age

In [None]:
plt.figure(figsize=(10,6))
sns.displot(data=df_vis,col='Diabetes_012',x=age,color='#993366');

#### people from 50 to 64 have higher chance to develop diabetes

---

### PhysHlth

In [None]:
plt.figure(figsize=(12,5))
sns.displot(x='PhysHlth', col='Diabetes_012' , data = df_vis, kind="kde")
plt.show()

### MentHlth

In [None]:

plt.figure(figsize=(12,5))
x= sns.displot(x='MentHlth', col='Diabetes_012', data = df_vis, kind="kde")
plt.show()

In [None]:
sns.displot(data=df_vis.loc[(df_vis['MentHlth']>0)&(df_vis['Diabetes_012']!="No Diabetes")],x='MentHlth',col='Diabetes_012',col_wrap=2,kde=True);

#### Mental health doesn't have effect on diabetes 

### GenHlth

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=df_vis,x='Diabetes_012',hue='GenHlth',palette='Set1');

---

### Income

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(x='Income', hue='Diabetes_012' , data = df_vis)
plt.show()

In [None]:
# The effect of the income on the healthcare
plt.figure(figsize=(10,6))
sns.displot(data=df_vis,x='Income',col='AnyHealthcare');

#### If the person has more income he/she will has good healthcare

## Education

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(x='Education', hue='Diabetes_012' , data = df_vis)
plt.show()

## Veggies 

In [None]:
pd.crosstab(df_vis.Veggies,df_vis.Diabetes_012).plot(kind="bar",figsize=(5,4))

plt.title('Diabetes Disease Frequency for Veggies')
plt.xlabel("Veggies")
plt.ylabel('Frequency')
plt.show()

## Fruits

In [None]:
pd.crosstab(df_vis.Fruits,df_vis.Diabetes_012).plot(kind="bar",figsize=(5,4))

plt.title('Diabetes Disease Frequency for Fruits')
plt.xlabel("Fruits")
plt.ylabel('Frequency')
plt.show()

## PhysActivity

In [None]:
pd.crosstab(df_vis.PhysActivity,df_vis.Diabetes_012).plot(kind="bar",figsize=(5,4))

plt.title('Diabetes Disease Frequency for PhysActivity')
plt.xlabel("PhysActivity")
plt.ylabel('Frequency')
plt.show()

# More information from visualization

In [None]:
plt.figure(figsize = (10,6))
sns.countplot(data=df_vis,x=df_vis['PhysActivity'],hue='DiffWalk',palette='husl')

#### Doing physical activity reduces the chance of having walking difficulty

In [None]:
plt.figure(figsize = (14,6))
# plt.subplot(1, 1, 1)
x=sns.catplot(data=df_vis[df_vis['BMI']<60],x="PhysActivity", y="BMI", kind="boxen",aspect=1,palette='hls')
plt.show()

y=sns.catplot(data=df_vis[df_vis['BMI']<60],x="DiffWalk", y="BMI", kind="boxen",aspect=1,palette='hls')
plt.show()


#### people who do Physical activity have lower BMI levels ,while people who have difficulty walking or climbing stairs tend to have higher BMI levels

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=df_vis,x='PhysActivity',hue='GenHlth',palette='Set1');

#### physical activity dramitically affects the genral health as people who do physical have way better general health 

**The summary of visualization:**

1. male and female are equally vulnerable for Diabetes.

2. people older than 45 are more vulnerable for diabetes then the younger ones.when the age increase the number of diabetic people also increas.

3. More than half of the diabetics are obese , alomst half of the pre diabetics are obese

4. percentages of diabetics and pre diabetics who suffers from obesity and overweight are much higher than percentage of non diabetic who suffers from obesity and overweight

5. when Education is going Higher the number of Diabetic people is dicreasing.

6. people with lower income has more risk of diabetes then the Higher ones.

7. GenHlth has a major effect on diabetes.when GenHlth is not good then the risk of diabetes increases rapidly.

8. MentHlth is a major factor which causes Diabetes.when Menthlth is not stable for long time then the risk of diabetes increases.

9. same goes with PhysHlth

10. Physical activity reduces the risk of diabetes.

11. Eating at least one fruit a day reduces the risk of diabetes.

12. Eating at least one veggies a day slower the risk of diabetes.

# Preprocessing

In [None]:
plt.figure(figsize = (25,8))
u = sns.boxplot(palette = 'cool', data=data)
u.set_xticklabels(u.get_xticklabels(),rotation=45)

In [None]:
data.columns

In [None]:
data.plot(kind="box", subplots=True, layout=(7,4), figsize=(15,14));

### Handling the outliers of the BMI

In [None]:
plt.figure(figsize = (12,6))
plt.subplot(1, 2, 1)
sns.boxplot(data=data,y='BMI',color='#cc6699')
plt.subplot(1, 2, 2)
sns.scatterplot(data=data,x='Diabetes_012',y='BMI',color='#cc6699')
plt.show()

In [None]:
x=data[data['BMI']>=70]
x.shape

In [None]:
df=data.copy()

In [None]:
df=data[data['BMI']<70]

In [None]:
plt.figure(figsize = (12,6))
plt.subplot(1, 2, 1)
sns.boxplot(data=df,y='BMI',color='#cc6699')
plt.subplot(1, 2, 2)
sns.scatterplot(data=df,x='Diabetes_012',y='BMI',color='#cc6699')
plt.show()

In [None]:
df['Diabetes_012'].value_counts()

# Split the data

In [None]:
y = df['Diabetes_012']
x = df.drop(['Diabetes_012'], axis=1)

In [None]:
x_train , x_test , y_train , y_test = train_test_split(x,y , test_size= 0.25 , random_state=42)

## Resampling

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(x_train, y_train)

# First Try
## Try to make all classes like diabetes

In [None]:
df[df['Diabetes_012']==2].shape

In [None]:
df[df['Diabetes_012']==2].shape[0]/df[df['Diabetes_012']==0].shape[0]

In [None]:
df_new=df[df['Diabetes_012']==0]
df_new = df_new.sample(frac=0.18473323520567242)
df_new['Diabetes_012'].value_counts()

In [None]:
df_new2=df[df['Diabetes_012']==2]
df_new2['Diabetes_012'].value_counts()

In [None]:
df_new.shape[0]/df[df['Diabetes_012']==1].shape[0]

In [None]:
df_new3=df[df['Diabetes_012']==1]
df_new3 = df_new3.sample(frac=7.58414554905783, replace=True)
df_new3['Diabetes_012'].value_counts()

In [None]:
df_all=pd.concat([df_new,df_new2,df_new3])

In [None]:
df_all.shape

In [None]:
y = df_all['Diabetes_012']
x = df_all.drop(['Diabetes_012'], axis=1)
x_train , x_test , y_train , y_test = train_test_split(x,y , test_size= 0.30 , random_state=42)

# Modeling

## 1. Decision tree

In [None]:
dt= DecisionTreeClassifier(max_features=12 , max_depth=14)
dt.fit(x_train , y_train)

In [None]:
print(dt.score(x_train , y_train))
print(dt.score(x_test, y_test))

In [None]:
y_pred_train_dt = dt.predict(x_train)
acc_train_dt = accuracy_score(y_train, y_pred_train_dt)

y_pred_test_dt = dt.predict(x_test)
acc_test_dt = accuracy_score(y_test, y_pred_test_dt)
print(acc_train_dt)
print(acc_test_dt)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_test_dt))

## 2. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_features=4 , max_depth=14)
rf.fit(x_train,y_train)


In [None]:
print(rf.score(x_train, y_train))
print(rf.score(x_test, y_test))

In [None]:
y_pred_train_rf = rf.predict(x_train)
acc_train_rf = accuracy_score(y_train, y_pred_train_rf)

y_pred_test_rf = rf.predict(x_test)
acc_test_rf = accuracy_score(y_test, y_pred_test_rf)
print(acc_train_rf)
print(acc_test_rf)

In [None]:
print(classification_report(y_test, y_pred_test_rf))

## 3. XGBoost

In [None]:
xgb= XGBClassifier(max_depth=7)
xgb.fit(x_train, y_train)


In [None]:
print(xgb.score(x_train, y_train))
print(xgb.score(x_test, y_test))

In [None]:
y_pred_train_xgb = xgb.predict(x_train)
acc_train_xgb = accuracy_score(y_train, y_pred_train_xgb)

y_pred_test_xgb = xgb.predict(x_test)
acc_test_xgb = accuracy_score(y_test, y_pred_test_xgb)
print(acc_train_xgb)
print(acc_test_xgb)

In [None]:
print(classification_report(y_test, y_pred_test_xgb))

# Second Try
## Try to make all classes like prediabetes

In [None]:
df[df['Diabetes_012']==1].shape

In [None]:
df[df['Diabetes_012']==1].shape[0]/df[df['Diabetes_012']==0].shape[0]

In [None]:
df_new=df[df['Diabetes_012']==0]
df_new = df_new.sample(frac=0.024357817767437444)
df_new['Diabetes_012'].value_counts()

In [None]:
df_new2=df[df['Diabetes_012']==1]
df_new2['Diabetes_012'].value_counts()

In [None]:
df_new.shape[0]/df[df['Diabetes_012']==2].shape[0]

In [None]:
df_new3=df[df['Diabetes_012']==2]
df_new3 = df_new3.sample(frac=0.13185400959561344)
df_new3['Diabetes_012'].value_counts()

In [None]:
df_all=pd.concat([df_new,df_new2,df_new3])

In [None]:
df_all.shape

In [None]:
y = df_all['Diabetes_012']
x = df_all.drop(['Diabetes_012'], axis=1)
x_train , x_test , y_train , y_test = train_test_split(x,y , test_size= 0.30 , random_state=42)

## Modeling

## 1. Decision Tree

In [None]:
dt= DecisionTreeClassifier(max_features=16 , max_depth=16)
dt.fit(x_train , y_train)

In [None]:
print(dt.score(x_train , y_train))
print(dt.score(x_test, y_test))

In [None]:
y_pred_train_dt = dt.predict(x_train)
acc_train_dt = accuracy_score(y_train, y_pred_train_dt)

y_pred_test_dt = dt.predict(x_test)
acc_test_dt = accuracy_score(y_test, y_pred_test_dt)
print(acc_train_dt)
print(acc_test_dt)

In [None]:
print(classification_report(y_test, y_pred_test_dt))

## 2. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_features=4 , max_depth=14)
rf.fit(x_train,y_train)


In [None]:
print(rf.score(x_train, y_train))
print(rf.score(x_test, y_test))

In [None]:
y_pred_train_rf = rf.predict(x_train)
acc_train_rf = accuracy_score(y_train, y_pred_train_rf)

y_pred_test_rf = rf.predict(x_test)
acc_test_rf = accuracy_score(y_test, y_pred_test_rf)
print(acc_train_rf)
print(acc_test_rf)

In [None]:
print(classification_report(y_test, y_pred_test_rf))

# Third Try

In [None]:
df[df['Diabetes_012']==2].shape

In [None]:
df[df['Diabetes_012']==2].shape[0]/df[df['Diabetes_012']==0].shape[0]*2

In [None]:
df_new=df[df['Diabetes_012']==0]
df_new = df_new.sample(frac=0.36946647041134484)
df_new['Diabetes_012'].value_counts()

In [None]:
df_new2=df[df['Diabetes_012']==2]
df_new2 = df_new2.sample(frac=2, replace=True)
df_new2['Diabetes_012'].value_counts()

In [None]:
df_new2.shape[0]/df[df['Diabetes_012']==1].shape[0]

In [None]:
df_new3=df[df['Diabetes_012']==1]
df_new3 = df_new3.sample(frac=15.16829109811566, replace=True)
df_new3['Diabetes_012'].value_counts()

In [None]:
df_all=pd.concat([df_new,df_new2,df_new3])

In [None]:
df_all.shape

In [None]:
y = df_all['Diabetes_012']
x = df_all.drop(['Diabetes_012'], axis=1)

In [None]:
y_orig = df['Diabetes_012']
x_orig = df.drop(['Diabetes_012'], axis=1)

In [None]:
x_train , x_test , y_train , y_test = train_test_split(x_orig,y_orig , test_size= 0.25 , random_state=42)

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(x_train, y_train)

## Modeling

## 1. Decision Tree

In [None]:
dt= DecisionTreeClassifier(max_features=4 , max_depth=16)
dt.fit(x , y)

In [None]:
print(dt.score(x , y))
print(dt.score(X_res, y_res))
print(dt.score(x_test, y_test))

In [None]:
y_pred_train_dt = dt.predict(X_res)
acc_train_dt = accuracy_score(y_res, y_pred_train_dt)

y_pred_test_dt = dt.predict(x_test)
acc_test_dt = accuracy_score(y_test, y_pred_test_dt)
print(acc_train_dt)
print(acc_test_dt)

In [None]:
print(classification_report(y_test, y_pred_test_dt))

## 2. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_features=4 , max_depth=14)
rf.fit(x, y)


In [None]:
print(rf.score(x, y))
print(rf.score(X_res, y_res))
print(rf.score(x_test, y_test))

In [None]:
y_pred_train_rf = rf.predict(X_res)
acc_train_rf = accuracy_score(y_res, y_pred_train_rf)

y_pred_test_rf = rf.predict(x_test)
acc_test_rf = accuracy_score(y_test, y_pred_test_rf)
print(acc_train_rf)
print(acc_test_rf)

In [None]:
print(classification_report(y_test, y_pred_test_rf))

## 3. XGB

In [None]:
xgb= XGBClassifier(max_depth=7)
xgb.fit(x, y)


In [None]:
print(xgb.score(x, y))
print(xgb.score(X_res, y_res))
print(xgb.score(x_test, y_test))

In [None]:
y_pred_train_xgb = xgb.predict(X_res)
acc_train_xgb = accuracy_score(y_res, y_pred_train_xgb)

y_pred_test_xgb = xgb.predict(x_test)
acc_test_xgb = accuracy_score(y_test, y_pred_test_xgb)
print(acc_train_xgb)
print(acc_test_xgb)

In [None]:
print(classification_report(y_test, y_pred_test_xgb))

## Stop Trying

---

In [None]:
y_orig = df['Diabetes_012']
x_orig = df.drop(['Diabetes_012'], axis=1)
x_train , x_test , y_train , y_test = train_test_split(x_orig,y_orig , test_size= 0.25 , random_state=42)

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(x_train, y_train)

# Modeling 

## 1. Logistic Regression

In [None]:
lg = LogisticRegression()
lg.fit(X_res, y_res)

In [None]:
print(lg.score(X_res, y_res))
print(lg.score(x_test, y_test))

In [None]:
y_pred_train_lg = lg.predict(X_res)
acc_train_lg = accuracy_score(y_res, y_pred_train_lg)

y_pred_test_lg = lg.predict(x_test)
acc_test_lg = accuracy_score(y_test, y_pred_test_lg)
print(acc_train_lg)
print(acc_test_lg)

In [None]:
print(classification_report(y_test, y_pred_test_lg))

In [None]:
print('Precision: %.3f' % precision_score(y_test, y_pred_test_lg,average="micro"))
print('Recall: %.3f' % recall_score(y_test, y_pred_test_lg,average="micro"))
print('F-measure: %.3f' % f1_score(y_test, y_pred_test_lg,average="micro"))

In [None]:
y_pred_prob_lg = lg.predict_proba(x_test)
roc_auc_score_lg= roc_auc_score(y_test, y_pred_prob_lg, multi_class="ovr")

print('ROC AUC Score: ',roc_auc_score_lg)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(lg,X_res, y_res)
plt.show(pl)

## 2. Random Forest

In [None]:
### 2) Random Forest Classification

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100,max_depth=16,max_features=10)
rf.fit(X_res, y_res)


In [None]:
print(rf.score(X_res, y_res))
print(rf.score(x_test, y_test))

In [None]:
y_pred_train_rf = rf.predict(X_res)
acc_train_rf = accuracy_score(y_res, y_pred_train_rf)

y_pred_test_rf = rf.predict(x_test)
acc_test_rf = accuracy_score(y_test, y_pred_test_rf)
print(acc_train_rf)
print(acc_test_rf)

In [None]:
print(classification_report(y_test, y_pred_test_rf))

In [None]:
print('Precision: %.3f' % precision_score(y_test, y_pred_test_rf,average="micro"))
print('Recall: %.3f' % recall_score(y_test, y_pred_test_rf,average="micro"))
print('F-measure: %.3f' % f1_score(y_test, y_pred_test_rf,average="micro"))

In [None]:
y_pred_prob_rf = rf.predict_proba(x_test)
roc_auc_score_rf = roc_auc_score(y_test, y_pred_prob_rf, multi_class="ovr")
print('ROC AUC Score:', roc_auc_score_rf)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(rf,X_res, y_res)
plt.show(pl)

## 2.2. Random Forest with cross val score

In [None]:
rf_crossval = RandomForestClassifier(n_estimators=100, random_state=0, max_depth=16, max_features=10) 
rf_crossval.fit(X_res, y_res)
score = cross_val_score(rf_crossval, X_res, y_res, cv=3, scoring="accuracy")
score.mean()

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(rf_crossval,X_res,y_res)
plt.show(pl)

In [None]:
y_train_pred = rf_crossval.predict(X_res)
precision_score(y_res, y_train_pred, average='macro')

In [None]:
recall_score(y_res, y_train_pred, average='macro')

In [None]:
y_test_pred = cross_val_predict(rf_crossval, x_test, y_test, cv=3)


In [None]:
recall_score(y_test, y_test_pred, average='micro')

In [None]:
precision_score(y_test, y_test_pred, average='micro')

## 3. XGB Classifier

In [None]:
xgb= XGBClassifier(max_depth=10)
xgb.fit(X_res, y_res)


In [None]:
print(xgb.score(X_res, y_res))
print(xgb.score(x_test, y_test))

In [None]:
y_pred_train_xgb = xgb.predict(X_res)
acc_train_xgb = accuracy_score(y_res, y_pred_train_xgb)

y_pred_test_xgb = xgb.predict(x_test)
acc_test_xgb = accuracy_score(y_test, y_pred_test_xgb)
print(acc_train_xgb)
print(acc_test_xgb)

In [None]:
print(classification_report(y_test, y_pred_test_xgb))

In [None]:
print('Precision: %.3f' % precision_score(y_test, y_pred_test_xgb,average="micro"))
print('Recall: %.3f' % recall_score(y_test, y_pred_test_xgb,average="micro"))
print('F-measure: %.3f' % f1_score(y_test, y_pred_test_xgb,average="micro"))

In [None]:
y_pred_prob_xgb = xgb.predict_proba(x_test)
roc_auc_score_xgb = roc_auc_score(y_test, y_pred_prob_xgb, multi_class="ovr")
print('ROC AUC Score:',roc_auc_score_xgb)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(xgb,X_res, y_res)
plt.show(pl)

## 4. Decision Tree

In [None]:
dt= DecisionTreeClassifier(max_features=10 , max_depth=16)
dt.fit(X_res, y_res)

In [None]:
print(dt.score(X_res, y_res))
print(dt.score(x_test, y_test))

In [None]:
y_pred_train_dt = dt.predict(X_res)
acc_train_dt = accuracy_score(y_res, y_pred_train_dt)

y_pred_test_dt = dt.predict(x_test)
acc_test_dt = accuracy_score(y_test, y_pred_test_dt)
print(acc_train_dt)
print(acc_test_dt)

In [None]:
print(classification_report(y_test, y_pred_test_dt))

In [None]:
print('Precision: %.3f' % precision_score(y_test, y_pred_test_dt,average="micro"))
print('Recall: %.3f' % recall_score(y_test, y_pred_test_dt,average="micro"))
print('F-measure: %.3f' % f1_score(y_test, y_pred_test_dt,average="micro"))

In [None]:
y_pred_prob_dt = dt.predict_proba(x_test)
roc_auc_score_dt = roc_auc_score(y_test, y_pred_prob_dt, multi_class="ovr")
print('ROC AUC Score:',roc_auc_score_dt)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(dt,X_res, y_res)
plt.show(pl)

# Compare between algorithms

In [None]:
Performance = pd.DataFrame(
    data = {
        'Model_after_resamppling': ['LogReg','RF','XGB','DT'],
        'Test_score': [accuracy_score(y_test, y_pred_test_lg),
                       accuracy_score(y_test, y_pred_test_rf),
                       accuracy_score(y_test, y_pred_test_xgb),
                      accuracy_score(y_test, y_pred_test_dt)],
        
        'ROC_AUC_Score': [roc_auc_score_lg,
                          roc_auc_score_rf, 
                          roc_auc_score_xgb,
                         roc_auc_score_dt]
    }
)


def show_values_on_bars(axs):
    def _show_on_single_plot(ax):
        for p in ax.patches:
            _x = p.get_x() + p.get_width() / 2
            _y = p.get_y() + p.get_height()
            value = '{:.2f}'.format(p.get_height())
            ax.text(_x, _y, value, ha="center")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)


plt.figure(figsize=(12, 8))
ax = sns.barplot(x="Model_after_resamppling", y="Test_score", data= Performance, palette="YlOrBr")
show_values_on_bars(ax)


In [None]:
Performance.sort_values('ROC_AUC_Score',ascending=False)

# Feature Selection

## 1. Feature Selection by chi2

Selecting 70% of the features using chi2. <br>
Chi-squared stats of non-negative features for classification tasks.<br>
The chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [None]:
data.columns

In [None]:
FeatureSelection = SelectPercentile(score_func = chi2, percentile=70) 
x = FeatureSelection.fit_transform(x_orig, y_orig)

#showing X Dimension 
print('X Shape is ' , x.shape)
print('Selected Features are : ' , FeatureSelection.get_support())

In [None]:
fe=FeatureSelection.get_support()
selected_features=[]
for i in range(len(fe)):
    if fe[i]==True:
        selected_features.append(df.columns[i])
        
selected_features.remove('Diabetes_012')
selected_features

## Modeling after feature selection

In [None]:
y2 = df['Diabetes_012']
x2 = df[selected_features]

In [None]:
x_train2 , x_test2 , y_train2 , y_test2 = train_test_split(x2,y2 , test_size= 0.25 , random_state=42)

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res2, y_res2 = sm.fit_resample(x_train2, y_train2)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf2 = RandomForestClassifier(n_estimators = 100,max_depth=16,max_features=10)
rf2.fit(X_res2, y_res2)


In [None]:
print(rf2.score(X_res2, y_res2))
print(rf2.score(x_test2, y_test2))

In [None]:
y_pred_train_rf2 = rf2.predict(X_res2)
acc_train_rf2 = accuracy_score(y_res2, y_pred_train_rf2)

y_pred_test_rf2 = rf2.predict(x_test2)
acc_test_rf2 = accuracy_score(y_test2, y_pred_test_rf2)
print(acc_train_rf2)
print(acc_test_rf2)

In [None]:
print(classification_report(y_test, y_pred_test_rf2))

In [None]:
print('Precision: %.3f' % precision_score(y_test2, y_pred_test_rf2,average="micro"))
print('Recall: %.3f' % recall_score(y_test2, y_pred_test_rf2,average="micro"))
print('F-measure: %.3f' % f1_score(y_test2, y_pred_test_rf2,average="micro"))

In [None]:
y_pred_prob_rf2 = rf2.predict_proba(x_test2)
roc_auc_score_rf2 = roc_auc_score(y_test2, y_pred_prob_rf2, multi_class="ovr")
print('ROC AUC Score:', roc_auc_score_rf2)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(rf2,X_res2, y_res2)
plt.show(pl)

## XGBoost

In [None]:
xgb2= XGBClassifier()
xgb2.fit(X_res2, y_res2)

In [None]:
print(xgb2.score(X_res2, y_res2))
print(xgb2.score(x_test2, y_test2))

In [None]:
y_pred_train_xgb2 = xgb2.predict(X_res2)
acc_train_xgb2 = accuracy_score(y_res2, y_pred_train_xgb2)

y_pred_test_xgb2 = xgb2.predict(x_test2)
acc_test_xgb2 = accuracy_score(y_test2, y_pred_test_xgb2)
print(acc_train_xgb2)
print(acc_test_xgb2)

In [None]:
print(classification_report(y_test, y_pred_test_xgb2))

In [None]:
print('Precision: %.3f' % precision_score(y_test2, y_pred_test_xgb2,average="micro"))
print('Recall: %.3f' % recall_score(y_test2, y_pred_test_xgb2,average="micro"))
print('F-measure: %.3f' % f1_score(y_test2, y_pred_test_xgb2,average="micro"))

In [None]:
y_pred_prob_xgb2 = xgb2.predict_proba(x_test2)
roc_auc_score_xgb2 = roc_auc_score(y_test2, y_pred_prob_xgb2, multi_class="ovr")
print('ROC AUC Score:',roc_auc_score_xgb2)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(xgb2,X_res2, y_res2)
plt.show(pl)

## 2. Feature selection by random forest

In [None]:
rf2 = RandomForestClassifier(random_state = 1, max_features = 'sqrt', n_jobs = 1, verbose = 1)
%time rf2.fit(x_train, y_train)
rf2.score(x_test, y_test)

In [None]:
print(rf2.score(x_train, y_train))
print(rf2.score(x_test, y_test))

In [None]:
pl=plot_confusion_matrix(rf2,x_train, y_train)
plt.show(pl)

In [None]:
feature = pd.Series(rf2.feature_importances_, index = x_train.columns).sort_values(ascending = False)
print(feature)

In [None]:
plt.figure(figsize = (10,6))
sns.barplot(x = feature, y = feature.index)
plt.title("Feature Importance")
plt.xlabel('Score')
plt.ylabel('Features')
plt.show()

In [None]:
my_features=list(feature[feature>0.025].index)

In [None]:
my_features

## Modeling after feature selection

In [None]:
y3 = df['Diabetes_012']
x3 = df[my_features]

In [None]:
x_train3 , x_test3 , y_train3 , y_test3 = train_test_split(x3,y3 , test_size= 0.25 , random_state=42)

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res3, y_res3 = sm.fit_resample(x_train3, y_train3)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf3 = RandomForestClassifier(n_estimators = 100,max_depth=16,max_features=10)
rf3.fit(X_res3, y_res3)

In [None]:
print(rf3.score(X_res3, y_res3))
print(rf3.score(x_test3, y_test3))

In [None]:
y_pred_train_rf3 = rf3.predict(X_res3)
acc_train_rf3 = accuracy_score(y_res3, y_pred_train_rf3)

y_pred_test_rf3 = rf3.predict(x_test3)
acc_test_rf3 = accuracy_score(y_test3, y_pred_test_rf3)
print(acc_train_rf3)
print(acc_test_rf3)

In [None]:
print(classification_report(y_test, y_pred_test_rf3))

In [None]:
print('Precision: %.3f' % precision_score(y_test3, y_pred_test_rf3,average="micro"))
print('Recall: %.3f' % recall_score(y_test3, y_pred_test_rf3,average="micro"))
print('F-measure: %.3f' % f1_score(y_test3, y_pred_test_rf3,average="micro"))

In [None]:
y_pred_prob_rf3 = rf3.predict_proba(x_test3)
roc_auc_score_rf3 = roc_auc_score(y_test3, y_pred_prob_rf3, multi_class="ovr")
print('ROC AUC Score:', roc_auc_score_rf3)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(rf3,X_res3, y_res3)
plt.show(pl)

## XGBoost

In [None]:
xgb3= XGBClassifier(max_depth=10)
xgb3.fit(X_res3, y_res3)

In [None]:
print(xgb3.score(X_res3, y_res3))
print(xgb3.score(x_test3, y_test3))

In [None]:
y_pred_train_xgb3 = xgb3.predict(X_res3)
acc_train_xgb3 = accuracy_score(y_res3, y_pred_train_xgb3)

y_pred_test_xgb3 = xgb3.predict(x_test3)
acc_test_xgb3 = accuracy_score(y_test3, y_pred_test_xgb3)
print(acc_train_xgb3)
print(acc_test_xgb3)

In [None]:
print(classification_report(y_test, y_pred_test_xgb3))

In [None]:
print('Precision: %.3f' % precision_score(y_test3, y_pred_test_xgb3,average="micro"))
print('Recall: %.3f' % recall_score(y_test3, y_pred_test_xgb3,average="micro"))
print('F-measure: %.3f' % f1_score(y_test3, y_pred_test_xgb3,average="micro"))

In [None]:
y_pred_prob_xgb3 = xgb3.predict_proba(x_test3)
roc_auc_score_xgb3 = roc_auc_score(y_test3, y_pred_prob_xgb3, multi_class="ovr")
print('ROC AUC Score:',roc_auc_score_xgb3)

In [None]:
from sklearn.metrics import plot_confusion_matrix
pl=plot_confusion_matrix(xgb3,X_res3, y_res3)
plt.show(pl)

# Conclusion

**1. major feature variables for Diabetes are : HIghBP , HighChol , BMI , Stroke , GenHlth , MentHlth , PhysHlth , Age , Eduation and Income.**

**2. Feature variables which increases the risk of Diabetes togather are : Smoking and HvyAlcoholConsump , Stroke and HeartDiseaseorAttack , HighBP and HighChol.**

**3. Feature variable Which is least effective on Diabetes , but they can help in dicreasing the risk Diabetes are : PhysActivity , Fruits , Veggies , AnyHealthcare , CholChek.**


**4. Because of imbalanced data the basic accuracy score was misleading, so I used the right evaluation metrics like Precision/Specificity, Recall/Sensitivity, F1 score, and AUC.**

**5. After the preprocessing and resampling of the data, it became balanced and more accurate**

**6. There is clear differences in the metrics between applying the algorithms after resampling**

**7. The XGBoost is the best algorithm has score on test and Auc score.**

**8. After feature selection by chi2 and chhosing 70 % of the features (14 features) the accuracy increased to 100%.**