#### Setup

The data will be preppred the same way as before, using the same processes. 

In [1]:
import pandas as pd
import numpy as np
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
warnings.simplefilter(action='ignore')

In [2]:
df = pd.read_csv('diabetes_prediction_dataset.csv')
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [3]:
#taking columns with multiple text values and converting them into columns with numerical values
dummies = pd.get_dummies(df[['gender', 'smoking_history']], dtype='int')
df = df.join(dummies)
df = df.drop(['gender', 'smoking_history'], axis=1)

In [4]:
df.head()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,gender_Female,gender_Male,gender_Other,smoking_history_No Info,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_not current
0,80.0,0,1,25.19,6.6,140,0,1,0,0,0,0,0,0,1,0
1,54.0,0,0,27.32,6.6,80,0,1,0,0,1,0,0,0,0,0
2,28.0,0,0,27.32,5.7,158,0,0,1,0,0,0,0,0,1,0
3,36.0,0,0,23.45,5.0,155,0,1,0,0,0,1,0,0,0,0
4,76.0,1,1,20.14,4.8,155,0,0,1,0,0,1,0,0,0,0


In [5]:
#train_test_split
X = df.drop(['diabetes'], axis=1)
y = df['diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
pipeline = Pipeline([('scale', StandardScaler()),
                       ('rfc', RandomForestClassifier(bootstrap = True, max_depth=None, 
                                                   min_samples_leaf=4, min_samples_split = 10, n_estimators=200))]).fit(X_train, y_train)

#### Testing the Model with Non-Diabetic and Diabetic Cases

Two test cases have been created, one non-diabetic case and one diabetic case. Each case contains features from after setting up the dataset, except they are missing a 'diabetes' value that represents the case's status. We are testing the model to see if it can accurately predict whether a case is diabetic or not.

In [7]:
non_diabetic_case = [
    30,   # age
    0,    # hypertension
    0,    # heart_disease
    22.0, # bmi
    5.5,  # HbA1c_level
    100,  # blood_glucose_level
    1, 0, 0,  # gender: gender_Female=1, gender_Male=0, gender_Other=0
    0, 1, 0, 0, 1, 0  # smoking_history: never
]

diabetic_case = [
    60,   # age
    1,    # hypertension
    1,    # heart_disease
    32.5, # bmi
    8.0,  # HbA1c_level
    200,  # blood_glucose_level
    0, 1, 0,  # gender: gender_Female=0, gender_Male=1, gender_Other=0
    0, 0, 1, 0, 0, 0  # smoking_history: current
]

In [8]:
# Convert the test cases to numpy arrays
non_diabetic_input = np.array(non_diabetic_case).reshape(1, -1)
diabetic_input = np.array(diabetic_case).reshape(1, -1)

# Make predictions
non_diabetic_prediction = pipeline.predict(non_diabetic_input)
diabetic_prediction = pipeline.predict(diabetic_input)

In [9]:
print("Non-Diabetic Case Prediction: ", "Diabetic" if non_diabetic_prediction[0] == 1 else "Non-Diabetic")
print("Diabetic Case Prediction: ", "Diabetic" if diabetic_prediction[0] == 1 else "Non-Diabetic")

Non-Diabetic Case Prediction:  Non-Diabetic
Diabetic Case Prediction:  Diabetic


#### Model Interpretation

Now that model has proven it is capable of accurately predicting whether an individual has diabetes or not, we can now check to see which features hold more importance over others in terms of predicting diabetes. 

In [10]:
# Access the trained RandomForestClassifier from the pipeline
rf_model = pipeline.named_steps['rfc']

# Get feature importances
feature_importances = rf_model.feature_importances_

# Create a DataFrame to display features and their importance
feature_names = X_train.columns
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the top features
print(feature_importance_df)

                        Feature    Importance
4                   HbA1c_level  4.616447e-01
5           blood_glucose_level  3.661862e-01
0                           age  6.471996e-02
3                           bmi  6.150661e-02
1                  hypertension  1.751374e-02
2                 heart_disease  1.227708e-02
9       smoking_history_No Info  4.865724e-03
12       smoking_history_former  2.687846e-03
13        smoking_history_never  1.968476e-03
7                   gender_Male  1.666687e-03
6                 gender_Female  1.612688e-03
10      smoking_history_current  1.290324e-03
14  smoking_history_not current  1.090065e-03
11         smoking_history_ever  9.694957e-04
8                  gender_Other  3.510803e-07


#### Conclusion

This study shows that certain features about a person hold more importance in the role of predicting a person who may be at risk of diabetes. In this case, the top five features are (in order of importance): HbA1c Level, Blood Glucose Level, Age, BMI, and Hypertension.