# Description
This file can be used to generate a simple xgboost model using the diabetes dataset under models. We first perform a bit of cleaning on the dataset we use as well. The dataset we need to use is the cleaned one. We do not use the user study model because it relies on an older version of scikit-learn and we preferred to use the newer one.

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
import pickle

In [30]:
df = pd.read_csv("./models/original_diabetes_dataset.csv")

In [31]:
df['gender'] = df['gender'].astype('category').cat.codes
df = df[df['smoking_history'] != 'No Info']
df['smoking_history'] = df['smoking_history'].apply(lambda x: 1 if x != 'never' else 0)

Balance the dataset by dropping a lot of samples. Bad idea for a decent model

In [32]:
diabetes_false_df = df[df['diabetes'] == 0]
diabetes_true_df = df[df['diabetes'] == 1]

diabetes_false_df = diabetes_false_df.sample(n=10000, random_state=1)

df = pd.concat([diabetes_true_df, diabetes_false_df])

In [33]:
df['diabetes'].value_counts()

0    10000
1     7046
Name: diabetes, dtype: int64

In [34]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
6,0,44.0,0,0,0,19.31,6.5,200,1
26,1,67.0,0,1,1,27.32,6.5,200,1
38,1,50.0,1,0,1,27.32,5.7,260,1
40,1,73.0,0,0,1,25.91,9.0,160,1
53,0,53.0,0,0,1,27.32,7.0,159,1


In [35]:
X = df.drop("diabetes", axis=1)
y = df["diabetes"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [36]:
clf = GradientBoostingClassifier(n_estimators=30, learning_rate=0.1, random_state=42).fit(X_train, y_train)

In [37]:
with open("./models/diabetes_xgb.pkl", "wb") as f:
    pickle.dump(clf, f)

In [38]:
with open('./models/diabetes_xgb.pkl', 'rb') as f:
    model = pickle.load(f)

In [41]:
print(log_loss(y_test, model.predict_proba(X_test)))
print(accuracy_score(y_test, model.predict(X_test)))

0.2516082536930491
0.8850439882697947
