# Kaggle Project: Diabetes prediction dataset

## Describe Your Dataset
**URL:** : https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/data?select=diabetes_prediction_dataset.csv

**Task:**
1. 필요한 library를 import
2. 데이터셋 생성 및 분할
3. 모델 정의: Logistic Regression, Decision Tree, SVM, Neural Network
4. 각 모델 학습 및 검증
5. Test data를 통한 최종 성능 평가

**Datasets**
: 전체 데이터를 train: validation: test = 6:2:2의 비율로 분할

* Train dataset: 전체 데이터셋 중 전처리 과정 후 60%

* Validation dataset: 전체 데이터셋 중 전처리 과정 후 20%

* Test dataset: 전체 데이터셋 중 전처리 과정 후 20%

**Features(x):** gender, age, hypertension, smoking_history, bmi, HbA1c_level, blood_glucose_level

**Target(y):** diabetes

## Build My Model

### Data preprocessing

In [100]:
# 필요한 library 불러오기
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [101]:
# 데이터 불러오기
diabetes_data = pd.read_csv('/kaggle/input/diabetes-prediction-dataset/diabetes_prediction_dataset.csv', header=0, encoding='euc-kr')
diabetes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


### 데이터 구성 확인

In [102]:
diabetes_data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [103]:
for col in diabetes_data.columns:
    print(diabetes_data[col].value_counts())
    print('='*100)

gender
Female    58552
Male      41430
Other        18
Name: count, dtype: int64
age
80.00    5621
51.00    1619
47.00    1574
48.00    1568
53.00    1542
         ... 
0.48       83
1.00       83
0.40       66
0.16       59
0.08       36
Name: count, Length: 102, dtype: int64
hypertension
0    92515
1     7485
Name: count, dtype: int64
heart_disease
0    96058
1     3942
Name: count, dtype: int64
smoking_history
No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: count, dtype: int64
bmi
27.32    25495
23.00      103
27.12      101
27.80      100
24.96      100
         ...  
58.23        1
48.18        1
55.57        1
57.07        1
60.52        1
Name: count, Length: 4247, dtype: int64
HbA1c_level
6.6    8540
5.7    8413
6.5    8362
5.8    8321
6.0    8295
6.2    8269
6.1    8048
3.5    7662
4.8    7597
4.5    7585
4.0    7542
5.0    7471
8.8     661
8.2     661
9.0     654
7.5     643
6.8     642
7.0   

In [104]:
# 데이터 결측치 확인
missing_values = diabetes_data.isnull().sum()
print(missing_values)

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64


In [105]:
diabetes_data.shape

(100000, 9)

In [106]:
diabetes_data = diabetes_data[diabetes_data['gender'] != 'Other']

In [107]:
diabetes_data.shape

(99982, 9)

In [108]:
diabetes_data['smoking_history'].value_counts()

smoking_history
No Info        35810
never          35092
former          9352
current         9286
not current     6439
ever            4003
Name: count, dtype: int64

In [109]:
# smoking_history에 No Info 최빈값 대체
mode_smoking_history = diabetes_data['smoking_history'][diabetes_data['smoking_history'] != 'No Info'].mode()[0]
diabetes_data['smoking_history'] = diabetes_data['smoking_history'].replace('No Info', mode_smoking_history)

In [110]:
diabetes_data['smoking_history'].value_counts()

smoking_history
never          70902
former          9352
current         9286
not current     6439
ever            4003
Name: count, dtype: int64

In [111]:
#흡연 여부로 대체
diabetes_data['smoking_history'] = diabetes_data['smoking_history'].replace({'never': 0, 'former': 1, 'current': 1, 'not current': 1, 'ever': 1})

In [112]:
diabetes_data['smoking_history'].value_counts()

smoking_history
0    70902
1    29080
Name: count, dtype: int64

In [113]:
_0_diabetes_data = diabetes_data[diabetes_data['smoking_history']==0]
print(_0_diabetes_data['smoking_history'].value_counts())

_1_diabetes_data = diabetes_data[diabetes_data['smoking_history']==1]
print(_1_diabetes_data['smoking_history'].value_counts())

print(diabetes_data['smoking_history'].value_counts())

smoking_history
0    70902
Name: count, dtype: int64
smoking_history
1    29080
Name: count, dtype: int64
smoking_history
0    70902
1    29080
Name: count, dtype: int64


In [114]:
# 'gender' 열을 숫자로 변환
diabetes_data['gender'] = diabetes_data['gender'].replace({'Male': 1, 'Female': 0})

# 'smoking_history' 열을 Label Encoding
# smoking_history_label_encoder = LabelEncoder()
# diabetes_data['smoking_history'] = smoking_history_label_encoder.fit_transform(diabetes_data['smoking_history'])

In [115]:
diabetes_data

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,0,80.0,0,1,0,25.19,6.6,140,0
1,0,54.0,0,0,0,27.32,6.6,80,0
2,1,28.0,0,0,0,27.32,5.7,158,0
3,0,36.0,0,0,1,23.45,5.0,155,0
4,1,76.0,1,1,1,20.14,4.8,155,0
...,...,...,...,...,...,...,...,...,...
99995,0,80.0,0,0,0,27.32,6.2,90,0
99996,0,2.0,0,0,0,17.37,6.5,100,0
99997,1,66.0,0,0,1,27.83,5.7,155,0
99998,0,24.0,0,0,0,35.42,4.0,100,0


In [116]:
print(diabetes_data.columns)

Index(['gender', 'age', 'hypertension', 'heart_disease', 'smoking_history',
       'bmi', 'HbA1c_level', 'blood_glucose_level', 'diabetes'],
      dtype='object')


### Model Construction

In [117]:
# 데이터를 특성과 타겟으로 분리
X = diabetes_data.drop('diabetes', axis=1) 
y = diabetes_data['diabetes'] 

In [118]:
# 데이터를 train, val, test 세트로 분할 (6:2:2)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)
print("y_test shape:", y_test.shape)

X_train shape: (59989, 8)
X_val shape: (19996, 8)
X_test shape: (19997, 8)
y_train shape: (59989,)
y_val shape: (19996,)
y_test shape: (19997,)


In [119]:
# 데이터 스케일링
scaler = MinMaxScaler()
columns_to_scale = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
X_train[columns_to_scale] = scaler.fit_transform(X_train[columns_to_scale])
X_val[columns_to_scale] = scaler.transform(X_val[columns_to_scale])
X_test[columns_to_scale] = scaler.transform(X_test[columns_to_scale])

### Train Model & Select Model (logistic regression model)

In [120]:
# 모델 초기화와 훈련(logistic regression model)
logistic_model = LogisticRegression(random_state = 42)
logistic_model.fit(X_train, y_train)

In [121]:
# 모델 예측(logistic regression model)
y_val_pred = logistic_model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)

# BCE 손실 계산
y_val_prob = logistic_model.predict_proba(X_val)
val_bce_loss = log_loss(y_val, y_val_prob)

In [122]:
# 모델 성능 평가(logistic regression model)
print("Validation Set Metrics:")
print(f"Accuracy: {val_accuracy:.2f}")
print(f"BCE Loss: {val_bce_loss:.2f}")
print("Classification Report:\n", val_classification_report)
print("Confusion Matrix:\n", val_confusion_matrix)

Validation Set Metrics:
Accuracy: 0.96
BCE Loss: 0.11
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98     18296
           1       0.89      0.63      0.73      1700

    accuracy                           0.96     19996
   macro avg       0.93      0.81      0.86     19996
weighted avg       0.96      0.96      0.96     19996

Confusion Matrix:
 [[18161   135]
 [  634  1066]]


### Performance

In [123]:
# 최종 테스트 데이터로 모델 평가 (logistic regression model)
y_test_pred = logistic_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)

# BCE 손실 계산
y_test_prob = logistic_model.predict_proba(X_test)
test_bce_loss = log_loss(y_test, y_test_prob)

print("\nTest Set Metrics:")
print(f"Accuracy: {test_accuracy:.2f}")
print(f"BCE Loss: {test_bce_loss:.2f}")
print("Classification Report:\n", test_classification_report)
print("Confusion Matrix:\n", test_confusion_matrix)


Test Set Metrics:
Accuracy: 0.96
BCE Loss: 0.11
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98     18297
           1       0.86      0.63      0.73      1700

    accuracy                           0.96     19997
   macro avg       0.92      0.81      0.85     19997
weighted avg       0.96      0.96      0.96     19997

Confusion Matrix:
 [[18128   169]
 [  623  1077]]


### Train Model & Select Model (Decision Tree model)

In [124]:
from sklearn.tree import DecisionTreeClassifier

# 모델 초기화와 훈련(Decision Tree model)
decision_tree_model = DecisionTreeClassifier(random_state=42)
decision_tree_model.fit(X_train, y_train)

In [125]:
# 모델 예측(Decision Tree model)
y_val_pred = decision_tree_model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_classification_report = classification_report(y_val, y_val_pred)
val_confusion_matrix = confusion_matrix(y_val, y_val_pred)

In [126]:
# 모델 성능 평가(Decision Tree model)
print("Decision Tree Validation Set Metrics:")
print(f"Accuracy: {val_accuracy:.2f}")
print("Classification Report:\n", val_classification_report)
print("Confusion Matrix:\n", val_confusion_matrix)

Decision Tree Validation Set Metrics:
Accuracy: 0.95
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97     18296
           1       0.72      0.73      0.73      1700

    accuracy                           0.95     19996
   macro avg       0.85      0.85      0.85     19996
weighted avg       0.95      0.95      0.95     19996

Confusion Matrix:
 [[17805   491]
 [  454  1246]]


### Performance

In [127]:
# 최종 테스트 데이터로 모델 평가(Decision Tree model)
y_test_pred = decision_tree_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_classification_report = classification_report(y_test, y_test_pred)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)

print("\nDecision Tree Test Set Metrics:")
print(f"Accuracy: {test_accuracy:.2f}")
print("Classification Report:\n", test_classification_report)
print("Confusion Matrix:\n", test_confusion_matrix)


Decision Tree Test Set Metrics:
Accuracy: 0.95
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97     18297
           1       0.72      0.74      0.73      1700

    accuracy                           0.95     19997
   macro avg       0.85      0.86      0.85     19997
weighted avg       0.95      0.95      0.95     19997

Confusion Matrix:
 [[17809   488]
 [  442  1258]]


### Train Model & Select Model (support vector model)

In [128]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [129]:
# 모델 초기화와 훈련(support vector model)
svm_model = SVC(kernel='linear', C=1.0, random_state=42)
svm_model.fit(X_train, y_train)

In [131]:
# 모델 예측(support vector model)
val_pred = svm_model.predict(X_val)
val_accuracy = accuracy_score(y_val, val_pred)
val_report = classification_report(y_val, val_pred)
val_confusion = confusion_matrix(y_val, val_pred)

In [132]:
# 모델 성능 평가(support vector model)
test_pred = svm_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)
test_report = classification_report(y_test, test_pred)
test_confusion = confusion_matrix(y_test, test_pred)

print("Validation Set Metrics:")
print(f"Accuracy: {val_accuracy:.2f}")
print("Classification Report:\n", val_report)
print("Confusion Matrix:\n", val_confusion)

Validation Set Metrics:
Accuracy: 0.96
Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98     18296
           1       0.93      0.59      0.72      1700

    accuracy                           0.96     19996
   macro avg       0.95      0.79      0.85     19996
weighted avg       0.96      0.96      0.96     19996

Confusion Matrix:
 [[18222    74]
 [  695  1005]]


### Performance

In [133]:
# 최종 테스트 데이터로 모델 평가(support vector model)
print("\nTest Set Metrics:")
print(f"Accuracy: {test_accuracy:.2f}")
print("Classification Report:\n", test_report)
print("Confusion Matrix:\n", test_confusion)


Test Set Metrics:
Accuracy: 0.96
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.98     18297
           1       0.92      0.60      0.73      1700

    accuracy                           0.96     19997
   macro avg       0.94      0.80      0.85     19997
weighted avg       0.96      0.96      0.96     19997

Confusion Matrix:
 [[18203    94]
 [  674  1026]]


### Model Construction  (Neural Network)

In [141]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [142]:
# 데이터를 특성과 타겟으로 분리
X = diabetes_data.drop('diabetes', axis=1)
y = diabetes_data['diabetes']

In [143]:
# 데이터 스케일링
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [144]:
# 데이터를 train, val, test 세트로 분할 (6:2:2 비율)
X_train, X_inter, y_train, y_inter = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_inter, y_inter, test_size=0.5, random_state=42)

### Train Model & Select Model (Neural Network model)

In [147]:
# 데이터를 PyTorch Tensor로 변환
X_train = torch.FloatTensor(X_train.values)
y_train = torch.FloatTensor(y_train.values).view(-1, 1)
X_val = torch.FloatTensor(X_val.values)
y_val = torch.FloatTensor(y_val.values).view(-1, 1)
X_test = torch.FloatTensor(X_test.values)
y_test = torch.FloatTensor(y_test.values).view(-1, 1)

In [148]:
# TensorDataset을 사용하여 데이터와 대상을 결합
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)
test_dataset = TensorDataset(X_test, y_test)

In [149]:
# DataLoader를 사용하여 배치 처리
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [150]:
class NeuralNetwork(nn.Module):
    def __init__(self, input_size):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        x = self.sigmoid(x)
        return x

In [163]:
# 모델 초기화와 훈련 (Neural Network)
input_size = X_train.shape[1]
model = NeuralNetwork(input_size)

# 손실 함수와 옵티마이저 정의
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 모델 훈련 (Neural network model)
num_epochs = 50
for epoch in range(num_epochs):
    model.train()
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}')

Epoch [10/50], Loss: 0.05682094395160675
Epoch [20/50], Loss: 0.07554303854703903
Epoch [30/50], Loss: 0.05462734401226044
Epoch [40/50], Loss: 0.2385360598564148
Epoch [50/50], Loss: 0.025082528591156006


In [164]:
# 모델 성능 평가 (Neural Network)
model.eval()
with torch.no_grad():
    val_accuracy = []
    val_true = []
    val_pred = []
    for inputs, targets in val_loader:
        outputs = model(inputs)
        predicted = (outputs > 0.5).float()
        accuracy = accuracy_score(targets.numpy(), predicted.numpy())
        val_accuracy.append(accuracy)
        val_true.extend(targets.numpy())
        val_pred.extend(predicted.numpy())

    val_accuracy = np.mean(val_accuracy)
    print(f'Validation Accuracy: {val_accuracy:.2f}')
    print("Validation Classification Report:\n", classification_report(val_true, val_pred))
    print("Validation Confusion Matrix:\n", confusion_matrix(val_true, val_pred))

Validation Accuracy: 0.96
Validation Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     18305
         1.0       0.99      0.52      0.68      1691

    accuracy                           0.96     19996
   macro avg       0.97      0.76      0.83     19996
weighted avg       0.96      0.96      0.95     19996

Validation Confusion Matrix:
 [[18298     7]
 [  818   873]]


### Performance

In [165]:
# 최종 테스트 데이터로 모델 평가 (Neural Network)
test_accuracy = []
test_true = []
test_pred = []

for inputs, targets in test_loader:
    outputs = model(inputs)
    predicted = (outputs > 0.5).float()
    accuracy = accuracy_score(targets.numpy(), predicted.numpy())
    test_accuracy.append(accuracy)
    test_true.extend(targets.numpy())
    test_pred.extend(predicted.numpy())

test_accuracy = np.mean(test_accuracy)
print(f'Test Accuracy: {test_accuracy:.2f}')
print("Test Classification Report:\n", classification_report(test_true, test_pred))
print("Test Confusion Matrix:\n", confusion_matrix(test_true, test_pred))

Test Accuracy: 0.96
Test Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     18247
         1.0       0.98      0.51      0.67      1750

    accuracy                           0.96     19997
   macro avg       0.97      0.76      0.82     19997
weighted avg       0.96      0.96      0.95     19997

Test Confusion Matrix:
 [[18226    21]
 [  855   895]]


### Test Accuracy 기준
1. Logistic regression : 0.96
2. Decision Tree : 0.95
3. Support Vector Machine : 0.96
4. Neural Network : 0.97 

대부분 모델의 성능이 준수하지만, **neural network model**이 좀 더 적합한 것으로 결론 내렸다.