# 介绍

糖尿病作为一种全球性慢性疾病，正对人类健康构成巨大威胁，其发病率不断攀升，给患者的生活质量和医疗系统带来了沉重负担。准确预测糖尿病的发病风险，对于实现早期干预、有效预防以及优化医疗资源配置具有至关重要的意义。本项目所采用的数据集来自Heywhale.com，该数据集源自Kaggle，涵盖了患者的诸多健康指标，如年龄、性别、种族、社会经济地位、教育水平、BMI、吸烟饮酒情况、饮食睡眠质量、家族病史、妊娠糖尿病史、多囊卵巢综合症、既往糖尿病前期、血压、血糖、糖化血红蛋白、血脂水平、药物使用情况、症状表现、生活质量评分、重金属及化学品暴露、水质污染、体检频率、药物依从性、健康素养等，以及最终的糖尿病诊断结果。这些丰富且多维度的数据，为我们深入分析糖尿病的发病影响因素以及构建精准的糖尿病风险预测模型提供了坚实基础，有助于医疗从业者更好地洞察病因和风险因素，从而制定出更具针对性和有效的预防与治疗策略。  


# 导入必备的库

In [73]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import warnings

from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report

# 忽略FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

import warnings

%matplotlib inline

In [66]:
# 忽略FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

# 数据探索

## 数据分布情况分析

导入数据集：

In [23]:
file_path = "/home/mw/input/data3501/diabetes_data.csv"

df = pd.read_csv(file_path)

查看数据集信息：

In [24]:
df.shape

(1879, 46)

数据集一共有46个特征，1879个样本。

In [25]:
df.head()

Unnamed: 0,PatientID,Age,Gender,Ethnicity,SocioeconomicStatus,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,...,TinglingHandsFeet,QualityOfLifeScore,HeavyMetalsExposure,OccupationalExposureChemicals,WaterQuality,MedicalCheckupsFrequency,MedicationAdherence,HealthLiteracy,Diagnosis,DoctorInCharge
0,6000,44,0,1,2,1,32.985284,1,4.499365,2.443385,...,1,73.765109,0,0,0,1.782724,4.48698,7.211349,1,Confidential
1,6001,51,1,0,1,2,39.916764,0,1.578919,8.301264,...,0,91.445753,0,0,1,3.38107,5.961705,5.024612,1,Confidential
2,6002,89,1,0,1,3,19.782251,0,1.177301,6.103395,...,0,54.485744,0,0,0,2.701019,8.950821,7.034944,0,Confidential
3,6003,21,1,1,1,2,32.376881,1,1.714621,8.645465,...,0,77.866758,0,0,1,1.409056,3.124769,4.717774,0,Confidential
4,6004,27,1,0,1,3,16.8086,0,15.462549,4.629383,...,0,37.731808,0,0,0,1.218452,6.977741,7.88794,0,Confidential


查看一下缺失值的分布情况：

In [26]:
df.isnull().sum().sum()

0

可以发现数据集中并不存在着缺失值。

查看一下数据集的数据类型：

In [27]:
df.dtypes

PatientID                          int64
Age                                int64
Gender                             int64
Ethnicity                          int64
SocioeconomicStatus                int64
EducationLevel                     int64
BMI                              float64
Smoking                            int64
AlcoholConsumption               float64
PhysicalActivity                 float64
DietQuality                      float64
SleepQuality                     float64
FamilyHistoryDiabetes              int64
GestationalDiabetes                int64
PolycysticOvarySyndrome            int64
PreviousPreDiabetes                int64
Hypertension                       int64
SystolicBP                         int64
DiastolicBP                        int64
FastingBloodSugar                float64
HbA1c                            float64
SerumCreatinine                  float64
BUNLevels                        float64
CholesterolTotal                 float64
CholesterolLDL  

In [28]:
df = df.drop("DoctorInCharge", axis = 1)

In [33]:
df = df.drop("PatientID", axis = 1)

接下来查看一下统计描述结果：

In [34]:
df.describe()

Unnamed: 0,Age,Gender,Ethnicity,SocioeconomicStatus,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,SlowHealingSores,TinglingHandsFeet,QualityOfLifeScore,HeavyMetalsExposure,OccupationalExposureChemicals,WaterQuality,MedicalCheckupsFrequency,MedicationAdherence,HealthLiteracy,Diagnosis
count,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,...,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0,1879.0
mean,55.043108,0.487493,0.755721,0.992017,1.699308,27.687601,0.281533,10.096587,5.20079,4.895801,...,0.102714,0.111229,48.508643,0.052155,0.103246,0.200639,1.997101,4.957539,5.011736,0.400213
std,20.515839,0.499977,1.047558,0.76494,0.885665,7.190975,0.449866,5.914216,2.857012,2.867144,...,0.303666,0.3145,28.758488,0.2224,0.304361,0.400585,1.122632,2.910934,2.920908,0.490072
min,20.0,0.0,0.0,0.0,0.0,15.025898,0.0,0.000928,0.004089,0.000885,...,0.0,0.0,0.00239,0.0,0.0,0.0,0.004013,0.005384,0.000362,0.0
25%,38.0,0.0,0.0,0.0,1.0,21.469981,0.0,4.789725,2.751022,2.476802,...,0.0,0.0,23.974098,0.0,0.0,0.0,1.057801,2.420024,2.410113,0.0
50%,55.0,0.0,0.0,1.0,2.0,27.722988,0.0,10.173865,5.249002,4.888566,...,0.0,0.0,47.519693,0.0,0.0,0.0,1.98717,4.843886,5.035208,0.0
75%,73.0,1.0,1.0,2.0,2.0,33.85646,1.0,15.285359,7.671402,7.356058,...,0.0,0.0,72.883179,0.0,0.0,0.0,2.946019,7.513933,7.586865,1.0
max,90.0,1.0,3.0,2.0,3.0,39.998811,1.0,19.996231,9.993893,9.998677,...,1.0,1.0,99.78853,1.0,1.0,1.0,3.999715,9.997165,9.993029,1.0


可视化：  

- 分类特征：查看不同类别的数量情况/目标变量在不同类别中的分布情况。  
- 数值类型：使用直方图，小提琴图等来进行可视化

## 特征选择

**相关性分析：**

In [83]:
len(df.columns)

44

In [37]:
features_cols = ['Age', 'Gender', 'Ethnicity', 'SocioeconomicStatus', 'EducationLevel',
       'BMI', 'Smoking', 'AlcoholConsumption', 'PhysicalActivity',
       'DietQuality', 'SleepQuality', 'FamilyHistoryDiabetes',
       'GestationalDiabetes', 'PolycysticOvarySyndrome', 'PreviousPreDiabetes',
       'Hypertension', 'SystolicBP', 'DiastolicBP', 'FastingBloodSugar',
       'HbA1c', 'SerumCreatinine', 'BUNLevels', 'CholesterolTotal',
       'CholesterolLDL', 'CholesterolHDL', 'CholesterolTriglycerides',
       'AntihypertensiveMedications', 'Statins', 'AntidiabeticMedications',
       'FrequentUrination', 'ExcessiveThirst', 'UnexplainedWeightLoss',
       'FatigueLevels', 'BlurredVision', 'SlowHealingSores',
       'TinglingHandsFeet', 'QualityOfLifeScore', 'HeavyMetalsExposure',
       'OccupationalExposureChemicals', 'WaterQuality',
       'MedicalCheckupsFrequency', 'MedicationAdherence', 'HealthLiteracy']

In [45]:
# 计算相关性矩阵
corr_matrix = df[features_cols].corr()

In [49]:
# 绘制heatmap
plt.figure(figsize=(20, 15))  # 可以根据需要调整图形大小
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)

# 设置标题
plt.title('Correlation Heatmap')

# 显示图形
plt.show()

# 模型创建

## 数据集分割

In [63]:
train_data, test_data = train_test_split(df, test_size=0.3, random_state=42)

## 逻辑回归模型

首先使用逻辑回归模型来进行建模：

In [67]:
# 分离特征和目标变量
X = train_data.drop('Diagnosis', axis=1)
y = train_data['Diagnosis']

# 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 创建逻辑回归模型实例
logreg = LogisticRegression(max_iter=1000)

# 设置要搜索的参数网格
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # 正则化强度的倒数
    'penalty': ['l1', 'l2']  # 正则化类型
}

# 创建GridSearchCV对象，设置交叉验证折数为5
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')

# 在训练数据上进行网格搜索
grid_search.fit(X_scaled, y)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=1000, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
                         'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

In [68]:
# 输出最佳参数组合
print("最佳参数组合：", grid_search.best_params_)

# 输出最佳模型的准确率
print("最佳模型准确率：", grid_search.best_score_)

# 使用最佳参数组合训练模型
best_model = grid_search.best_estimator_

最佳参数组合： {'C': 0.1, 'penalty': 'l1'}
最佳模型准确率： 0.8334600760456273


查看混淆矩阵：

In [71]:
# 预测训练数据
y_pred = best_model.predict(X_scaled)
y_pred_proba = best_model.predict_proba(X_scaled)[:, 1]

# 输出混淆矩阵
conf_matrix = confusion_matrix(y, y_pred)
print("混淆矩阵：")
print(conf_matrix)

混淆矩阵：
[[682  98]
 [108 427]]


In [74]:
# 输出分类报告
class_report = classification_report(y, y_pred)
print("分类报告：")
print(class_report)

分类报告：
              precision    recall  f1-score   support

           0       0.86      0.87      0.87       780
           1       0.81      0.80      0.81       535

    accuracy                           0.84      1315
   macro avg       0.84      0.84      0.84      1315
weighted avg       0.84      0.84      0.84      1315



In [72]:
# 绘制ROC-AUC曲线
fpr, tpr, thresholds = roc_curve(y, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

In [79]:
best_model

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [80]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

# 训练逻辑回归模型
logreg = LogisticRegression(penalty="l1", C=0.1)
logreg.fit(X_scaled, y)

# 使用SelectFromModel进行特征选择
sfm = SelectFromModel(logreg, threshold='mean')  # 可以根据需要调整threshold参数
X_selected = sfm.fit_transform(X_scaled, y)

# 输出被选择的特征
selected_features = X.columns[sfm.get_support()]
print("Selected features:", selected_features)

Selected features: Index(['Hypertension', 'FastingBloodSugar', 'HbA1c', 'FrequentUrination',
       'ExcessiveThirst', 'UnexplainedWeightLoss', 'BlurredVision'],
      dtype='object')


## 选择出来的模型

In [81]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
import warnings

# 忽略FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

# 假设train_data是一个DataFrame，其中包含特征和目标变量'Diagnosis'
# train_data = pd.read_csv('your_train_data.csv')

# 分离特征和目标变量
X = train_data.drop('Diagnosis', axis=1)
y = train_data['Diagnosis']

# 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 创建逻辑回归模型实例
logreg = LogisticRegression(penalty="l1", C=0.1, max_iter=1000, solver='liblinear')

# 使用SelectFromModel进行特征选择
sfm = SelectFromModel(logreg, threshold='mean')
X_selected = sfm.fit_transform(X_scaled, y)

# 输出被选择的特征
selected_features = X.columns[sfm.get_support()]
print("Selected features:", selected_features)

# 使用被选择的特征重新训练逻辑回归模型
logreg_selected = LogisticRegression(max_iter=1000, solver='lbfgs')
logreg_selected.fit(X_selected, y)

# 预测训练数据
y_pred = logreg_selected.predict(X_selected)
y_pred_proba = logreg_selected.predict_proba(X_selected)[:, 1]

# 输出混淆矩阵
conf_matrix = confusion_matrix(y, y_pred)
print("混淆矩阵：")
print(conf_matrix)

# 输出分类报告
class_report = classification_report(y, y_pred)
print("分类报告：")
print(class_report)

# 绘制ROC-AUC曲线
fpr, tpr, thresholds = roc_curve(y, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Selected features: Index(['Hypertension', 'FastingBloodSugar', 'HbA1c', 'FrequentUrination',
       'ExcessiveThirst', 'UnexplainedWeightLoss', 'BlurredVision'],
      dtype='object')
混淆矩阵：
[[686  94]
 [119 416]]
分类报告：
              precision    recall  f1-score   support

           0       0.85      0.88      0.87       780
           1       0.82      0.78      0.80       535

    accuracy                           0.84      1315
   macro avg       0.83      0.83      0.83      1315
weighted avg       0.84      0.84      0.84      1315



In [88]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
import warnings
import numpy as np

# 忽略FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

# 假设train_data是一个DataFrame，其中包含特征和目标变量'Diagnosis'
# train_data = pd.read_csv('your_train_data.csv')

# 分离特征和目标变量
X = train_data.drop('Diagnosis', axis=1)
y = train_data['Diagnosis']

# 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 创建逻辑回归模型实例
logreg = LogisticRegression(penalty="l1", C=0.1, max_iter=1000, solver='liblinear')

# 使用SelectFromModel进行特征选择
sfm = SelectFromModel(logreg, threshold='mean')
X_selected = sfm.fit_transform(X_scaled, y)

# 输出被选择的特征
selected_features = X.columns[sfm.get_support()]
print("Selected features:", selected_features)

# 使用被选择的特征重新训练逻辑回归模型
logreg_selected = LogisticRegression(max_iter=1000, solver='lbfgs')
logreg_selected.fit(X_selected, y)

# 输出被选择的特征及其重要性
feature_importances = pd.Series(logreg_selected.coef_[0], index=selected_features)
print("Feature importances:\n", feature_importances)

# 预测训练数据
y_pred = logreg_selected.predict(X_selected)
y_pred_proba = logreg_selected.predict_proba(X_selected)[:, 1]

# 输出混淆矩阵
conf_matrix = confusion_matrix(y, y_pred)
print("混淆矩阵：")
print(conf_matrix)

# 输出分类报告
class_report = classification_report(y, y_pred)
print("分类报告：")
print(class_report)

# 绘制ROC-AUC曲线
fpr, tpr, thresholds = roc_curve(y, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Selected features: Index(['Hypertension', 'FastingBloodSugar', 'HbA1c', 'FrequentUrination',
       'ExcessiveThirst', 'UnexplainedWeightLoss', 'BlurredVision'],
      dtype='object')
Feature importances:
 Hypertension             0.567630
FastingBloodSugar        1.884523
HbA1c                    1.740358
FrequentUrination        0.563268
ExcessiveThirst          0.465295
UnexplainedWeightLoss    0.274633
BlurredVision            0.239278
dtype: float64
混淆矩阵：
[[686  94]
 [119 416]]
分类报告：
              precision    recall  f1-score   support

           0       0.85      0.88      0.87       780
           1       0.82      0.78      0.80       535

    accuracy                           0.84      1315
   macro avg       0.83      0.83      0.83      1315
weighted avg       0.84      0.84      0.84      1315



## 树模型 - 决策树

## 树模型 - 随机森林