 1. Problem

Diabetes is a serious health issue that affects millions of people around the world.
In this project, we aim to analyze a medical dataset collected from institutions in Iraq that includes measurements
such as cholesterol levels, BMI, blood sugar (HbA1c), and other health indicators.

The goal is to use data mining techniques to better understand the patterns and factors that influence diabetes.
Specifically, we want to classify patients into three categories:
.Diabetic
.Non-Diabetic
.Predict-Diabetic (likely to develop diabetes)

This analysis can help identify early warning signs, support doctors in diagnosing patients more accurately,
and suggest lifestyle or medical changes to prevent complications.

 2. Data Mining Task

The dataset includes 1,000 records and 14 attributes such as age, gender, BMI, cholesterol types, and kidney function indicators.
The class label indicates whether the patient is diabetic, non-diabetic, or likely to become diabetic.

We are applying two main data mining techniques:

1-Classification:
To build a predictive model that classifies new patients based on their health metrics into the
correct diabetes category (Diabetic / Non-Diabetic / Predict-Diabetic).

2-Clustering:
To group patients with similar health profiles together and discover hidden patterns in the data without
using the class label. This can reveal subgroups within the population that share common traits or risk levels.

By combining both methods, we hope to uncover valuable insights into how different health indicators relate to diabetes risk.


3. Data

3.1 Dataset Overview
The dataset used in this project was sourced from Kaggle: Diabetes Dataset. It contains 1,000 entries and 14 attributes related
to diabetic health indicators, collected from medical institutions in Iraq. The primary goal of the analysis is to classify
patients as Diabetic (Y), Non-Diabetic (N), or Predict-Diabetic (P) using various numeric and categorical medical attributes.

3.2 Attribute Description


| Attribute     | Type                | Description                                          |
|---------------|---------------------|------------------------------------------------------|
| ID            | Nominal             | Unique identifier for each data entry               |
| No_Pation     | Nominal             | Internal patient number (not used in analysis)       |
| Gender        | Nominal (Binary)    | Male or Female                                      |
| AGE           | Numeric (Ratio)     | Patient's age in years                              |
| Urea          | Numeric (Interval)  | Urea level in blood (mg/dL), indicates kidney function |
| Cr            | Numeric (Interval)  | Creatinine level (kidney health indicator)           |
| HbA1c         | Numeric (Interval)  | Average blood sugar level over 2-3 months            |
| Chol          | Numeric (Interval)  | Total cholesterol (mg/dL)                            |
| TG            | Numeric (Interval)  | Triglycerides level                                  |
| HDL           | Numeric (Interval)  | High-density lipoprotein ("good cholesterol")        |
| LDL           | Numeric (Interval)  | Low-density lipoprotein ("bad cholesterol")          |
| VLDL          | Numeric (Interval)  | Very low-density lipoprotein                         |
| BMI           | Numeric (Ratio)     | Body Mass Index (kg/m²)                              |
| CLASS         | Nominal (Multiclass)| Diabetic status: Diabetic (Y), Non-Diabetic (N), Predict-Diabetic (P) |

3.3 Data Summary
The dataset is clean with no missing values and consists mostly of continuous numeric variables. Here is a statistical summary of the key attributes:


| Attribute | Mean | Std Dev | Min  | Max  |
|-----------|------|---------|------|------|
| Age       | 53.5 | 8.8     | 20   | 74   |
| BMI       | 29.6 | 5.0     | 19.0 | 44.0 |
| HbA1c     | 8.28 | 2.53    | 0.9  | 14.0 |
| LDL       | 2.61 | 1.12    | 0.3  | 6.4  |
| HDL       | 1.20 | 0.66    | 0.2  | 3.9  |
| Urea      | 5.12 | 2.94    | 0.5  | 23.1 |

These values provide an initial sense of the patients' health characteristics. For example, the average BMI is nearly 30, which borders on the clinical threshold for obesity, and the mean HbA1c value indicates poor long-term glucose control on average.

3.4 Class Distribution
The target attribute CLASS includes three categories:

Y (Diabetic)

N (Non-Diabetic)

P (Predict-Diabetic)

Before modeling, we checked the class balance to ensure fair training. The class distribution is:  df['CLASS'].value_counts()

If the class distribution is highly imbalanced, resampling techniques (like SMOTE or undersampling) may be applied during preprocessing.


4. Data Preprocessing

 5. Data Mining Technique


 6. Evaluation and Comparison

 7. Findings


 8. References