In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = './data/diabetes.csv'
diabetes_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
diabetes_data.head()

## 1. Exploring the Dataset Structure

### 1.1 Dataset Structure

The `diabetes.csv` dataset contains various biomedical measurements aimed at predicting diabetes. Each row represents data from an individual, along with the outcome of a diabetes diagnosis. The dataset includes the following columns:

1. **Pregnancies**: Number of times pregnant.
2. **Glucose**: Plasma glucose concentration (2-hour test).
3. **BloodPressure**: Diastolic blood pressure (mm Hg).
4. **SkinThickness**: Triceps skin fold thickness (mm).
5. **Insulin**: 2-hour serum insulin (mu U/ml).
6. **BMI**: Body mass index (weight in kg/(height in m)^2).
7. **DiabetesPedigreeFunction**: Diabetes pedigree function (a measure of genetic influence on diabetes likelihood).
8. **Age**: Age (years).
9. **Outcome**: Binary result indicating whether the individual has diabetes (1: Yes, 0: No).

### 1.2 Data Preprocessing and Cleaning

#### 1.2.1 Checking for Missing Values

In [None]:
# Check for missing values
missing_values = diabetes_data.isnull().sum()
print("Actual Missing Values:\n", missing_values)

# Check for zero values in specific columns (as zero may indicate missing data)
columns_to_check = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
zero_values = (diabetes_data[columns_to_check] == 0).sum()
print("Zero Values (considered as missing data):\n", zero_values)

#### 1.2.2 Handling Missing Values

No `NaN` or null values are found in the dataset. However, zero values in biomedical measurements like Glucose, BloodPressure, SkinThickness, Insulin, and BMI are considered biologically implausible and are treated as missing data. These zeros will be replaced with the median values of their respective columns.

In [None]:
# Replace zero values with the median of the respective columns
for col in columns_to_check:
    diabetes_data[col] = diabetes_data[col].replace(0, diabetes_data[col].median())

# Verify replacement
zero_values_after = (diabetes_data[columns_to_check] == 0).sum()
print("Zero Values After Replacement:\n", zero_values_after)

#### 1.2.3 Statistical Summary

A statistical summary is created to understand the distribution of each measurement, identifying potential anomalies:

In [None]:
# Statistical summary
statistics_summary = diabetes_data.describe()
statistics_summary

### 1.3 Visualizing the Data

#### 1.3.1 Histograms

Histograms are used to visualize the distribution of each variable, which helps in understanding how common or rare each measurement is in the dataset.

In [None]:
# Plot histograms for all features
diabetes_data.hist(bins=15, figsize=(15, 10), layout=(3, 3), edgecolor='black')
plt.suptitle('Distribution of Features in the Diabetes Dataset', fontsize=16)
plt.show()

#### 1.3.2 Boxplots

Boxplots display the median, distribution, and potential outliers for each variable. Outliers represent abnormal data points that need attention.

In [None]:
# Plot boxplots for all features
plt.figure(figsize=(15, 10))
sns.boxplot(data=diabetes_data)
plt.title('Boxplot of Features in the Diabetes Dataset')
plt.show()

#### 1.3.3 Scatter Plots

Scatter plots are used to visualize the relationship between two variables. The `Outcome` variable is used to differentiate diabetic and non-diabetic individuals.

In [None]:
# Scatter plot for Glucose vs BMI
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=diabetes_data)
plt.title('Glucose vs BMI Scatter Plot')
plt.show()

# Scatter plot for Age vs BMI
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='BMI', hue='Outcome', data=diabetes_data)
plt.title('Age vs BMI Scatter Plot')
plt.show()

#### 1.3.4 Pair Plot

Pair plots provide a broader view of relationships between all pairs of features, highlighting general trends and interrelationships.

In [None]:
# Pairplot of all features
sns.pairplot(diabetes_data, hue='Outcome', diag_kind='kde')
plt.suptitle('Pairplot of All Features', fontsize=16)
plt.show()

### 1.4 Correlation Analysis and Heatmap

A correlation matrix is used to measure the linear relationship between each pair of variables. The correlation coefficient ranges from +1 to -1, with +1 indicating a strong positive relationship and -1 indicating a strong negative relationship.

In [None]:
# Correlation matrix
correlation_matrix = diabetes_data.corr()

# Plotting the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()

#### Interpretation of Correlation Analysis

- **Glucose and Outcome**: High positive correlation (0.47). Higher glucose levels are strongly associated with diabetes diagnosis, indicating that glucose is a key predictor.
- **BMI and Outcome**: Positive correlation (0.31). High BMI values are associated with increased diabetes risk, highlighting the role of obesity.
- **Age and Outcome**: Positive correlation (0.24). Older individuals have a higher likelihood of being diagnosed with diabetes.
- **Pregnancies and Outcome**: Low positive correlation (0.22). More pregnancies might slightly increase the risk of diabetes.
- **BloodPressure and Outcome**: Very low positive correlation (0.065). There is a weak relationship between blood pressure and diabetes.

No negative correlations are observed in this dataset, indicating that the measured variables generally show positive relationships with diabetes diagnosis.

### 1.5 General Conclusions

- **Glucose**, **BMI**, and **Age** are the strongest indicators for understanding diabetes risk and play a critical role in diabetes prediction.
- **Insulin** and **SkinThickness** show lower correlations but are still noteworthy.
- **Pregnancies** and **DiabetesPedigreeFunction** have less obvious effects, but they could still influence diabetes risk.

These analyses provide a foundation for further modeling and deeper understanding of diabetes prediction. For more questions or additional support, please feel free to ask!

### 1.6 Saving Cleaned Data

In [None]:
# Save the cleaned dataset to a new CSV file
cleaned_data_file = 'cleaned_diabetes_data.csv'
diabetes_data.to_csv(cleaned_data_file, index=False)
print(f"Cleaned data has been saved to {cleaned_data_file}")