In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploratory Data Analysis for Diabetes Prediction\n",
    "\n",
    "This notebook explores the Pima Indians Diabetes Dataset to understand its characteristics and prepare for model development."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import os\n",
    "from sklearn.impute import SimpleImputer, KNNImputer\n",
    "\n",
    "# Set plot style\n",
    "plt.style.use('seaborn-whitegrid')\n",
    "sns.set(font_scale=1.2)\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load and Inspect the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "data_path = os.path.join('..', 'data', 'diabetes.csv')\n",
    "data = pd.read_csv(data_path)\n",
    "\n",
    "# Display basic information\n",
    "print(f\"Dataset shape: {data.shape}\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check data types and other information\n",
    "data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "print(\"Missing values per column:\")\n",
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Identify Zero Values in Physiologically Impossible Fields"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for zeros in columns where zero is not a valid physiological value\n",
    "zero_columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']\n",
    "\n",
    "for column in zero_columns:\n",
    "    zero_count = (data[column] == 0).sum()\n",
    "    print(f\"{column}: {zero_count} zeros ({zero_count/len(data)*100:.2f}%)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize zeros in these columns\n",
    "plt.figure(figsize=(12, 6))\n",
    "sns.heatmap((data[zero_columns] == 0).transpose(), \n",
    "            cmap='YlOrRd', \n",
    "            cbar_kws={'label': 'Is Zero'},\n",
    "            yticklabels=zero_columns)\n",
    "plt.title('Zero Values in Physiologically Impossible Fields')\n",
    "plt.xlabel('Row Index')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Explore Target Variable Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Target variable distribution\n",
    "plt.figure(figsize=(8, 6))\n",
    "ax = sns.countplot(x='Outcome', data=data)\n",
    "plt.title('Distribution of Diabetes Outcome')\n",
    "plt.xlabel('Outcome (0 = No Diabetes, 1 = Diabetes)')\n",
    "\n",
    "# Add count and percentage labels\n",
    "total = len(data)\n",
    "for p in ax.patches:\n",
    "    height = p.get_height()\n",
    "    ax.text(p.get_x() + p.get_width()/2., height + 5,\n",
    "            f'{height} ({height/total*100:.1f}%)', \n",
    "            ha=\"center\", fontsize=12)\n",
    "\n",
    "plt.show()\n",
    "\n",
    "print(f\"Percentage of diabetic cases: {data['Outcome'].mean() * 100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Feature Distributions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize the distribution of each feature\n",
    "plt.figure(figsize=(15, 10))\n",
    "for i, column in enumerate(data.columns[:-1], 1):\n",
    "    plt.subplot(3, 3, i)\n",
    "    sns.histplot(data[column], kde=True)\n",
    "    plt.title(f'Distribution of {column}')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for skewness in distributions\n",
    "skewness = data.skew()\n",
    "print(\"Skewness of features:\")\n",
    "print(skewness)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Box plots by outcome\n",
    "plt.figure(figsize=(15, 10))\n",
    "for i, column in enumerate(data.columns[:-1], 1):\n",
    "    plt.subplot(3, 3, i)\n",
    "    sns.boxplot(x='Outcome', y=column, data=data)\n",
    "    plt.title(f'{column} by Outcome')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Statistical comparison between groups\n",
    "print(\"Feature statistics by outcome:\")\n",
    "for column in data.columns[:-1]:\n",
    "    diabetic_mean = data[data['Outcome'] == 1][column].mean()\n",
    "    non_diabetic_mean = data[data['Outcome'] == 0][column].mean()\n",
    "    difference = diabetic_mean - non_diabetic_mean\n",
    "    print(f\"{column}: Diabetic mean = {diabetic_mean:.2f}, Non-diabetic mean = {non_diabetic_mean:.2f}, Difference = {difference:.2f} ({difference/non_diabetic_mean*100:.1f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Correlation Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correlation matrix\n",
    "plt.figure(figsize=(12, 10))\n",
    "correlation_matrix = data.corr()\n",
    "mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))\n",
    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', mask=mask)\n",
    "plt.title('Correlation Matrix')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature correlations with target\n",
    "plt.figure(figsize=(12, 6))\n",
    "correlation_with_target = pd.DataFrame(\n",
    "    {'correlation': data.corr()['Outcome'].drop('Outcome')}\n",
    ").sort_values('correlation', ascending=False)\n",
    "\n",
    "sns.barplot(x=correlation_with_target.index, y='correlation', data=correlation_with_target)\n",
    "plt.xticks(rotation=45)\n",
    "plt.title('Feature Correlations with Diabetes Outcome')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Correlations with Outcome (sorted):\")\n",
    "print(correlation_with_target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Feature Relationships"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scatter plot for Glucose vs BMI colored by outcome\n",
    "plt.figure(figsize=(10, 8))\n",
    "sns.scatterplot(x='Glucose', y='BMI', hue='Outcome', data=data, palette='viridis', alpha=0.7)\n",
    "plt.title('Glucose vs BMI by Diabetes Outcome')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scatter plot for Age vs Glucose colored by outcome\n",
    "plt.figure(figsize=(10, 8))\n",
    "sns.scatterplot(x='Age', y='Glucose', hue='Outcome', data=data, palette='viridis', alpha=0.7)\n",
    "plt.title('Age vs Glucose by Diabetes Outcome')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pairplot for key features\n",
    "key_features = ['Glucose', 'BMI', 'Age', 'Insulin', 'DiabetesPedigreeFunction', 'Outcome']\n",
    "sns.pairplot(data[key_features], hue='Outcome', palette='Set1')\n",
    "plt.suptitle('Pairplot of Key Features', y=1.02)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Handle Missing Values (Zeros)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace zeros with NaN for columns where zero is not a valid value\n",
    "data_processed = data.copy()\n",
    "for column in zero_columns:\n",
    "    data_processed[column] = data_processed[column].replace(0, np.nan)\n",
    "\n",
    "print(\"Missing values after replacing zeros with NaN:\")\n",
    "data_processed.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize missing data\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.heatmap(data_processed.isnull(), cbar=False, yticklabels=False, cmap='viridis')\n",
    "plt.title('Missing Value Map (after replacing zeros with NaN)')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if missing values are related to outcome\n",
    "for column in zero_columns:\n",
    "    missing_outcome_1 = data_processed[data_processed['Outcome'] == 1][column].isnull().mean() * 100\n",
    "    missing_outcome_0 = data_processed[data_processed['Outcome'] == 0][column].isnull().mean() * 100\n",
    "    print(f\"{column}: Missing in diabetic patients: {missing_outcome_1:.1f}%, Missing in non-diabetic patients: {missing_outcome_0:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Imputation Strategy Comparison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create copies for different imputation strategies\n",
    "data_median = data_processed.copy()\n",
    "data_knn = data_processed.copy()\n",
    "data_mean = data_processed.copy()\n",
    "\n",
    "# Median imputation\n",
    "imputer_median = SimpleImputer(strategy='median')\n",
    "data_median.iloc[:, :-1] = imputer_median.fit_transform(data_processed.iloc[:, :-1])\n",
    "\n",
    "# Mean imputation\n",
    "imputer_mean = SimpleImputer(strategy='mean')\n",
    "data_mean.iloc[:, :-1] = imputer_mean.fit_transform(data_processed.iloc[:, :-1])\n",
    "\n",
    "# KNN imputation\n",
    "imputer_knn = KNNImputer(n_neighbors=5)\n",
    "data_knn.iloc[:, :-1] = imputer_knn.fit_transform(data_processed.iloc[:, :-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare distributions before and after imputation\n",
    "for column in zero_columns:\n",
    "    plt.figure(figsize=(15, 5))\n",
    "    \n",
    "    # Original data (excluding zeros)\n",
    "    plt.subplot(1, 4, 1)\n",
    "    sns.histplot(data[data[column] > 0][column], kde=True, color='blue')\n",
    "    plt.title(f'Original {column} (non-zero)')\n",
    "    \n",
    "    # Mean imputed\n",
    "    plt.subplot(1, 4, 2)\n",
    "    sns.histplot(data_mean[column], kde=True, color='orange')\n",
    "    plt.title(f'Mean Imputed {column}')\n",
    "    \n",
    "    # Median imputed\n",
    "    plt.subplot(1, 4, 3)\n",
    "    sns.histplot(data_median[column], kde=True, color='green')\n",
    "    plt.title(f'Median Imputed {column}')\n",
    "    \n",
    "    # KNN imputed\n",
    "    plt.subplot(1, 4, 4)\n",
    "    sns.histplot(data_knn[column], kde=True, color='red')\n",
    "    plt.title(f'KNN Imputed {column}')\n",
    "    \n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare correlations after different imputation methods\n",
    "corr_original = data[data['Insulin'] > 0].corr()['Outcome']['Insulin']\n",
    "corr_median = data_median.corr()['Outcome']['Insulin']\n",
    "corr_mean = data_mean.corr()['Outcome']['Insulin']\n",
    "corr_knn = data_knn.corr()['Outcome']['Insulin']\n",
    "\n",
    "print(f\"Correlation between Insulin and Outcome:\")\n",
    "print(f\"Original (non-zero only): {corr_original:.4f}\")\n",
    "print(f\"After median imputation: {corr_median:.4f}\")\n",
    "print(f\"After mean imputation: {corr_mean:.4f}\")\n",
    "print(f\"After KNN imputation: {corr_knn:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Feature Engineering Ideas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a copy for feature engineering experiments\n",
    "data_featured = data_knn.copy()  # Using KNN imputed data\n",
    "\n",
    "# Create ratio features\n",
    "data_featured['Glucose_to_BMI_Ratio'] = data_featured['Glucose'] / data_featured['BMI']\n",
    "data_featured['Insulin_to_Glucose_Ratio'] = data_featured['Insulin'] / data_featured['Glucose']\n",
    "\n",
    "# Create interaction features\n",
    "data_featured['Age_BMI_Interaction'] = data_featured['Age'] * data_featured['BMI'] / 100\n",
    "data_featured['Glucose_Age_Interaction'] = data_featured['Glucose'] * data_featured['Age'] / 100\n",
    "\n",
    "# Log transform skewed features\n",
    "data_featured['Insulin_Log'] = np.log1p(data_featured['Insulin'])\n",
    "data_featured['DiabetesPedigreeFunction_Log'] = np.log1p(data_featured['DiabetesPedigreeFunction'])\n",
    "\n",
    "# BMI categories according to WHO\n",
    "bins = [0, 18.5, 25, 30, 35, 100]\n",
    "labels = ['Underweight', 'Normal', 'Overweight', 'Obese_I', 'Obese_II_III']\n",
    "data_featured['BMI_Category'] = pd.cut(data_featured['BMI'], bins=bins, labels=labels)\n",
    "\n",
    "# Age groups\n",
    "age_bins = [0, 30, 45, 60, 100]\n",
    "age_labels = ['Young', 'Middle_Aged', 'Senior', 'Elderly']\n",
    "data_featured['Age_Group'] = pd.cut(data_featured['Age'], bins=age_bins, labels=age_labels)\n",
    "\n",
    "# Blood pressure categories\n",
    "bp_bins = [0, 60, 80, 90, 120, 200]\n",
    "bp_labels = ['Low', 'Normal', 'Elevated', 'High_Stage1', 'High_Stage2']\n",
    "data_featured['BP_Category'] = pd.cut(data_featured['BloodPressure'], bins=bp_bins, labels=bp_labels)\n",
    "\n",
    "# Display the new features\n",
    "print(\"New features created:\")\n",
    "new_features = set(data_featured.columns) - set(data_knn.columns)\n",
    "print(new_features)\n",
    "data_featured[list(new_features) + ['Outcome']].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze relationships between new features and outcome\n",
    "# For numeric features\n",
    "numeric_new_features = ['Glucose_to_BMI_Ratio', 'Insulin_to_Glucose_Ratio', 'Age_BMI_Interaction', \n",
    "                       'Glucose_Age_Interaction', 'Insulin_Log', 'DiabetesPedigreeFunction_Log']\n",
    "\n",
    "plt.figure(figsize=(18, 10))\n",
    "for i, column in enumerate(numeric_new_features, 1):\n",
    "    plt.subplot(2, 3, i)\n",
    "    sns.boxplot(x='Outcome', y=column, data=data_featured)\n",
    "    plt.title(f'{column} by Outcome')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For categorical features\n",
    "plt.figure(figsize=(18, 6))\n",
    "\n",
    "plt.subplot(1, 3, 1)\n",
    "sns.countplot(x='BMI_Category', hue='Outcome', data=data_featured)\n",
    "plt.title('Diabetes Outcome by BMI Category')\n",
    "plt.xticks(rotation=45)\n",
    "\n",
    "plt.subplot(1, 3, 2)\n",
    "sns.countplot(x='Age_Group', hue='Outcome', data=data_featured)\n",
    "plt.title('Diabetes Outcome by Age Group')\n",
    "\n",
    "plt.subplot(1, 3, 3)\n",
    "sns.countplot(x='BP_Category', hue='Outcome', data=data_featured)\n",
    "plt.title('Diabetes Outcome by BP Category')\n",
    "plt.xticks(rotation=45)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate percentage of diabetic cases in each categorical group\n",
    "for cat_feature in ['BMI_Category', 'Age_Group', 'BP_Category']:\n",
    "    print(f\"\\nPercentage of diabetic cases by {cat_feature}:\")\n",
    "    diabetic_pct = data_featured.groupby(cat_feature)['Outcome'].mean() * 100\n",
    "    counts = data_featured.groupby(cat_feature).size()\n",
    "    result = pd.DataFrame({'Count': counts, 'Diabetic_Percentage': diabetic_pct})\n",
    "    print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Feature Selection Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert categorical variables to dummy variables\n",
    "data_featured_encoded = pd.get_dummies(data_featured, columns=['BMI_Category', 'Age_Group', 'BP_Category'], drop_first=False)\n",
    "\n",
    "# Correlation analysis with new features\n",
    "plt.figure(figsize=(14, 8))\n",
    "correlation_with_target = pd.DataFrame(\n",
    "    {'correlation': data_featured_encoded.corr()['Outcome'].drop('Outcome')}\n",
    ").sort_values('correlation', ascending=False)\n",
    "\n",
    "top_corr = correlation_with_target.head(15)\n",
    "sns.barplot(x=top_corr.index, y='correlation', data=top_corr)\n",
    "plt.xticks(rotation=90)\n",
    "plt.title('Top 15 Features Correlated with Diabetes Outcome')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print all correlations with outcome\n",
    "print(\"All features sorted by correlation with outcome:\")\n",
    "print(correlation_with_target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for high correlation between features\n",
    "plt.figure(figsize=(18, 16))\n",
    "correlation_matrix = data_featured_encoded.drop('Outcome', axis=1).corr()\n",
    "mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))\n",
    "sns.heatmap(correlation_matrix, mask=mask, cmap='coolwarm', annot=False, \n",
    "            vmax=1.0, vmin=-1.0, linewidths=0.5)\n",
    "plt.title('Correlation Matrix of Features')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Identify highly correlated features (|correlation| > 0.8)\n",
    "corr_matrix = correlation_matrix.abs()\n",
    "upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))\n",
    "high_corr_features = [(upper.index[i], upper.columns[j], upper.iloc[i, j]) \n",
    "                     for i, j in zip(*np.where(upper > 0.8))]\n",
    "\n",
    "print(\"Highly correlated features (|correlation| > 0.8):\")\n",
    "for feat1, feat2, corr in high_corr_features:\n",
    "    print(f\"{feat1} & {feat2}: {corr:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11. Data Preprocessing Strategy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summarize the preprocessing strategy based on our EDA\n",
    "preprocessing_steps = [\n",
    "    \"1. Replace zeros with NaN in columns where zero is physiologically impossible\",\n",
    "    \"2. Use KNN imputation for missing values as it preserves distributions better\",\n",
    "    \"3. Create ratio features (Glucose_to_BMI_Ratio, etc.) to capture interactions\",\n",
    "    \"4. Apply log transformations to skewed features (Insulin, DiabetesPedigreeFunction)\",\n",
    "    \"5. Create categorical features from continuous variables (BMI_Category, Age_Group)\",\n",
    "    \"6. Remove highly correlated features to avoid multicollinearity\",\n",
    "    \"7. Apply feature scaling as features have different ranges\",\n",
    "    \"8. Consider class imbalance techniques (SMOTE) as the dataset has more non-diabetic cases\"\n",
    "]\n",
    "\n",
    "print(\"Recommended Data Preprocessing Strategy:\")\n",
    "for i, step in enumerate(preprocessing_steps, 1):\n",
    "    print(step)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12. Key Insights and Recommendations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Key Findings:\n",
    "\n",
    "1. **Missing Data**: Several features contain zero values that are physiologically impossible, particularly Insulin (374 zeros, 48.7%), SkinThickness (227 zeros, 29.6%), and BloodPressure (35 zeros, 4.6%). These should be treated as missing values.\n",
    "\n",
    "2. **Class Imbalance**: The dataset has an imbalance with approximately 65% negative cases (non-diabetic) and 35% positive cases (diabetic).\n",
    "\n",
    "3. **Important Features**: Glucose shows the strongest correlation with diabetes outcome, followed by BMI, Age, and DiabetesPedigreeFunction.\n",
    "\n",
    "4. **Feature Engineering**: Creating new features like Glucose-to-BMI ratio and age-BMI interaction could improve model performance. Categorical features (BMI categories and age groups) also show clear patterns with diabetes outcome.\n",
    "\n",
    "5. **Imputation Strategy**: KNN imputation appears to preserve the distribution of the data better than median or mean imputation, especially for features like Insulin and SkinThickness.\n",
    "\n",
    "6. **Feature Correlations**: Some engineered features show stronger correlations with the outcome than original features. For example, Glucose_Age_Interaction and BMI_Category_Obese_II_III have high correlations with diabetes outcome.\n",
    "\n",
    "7. **Multicollinearity**: Several features are highly correlated with each other, particularly the engineered features derived from the same original features. This could affect model performance and should be addressed.\n",
    "\n",
    "8. **Skewed Distributions**: Features like Insulin and DiabetesPedigreeFunction show significant skewness, suggesting that log transformations might be beneficial for modeling.\n",
    "\n",
    "### Recommendations for Modeling:\n",
    "\n",
    "1. **Data Preprocessing**:\n",
    "   - Use KNN imputation for missing values\n",
    "   - Apply feature scaling due to different ranges\n",
    "   - Consider SMOTE or other techniques to address class imbalance\n",
    "   - Apply log transformations to skewed features\n",
    "\n",
    "2. **Feature Engineering**:\n",
    "   - Include ratio features (Glucose-to-BMI, Insulin-to-Glucose)\n",
    "   - Use interaction terms between key features (Age×BMI, Glucose×Age)\n",
    "   - Create categorical features from BMI, Age, and Blood Pressure\n",
    "   - Consider polynomial features for key variables like Glucose and BMI\n",
    "\n",
    "3. **Feature Selection**:\n",
    "   - Remove highly correlated features to avoid multicollinearity\n",
    "   - Focus on features with stronger correlation to the target\n",
    "   - Consider using feature selection methods like Recursive Feature Elimination\n",
    "   - Evaluate feature importance from tree-based models\n",
    "\n",
    "4. **Model Selection**:\n",
    "   - Try tree-based models (Random Forest, Gradient Boosting) which handle non-linear relationships well\n",
    "   - Compare with logistic regression as a baseline\n",
    "   - Consider SVM with non-linear kernels\n",
    "   - Use cross-validation to ensure robust performance estimation\n",
    "   - Ensemble different models for potentially better performance\n",
    "\n",
    "5. **Model Evaluation**:\n",
    "   - Use multiple metrics beyond accuracy (precision, recall, F1-score, ROC-AUC)\n",
    "   - Pay special attention to sensitivity (recall) as missing a potential diabetic case is more costly\n",
    "   - Analyze confusion matrices to understand error patterns\n",
    "   - Use SHAP or LIME for model interpretability\n",
    "\n",
    "6. **Clinical Relevance**:\n",
    "   - Develop risk categories (low, moderate, high) based on probability thresholds\n",
    "   - Create actionable insights for healthcare providers\n",
    "   - Consider different models for different age groups or BMI categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13. Next Steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. **Implement the preprocessing pipeline** based on findings from this analysis\n",
    "   - Replace zeros with NaNs in appropriate columns\n",
    "   - Apply KNN imputation\n",
    "   - Create engineered features\n",
    "   - Scale features\n",
    "\n",
    "2. **Engineer and select features** as recommended\n",
    "   - Create all proposed features\n",
    "   - Remove highly correlated features\n",
    "   - Select most important features\n",
    "\n",
    "3. **Address class imbalance**\n",
    "   - Apply SMOTE to balance the training data\n",
    "   - Consider different sampling techniques\n",
    "\n",
    "4. **Train and compare multiple models**, including:\n",
    "   - Logistic Regression\n",
    "   - Random Forest\n",
    "   - Gradient Boosting\n",
    "   - XGBoost\n",
    "   - SVM\n",
    "\n",
    "5. **Tune hyperparameters** for the best performing models\n",
    "   - Use grid search or random search\n",
    "   - Apply cross-validation\n",
    "\n",
    "6. **Evaluate models** using appropriate metrics\n",
    "   - Accuracy, precision, recall, F1-score, ROC-AUC\n",
    "   - Confusion matrices\n",
    "   - Learning curves\n",
    "\n",
    "7. **Interpret model results** using tools like SHAP for explainability\n",
    "   - Understand feature importance\n",
    "   - Analyze individual predictions\n",
    "   - Create visualizations for interpretability\n",
    "\n",
    "8. **Develop a risk stratification system**\n",
    "   - Define risk categories based on prediction probabilities\n",
    "   - Create actionable recommendations for each risk level\n",
    "\n",
    "9. **Document findings and methodology**\n",
    "   - Write up the research paper\n",
    "   - Include visualizations and insights from this analysis\n",
    "   - Compare results with existing literature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 14. Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This exploratory data analysis has provided valuable insights into the Pima Indians Diabetes Dataset. We've identified key patterns, relationships, and challenges that will inform our modeling approach. The dataset contains several features with strong predictive potential, particularly glucose levels, BMI, and age.\n",
    "\n",
    "Missing values pose a significant challenge, especially in the Insulin and SkinThickness features, but KNN imputation appears to be an effective solution. Feature engineering opportunities are abundant, with several derived features showing promising relationships with diabetes outcomes.\n",
    "\n",
    "By implementing the recommended preprocessing steps and modeling strategies, we can develop a robust diabetes risk prediction model that provides actionable insights for healthcare providers. This model has the potential to identify individuals at high risk of developing diabetes, enabling timely interventions that could prevent or delay the disease's onset."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}