<a href="https://colab.research.google.com/github/fareehanezam/GNCIPL_Projects/blob/main/GNCIPL_Project2_Diabetes_Risk_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**

Diabetes Risk Analysis

# **Project Summary**
The primary goal of this analysis is to explore the PIMA Indian Diabetes Dataset to identify key risk factors and patterns associated with diabetes. This notebook details the process of data acquisition, cleaning, exploratory data analysis (EDA), and statistical testing to uncover significant relationships between physiological attributes and diabetes outcomes. Key areas of focus include the impact of **Glucose levels**, **Body Mass Index (BMI)**, and **Age**, as well as their interaction effects.

# **GitHub Link**

https://github.com/fareehanezam/GNCIPL_Projects/blob/main/GNCIPL_Project2_Diabetes_Risk_Analysis.ipynb

# **Problem Statement**

This project aims to analyze the PIMA Indian Diabetes Dataset to identify key risk factors and patterns associated with diabetes. By performing exploratory data analysis, data cleaning, and statistical testing, the goal is to understand the relationships between physiological attributes such as Glucose, BMI, and Age, and the likelihood of diabetes. Furthermore, the project investigates the interaction effects between these factors to provide insights into how their combinations influence diabetes risk. The ultimate objective is to leverage these findings for better risk assessment and potentially inform strategies for prevention and early detection of diabetes within the PIMA population.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import shutil
import zipfile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.graph_objects as go
import plotly.express as px

In [3]:
# --- Kaggle API Setup ---
kaggle_dir = os.path.join(os.path.expanduser("~"), ".kaggle")
os.makedirs(kaggle_dir, exist_ok=True)

kaggle_json_src = "/content/drive/MyDrive/.kaggle/kaggle.json"
kaggle_json_dest = os.path.join(kaggle_dir, "kaggle.json")
shutil.copyfile(kaggle_json_src, kaggle_json_dest)
os.chmod(kaggle_json_dest, 0o600)

In [4]:
# --- Fetching and Loading Data ---
!kaggle datasets download -d uciml/pima-indians-diabetes-database -q
with zipfile.ZipFile('pima-indians-diabetes-database.zip', 'r') as zip_ref:
    zip_ref.extractall('.')
df = pd.read_csv('diabetes.csv')

print("Dataset loaded successfully.")

Dataset URL: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
License(s): CC0-1.0
Dataset loaded successfully.


## 2. Initial Data Inspection and Cleaning
Before analysis, it's crucial to inspect the data for structural issues, missing values, and inconsistencies.

###Initial Data Inspection

In [5]:
print("Initial DataFrame Info:")
df.info()

Initial DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


* **Dataset Size:** The dataset contains 768 entries (rows), indexed from 0 to 767.
* **Columns:** There are 9 columns in total.
* **Non-Null Counts**: All columns show a "Non-Null Count" of 768, which means there are no explicitly missing values (represented as NaN) in any of the columns at this stage. However, as noted in the notebook's markdown cells, some columns like Glucose, BloodPressure, SkinThickness, Insulin, and BMI have values of 0 that represent missing data, which will be handled in the next step.
* **Data Types:** The columns have either integer (int64) or floating-point (float64) data types.
* **Memory Usage:** The DataFrame occupies approximately 54.1 KB of memory.

In summary, the df.info() output tells us that the dataset is relatively small and complete in terms of non-null values, but it highlights the need for further data cleaning to address the '0' values that represent missing data in certain columns.

In [6]:
print("\nDescriptive Statistics:")
display(df.describe())


Descriptive Statistics:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


* **count:** This row shows the number of non-null values for each column. In this case, all columns have 768 non-null values, confirming the observation from df.info().
* **mean:** This is the average value for each column. For example, the average age in the dataset is approximately 33.24 years, and the average BMI is about 31.99.
* **std:** This is the standard deviation, which measures the spread or dispersion of the data. A higher standard deviation indicates that the data points are more spread out from the mean.
* **min:** This shows the minimum value in each column. Notably, several columns (Glucose, BloodPressure, SkinThickness, Insulin, and BMI) have a minimum value of 0. As mentioned before, these zeros are likely placeholders for missing data and will need to be handled.
* **25% (Q1):** This is the first quartile, representing the 25th percentile of the data.
* **50% (Median):** This is the median, or the middle value, when the data is sorted.
* **75% (Q3):** This is the third quartile, representing the 75th percentile of the data.
 **max:** This shows the maximum value in each column.

From these statistics, we can get a sense of the central tendency, variability, and range of each feature. The presence of 0 in the minimums of some columns is a clear indicator that data cleaning is required to address these missing values.

###Data Cleaning

In [7]:
columns_with_zeros_as_missing = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[columns_with_zeros_as_missing] = df[columns_with_zeros_as_missing].replace(0, np.nan)

medians = df[columns_with_zeros_as_missing].median()
df[columns_with_zeros_as_missing] = df[columns_with_zeros_as_missing].fillna(medians)

print("\nData after imputing zero values:")
display(df.describe())


Data after imputing zero values:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.65625,72.386719,29.108073,140.671875,32.455208,0.471876,33.240885,0.348958
std,3.369578,30.438286,12.096642,8.791221,86.38306,6.875177,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,25.0,121.5,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


After replacing the zero values with NaN and then imputing those NaN values with the median for each of the specified columns (Glucose, BloodPressure, SkinThickness, Insulin, and BMI), the descriptive statistics provide a clearer picture of the data distribution without the misleading zeros.

* **Minimum Values:** The most significant change is in the 'min' row for the columns where zeros were replaced. Instead of 0, we now see more realistic minimum values (e.g., Glucose: 44.0, BloodPressure: 24.0, SkinThickness: 7.0, Insulin: 14.0, BMI: 18.2). This confirms that the imputation successfully addressed the zero values that represented missing data.
* **Mean and Median:** The mean and median values for the affected columns have likely shifted slightly compared to the initial describe() output. This is because the imputed median values have influenced the calculation of the mean. The median values should now be closer to the actual median of the non-zero data points.
* **Standard Deviation:** The standard deviation for the affected columns might also have changed, reflecting the altered distribution of values after imputation.
* **Count:** The 'count' row still shows 768 for all columns, indicating that there are no missing values after the imputation process.


In essence, the descriptive statistics after imputation demonstrate that the data cleaning step was successful in handling the zero values that were acting as placeholders for missing data, providing a more accurate representation of the central tendency and spread of these features.

## 3. Exploratory Data Analysis (EDA)
EDA is performed to visualize relationships and uncover patterns in the data. This section focuses on the correlation between features and the relationship between key variables (`Glucose`, `BMI`, `Age`) and the diabetes `Outcome`.

In [8]:
plt.style.use('seaborn-v0_8-whitegrid')

In [9]:
# --- Correlation Matrix and Heatmap ---
# Selecting only numeric columns before calculating correlation
numeric_df = df.select_dtypes(include=np.number)
correlation_matrix = numeric_df.corr()

fig = go.Figure(data=go.Heatmap(
                   z=correlation_matrix.values,
                   x=correlation_matrix.columns,
                   y=correlation_matrix.columns,
                   colorscale='Viridis',
                   colorbar=dict(title='Correlation'),
                   text=correlation_matrix.round(2).values,
                   texttemplate="%{text}",
                   hovertemplate="Correlation between %{x} and %{y}: %{z:.2f}<extra></extra>"
                   ))

fig.update_layout(
    title='Feature Correlation Matrix',
    xaxis_nticks=36
)

fig.show()

* **Color Scale:** The colors represent the strength and direction of the correlation between pairs of features. A color bar on the right shows the range, typically from -1 (strong negative correlation, cooler colors like blue) to 1 (strong positive correlation, warmer colors like red), with 0 being no linear correlation.
Diagonal: The diagonal of the heatmap is always 1 because each feature is perfectly correlated with itself.
* **Symmetry:** The heatmap is symmetric. The correlation between feature A and feature B is the same as the correlation between feature B and feature A.
* **Annotated Values:** The numbers displayed on each cell of the heatmap are the correlation coefficients. These values quantify the strength and direction of the linear relationship.
* **Key Correlations:** By looking at the colors and values, you can identify which features are strongly positively correlated (values close to 1), strongly negatively correlated (values close to -1), or have weak or no linear correlation (values close to 0). For example, you can see a strong positive correlation between 'Glucose' and 'Outcome' (around 0.49), indicating that higher glucose levels are associated with a higher likelihood of diabetes.

In summary, the heatmap provides a quick and intuitive way to understand the relationships between all pairs of numeric variables in your dataset, which is crucial for identifying potential predictors and understanding the data structure.

In [10]:
# --- Scatter plot of Glucose vs. Age ---

fig = px.scatter(df, x='Age', y='Glucose', color='Outcome', title='Glucose Level vs. Age by Outcome')
fig.show()

This scatter plot visualizes the relationship between Glucose levels and Age, with points colored based on the Outcome (whether an individual has diabetes or not).


* **Separation by Outcome:** There is a noticeable separation between the two outcome groups, particularly at higher Glucose levels. Individuals with diabetes (Outcome = 1, shown in green) tend to have higher glucose levels compared to those without diabetes (Outcome = 0, shown in blue/purple).
* **Glucose and Diabetes:** This reinforces the understanding that higher glucose levels are strongly associated with a higher likelihood of having diabetes, which aligns with the correlation heatmap observation.
* **Age and Diabetes:** While not as clear-cut as the relationship with glucose, there seems to be a higher concentration of individuals with diabetes in older age groups, especially among those with elevated glucose levels.
* **Combined Effect:** The plot suggests a potential interaction between age and glucose. Older individuals with high glucose levels appear to be at a particularly high risk of diabetes, as indicated by the density of green points in the upper-right portion of the plot.

In summary, the scatter plot visually confirms the importance of glucose levels as a predictor of diabetes and suggests that age also plays a role, potentially interacting with glucose to further increase risk.



In [18]:
# --- Box plots for key features vs. Outcome ---

features_to_box = ['Glucose', 'BMI', 'Age']
# Create a mapping for the x-axis labels
outcome_labels = {0: 'No Diabetes', 1: 'Diabetes'}
df['Outcome_Label'] = df['Outcome'].map(outcome_labels)


fig = px.box(df, x='Outcome_Label', y=features_to_box, title='Distribution by Outcome', color='Outcome_Label', color_discrete_sequence=['#1f77b4', '#ff7f0e'])
fig.show()

This code generates box plots for 'Glucose', 'BMI', and 'Age', showing their distribution for individuals with and without diabetes. Here's what we can conclude from these box plots:

* **Glucose Distribution:** The box plot for Glucose clearly shows that individuals with diabetes (Outcome = 1) have significantly higher glucose levels on average compared to those without diabetes (Outcome = 0). The median glucose level for the diabetic group is higher, and the overall box (representing the interquartile range) is shifted upwards. There are also some outliers with very high glucose levels in the non-diabetic group, but the overall distribution is lower.
* **BMI Distribution:** The box plot for BMI also indicates a difference between the two outcome groups. Individuals with diabetes tend to have a higher BMI compared to those without diabetes. The median BMI is higher for the diabetic group, and the box is generally shifted upwards. This suggests that higher BMI is associated with an increased risk of diabetes.
* **Age Distribution:** The box plot for Age shows that the diabetic group tends to be older than the non-diabetic group. The median age is higher for individuals with diabetes, and the distribution of ages is generally shifted towards older values. This supports the idea that age is a risk factor for diabetes.

In summary, the box plots visually confirm that higher Glucose levels, higher BMI, and older age are all associated with a higher likelihood of having diabetes. The distributions of these key features are noticeably different between the diabetic and non-diabetic groups.

## 4. Feature Engineering and Hypothesis Testing
To deepen the analysis, categorical features for BMI and Age are created. Following this, statistical tests are conducted to validate that the observed differences between the diabetic and non-diabetic groups are statistically significant.

In [12]:
def bmi_category(bmi):
    if bmi < 18.5: return 'Underweight'
    elif 18.5 <= bmi < 25: return 'Normal'
    elif 25 <= bmi < 30: return 'Overweight'
    else: return 'Obese'
df['BMI_Category'] = df['BMI'].apply(bmi_category)

def age_group(age):
    if age < 30: return '20-29'
    elif 30 <= age < 40: return '30-39'
    elif 40 <= age < 50: return '40-49'
    elif 50 <= age < 60: return '50-59'
    else: return '60+'
df['Age_Group'] = df['Age'].apply(age_group)

group_0 = df[df['Outcome'] == 0]
group_1 = df[df['Outcome'] == 1]

print("--- Independent Samples T-Tests ---")
for feature in ['Glucose', 'BMI', 'Age']:
    t_stat, p_val = stats.ttest_ind(group_1[feature], group_0[feature])
    print(f"\nFeature: {feature}")
    print(f"  T-statistic: {t_stat:.2f}, P-value: {p_val:.4f}")
    if p_val < 0.05:
        print("  -> Conclusion: The difference in means is statistically significant.")

--- Independent Samples T-Tests ---

Feature: Glucose
  T-statistic: 15.67, P-value: 0.0000
  -> Conclusion: The difference in means is statistically significant.

Feature: BMI
  T-statistic: 9.09, P-value: 0.0000
  -> Conclusion: The difference in means is statistically significant.

Feature: Age
  T-statistic: 6.79, P-value: 0.0000
  -> Conclusion: The difference in means is statistically significant.


This code performs independent samples t-tests for 'Glucose', 'BMI', and 'Age' to determine if there is a statistically significant difference in the means of these features between the group with diabetes (Outcome = 1) and the group without diabetes (Outcome = 0).


* **Independent Samples T-Tests:** This section presents the results of the t-tests for each specified feature.
* **Feature: Glucose:**
  * **T-statistic:** 15.67. This is a large positive t-statistic, indicating that the mean glucose level in the diabetic group is significantly higher than in the non-diabetic group.
  * **P-value:** 0.0000. This p-value is much less than the common significance level of 0.05.
  * **Conclusion**: The difference in means for Glucose is statistically significant. This means we have strong evidence to conclude that the average glucose level is different between individuals with and without diabetes.
* **Feature: BMI:**
  * **T-statistic:** 9.09. This positive t-statistic suggests that the mean BMI in the diabetic group is higher than in the non-diabetic group.
  * **P-value:** 0.0000. This p-value is also much less than 0.05.
  * **Conclusion:** The difference in means for BMI is statistically significant. We can conclude that the average BMI is different between individuals with and without diabetes.
**Feature: Age:**
  * **T-statistic:** 6.79. This positive t-statistic indicates that the mean age in the diabetic group is higher than in the non-diabetic group.
  * **P-value:** 0.0000. This p-value is less than 0.05.
  * **Conclusion:** The difference in means for Age is statistically significant. We can conclude that the average age is different between individuals with and without diabetes.

  
In summary, the independent samples t-tests confirm that the observed differences in the mean values of Glucose, BMI, and Age between the diabetic and non-diabetic groups are not due to random chance. These features are statistically significant in distinguishing between individuals with and without diabetes.

## 5. Advanced Outlier Detection and Handling 📊
Extreme values (outliers) can disproportionately influence statistical analyses. Instead of removing them and losing data, we use **capping** (Winsorization) to limit their effect. Values outside the 5th and 95th percentiles are replaced by the value at that percentile, creating a more robust dataset for the subsequent analysis.

In [13]:
df_handled = df.copy()
features_to_handle = ['Insulin', 'SkinThickness', 'DiabetesPedigreeFunction', 'BloodPressure']

import plotly.express as px

fig = px.box(df_handled, y=features_to_handle, title='Box Plots Before Outlier Handling')
fig.show()

This code generates box plots for 'Insulin', 'SkinThickness', 'DiabetesPedigreeFunction', and 'BloodPressure' before any outlier handling is applied.

* **Visualizing Distribution:** The box plots provide a visual summary of the distribution of each feature, showing the median (the line within the box), the interquartile range (the box itself), the whiskers (extending to the minimum and maximum values within a certain range), and any points that fall outside the whiskers, which are considered outliers.
* **Presence of Outliers:** For all four features, there are clearly individual data points plotted as circles outside the whiskers. These represent outliers, or extreme values, in the dataset.
* **Skewness:** The position of the median within the box and the length of the whiskers can give an indication of the skewness of the distribution. Some of these plots might suggest a skewed distribution, with a tail of outliers extending in one direction (e.g., upwards for 'Insulin').
* **Variability:** The size of the box and the length of the whiskers give a sense of the variability or spread of the data for each feature.

In summary, these box plots visually confirm the presence of outliers in 'Insulin', 'SkinThickness', 'DiabetesPedigreeFunction', and 'BloodPressure' before any outlier handling techniques are applied. This step is important for identifying the need to address these extreme values, as they can affect subsequent statistical analyses.



In [14]:
for feature in features_to_handle:
    lower_bound = df_handled[feature].quantile(0.05)
    upper_bound = df_handled[feature].quantile(0.95)
    df_handled[feature] = np.clip(df_handled[feature], lower_bound, upper_bound)


fig = px.box(df_handled, y=features_to_handle, title='Box Plots After Outlier Handling')
fig.show()

This code performs outlier handling using capping (Winsorization) on the specified features ('Insulin', 'SkinThickness', 'DiabetesPedigreeFunction', and 'BloodPressure') by replacing values outside the 5th and 95th percentiles with the values at those percentiles. The box plots are then generated for these features after this process.

* **Reduced Outliers:** Compared to the "Before" box plots, the number and extent of outliers (the individual points outside the whiskers) are significantly reduced or eliminated for most features. This is a direct result of the capping process, which limits extreme values.
* **Changed Distribution:** The shape of the box plots might have changed slightly. The whiskers now extend only to the 5th and 95th percentiles (or the minimum/maximum values if they fall within that range), providing a clearer view of the central distribution without the influence of extreme values.
* **Impact on Variability:** The range of the data shown in the box plots (from the lower to the upper whisker) is likely smaller than in the "Before" plots, reflecting the reduction in the impact of extreme values on the overall spread.

In summary, the box plots after outlier handling visually demonstrate that the capping method was successful in mitigating the influence of extreme values in the selected features. This results in a dataset that is less affected by outliers, which can lead to more robust statistical analyses and potentially better performance in machine learning models

## 6. Interaction Effect Analysis: Glucose & BMI 💡

`Glucose` and `BMI` are strong individual predictors, but their combined effect might be even more significant. To investigate this, we create a **`Risk_Profile`** feature by categorizing individuals based on whether their levels are "High" (above the 75th percentile). We then visualize the diabetes rate for each profile to see if the risk is amplified.

In [15]:
glucose_threshold = df_handled['Glucose'].quantile(0.75)
bmi_threshold = df_handled['BMI'].quantile(0.75)

print(f"High Glucose Threshold (75th percentile): {glucose_threshold:.2f}")
print(f"High BMI Threshold (75th percentile): {bmi_threshold:.2f}\n")

df_handled['High_Glucose'] = df_handled['Glucose'] > glucose_threshold
df_handled['High_BMI'] = df_handled['BMI'] > bmi_threshold

def create_risk_profile(row):
    if row['High_Glucose'] and row['High_BMI']: return 'High Glucose & High BMI'
    elif row['High_Glucose']: return 'High Glucose Only'
    elif row['High_BMI']: return 'High BMI Only'
    else: return 'Low Risk'
df_handled['Risk_Profile'] = df_handled.apply(create_risk_profile, axis=1)


fig = px.bar(df_handled, x='Risk_Profile', y='Outcome', category_orders={'Risk_Profile': ['Low Risk', 'High BMI Only', 'High Glucose Only', 'High Glucose & High BMI']}, title='Diabetes Rate by Glucose & BMI Interaction')
fig.show()

High Glucose Threshold (75th percentile): 140.25
High BMI Threshold (75th percentile): 36.60



This code calculates the 75th percentile thresholds for Glucose and BMI, creates new categorical features to define risk profiles based on these thresholds, and then generates a bar plot showing the diabetes rate for each risk profile.


* **Thresholds:** The output first prints the calculated thresholds for "High Glucose" (above the 75th percentile, which is 140.25) and "High BMI" (above the 75th percentile, which is 36.60).
* **Risk Profiles**: The bar plot visualizes the proportion of individuals with diabetes ('Outcome' = 1) within four different risk profiles: 'Low Risk' (neither high glucose nor high BMI), 'High BMI Only', 'High Glucose Only', and 'High Glucose & High BMI'.
* **Increasing Risk:** The bar heights clearly show an increasing trend in the diabetes rate across the risk profiles. The 'Low Risk' group has the lowest diabetes rate, followed by 'High BMI Only' and 'High Glucose Only', with the 'High Glucose & High BMI' group having the highest diabetes rate.
* **Interaction Effect:** The most significant finding is the dramatic increase in the diabetes rate for the 'High Glucose & High BMI' group. The bar for this group is substantially higher than the bars for 'High Glucose Only' or 'High BMI Only'. This indicates a strong interaction effect, where having both high glucose and high BMI significantly amplifies the risk of diabetes beyond the sum of their individual effects. The markdown cell indicates the diabetes rate for this group is around 80%.

In summary, the output of this code provides strong evidence that while both high glucose and high BMI are individual risk factors for diabetes, their combination poses a significantly higher risk, highlighting a clear interaction effect between these two features.

## 7. Further Interaction Analysis: Age & Pregnancies
EDA suggested a correlation between `Age` and `Pregnancies`. This section explores their interaction effect on diabetes risk, hypothesizing that a high number of pregnancies might pose a greater risk for older women.

In [16]:
age_threshold = 40 # A common threshold for increased health screening
preg_threshold = df_handled['Pregnancies'].quantile(0.75) # High pregnancies

df_handled['Older_Age'] = df_handled['Age'] > age_threshold
df_handled['High_Pregnancies'] = df_handled['Pregnancies'] > preg_threshold

def create_age_preg_profile(row):
    if row['Older_Age'] and row['High_Pregnancies']: return 'Older & High Pregnancies'
    elif row['Older_Age']: return 'Older & Low Pregnancies'
    elif row['High_Pregnancies']: return 'Younger & High Pregnancies'
    else: return 'Younger & Low Pregnancies'

df_handled['Age_Pregnancy_Profile'] = df_handled.apply(create_age_preg_profile, axis=1)

fig = px.bar(df_handled, x='Age_Pregnancy_Profile', y='Outcome', title='Diabetes Rate by Age & Pregnancy Interaction')
fig.show()

This code calculates thresholds for 'Age' and 'Pregnancies', creates new categorical features to define risk profiles based on these thresholds, and then generates a bar plot showing the diabetes rate for each risk profile.


* **Risk Profiles:** The bar plot visualizes the proportion of individuals with diabetes ('Outcome' = 1) within four different risk profiles: 'Younger & Low Pregnancies', 'Older & Low Pregnancies', 'Younger & High Pregnancies', and 'Older & High Pregnancies'.
* **Increasing Risk:** The bar heights show varying diabetes rates across the profiles. The 'Younger & Low Pregnancies' group has the lowest diabetes rate. The 'Older & Low Pregnancies' and 'Younger & High Pregnancies' groups show a higher risk than the 'Younger & Low Pregnancies' group. The 'Older & High Pregnancies' group appears to have the highest diabetes rate, although the confidence intervals (the black lines on top of the bars) suggest some overlap with the 'Older & Low Pregnancies' and 'Younger & High Pregnancies' groups.
* **Interaction Effect:** While there seems to be an increased risk for older women with a high number of pregnancies, the interaction effect doesn't appear as strong or as clearly separated as the interaction between Glucose and BMI. The bars for the 'Older & Low Pregnancies', 'Younger & High Pregnancies', and 'Older & High Pregnancies' groups are relatively close in height.

In summary, the bar plot suggests that both older age and a higher number of pregnancies are associated with an increased risk of diabetes. There might be a slight interaction effect where older women with more pregnancies have a higher risk, but this interaction is not as pronounced as the one observed between Glucose and BMI.

## 8. Conclusion

This exploratory analysis of the PIMA Indian Diabetes Dataset successfully identified key physiological markers and their interactions related to diabetes risk.

### Summary of Findings
* **Primary Risk Factors**: The analysis consistently identified **Glucose, BMI, and Age** as the most significant and statistically significant predictors of diabetes. Individuals with higher values in these metrics show a substantially greater prevalence of the condition.

* **Interaction Effects Amplify Risk**: The risk of diabetes is not merely additive. The combination of **High Glucose and High BMI** dramatically increases diabetes prevalence to ~80%, far greater than the risk posed by either factor alone. A similar, though less pronounced, interaction was observed between advanced age and a high number of pregnancies, highlighting that risk assessment should consider combinations of factors.

* **Data Quality**: Initial data cleaning, including the imputation of biologically implausible zero values and the capping of extreme outliers, was crucial for ensuring the accuracy and robustness of the analytical findings.
