<a href="https://colab.research.google.com/github/chhabradeevyansh001/Data-Anaytics_Lab-/blob/main/final_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case Study: Statistical Analysis on Diabetes Dataset

Objective : We aim to determine whether specific physiological features (Glucose, BMI, Age, etc.) differ significantly between diabetic and non-diabetic individuals

In [3]:
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import ztest

# Load dataset
df = pd.read_csv("diabetes.csv")

# --- Handle Missing or Null Values ---
# Replace 0 values in certain physiological columns (where 0 is biologically invalid) with NaN
cols_with_zeros = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
df[cols_with_zeros] = df[cols_with_zeros].replace(0, np.nan)

# Impute missing values using median (robust against outliers)
df = df.fillna(df.median(numeric_only=True))

# Confirm no missing values remain
print("Missing values after cleaning:\n", df.isnull().sum())


Missing values after cleaning:
 Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In healthcare data, missing or zero values for physiological metrics (e.g., BMI = 0) are not real.
Imputing with median keeps the data distribution realistic while preserving sample size.



Independent Two-Sample T-Test

Purpose: Compare means of a continuous variable between two independent groups (e.g., diabetics vs. non-diabetics).


Use case: Are mean glucose levels significantly higher in diabetic patients?

Hypotheses:

H₀: μ₁ = μ₂ (no difference)

H₁: μ₁ ≠ μ₂ (significant difference)

In [4]:
group0 = df[df["Outcome"] == 0]["Glucose"]
group1 = df[df["Outcome"] == 1]["Glucose"]
t_stat, t_p = stats.ttest_ind(group0, group1, equal_var=False)
print(f"T-test → statistic={t_stat:.4f}, p-value={t_p:.4f}")

T-test → statistic=-14.8527, p-value=0.0000


Interpretation:
If p < 0.05 → Reject H₀ → Mean glucose differs significantly between diabetic and non-diabetic patients.


Significance:
T-test helps validate that glucose is a key discriminative feature for diabetes diagnosis.


Z-Test

Purpose: Similar to t-test, but used when sample size is large (n > 30) and population variance is known/approximated.

Use case: Test if average BMI differs significantly across diabetic status.

In [5]:
z_stat, z_p = ztest(df[df["Outcome"] == 0]["BMI"], df[df["Outcome"] == 1]["BMI"])
print(f"Z-test → statistic={z_stat:.4f}, p-value={z_p:.4f}")


Z-test → statistic=-9.0901, p-value=0.0000


Interpretation:
If p < 0.05 → BMI has a statistically significant difference between the two populations.

Significance:
Used to test large-sample mean differences in continuous data.

F-Test (Variance Comparison)

Purpose: Check if variances of two groups are significantly different.

Use case: Are age variances the same between diabetic and non-diabetic individuals?

Hypotheses:

H₀: σ₁² = σ₂²

H₁: σ₁² ≠ σ₂²

In [6]:
age0 = df[df["Outcome"] == 0]["Age"]
age1 = df[df["Outcome"] == 1]["Age"]

f_stat = np.var(age0, ddof=1) / np.var(age1, ddof=1)
df1, df2 = len(age0) - 1, len(age1) - 1
f_p = 1 - stats.f.cdf(f_stat, df1, df2)
print(f"F-test → statistic={f_stat:.4f}, p-value={f_p:.4f}")


F-test → statistic=1.1316, p-value=0.1284
