# üß¨ Chronic Kidney Disease (CKD) Prediction

**Problem Statement**: To predict whether a patient suffers from Chronic Kidney Disease based on diagnostic, clinical, and demographic attributes.

**Dataset**: [CKD Dataset from Kaggle](https://www.kaggle.com/datasets/mansoordaku/ckdisease)

**Significance**:
- CKD is a major global health issue that often goes undiagnosed until advanced stages
- Early identification helps reduce severe health complications and costs
- Using data science, we can uncover relationships between measurable health indicators and disease presence

---

## üìö Step 1: Data Collection & Description

### Import Required Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, ttest_ind, spearmanr

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Plot styling
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")

### Load the Dataset

In [None]:
# Load the dataset
df = pd.read_csv('kidney_disease.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}") 

### A. Data Overview

In [None]:
# Display first few rows
print("\nüìä First 10 rows of the dataset:")
df.head(10)

In [None]:
# Display last few rows
print("\nüìä Last 10 rows of the dataset:")
df.tail(10)

In [None]:
# Dataset info
print("\nüîç Dataset Information:")
df.info()

In [None]:
# Column names
print("\nüìã Column Names:")
print(df.columns.tolist())

### B. Feature Descriptions

| Feature | Type | Description |
|---------|------|-------------|
| **id** | Numeric | Patient ID |
| **age** | Numeric | Patient's age (years) |
| **bp** | Numeric | Blood pressure (mm/Hg) |
| **sg** | Categorical (Ordinal) | Specific gravity (1.005, 1.010, 1.015, 1.020, 1.025) |
| **al** | Ordinal | Albumin levels (0-5) |
| **su** | Ordinal | Sugar levels (0-5) |
| **rbc** | Categorical | Red blood cell type (normal/abnormal) |
| **pc** | Categorical | Pus cell type (normal/abnormal) |
| **pcc** | Categorical | Pus cell clumps (present/notpresent) |
| **ba** | Categorical | Bacteria (present/notpresent) |
| **bgr** | Numeric | Blood glucose random (mg/dl) |
| **bu** | Numeric | Blood urea (mg/dl) |
| **sc** | Numeric | Serum creatinine (mg/dl) |
| **sod** | Numeric | Sodium (mEq/L) |
| **pot** | Numeric | Potassium (mEq/L) |
| **hemo** | Numeric | Hemoglobin (gms) |
| **pcv** | Numeric | Packed cell volume (%) |
| **wc** | Numeric | White blood cell count (cells/cumm) |
| **rc** | Numeric | Red blood cell count (millions/cumm) |
| **htn** | Binary | Hypertension (yes/no) |
| **dm** | Binary | Diabetes mellitus (yes/no) |
| **cad** | Binary | Coronary artery disease (yes/no) |
| **appet** | Categorical | Appetite (good/poor) |
| **pe** | Binary | Pedal edema (yes/no) |
| **ane** | Binary | Anemia (yes/no) |
| **classification** | Target | **CKD status (ckd/notckd)** |

In [None]:
# Separate features by type
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

# Remove id and classification from feature lists
if 'id' in numeric_features:
    numeric_features.remove('id')
if 'classification' in categorical_features:
    categorical_features.remove('classification')

print(f"\nüî¢ Numeric Features ({len(numeric_features)}):")
print(numeric_features)

print(f"\nüè∑Ô∏è Categorical Features ({len(categorical_features)}):")
print(categorical_features)

print(f"\nüéØ Target Variable: classification")

### C. Basic Statistics

In [None]:
# Statistical summary of numeric features
print("\nüìà Statistical Summary (Numeric Features):")
df[numeric_features].describe().T

In [None]:
# Summary of categorical features
print("\nüìä Categorical Features Summary:")
for col in categorical_features:
    print(f"\n{col}:")
    print(df[col].value_counts())
    print("-" * 50)

In [None]:
# Target variable distribution
print("\nüéØ Target Variable Distribution:")
print(df['classification'].value_counts())
print("\nPercentage:")
print(df['classification'].value_counts(normalize=True) * 100)

---
## üîç Step 2: Exploratory Data Analysis (EDA)

### A. Data Quality Assessment

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nüîç Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print("\nRemoving duplicates...")
    df = df.drop_duplicates()
    print(f"‚úÖ Duplicates removed. New shape: {df.shape}")

In [None]:
# Replace '?' and whitespace strings with NaN
print("\nüîÑ Converting '?' and empty strings to NaN...")

# Replace ? with NaN
df = df.replace('?', np.nan)

# Replace whitespace-only strings with NaN
df = df.replace(r'^\s*$', np.nan, regex=True)

# Strip whitespace from all string columns
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip()

print("‚úÖ Data cleaning completed!")

### B. Missing Value Analysis

In [None]:
# Calculate missing values
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum().values,
    'Missing_Percentage': (df.isnull().sum().values / len(df)) * 100
})

missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print("\n‚ùå Missing Value Analysis:")
print(missing_data.to_string(index=False))

print(f"\nüìä Total missing values: {df.isnull().sum().sum()}")
print(f"üìä Percentage of missing data: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100:.2f}%")

In [None]:
# Visualize missing values - Bar plot
plt.figure(figsize=(14, 6))
missing_pct = (df.isnull().sum() / len(df)) * 100
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=False)

plt.barh(missing_pct.index, missing_pct.values, color='coral')
plt.xlabel('Missing Percentage (%)', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Missing Values by Feature', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)

for i, v in enumerate(missing_pct.values):
    plt.text(v + 0.5, i, f'{v:.1f}%', va='center')

plt.tight_layout()
plt.show()

In [None]:
# Visualize missing values - Heatmap using missingno
plt.figure(figsize=(14, 8))
msno.matrix(df, figsize=(14, 8), fontsize=10)
plt.title('Missing Data Matrix Visualization', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Heatmap of missing values
plt.figure(figsize=(14, 8))
sns.heatmap(df.isnull(), cbar=True, cmap='viridis', yticklabels=False)
plt.title('Missing Data Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Features', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Check if missingness correlates with target variable
print("\nüîç Missing Data Pattern Analysis (by Target Class):")
print("\nMissing value percentage by classification:")

for col in df.columns:
    if df[col].isnull().sum() > 0:
        missing_by_class = df.groupby('classification')[col].apply(lambda x: x.isnull().sum() / len(x) * 100)
        if len(missing_by_class) > 0:
            print(f"\n{col}:")
            print(missing_by_class)

### C. Univariate Analysis

#### Numeric Features Analysis

In [None]:
# Detailed statistics for numeric features including skewness and kurtosis
print("\nüìä Detailed Statistics (Numeric Features):")

numeric_stats = pd.DataFrame({
    'Mean': df[numeric_features].mean(),
    'Median': df[numeric_features].median(),
    'Std': df[numeric_features].std(),
    'IQR': df[numeric_features].quantile(0.75) - df[numeric_features].quantile(0.25),
    'Skewness': df[numeric_features].skew(),
    'Kurtosis': df[numeric_features].kurtosis()
})

numeric_stats

In [None]:
# Distribution plots for numeric features - Histograms
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(18, 20))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    if idx < len(axes):
        df[col].hist(bins=30, ax=axes[idx], color='skyblue', edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Distribution of {col}', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Frequency')
        axes[idx].grid(alpha=0.3)

# Hide extra subplots
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Histograms of Numeric Features', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# KDE plots for numeric features
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(18, 20))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    if idx < len(axes):
        df[col].dropna().plot(kind='kde', ax=axes[idx], color='coral', linewidth=2)
        axes[idx].set_title(f'KDE of {col}', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Density')
        axes[idx].grid(alpha=0.3)

# Hide extra subplots
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('KDE Plots of Numeric Features', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# Boxplots for numeric features to identify outliers
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(18, 20))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    if idx < len(axes):
        df.boxplot(column=col, ax=axes[idx], patch_artist=True,
                   boxprops=dict(facecolor='lightblue', color='blue'),
                   medianprops=dict(color='red', linewidth=2))
        axes[idx].set_title(f'Boxplot of {col}', fontweight='bold')
        axes[idx].set_ylabel(col)
        axes[idx].grid(alpha=0.3)

# Hide extra subplots
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Boxplots of Numeric Features', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

#### Categorical Features Analysis

In [None]:
# Bar plots for categorical features
n_categorical = len(categorical_features)
fig, axes = plt.subplots(nrows=(n_categorical + 2) // 3, ncols=3, figsize=(18, 4 * ((n_categorical + 2) // 3)))
axes = axes.flatten()

for idx, col in enumerate(categorical_features):
    if idx < len(axes):
        value_counts = df[col].value_counts()
        value_counts.plot(kind='bar', ax=axes[idx], color='teal', edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Distribution of {col}', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Count')
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].grid(alpha=0.3)
        
        # Add value labels on bars
        for i, v in enumerate(value_counts.values):
            axes[idx].text(i, v + max(value_counts.values) * 0.01, str(v), ha='center', va='bottom')

# Hide extra subplots
for idx in range(n_categorical, len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Bar Plots of Categorical Features', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# Pie charts for binary features
binary_features = ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']
binary_features = [f for f in binary_features if f in df.columns]

fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(20, 12))
axes = axes.flatten()

for idx, col in enumerate(binary_features):
    if idx < len(axes):
        value_counts = df[col].value_counts()
        axes[idx].pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%',
                     startangle=90, colors=['#ff9999', '#66b3ff', '#99ff99', '#ffcc99'])
        axes[idx].set_title(f'{col}', fontweight='bold', fontsize=12)

# Hide extra subplots
for idx in range(len(binary_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Pie Charts of Binary/Categorical Features', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Target variable visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
target_counts = df['classification'].value_counts()
target_counts.plot(kind='bar', ax=axes[0], color=['#ff6b6b', '#4ecdc4'], edgecolor='black', alpha=0.8)
axes[0].set_title('Target Variable Distribution (Bar)', fontweight='bold', fontsize=14)
axes[0].set_xlabel('Classification', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].tick_params(axis='x', rotation=0)
axes[0].grid(alpha=0.3)

for i, v in enumerate(target_counts.values):
    axes[0].text(i, v + 5, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(target_counts.values, labels=target_counts.index, autopct='%1.1f%%',
           startangle=90, colors=['#ff6b6b', '#4ecdc4'], explode=[0.05, 0.05])
axes[1].set_title('Target Variable Distribution (Pie)', fontweight='bold', fontsize=14)

plt.tight_layout()
plt.show()

print(f"\nüìä Class Distribution:")
print(f"CKD: {target_counts.get('ckd', 0)} ({target_counts.get('ckd', 0)/len(df)*100:.2f}%)")
print(f"Not CKD: {target_counts.get('notckd', 0)} ({target_counts.get('notckd', 0)/len(df)*100:.2f}%)")

### D. Bivariate Analysis (Feature vs Target)

#### Numeric Features vs Target

In [None]:
# Boxplots: Numeric features by classification
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(18, 22))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    if idx < len(axes):
        df.boxplot(column=col, by='classification', ax=axes[idx], patch_artist=True)
        axes[idx].set_title(f'{col} by Classification', fontweight='bold')
        axes[idx].set_xlabel('Classification')
        axes[idx].set_ylabel(col)
        axes[idx].get_figure().suptitle('')  # Remove automatic title

# Hide extra subplots
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Numeric Features Distribution by Target Class', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# Statistical testing: t-test for numeric features
print("\nüìä T-Test Results (Numeric Features vs Target):")
print("\nTesting if mean values differ significantly between CKD and Not CKD groups\n")
print(f"{'Feature':<15} {'CKD Mean':<12} {'NotCKD Mean':<14} {'t-statistic':<14} {'p-value':<12} {'Significant?'}")
print("="*95)

ttest_results = []

for col in numeric_features:
    # Get data for both groups (drop NaN values)
    ckd_data = df[df['classification'] == 'ckd'][col].dropna()
    notckd_data = df[df['classification'] == 'notckd'][col].dropna()
    
    if len(ckd_data) > 0 and len(notckd_data) > 0:
        # Perform t-test
        t_stat, p_value = ttest_ind(ckd_data, notckd_data)
        
        # Determine significance
        significant = '‚úì Yes' if p_value < 0.05 else '‚úó No'
        
        ttest_results.append({
            'Feature': col,
            'CKD_Mean': ckd_data.mean(),
            'NotCKD_Mean': notckd_data.mean(),
            't_statistic': t_stat,
            'p_value': p_value,
            'Significant': significant
        })
        
        print(f"{col:<15} {ckd_data.mean():<12.3f} {notckd_data.mean():<14.3f} {t_stat:<14.3f} {p_value:<12.6f} {significant}")

ttest_df = pd.DataFrame(ttest_results)

In [None]:
# Violin plots for better distribution comparison
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(18, 22))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    if idx < len(axes):
        sns.violinplot(data=df, x='classification', y=col, ax=axes[idx], palette='Set2')
        axes[idx].set_title(f'{col} by Classification', fontweight='bold')
        axes[idx].set_xlabel('Classification')
        axes[idx].set_ylabel(col)

# Hide extra subplots
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Violin Plots: Numeric Features by Target Class', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

#### Categorical Features vs Target

In [None]:
# Stacked bar plots for categorical features vs target
n_categorical = len(categorical_features)
fig, axes = plt.subplots(nrows=(n_categorical + 2) // 3, ncols=3, figsize=(18, 4 * ((n_categorical + 2) // 3)))
axes = axes.flatten()

for idx, col in enumerate(categorical_features):
    if idx < len(axes):
        # Create contingency table
        ct = pd.crosstab(df[col], df['classification'])
        ct.plot(kind='bar', stacked=True, ax=axes[idx], color=['#ff6b6b', '#4ecdc4'], 
                edgecolor='black', alpha=0.8)
        axes[idx].set_title(f'{col} vs Classification', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Count')
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].legend(title='Classification', loc='upper right')
        axes[idx].grid(alpha=0.3)

# Hide extra subplots
for idx in range(n_categorical, len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Stacked Bar Plots: Categorical Features vs Target', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# Chi-square test for categorical features
print("\nüìä Chi-Square Test Results (Categorical Features vs Target):")
print("\nTesting if categorical features are independent of target variable\n")
print(f"{'Feature':<15} {'Chi2-statistic':<18} {'p-value':<12} {'Significant?'}")
print("="*60)

chi2_results = []

for col in categorical_features:
    # Create contingency table (remove NaN)
    ct = pd.crosstab(df[col].fillna('Missing'), df['classification'])
    
    # Perform chi-square test
    chi2, p_value, dof, expected = chi2_contingency(ct)
    
    # Determine significance
    significant = '‚úì Yes' if p_value < 0.05 else '‚úó No'
    
    chi2_results.append({
        'Feature': col,
        'chi2_statistic': chi2,
        'p_value': p_value,
        'Significant': significant
    })
    
    print(f"{col:<15} {chi2:<18.3f} {p_value:<12.6f} {significant}")

chi2_df = pd.DataFrame(chi2_results)

In [None]:
# Grouped bar plots for better comparison
n_categorical = len(categorical_features)
fig, axes = plt.subplots(nrows=(n_categorical + 2) // 3, ncols=3, figsize=(18, 4 * ((n_categorical + 2) // 3)))
axes = axes.flatten()

for idx, col in enumerate(categorical_features):
    if idx < len(axes):
        # Create contingency table with percentages
        ct = pd.crosstab(df[col], df['classification'], normalize='index') * 100
        ct.plot(kind='bar', ax=axes[idx], color=['#ff6b6b', '#4ecdc4'], 
                edgecolor='black', alpha=0.8)
        axes[idx].set_title(f'{col} vs Classification (%)', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Percentage (%)')
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].legend(title='Classification', loc='upper right')
        axes[idx].grid(alpha=0.3)

# Hide extra subplots
for idx in range(n_categorical, len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Grouped Bar Plots: Categorical Features vs Target (Percentage)', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

#### Correlation Analysis

In [None]:
# Pearson correlation heatmap for numeric features
plt.figure(figsize=(14, 12))
correlation_matrix = df[numeric_features].corr(method='pearson')

sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Pearson Correlation Heatmap (Numeric Features)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Print highly correlated pairs (|correlation| > 0.7)
print("\nüîó Highly Correlated Feature Pairs (|r| > 0.7):")
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            high_corr.append({
                'Feature 1': correlation_matrix.columns[i],
                'Feature 2': correlation_matrix.columns[j],
                'Correlation': correlation_matrix.iloc[i, j]
            })

if high_corr:
    high_corr_df = pd.DataFrame(high_corr).sort_values('Correlation', ascending=False)
    print(high_corr_df.to_string(index=False))
else:
    print("No feature pairs with |correlation| > 0.7 found.")

In [None]:
# Spearman correlation heatmap (for non-normal data)
plt.figure(figsize=(14, 12))
spearman_corr = df[numeric_features].corr(method='spearman')

sns.heatmap(spearman_corr, annot=True, fmt='.2f', cmap='viridis', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Spearman Correlation Heatmap (Numeric Features)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### E. Multivariate Analysis

In [None]:
# Pairplot for key numeric features
key_features = ['bp', 'bgr', 'bu', 'sc', 'hemo']
key_features = [f for f in key_features if f in numeric_features]

# Add classification to the features
pairplot_data = df[key_features + ['classification']].copy()

print("\nüé® Creating pairplot for key features...")
sns.pairplot(pairplot_data, hue='classification', palette=['#ff6b6b', '#4ecdc4'],
             diag_kind='kde', plot_kws={'alpha': 0.6}, height=2.5)
plt.suptitle('Pairplot: Key Numeric Features by Classification', y=1.02, fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Scatter plots with interesting relationships
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Hemoglobin vs Serum Creatinine
if 'hemo' in df.columns and 'sc' in df.columns:
    for classification in df['classification'].unique():
        mask = df['classification'] == classification
        color = '#ff6b6b' if classification == 'ckd' else '#4ecdc4'
        axes[0, 0].scatter(df.loc[mask, 'hemo'], df.loc[mask, 'sc'], 
                          alpha=0.6, s=50, label=classification, color=color)
    axes[0, 0].set_xlabel('Hemoglobin (gms)', fontsize=12)
    axes[0, 0].set_ylabel('Serum Creatinine (mg/dl)', fontsize=12)
    axes[0, 0].set_title('Hemoglobin vs Serum Creatinine', fontweight='bold', fontsize=14)
    axes[0, 0].legend()
    axes[0, 0].grid(alpha=0.3)

# 2. Blood Urea vs Serum Creatinine
if 'bu' in df.columns and 'sc' in df.columns:
    for classification in df['classification'].unique():
        mask = df['classification'] == classification
        color = '#ff6b6b' if classification == 'ckd' else '#4ecdc4'
        axes[0, 1].scatter(df.loc[mask, 'bu'], df.loc[mask, 'sc'], 
                          alpha=0.6, s=50, label=classification, color=color)
    axes[0, 1].set_xlabel('Blood Urea (mg/dl)', fontsize=12)
    axes[0, 1].set_ylabel('Serum Creatinine (mg/dl)', fontsize=12)
    axes[0, 1].set_title('Blood Urea vs Serum Creatinine', fontweight='bold', fontsize=14)
    axes[0, 1].legend()
    axes[0, 1].grid(alpha=0.3)

# 3. Hemoglobin vs Blood Pressure
if 'hemo' in df.columns and 'bp' in df.columns:
    for classification in df['classification'].unique():
        mask = df['classification'] == classification
        color = '#ff6b6b' if classification == 'ckd' else '#4ecdc4'
        axes[1, 0].scatter(df.loc[mask, 'hemo'], df.loc[mask, 'bp'], 
                          alpha=0.6, s=50, label=classification, color=color)
    axes[1, 0].set_xlabel('Hemoglobin (gms)', fontsize=12)
    axes[1, 0].set_ylabel('Blood Pressure (mm/Hg)', fontsize=12)
    axes[1, 0].set_title('Hemoglobin vs Blood Pressure', fontweight='bold', fontsize=14)
    axes[1, 0].legend()
    axes[1, 0].grid(alpha=0.3)

# 4. Blood Glucose vs Blood Pressure
if 'bgr' in df.columns and 'bp' in df.columns:
    for classification in df['classification'].unique():
        mask = df['classification'] == classification
        color = '#ff6b6b' if classification == 'ckd' else '#4ecdc4'
        axes[1, 1].scatter(df.loc[mask, 'bgr'], df.loc[mask, 'bp'], 
                          alpha=0.6, s=50, label=classification, color=color)
    axes[1, 1].set_xlabel('Blood Glucose Random (mg/dl)', fontsize=12)
    axes[1, 1].set_ylabel('Blood Pressure (mm/Hg)', fontsize=12)
    axes[1, 1].set_title('Blood Glucose vs Blood Pressure', fontweight='bold', fontsize=14)
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

plt.suptitle('Multivariate Analysis: Key Feature Relationships', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# 3D scatter plot for cluster visualization
from mpl_toolkits.mplot3d import Axes3D

if 'hemo' in df.columns and 'sc' in df.columns and 'bu' in df.columns:
    fig = plt.figure(figsize=(14, 10))
    ax = fig.add_subplot(111, projection='3d')
    
    for classification in df['classification'].unique():
        mask = df['classification'] == classification
        color = '#ff6b6b' if classification == 'ckd' else '#4ecdc4'
        ax.scatter(df.loc[mask, 'hemo'], df.loc[mask, 'sc'], df.loc[mask, 'bu'],
                  alpha=0.6, s=50, label=classification, color=color)
    
    ax.set_xlabel('Hemoglobin (gms)', fontsize=12)
    ax.set_ylabel('Serum Creatinine (mg/dl)', fontsize=12)
    ax.set_zlabel('Blood Urea (mg/dl)', fontsize=12)
    ax.set_title('3D Scatter Plot: Hemoglobin, Creatinine, Urea', fontweight='bold', fontsize=14)
    ax.legend()
    
    plt.tight_layout()
    plt.show()

### F. Outlier Detection

In [None]:
# Detect outliers using IQR method
print("\nüîç Outlier Detection using IQR Method:")
print("\nOutliers are defined as values outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR]\n")
print(f"{'Feature':<15} {'Total Values':<15} {'Outliers':<12} {'Outlier %':<12} {'Lower Bound':<14} {'Upper Bound'}")
print("="*90)

outlier_info = []

for col in numeric_features:
    # Calculate Q1, Q3, and IQR
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define outlier bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Count outliers
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    n_outliers = len(outliers)
    total_values = df[col].notna().sum()
    outlier_pct = (n_outliers / total_values * 100) if total_values > 0 else 0
    
    outlier_info.append({
        'Feature': col,
        'Total_Values': total_values,
        'Outliers': n_outliers,
        'Outlier_Percentage': outlier_pct,
        'Lower_Bound': lower_bound,
        'Upper_Bound': upper_bound
    })
    
    print(f"{col:<15} {total_values:<15} {n_outliers:<12} {outlier_pct:<12.2f} {lower_bound:<14.2f} {upper_bound:.2f}")

outlier_df = pd.DataFrame(outlier_info)

In [None]:
# Visualize outliers
fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(18, 20))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    if idx < len(axes):
        # Calculate bounds
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Create boxplot
        bp = axes[idx].boxplot(df[col].dropna(), patch_artist=True,
                               boxprops=dict(facecolor='lightblue'),
                               flierprops=dict(marker='o', markerfacecolor='red', markersize=6, alpha=0.5))
        
        # Add horizontal lines for bounds
        axes[idx].axhline(y=lower_bound, color='green', linestyle='--', linewidth=1, label='Lower Bound')
        axes[idx].axhline(y=upper_bound, color='orange', linestyle='--', linewidth=1, label='Upper Bound')
        
        axes[idx].set_title(f'{col}', fontweight='bold')
        axes[idx].set_ylabel(col)
        axes[idx].grid(alpha=0.3)
        axes[idx].legend(fontsize=8)

# Hide extra subplots
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Outlier Detection: Boxplots with IQR Bounds', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

In [None]:
# Identify medically impossible or extreme values
print("\n‚ö†Ô∏è Checking for Medically Impossible/Extreme Values:\n")

# Blood Pressure
if 'bp' in df.columns:
    extreme_bp = df[(df['bp'] < 40) | (df['bp'] > 200)]
    print(f"Blood Pressure < 40 or > 200 mm/Hg: {len(extreme_bp)} cases")
    if len(extreme_bp) > 0:
        print(f"  Range: {df['bp'].min():.1f} - {df['bp'].max():.1f}")

# Serum Creatinine
if 'sc' in df.columns:
    extreme_sc = df[df['sc'] > 15]
    print(f"\nSerum Creatinine > 15 mg/dl: {len(extreme_sc)} cases")
    if len(extreme_sc) > 0:
        print(f"  Max value: {df['sc'].max():.1f}")

# Hemoglobin
if 'hemo' in df.columns:
    extreme_hemo = df[(df['hemo'] < 3) | (df['hemo'] > 20)]
    print(f"\nHemoglobin < 3 or > 20 gms: {len(extreme_hemo)} cases")
    if len(extreme_hemo) > 0:
        print(f"  Range: {df['hemo'].min():.1f} - {df['hemo'].max():.1f}")

# Blood Glucose
if 'bgr' in df.columns:
    extreme_bgr = df[(df['bgr'] < 50) | (df['bgr'] > 500)]
    print(f"\nBlood Glucose < 50 or > 500 mg/dl: {len(extreme_bgr)} cases")
    if len(extreme_bgr) > 0:
        print(f"  Range: {df['bgr'].min():.1f} - {df['bgr'].max():.1f}")

# Age
if 'age' in df.columns:
    extreme_age = df[(df['age'] < 0) | (df['age'] > 120)]
    print(f"\nAge < 0 or > 120 years: {len(extreme_age)} cases")
    if len(extreme_age) > 0:
        print(f"  Range: {df['age'].min():.1f} - {df['age'].max():.1f}")

---
## üìù Summary of EDA Findings

In [None]:
print("="*80)
print("üìä EXPLORATORY DATA ANALYSIS - KEY FINDINGS")
print("="*80)

print("\n1. DATASET OVERVIEW:")
print(f"   ‚Ä¢ Total Records: {df.shape[0]}")
print(f"   ‚Ä¢ Total Features: {df.shape[1]}")
print(f"   ‚Ä¢ Numeric Features: {len(numeric_features)}")
print(f"   ‚Ä¢ Categorical Features: {len(categorical_features)}")

print("\n2. TARGET VARIABLE:")
target_dist = df['classification'].value_counts()
print(f"   ‚Ä¢ CKD Cases: {target_dist.get('ckd', 0)} ({target_dist.get('ckd', 0)/len(df)*100:.1f}%)")
print(f"   ‚Ä¢ Not CKD Cases: {target_dist.get('notckd', 0)} ({target_dist.get('notckd', 0)/len(df)*100:.1f}%)")
if abs(target_dist.get('ckd', 0) - target_dist.get('notckd', 0)) / len(df) > 0.3:
    print("   ‚ö†Ô∏è Dataset is imbalanced - may need balancing techniques")

print("\n3. DATA QUALITY:")
print(f"   ‚Ä¢ Missing Values: {df.isnull().sum().sum()} ({(df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100:.2f}%)")
print(f"   ‚Ä¢ Duplicate Rows: {duplicates}")
if not missing_data.empty:
    print(f"   ‚Ä¢ Features with >20% missing: {len(missing_data[missing_data['Missing_Percentage'] > 20])}")

print("\n4. STATISTICAL SIGNIFICANCE:")
if 'ttest_df' in locals():
    sig_features = ttest_df[ttest_df['Significant'] == '‚úì Yes']
    print(f"   ‚Ä¢ Numeric features significantly different between classes: {len(sig_features)}/{len(numeric_features)}")
if 'chi2_df' in locals():
    sig_cat_features = chi2_df[chi2_df['Significant'] == '‚úì Yes']
    print(f"   ‚Ä¢ Categorical features associated with target: {len(sig_cat_features)}/{len(categorical_features)}")

print("\n5. CORRELATIONS:")
if 'high_corr' in locals() and len(high_corr) > 0:
    print(f"   ‚Ä¢ Highly correlated feature pairs (|r| > 0.7): {len(high_corr)}")
    print("   ‚ö†Ô∏è May need feature selection to remove multicollinearity")
else:
    print("   ‚Ä¢ No highly correlated feature pairs found (|r| > 0.7)")

print("\n6. OUTLIERS:")
if 'outlier_df' in locals():
    high_outlier_features = outlier_df[outlier_df['Outlier_Percentage'] > 5]
    print(f"   ‚Ä¢ Features with >5% outliers: {len(high_outlier_features)}")
    if len(high_outlier_features) > 0:
        print("   ‚Ä¢ These features may need treatment (capping/winsorization)")

print("\n7. NEXT STEPS:")
print("   ‚úì Handle missing values (imputation/deletion)")
print("   ‚úì Treat outliers (capping/winsorization)")
print("   ‚úì Encode categorical variables")
print("   ‚úì Feature scaling/normalization")
print("   ‚úì Feature selection")
print("   ‚úì Train ML models for CKD prediction")

print("\n" + "="*80)
print("‚úÖ EDA COMPLETED SUCCESSFULLY!")
print("="*80)

---
## üéØ Conclusion

The Exploratory Data Analysis revealed several important insights:

### Key Findings:
1. **Missing Data**: The dataset contains substantial missing values that need to be addressed
2. **Class Imbalance**: The target variable shows imbalance between CKD and non-CKD cases
3. **Feature Relationships**: Several features show strong associations with the target variable
4. **Outliers**: Multiple features contain outliers that may need treatment
5. **Data Quality**: Some values may need validation for medical plausibility

### Next Steps:
- **Data Preprocessing**: Handle missing values and outliers
- **Feature Engineering**: Create new features if needed
- **Model Building**: Train and evaluate classification models
- **Model Optimization**: Fine-tune hyperparameters for best performance

---