# Exploratory Data Analysis (EDA)
## Philippine Health Indicators

**Purpose**
Understand the structure, quality, and distribution of national health indicators
for the Philippines prior to statistical modeling and policy analysis.

**Dataset Source**
https://www.kaggle.com/datasets/thedevastator/philippine-health-indicators

**Key Outputs**
- Data schema and summary statistics
- Missing data analysis
- Distribution plots (histograms, boxplots)
- Correlation matrix
- Cleaned dataset snapshot


In [None]:
# Core libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display options
pd.set_option("display.max_columns", 100)
sns.set(style="whitegrid")

# Load dataset
# In Colab, upload the CSV or mount Google Drive
df = pd.read_csv("/content/philippine_health_indicators.csv")

# Preview dataset
df.head()


In [None]:
# Dataset shape
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

# Column data types
df.info()


In [None]:
# Count unique entities (adjust column names if needed)
summary_counts = {
    "Unique Years": df["Year"].nunique() if "Year" in df.columns else "N/A",
    "Unique Indicators": df["Indicator"].nunique() if "Indicator" in df.columns else "N/A",
    "Unique Regions": df["Region"].nunique() if "Region" in df.columns else "N/A"
}

summary_counts


In [None]:
# Missing value count and percentage
missing_df = pd.DataFrame({
    "Missing Count": df.isnull().sum(),
    "Missing %": (df.isnull().mean() * 100).round(2)
}).sort_values("Missing %", ascending=False)

missing_df


In [None]:
# Missing data heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Data Heatmap")
plt.show()


In [None]:
# Drop columns with >40% missing data
threshold = 0.4
df = df.loc[:, df.isnull().mean() < threshold]

# Drop rows with missing year or indicator values
critical_cols = [c for c in ["Year", "Indicator"] if c in df.columns]
df = df.dropna(subset=critical_cols)

df.shape


In [None]:
# Convert Year to integer
if "Year" in df.columns:
    df["Year"] = df["Year"].astype(int)

# Convert categorical columns
categorical_cols = df.select_dtypes(include="object").columns
for col in categorical_cols:
    df[col] = df[col].astype("category")

# Verify data types
df.dtypes


In [None]:
# Identify numeric columns
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
numeric_cols


In [None]:
df[numeric_cols].hist(
    figsize=(15, 10),
    bins=30,
    edgecolor="black"
)
plt.suptitle("Distribution of Numeric Health Indicators", fontsize=16)
plt.show()


In [None]:
plt.figure(figsize=(14, 6))
sns.boxplot(data=df[numeric_cols], orient="h")
plt.title("Boxplot of Numeric Indicators")
plt.show()


In [None]:
# Z-score based outlier detection
from scipy.stats import zscore

z_scores = np.abs(zscore(df[numeric_cols], nan_policy="omit"))
outlier_mask = (z_scores > 3)

outlier_counts = pd.Series(outlier_mask.sum(axis=0), index=numeric_cols)
outlier_counts.sort_values(ascending=False)


In [None]:
# Flag but do not remove outliers (clinical data often contains extremes)
df["outlier_flag"] = outlier_mask.any(axis=1)

df["outlier_flag"].value_counts()


In [None]:
corr_matrix = df[numeric_cols].corr()

plt.figure(figsize=(12, 8))
sns.heatmap(
    corr_matrix,
    cmap="coolwarm",
    center=0,
    linewidths=0.5
)
plt.title("Correlation Matrix of Health Indicators")
plt.show()


In [None]:
summary_stats = df[numeric_cols].describe().T
summary_stats["median"] = df[numeric_cols].median()

summary_stats


In [None]:
# Save cleaned dataset snapshot
df.to_csv("/content/cleaned_philippine_health_indicators.csv", index=False)


## Key Takeaways from EDA

- Dataset structure and coverage validated
- Missingness is non-random and indicator-specific
- Several indicators exhibit skewness and outliers consistent with real-world health data
- Strong correlations exist between selected indicators, justifying multivariate analysis

