# **Correlation & Relationship Analysis**

---

Quantify and visualize **statistical relationships** among key variables—student enrollment, teacher counts, teacher–student ratios, time, regions, and school categories. This notebook assesses whether staffing scales with enrollment and where relationships diverge across contexts.


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import pearsonr, spearmanr

pd.set_option("display.max_columns", None)
sns.set(style="whitegrid")

In [None]:
# Dataset source:
# https://www.kaggle.com/datasets/franksebastiancayaco/philippine-public-school-teachers-and-students

DATA_PATH = "../data/raw/philippine_public_school_teachers_students.csv"

df = pd.read_csv(DATA_PATH)
df.head()

In [None]:
# Normalize time variable
df["school_year"] = df["school_year"].astype(str)
df["year_start"] = df["school_year"].str[:4].astype(int)

# Numeric coercion
df["students"] = pd.to_numeric(df["students"], errors="coerce")
df["teachers"] = pd.to_numeric(df["teachers"], errors="coerce")

# Derived metric
df["students_per_teacher"] = df["students"] / df["teachers"]

df.info()

In [None]:
numeric_df = df[
    ["students", "teachers", "students_per_teacher", "year_start"]
].dropna()

corr_matrix = numeric_df.corr(method="pearson")
corr_matrix

In [None]:
plt.figure(figsize=(6, 4))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap="coolwarm",
    fmt=".2f"
)
plt.title("Pearson Correlation Matrix")
plt.show()

In [None]:
pearson_st, pearson_p = pearsonr(
    numeric_df["students"],
    numeric_df["teachers"]
)

spearman_st, spearman_p = spearmanr(
    numeric_df["students"],
    numeric_df["teachers"]
)

pd.DataFrame({
    "Method": ["Pearson", "Spearman"],
    "Correlation": [pearson_st, spearman_st],
    "p_value": [pearson_p, spearman_p]
})

In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(
    data=df,
    x="students",
    y="teachers",
    alpha=0.6
)

sns.regplot(
    data=df,
    x="students",
    y="teachers",
    scatter=False,
    color="red"
)

plt.title("Students vs Teachers Relationship")
plt.xlabel("Number of Students")
plt.ylabel("Number of Teachers")
plt.show()

In [None]:
plt.figure(figsize=(8, 5))

sns.scatterplot(
    data=df,
    x="students",
    y="teachers",
    hue="school_category",
    alpha=0.6
)

plt.title("Students vs Teachers by School Category")
plt.xlabel("Number of Students")
plt.ylabel("Number of Teachers")
plt.legend(title="School Category")
plt.show()

In [None]:
category_corr = (
    df.groupby("school_category")
      .apply(lambda x: x[["students", "teachers"]].corr().iloc[0,1])
      .reset_index(name="student_teacher_corr")
)

category_corr

In [None]:
regional_corr = (
    df.groupby("region")
      .apply(lambda x: x[["students", "teachers"]].corr().iloc[0,1])
      .reset_index(name="student_teacher_corr")
      .sort_values("student_teacher_corr")
)

regional_corr

In [None]:
plt.figure(figsize=(8, 5))

sns.barplot(
    data=regional_corr,
    y="region",
    x="student_teacher_corr"
)

plt.title("Student–Teacher Correlation by Region")
plt.xlabel("Correlation Coefficient")
plt.ylabel("Region")
plt.show()

In [None]:
plt.figure(figsize=(6, 4))

sns.scatterplot(
    data=df,
    x="students",
    y="students_per_teacher",
    alpha=0.6
)

plt.title("Enrollment vs Teacher–Student Ratio")
plt.xlabel("Number of Students")
plt.ylabel("Students per Teacher")
plt.show()

In [None]:
time_corr = (
    df.groupby("year_start")
      .apply(lambda x: x[["students", "teachers"]].corr().iloc[0,1])
      .reset_index(name="student_teacher_corr")
)

time_corr

In [None]:
plt.figure(figsize=(6, 4))
plt.plot(
    time_corr["year_start"],
    time_corr["student_teacher_corr"],
    marker="o"
)

plt.title("Student–Teacher Correlation Over Time")
plt.xlabel("School Year (Start)")
plt.ylabel("Correlation")
plt.show()

### Key Correlation and Relationship Insights

1. Student enrollment and teacher counts exhibit a strong positive correlation,
   indicating that staffing generally scales with demand at the national level.
2. Correlation strength varies by school category, suggesting differing staffing
   responsiveness across educational levels.
3. Regional correlations reveal heterogeneity, with some regions showing weaker
   alignment between enrollment growth and teacher deployment.
4. The relationship between enrollment size and teacher–student ratios indicates
   that larger enrollments do not always result in proportionate staffing.

These findings motivate inequality measurement, structural break testing, and
causal modeling in subsequent notebooks.