### IPEDS + CIP Dataset (Cleaned)

This notebook uses the cleaned IPEDS–CIP dataset:

- **File:** `ipeds_cip_final_for_dissertation_clean.csv`
- **Rows:** ~11,000 institution–year observations (2017–2022)
- **Population:** 4-year, public and private not-for-profit institutions
- **Grain:** One row per (institution, year)

**Target variable**

- `completers` — total number of graduates (completers) in that year at the institution.  
  This is a continuous, right-skewed count variable (few very large universities with 10k+ completers).

**Key numeric features**

- `longitude`, `latitude` – geographic location of the institution (cleaned; impossible latitudes removed).
- `student_faculty_ratio` – student–faculty ratio, median-imputed where missing.
- `headcount` – total headcount enrollment, median-imputed where missing.
- `cbsa` – Core-Based Statistical Area code, with invalid/negative codes set to the modal value.
- Four binary enrollment flags (0/1):
  - `enrolled_undergrad_fulltime`
  - `enrolled_undergrad_parttime`
  - `enrolled_graduate_fulltime`
  - `enrolled_graduate_parttime`

**Key categorical features**

- `state_abbr` – state or territory (≈ 58 levels).
- `region` – IPEDS region (10 levels).
- `sector` – broad institutional sector (e.g., Public, Private not-for-profit).
- `inst_control` – control of institution (public vs private not-for-profit).
- `urban_centric_locale` – urban/rural classification (13 levels).
- `inst_size` – institutional size category (e.g., "Under 1", "1", "5", "10", "20", "Missing").
- `cbsa_type` – metro vs micro vs non-core classification (3 levels).
- `inst_affiliation` – institutional affiliation (3 levels).

**High-cardinality feature (for embeddings)**

- `cips` – comma-separated list of CIP codes offered by the institution in that year  
  (on average ~54 CIPs per row, ~8,700 unique CIP combinations).
- This column will **not** be one-hot encoded. Instead, separate notebooks will:
  - explode CIPs,
  - build CIP-level embeddings (Node2Vec, Poincaré, etc.),
  - and aggregate embeddings back to the institution–year level.

**Identifiers**

- `unitid` – IPEDS institution identifier.
- `year` – reporting year.

In this notebook, we:

1. Load the cleaned IPEDS–CIP file.
2. Define column groups (numeric features, categorical features, high-cardinality features, IDs, target).
3. Create a train/validation/test split suitable for downstream encoding, embedding, and synthetic-data experiments.
4. Save the resulting splits to disk for use in later notebooks (encoding baselines, embedding experiments, CTGAN/TVAE, LLM fine-tuning).


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os
PROJ = "/content/drive/MyDrive/dissertation"
DATA_DIR = f"{PROJ}/data"
os.makedirs(DATA_DIR, exist_ok=True)

# Just save IPEDS
local_path = "/content/ipeds_data.csv"
ipeds_data.to_csv(local_path, index=False)
print("Saved locally to:", local_path)

drive_path = f"{DATA_DIR}/ipeds_data.csv"
ipeds_data.to_csv(drive_path, index=False)
print("Also saved to dissertation drive:", drive_path)


Mounted at /content/drive
Saved locally to: /content/ipeds_data.csv
Also saved to dissertation drive: /content/drive/MyDrive/dissertation/data/ipeds_data.csv


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# -----------------------------------
# FEATURE GROUPS FOR IPEDS (FINAL)
# -----------------------------------

# IDs and target
id_cols = ["unitid", "year"]
target_col = "completers"

# High-cardinality (for embeddings later, NOT encoded here)
high_card_cols = ["cips"]

# Numeric features (including flags)
numeric_cols = [
    "longitude",
    "latitude",
    "student_faculty_ratio",
    "headcount",
    "cbsa",
    "enrolled_undergrad_fulltime",
    "enrolled_undergrad_parttime",
    "enrolled_graduate_fulltime",
    "enrolled_graduate_parttime",
]

# Categorical features used in *baseline encoding benchmarks*
# (keep this small on purpose: consistent with other datasets)
categorical_model_cols = [
    "region",
    "sector",
]

# Metadata categoricals: kept in the data, not used as model inputs
metadata_cats = [
    "state_abbr",
    "inst_control",
    "urban_centric_locale",
    "inst_size",
    "cbsa_type",
    "inst_affiliation",
]

print("Numeric cols:", numeric_cols)
print("Categorical MODEL cols:", categorical_model_cols)
print("High-card cols:", high_card_cols)
print("Metadata categoricals (not used in X):", metadata_cats)

# Sanity checks
for lst, name in [
    (id_cols, "ID"),
    ([target_col], "Target"),
    (numeric_cols, "Numeric"),
    (categorical_model_cols, "Model categoricals"),
    (high_card_cols, "High-card"),
    (metadata_cats, "Metadata categoricals"),
]:
    missing = [c for c in lst if c not in ipeds_data.columns]
    if missing:
        print(f"⚠ Missing {name} columns:", missing)


Numeric cols: ['longitude', 'latitude', 'student_faculty_ratio', 'headcount', 'cbsa', 'enrolled_undergrad_fulltime', 'enrolled_undergrad_parttime', 'enrolled_graduate_fulltime', 'enrolled_graduate_parttime']
Categorical MODEL cols: ['region', 'sector']
High-card cols: ['cips']
Metadata categoricals (not used in X): ['state_abbr', 'inst_control', 'urban_centric_locale', 'inst_size', 'cbsa_type', 'inst_affiliation']


In [None]:
# -----------------------------------
# BUILD TRAIN / VAL / TEST SPLITS
# -----------------------------------

df = ipeds_data.copy()

# Bin target for stratification
num_bins = 5
df["_completers_bin"] = pd.qcut(
    df[target_col],
    q=num_bins,
    labels=False,
    duplicates="drop"
)

# Features used as model inputs in 05B
feature_cols = numeric_cols + categorical_model_cols

X = df[feature_cols]
y = df[target_col]
strat = df["_completers_bin"]

# First split: train vs temp (val+test)
X_train, X_temp, y_train, y_temp, strat_train, strat_temp = train_test_split(
    X, y, strat, test_size=0.4, random_state=42, stratify=strat
)

# Second split: temp -> val + test (50/50 of temp => 20/20 each)
X_val, X_test, y_val, y_test, strat_val, strat_test = train_test_split(
    X_temp, y_temp, strat_temp, test_size=0.5, random_state=42, stratify=strat_temp
)

print("Train shape (X):", X_train.shape)
print("Val shape (X):  ", X_val.shape)
print("Test shape (X): ", X_test.shape)

# Build full frames with IDs, target, high-card, and metadata
def build_split_frame(X_part, y_part, idx):
    out = X_part.copy()
    out[target_col] = y_part.values
    out[id_cols] = df.loc[idx, id_cols].values
    out[high_card_cols] = df.loc[idx, high_card_cols].values
    # keep metadata categoricals for analysis / later use
    for col in metadata_cats:
        out[col] = df.loc[idx, col].values
    return out

train_idx = X_train.index
val_idx = X_val.index
test_idx = X_test.index

ipeds_train = build_split_frame(X_train, y_train, train_idx)
ipeds_val   = build_split_frame(X_val,   y_val,   val_idx)
ipeds_test  = build_split_frame(X_test,  y_test,  test_idx)

print("\nFinal split shapes (including IDs, target, cips, metadata):")
print("ipeds_train:", ipeds_train.shape)
print("ipeds_val:  ", ipeds_val.shape)
print("ipeds_test: ", ipeds_test.shape)


Train shape (X): (6602, 11)
Val shape (X):   (2201, 11)
Test shape (X):  (2201, 11)

Final split shapes (including IDs, target, cips, metadata):
ipeds_train: (6602, 21)
ipeds_val:   (2201, 21)
ipeds_test:  (2201, 21)


In [None]:
# -----------------------------------
# SAVE SPLITS (LOCAL + DRIVE)
# -----------------------------------

# Local
train_path_local = "/content/ipeds_train.csv"
val_path_local   = "/content/ipeds_val.csv"
test_path_local  = "/content/ipeds_test.csv"

ipeds_train.to_csv(train_path_local, index=False)
ipeds_val.to_csv(val_path_local, index=False)
ipeds_test.to_csv(test_path_local, index=False)

print("Saved locally:")
print(" -", train_path_local)
print(" -", val_path_local)
print(" -", test_path_local)

# Drive (dissertation/data)
PROJ = "/content/drive/MyDrive/dissertation"
DATA_DIR = f"{PROJ}/data"

train_path_drive = f"{DATA_DIR}/ipeds_train.csv"
val_path_drive   = f"{DATA_DIR}/ipeds_val.csv"
test_path_drive  = f"{DATA_DIR}/ipeds_test.csv"

ipeds_train.to_csv(train_path_drive, index=False)
ipeds_val.to_csv(val_path_drive, index=False)
ipeds_test.to_csv(test_path_drive, index=False)

print("\nAlso saved to dissertation drive:")
print(" -", train_path_drive)
print(" -", val_path_drive)
print(" -", test_path_drive)


Saved locally:
 - /content/ipeds_train.csv
 - /content/ipeds_val.csv
 - /content/ipeds_test.csv

Also saved to dissertation drive:
 - /content/drive/MyDrive/dissertation/data/ipeds_train.csv
 - /content/drive/MyDrive/dissertation/data/ipeds_val.csv
 - /content/drive/MyDrive/dissertation/data/ipeds_test.csv


In [None]:
import pandas as pd
import numpy as np

# Make sure these match what you defined earlier
numeric_cols = [
    "longitude",
    "latitude",
    "student_faculty_ratio",
    "headcount",
    "cbsa",
    "enrolled_undergrad_fulltime",
    "enrolled_undergrad_parttime",
    "enrolled_graduate_fulltime",
    "enrolled_graduate_parttime",
]

categorical_model_cols = ["region", "sector"]
target_col = "completers"

feature_cols = numeric_cols + categorical_model_cols

# X/y for each split
X_train = ipeds_train[feature_cols].copy()
y_train = ipeds_train[target_col].copy()

X_val = ipeds_val[feature_cols].copy()
y_val = ipeds_val[target_col].copy()

X_test = ipeds_test[feature_cols].copy()
y_test = ipeds_test[target_col].copy()

X_train.head(), y_train.head()


(       longitude   latitude  student_faculty_ratio  headcount     cbsa  \
 121   -87.681358  34.806740                   15.0    10003.0  22520.0   
 1062 -118.289520  34.065971                   10.0      191.0  31080.0   
 8025  -75.305206  40.007450                    8.0     1349.0  37980.0   
 9311  -96.285652  32.735508                   15.0      104.0  19100.0   
 1514  -72.767975  41.688248                   14.0     2350.0  25540.0   
 
       enrolled_undergrad_fulltime  enrolled_undergrad_parttime  \
 121                             1                            1   
 1062                            1                            1   
 8025                            1                            1   
 9311                            1                            1   
 1514                            1                            1   
 
       enrolled_graduate_fulltime  enrolled_graduate_parttime  \
 121                            1                           1   
 1062         

In [None]:
!pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.9.0-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.9.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.9/85.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.9.0


In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import math

def eval_regression(model_name, y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    return {
        "model": model_name,
        "R2": r2,
        "RMSE": rmse,
        "MAE": mae,
    }


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import Pipeline

# Column transformer for OHE on categoricals, passthrough numerics
ohe_preprocessor = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_model_cols),
    ]
)

ohe_model = Pipeline(
    steps=[
        ("preprocess", ohe_preprocessor),
        ("regressor", HistGradientBoostingRegressor(random_state=42)),
    ]
)

ohe_model.fit(X_train, y_train)

y_val_pred_ohe = ohe_model.predict(X_val)
y_test_pred_ohe = ohe_model.predict(X_test)

results = []
results.append(eval_regression("HGBR + OHE (region, sector) [VAL]", y_val, y_val_pred_ohe))
results.append(eval_regression("HGBR + OHE (region, sector) [TEST]", y_test, y_test_pred_ohe))

pd.DataFrame(results)


Unnamed: 0,model,R2,RMSE,MAE
0,"HGBR + OHE (region, sector) [VAL]",0.844357,968.658822,437.62121
1,"HGBR + OHE (region, sector) [TEST]",0.844578,1019.147201,441.292875


In [None]:
import category_encoders as ce
from sklearn.ensemble import HistGradientBoostingRegressor

# Copy to avoid messing original
X_train_te = X_train.copy()
X_val_te   = X_val.copy()
X_test_te  = X_test.copy()

# TargetEncoder on categorical columns
te = ce.TargetEncoder(cols=categorical_model_cols, smoothing=0.3)

X_train_te[categorical_model_cols] = te.fit_transform(X_train[categorical_model_cols], y_train)
X_val_te[categorical_model_cols]   = te.transform(X_val[categorical_model_cols])
X_test_te[categorical_model_cols]  = te.transform(X_test[categorical_model_cols])

te_model = HistGradientBoostingRegressor(random_state=42)
te_model.fit(X_train_te, y_train)

y_val_pred_te = te_model.predict(X_val_te)
y_test_pred_te = te_model.predict(X_test_te)

results.append(eval_regression("HGBR + TargetEnc (region, sector) [VAL]", y_val, y_val_pred_te))
results.append(eval_regression("HGBR + TargetEnc (region, sector) [TEST]", y_test, y_test_pred_te))

results_df = pd.DataFrame(results)
results_df


Unnamed: 0,model,R2,RMSE,MAE
0,"HGBR + OHE (region, sector) [VAL]",0.844357,968.658822,437.62121
1,"HGBR + OHE (region, sector) [TEST]",0.844578,1019.147201,441.292875
2,"HGBR + TargetEnc (region, sector) [VAL]",0.848827,954.646948,430.647646
3,"HGBR + TargetEnc (region, sector) [TEST]",0.843249,1023.498105,441.720725


In [None]:
import pandas as pd

# Load training split (we only fit embeddings on train)
cip_train = ipeds_train[["unitid", "cips"]].copy()

# Clean whitespace & split on commas
cip_train["cips_list"] = (
    cip_train["cips"]
    .astype(str)
    .str.replace(" ", "")     # remove spaces
    .str.split(",")           # turn into list
)

# Preview
cip_train[["cips", "cips_list"]].head()


Unnamed: 0,cips,cips_list
121,"04.0501,09.0101,09.0102,11.0101,11.0103,12.059...","[04.0501, 09.0101, 09.0102, 11.0101, 11.0103, ..."
1062,"51.3301,99","[51.3301, 99]"
8025,"03.0103,04.0301,05.0104,05.0207,11.0101,16.010...","[03.0103, 04.0301, 05.0104, 05.0207, 11.0101, ..."
9311,"24.0101,39.0201,99","[24.0101, 39.0201, 99]"
1514,"11.1003,13.1210,13.1501,13.9999,15.1102,19.070...","[11.1003, 13.1210, 13.1501, 13.9999, 15.1102, ..."
