A UIUC undergraduate research project measuring how class-release pedestrian surges at signalized intersections cause CUMTD bus delays along campus transit corridors. Three companion datasets archived 2026-04-23 → 2026-04-29 (6.76 days), released on Hugging Face for community reuse.
All four are CC-BY 4.0 (NWS subset is CC0; see each repo card).
| dataset | what | size | repo |
|---|---|---|---|
| Campus Cams | 1.55M JPEGs at 1 Hz from Quad / Alma / Morrow YouTube live streams + per-frame metadata | 489 hourly WebDataset shards | dr-pod/data-dive-campus-cams |
| MTD GTFS-Realtime archive | vehicle_positions + trip_updates + service_alerts polled every 5 s, length-prefixed gzipped protobuf streams + JSONL index |
~3.6 GB | dr-pod/data-dive-mtd-gtfs-rt |
| Google Directions traversal-times | 8 campus OD pairs polled every 60 s, ~70K route-time observations | ~32 MB | dr-pod/data-dive-google-directions |
| KCMI hourly weather | NWS observations from Willard Airport (closest official station to UIUC) | <1 MB | dr-pod/data-dive-cu-weather |
To my knowledge no continuous public archive of CUMTD's real-time feeds existed before this one — the live feeds overwrite themselves.
UIUC class-release pedestrian surges at signalized intersections (Green/Wright, Wright/Springfield, Gregory/Wright, etc.) cause measurable bus delays via signal-phase saturation. The mechanism is the pedestrian volume crossing at the intersection, not bus ridership — most class-hopping on this compact campus is on foot.
Evaluation framework: residual-correction layer on top of CUMTD's published predictions, scored against the Transit ETA Accuracy Benchmark methodology, with class-release timing derived from UIUC Course Explorer XML.
Candidate advisor: Prof. Lewis Lehe (UIUC CEE, transit operations). Target venues: TRB Annual Meeting 2027 (abstract due Aug 2026) or IEEE ITSC.
┌──────────────────────┐
│ Railway (decommissioned 2026-04-30)
│ data-dive-archiver │
│ ├─ gtfs-rt poller (5s)
│ ├─ google directions poller (60s, halted Apr 29)
│ ├─ NWS weather poller (15min)
│ └─ status server :8080
└──────────────────────┘
▲ poll /vehicle-positions,
│ /trip-updates, /service-alerts
┌───────────────────────────┐
│ MTD GTFS-RT (public feed) │
└───────────────────────────┘
┌──────────────────────────────────┐
│ Spare Mac (UIUC network) │ ┌──────────────────────┐
│ ├─ cam.quad launchd (1 Hz) │────► │ ~/data-dive/ │
│ ├─ cam.alma launchd (1 Hz) │ │ cam_frames/{cam}/... │ → HF dataset
│ ├─ cam.morrow launchd (1 Hz) │ └──────────────────────┘
│ ├─ cam.dashboard launchd :8080 │
│ └─ cam.uploader launchd (HF push)
└──────────────────────────────────┘
▲
│ yt-dlp → ffmpeg → JPEG
│
YouTube live streams (Quad, Alma Mater, Morrow Plots)
One-shot pulls (laptop, on demand):
- UIUC Course Explorer XML (Spring 2026 sections + rooms)
- MTD static GTFS zip (stops, routes, shapes, schedule)
from huggingface_hub import HfFileSystem
import gzip, struct
from google.transit import gtfs_realtime_pb2
# Pull a single hour of vehicle positions
fs = HfFileSystem()
local = fs.get(
"datasets/dr-pod/data-dive-mtd-gtfs-rt/vehicle_positions/2026-04-25/18.pb.gz",
"/tmp/18.pb.gz",
)
with gzip.open("/tmp/18.pb.gz", "rb") as f:
while True:
head = f.read(4)
if not head: break
n, = struct.unpack(">I", head)
msg = gtfs_realtime_pb2.FeedMessage()
msg.ParseFromString(f.read(n))
for entity in msg.entity:
...The pipeline that built these datasets is in scrapers/. Volume is preserved
on Railway if it ever needs to come back online.
# GTFS-RT archiver (in container)
DATA_DIR=./data/raw/gtfs_rt python scrapers/gtfs_rt_archiver.py
# Cam archivers — three launchd agents, see scrapers/com.datadive.cam.*.plist
launchctl load ~/Library/LaunchAgents/com.datadive.cam.quad.plist| event | timestamp |
|---|---|
| Cams + GTFS-RT polling started | 2026-04-23 ~05:30 UTC |
| Google Directions polling went live | 2026-04-23 ~22:00 UTC |
| Cams + Directions stopped | 2026-04-29 23:42 UTC |
| Railway service decommissioned | 2026-04-30 ~01:50 UTC |
docs/challenges.md— running log of research-process obstacles, decisions, and incidents (including the Apr 29 Google Directions API overspend post-mortem).docs/outreach_emails.md— research-collaboration and data-access outreach drafts (Lewis Lehe, Transit Inc., CUMTD developer team, MetroLab Network).literature-review/NOTES.md— what we use from each paper.cam_frames/{cam}/camera_meta.json— provenance per camera.
CC-BY 4.0 for all derived datasets and code, except the NWS weather subset which is CC0 (US government public domain).