Skip to content

drPod/data-dive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-dive

A UIUC undergraduate research project measuring how class-release pedestrian surges at signalized intersections cause CUMTD bus delays along campus transit corridors. Three companion datasets archived 2026-04-23 → 2026-04-29 (6.76 days), released on Hugging Face for community reuse.


Published datasets

All four are CC-BY 4.0 (NWS subset is CC0; see each repo card).

dataset what size repo
Campus Cams 1.55M JPEGs at 1 Hz from Quad / Alma / Morrow YouTube live streams + per-frame metadata 489 hourly WebDataset shards dr-pod/data-dive-campus-cams
MTD GTFS-Realtime archive vehicle_positions + trip_updates + service_alerts polled every 5 s, length-prefixed gzipped protobuf streams + JSONL index ~3.6 GB dr-pod/data-dive-mtd-gtfs-rt
Google Directions traversal-times 8 campus OD pairs polled every 60 s, ~70K route-time observations ~32 MB dr-pod/data-dive-google-directions
KCMI hourly weather NWS observations from Willard Airport (closest official station to UIUC) <1 MB dr-pod/data-dive-cu-weather

To my knowledge no continuous public archive of CUMTD's real-time feeds existed before this one — the live feeds overwrite themselves.


Research thesis (paper extension)

UIUC class-release pedestrian surges at signalized intersections (Green/Wright, Wright/Springfield, Gregory/Wright, etc.) cause measurable bus delays via signal-phase saturation. The mechanism is the pedestrian volume crossing at the intersection, not bus ridership — most class-hopping on this compact campus is on foot.

Evaluation framework: residual-correction layer on top of CUMTD's published predictions, scored against the Transit ETA Accuracy Benchmark methodology, with class-release timing derived from UIUC Course Explorer XML.

Candidate advisor: Prof. Lewis Lehe (UIUC CEE, transit operations). Target venues: TRB Annual Meeting 2027 (abstract due Aug 2026) or IEEE ITSC.


Architecture (capture period: Apr 23 – Apr 29 2026)

                                 ┌──────────────────────┐
                                 │ Railway (decommissioned 2026-04-30)
                                 │  data-dive-archiver  │
                                 │  ├─ gtfs-rt poller (5s)
                                 │  ├─ google directions poller (60s, halted Apr 29)
                                 │  ├─ NWS weather poller (15min)
                                 │  └─ status server :8080
                                 └──────────────────────┘
                                            ▲ poll /vehicle-positions,
                                            │ /trip-updates, /service-alerts
                              ┌───────────────────────────┐
                              │ MTD GTFS-RT (public feed) │
                              └───────────────────────────┘

┌──────────────────────────────────┐
│ Spare Mac (UIUC network)         │      ┌──────────────────────┐
│  ├─ cam.quad   launchd (1 Hz)    │────► │ ~/data-dive/         │
│  ├─ cam.alma   launchd (1 Hz)    │      │ cam_frames/{cam}/... │  → HF dataset
│  ├─ cam.morrow launchd (1 Hz)    │      └──────────────────────┘
│  ├─ cam.dashboard launchd :8080  │
│  └─ cam.uploader launchd (HF push)
└──────────────────────────────────┘
              ▲
              │ yt-dlp → ffmpeg → JPEG
              │
        YouTube live streams (Quad, Alma Mater, Morrow Plots)

One-shot pulls (laptop, on demand):
  - UIUC Course Explorer XML (Spring 2026 sections + rooms)
  - MTD static GTFS zip (stops, routes, shapes, schedule)

Reproducing the analysis

Quick start (with the published datasets)

from huggingface_hub import HfFileSystem
import gzip, struct
from google.transit import gtfs_realtime_pb2

# Pull a single hour of vehicle positions
fs = HfFileSystem()
local = fs.get(
    "datasets/dr-pod/data-dive-mtd-gtfs-rt/vehicle_positions/2026-04-25/18.pb.gz",
    "/tmp/18.pb.gz",
)
with gzip.open("/tmp/18.pb.gz", "rb") as f:
    while True:
        head = f.read(4)
        if not head: break
        n, = struct.unpack(">I", head)
        msg = gtfs_realtime_pb2.FeedMessage()
        msg.ParseFromString(f.read(n))
        for entity in msg.entity:
            ...

Re-running the archivers

The pipeline that built these datasets is in scrapers/. Volume is preserved on Railway if it ever needs to come back online.

# GTFS-RT archiver (in container)
DATA_DIR=./data/raw/gtfs_rt python scrapers/gtfs_rt_archiver.py

# Cam archivers — three launchd agents, see scrapers/com.datadive.cam.*.plist
launchctl load ~/Library/LaunchAgents/com.datadive.cam.quad.plist

Capture window

event timestamp
Cams + GTFS-RT polling started 2026-04-23 ~05:30 UTC
Google Directions polling went live 2026-04-23 ~22:00 UTC
Cams + Directions stopped 2026-04-29 23:42 UTC
Railway service decommissioned 2026-04-30 ~01:50 UTC

See also

  • docs/challenges.md — running log of research-process obstacles, decisions, and incidents (including the Apr 29 Google Directions API overspend post-mortem).
  • docs/outreach_emails.md — research-collaboration and data-access outreach drafts (Lewis Lehe, Transit Inc., CUMTD developer team, MetroLab Network).
  • literature-review/NOTES.md — what we use from each paper.
  • cam_frames/{cam}/camera_meta.json — provenance per camera.

License

CC-BY 4.0 for all derived datasets and code, except the NWS weather subset which is CC0 (US government public domain).

About

UIUC Data Dive Sp26 — MTD v3 transit analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors