Language: English | δΈζη
Warning
Not all source files have been fully polished yet (e.g., comments, code structure, and style). Some cleanup work has been done, but much remains. All code will be brought up to professional engineering quality before the end of the semester. Apologies for the current state.πππ
Prof: Ramesh Govindan
Proposal Report: Proposal.pdf
This project builds a clean-slate, bare-metal FaaS (Function-as-a-Service) worker node from scratch in C++, targeting the cold-start problem in edge serverless environments.
The core research question: Can a lightweight, O(1) prediction heuristic (EWMA + CUSUM + Little's Law) reduce cold-start penalties in cyclic workloads β while consuming far less overhead than heavyweight time-series models like ARIMA?
- No Kubernetes, no Docker. OS-level process primitives + Copy-on-Write as a "proxy sandbox." The goal is to decouple and precisely measure our prediction algorithm's overhead without KVM/container noise masking the signal. This is a methodological adjustment made after the advisor raised the Firecracker question: we are measuring the control plane, not reinventing data-plane isolation technology.
- Strict Reactor / DispatchPool separation. The epoll event loop never blocks; all UDS I/O is offloaded to a 64-thread
DispatchPool. The predictor runs on the same thread as the event loop with zero contention. - Simplicity-first. EWMA + CUSUM run in O(1) time and O(1) memory. We claim this is sufficient for cyclic IoT workloads and intend to prove it against ARIMA on a Pareto (latency vs. memory) curve.
| Module | File | Status |
|---|---|---|
| C++ Epoll Reactor (ET mode, non-blocking) | tcp_server.hpp |
β Done |
| 64-thread DispatchPool (all blocking I/O here) | DispatchPool.hpp |
β Done |
| WorkerPool: fork + scale-down scavenger thread | WorkerPool.hpp |
β Done |
| Predictor: EWMA + CUSUM + Little's Law | Predictor.hpp |
β Done |
Dynamic T feedback (UpdateServiceTime) |
web_server.hpp |
β Done |
| Python Worker (simulates T=500ms AI inference) | worker.py |
β Done |
| Two-cycle comparison test + drain phase | load_tester.py |
β Done |
| Missing Item | Type | Priority |
|---|---|---|
ARIMA baseline (Python statsmodels) |
Code + Experiment | π΄ High β explicitly requested by advisor |
| Reactive baseline (predictor disabled, cold fallback only) | Code + Experiment | π΄ High β needed to show worst case |
| Static baseline (fixed N workers, no scaling) | Code + Experiment | π‘ Medium β needed for Pareto curve |
| Memory / CPU overhead measurement | Experiment | π΄ High β core data for Pareto curve |
| Predictor inference latency measurement | Experiment | π‘ Medium β proves EWMA/CUSUM << ARIMA |
| CloudLab bare-metal testing | Environment | π‘ Medium β loopback has no network noise |
Cyclic bursty workload with wrk |
Experiment | π‘ Medium β proposal uses wrk, not Python script |
Log: test_20260303_013519.log
In V1.0, the Python worker processed a 1Γ1 pixel image β service time T β 0.001s. With such a tiny T, Little's Law gives N = β60 Γ 0.001β + 1 = 2 workers even at peak 60 RPS. The predictor never needed to scale beyond N=2, so cold starts never occurred. Worse, scale-down was not yet implemented: workers forked in Cycle 1 stayed alive forever, so Cycle 2 trivially had zero cold starts β not because of EWMA memory, but because workers were never killed.
Result: P50=1.4ms, P95=1.7ms, P99=2.1ms, 0 cold starts. Looks great, means nothing.
Root cause of the misleading result: T was too small β N was always 2 β predictor was irrelevant. No scale-down meant Cycle 1 "pre-warmed" Cycle 2 for free.
Logs: test_20260303_022336.log, test_20260303_022743.log
After adding time.sleep(0.5) to the worker (T=500ms), the warm path latency became ~502ms. The cold-start threshold in load_tester.py was still set at rtt > 500 β so every warm request (502ms) was classified as COLD. 100% COLD across all phases. The fix was to set the threshold at 700ms, the midpoint between the warm path (~502ms) and the cold fallback (~800ms).
Log: test_20260303_220742.log
With T=500ms, threshold=700ms, idle_timeout=6s, and an 8-second drain phase between cycles to force the scavenger to kill all workers:
| Phase | Total | WARM | COLD | P50 | P99 |
|---|---|---|---|---|---|
| C1-Warmup (2 RPS, 10s) | 20 | 18 | 2 | 503 ms | 806 ms |
| C1-Spike (30 RPS, 5s) | 150 | 55 | 95 | 801 ms | 806 ms |
| C1-Cooldown (2 RPS, 6s) | 12 | 12 | 0 | 502 ms | 504 ms |
| [Drain: 8s zero traffic β scavenger kills all workers] | |||||
| C2-Warmup (2 RPS, 5s) | 10 | 8 | 2 | 515 ms | 812 ms |
| C2-Spike (30 RPS, 5s) | 150 | 64 | 86 | 801 ms | 804 ms |
Key comparison β spike phase only:
- C1-Spike: 55 warm / 95 cold β 37% warm hit rate
- C2-Spike: 64 warm / 86 cold β 43% warm hit rate
- Cold-start reduction: β9.5%
What caused the improvement? After C1's spike, EWMA climbs to ~15 RPS. During cooldown it decays to ~2.9 by C2-Warmup. Little's Law gives β2.9 Γ 0.5β + 1 = 3 workers β one more than in C1's warmup. That extra pre-warmed worker accounts for the +9 warm hits in C2's spike.
Why is the improvement modest? CUSUM is reactive, not predictive: it fires at second 2 of the spike, but new workers need 0.9s to warm up. Most spike improvement comes from EWMA memory during warmup, not CUSUM during the spike itself. The real bottleneck is that 30 RPS Γ 0.5s = 15 concurrent workers needed, but we only pre-warm ~3. The system simply doesn't have enough workers during a steep spike.
To reach paper-quality evaluation, three things are needed in order of priority:
- Implement baselines β Reactive (no predictor), Static (fixed N), and ARIMA (Python
statsmodels). Without these, we have no Pareto curve and no paper claim. - Measure overhead β CPU and memory footprint of each policy. This is the axis that differentiates us from ARIMA.
- CloudLab deployment β Real bare-metal nodes with real network latency. Loopback measurements are clean but don't represent edge conditions.
# Dependencies
pip install Pillow
# Build
make clean && make
# Run server (port 8080)
./server
# Run two-cycle comparison test
python3 load_tester.pyLog output: logs/test_<TIMESTAMP>.log
check #2 report: checkin2.pdf
| Change | Description |
|---|---|
| CoW Template Process | Added worker_template.py: loads Pillow once at startup; all subsequent workers are CoW-forked from it. Per-worker cold start drops from ~800 ms to ~100 ms |
| Narrative Reframe | Dropped the circular "bypass KVM for clean measurement" argument. New framing: edge inference nodes cannot afford MicroVM overhead β OS process isolation is the appropriate data plane |
| Reactive Baseline | ./server reactive β pure scale-on-demand, no prediction; establishes cold-start lower bound |
| Static Baseline | ./server static 15 β fixed pool of 15 workers kept alive throughout; establishes resource upper bound |
| ARIMA Baseline | ./server arima β separate Python process running ARIMA(2,1,2); measures cost of heavyweight forecasting |
| 4-Cycle Workload | load_tester.py upgraded to 4 cycles (C1: 8 s warmup; C2βC4: 35 s warmup) to give ARIMA sufficient history |
| Experiment Automation | run_experiments.sh runs all 4 baselines serially and archives logs to logs/exp_<mode>_<ts>/ |
| Resource Monitor | resource_monitor.py samples RSS, CPU, and worker count at 1 Hz; outputs CSV for Pareto analysis |
Completed:
- β CoW template process (
worker_template.py+ refactoredWorkerPool.hpp) - β Reactive / Static / ARIMA baselines
- β 4-cycle bursty-ramp load tester
- β Automated experiment runner + log archiving
- β Resource monitor + Pareto analysis
- β Visualizations (3 figures: cold-start bar chart, worker time-series, Pareto scatter)
Remaining (Final Deliverable):
- β CloudLab bare-metal deployment
- β
wrk-based high-frequency load generation - β Adaptive CUSUM threshold (fix C2/C4 regression)
- β Frontend Web UI
- β Final written report
# Dependencies
pip install Pillow statsmodels
# Build
make clean && make
# Run a specific mode
./server ewma # default: EWMA+CUSUM prediction
./server reactive # reactive baseline
./server static 15 # fixed pool of 15 workers
./server arima # ARIMA prediction
# Run 4-cycle load test
python3 load_tester.py
# Run all 4 baselines (recommended β auto-archives logs)
./run_experiments.sh
# Run a single baseline
./run_experiments.sh ewmaLog output: logs/exp_<mode>_<timestamp>/ (contains test.log, server.log, resource.csv)
Key Findings:
- EWMA+CUSUM achieves zero cold starts in C1/C3 (odd cycles) β Reactive never does. Predictive pre-warming works for bursty-ramp traffic
- C2/C4 regress to cold=36. Root cause: the first Ramp measurement window straddles the Cooldown/Ramp boundary and captures only ~9 RPS; CUSUM accumulates to 7.83 β just 0.17 below the threshold of 8 β so the alarm fires only after the Spike has begun
- ARIMA requires 4+ cycles to converge (C4 cold=10); it underperforms Reactive in C1βC3, confirming the "history cold-start" cost
- EWMA+CUSUM has the lowest average RSS (2,021 MB) despite peaking at 23 workers β the scavenger recycles workers every 6 s, keeping each new fork in a clean CoW state
Known Issues / Limitations:
- Fixed CUSUM threshold causes every-other-cycle regression; adaptive threshold is the planned fix
- All tests run on localhost loopback; CloudLab results with real network latency may differ
- RSS figures are summed per-process and double-count CoW shared pages; true physical memory (PSS) is lower
This section captures the consolidated final results compiled ahead of the 2026-04-30 CSCI 599 class presentation. It systematically supplements Check #2 with three additions: CoW cold-start quantification, the Adaptive CUSUM baseline, and the Warmup-Sweep ablation. All figures live in
figures/pre/β seeMANIFEST.mdfor provenance.
| Addition | Files / Command | Description |
|---|---|---|
| CoW cold-start quantification | figures/plot_cow.py β slide05_cow.png |
Hand-measured from worker.py simulated cold start + server.log "Worker N ready (CoW fork)" lines: Naive (exec Python + import) β 900 ms vs CoW (fork from warm parent) β 100 ms β 9Γ faster, no runtime dependency |
| CUSUM real-data trace | figures/plot_rps_cusum.py β slide06_rps_cusum.png |
CUSUM accumulator trajectory reconstructed from sweep #3's real server.log (90 predictor ticks, 11 SPIKE DETECTED events, drift=5, h=8) β confirms alarms fire entirely during the ramp phase |
| Adaptive CUSUM baseline | ./server ewma_adaptive |
Uses EWMSD (running Ο) for z-score normalization β the alternative to fixed-drift CUSUM |
| Workload design figure | figures/plot_workload.py β slide07_workload.png |
load_tester.py's 4-cycle Bursty-Ramp parameters visualized as a timeline; the Ramp band is labeled as the CUSUM detection window |
| 5-mode main result | figures/plot_main_result.py β slide08_main_result.png |
2026-04-20 re-run of all 5 modes Γ 4 cycles; cold counts parsed from each load_tester_output.txt SPIKE COMPARISON table |
| Warmup-Sweep ablation | figures/plot_sweep.py β slide10_sweep.png |
sweep #1 (W=5, 120 s endpoints) + sweep #3 (W=10, 20, 35, 60 s interior); fixed vs adaptive comparison |
| Mode | C1 | C2 | C3 | C4 | Total | Notes |
|---|---|---|---|---|---|---|
| Static-15 (over-provisioned) | 0 | 0 | 0 | 0 | 0 | 15 workers pinned the entire run β upper-bound reference |
| Adaptive CUSUM (EWMSD z-score) | 0 | 0 | 0 | 0 | 0 | W=35 happens to sit in the sweet spot |
| Fixed CUSUM (ours) | 0 | 14 | 33 | 0 | 47 | C3 hit a clock-aliasing event |
| Reactive (scale on backlog) | 20 | 15 | 20 | 12 | 67 | Reactive baseline |
| ARIMA (smoothed Target) | 20 | 18 | 31 | 16 | 85 | Heavyweight time-series forecasting |
Total cold starts = number of spike-phase requests (out of 600 across 4 cycles) classified as cold (RTT > 700 ms).
Key findings:
- Predict-based (Fixed CUSUM) clearly beats reactive baselines: 30% fewer cold starts than Reactive, 45% fewer than ARIMA
- 33 of those 47 come from a single C3 clock-aliasing event: the 2-second measurement window happened to truncate the ramp, leaving CUSUM at 18.24 instead of the expected ~8β10 (
server.logat t=1776671064), delaying the first SPIKE DETECTED by 4 seconds. This is a known failure mode of fixed-drift CUSUM, not a bug. Without this event, total β 14 β almost at the Static-15 floor, but without pinning 15 workers - CoW Template drops per-worker spin-up from 900 ms to 100 ms (9Γ speedup, no runtime dependency)
- CUSUM fires during the ramp, not after the peak: a single 200 s 4-cycle run produces 11 SPIKE DETECTED alarms, all of them during the ramp climb β none after the peak
We pre-import Pillow and set up the socket once in a single template Python process. Every new worker is cloned from it via fork(). Linux's copy-on-write makes the fork itself nearly free β Pillow's code and import tables are read-only, so almost no pages are duplicated.
β No image. No snapshot. No registry. Just fork() from a warm parent.
Top panel: blue is measured RPS, dashed orange is the EWMA baseline (Ξ±=0.2). The baseline lags by design β that lag is what lets CUSUM observe the gap when RPS pulls away.
Bottom panel: green is the CUSUM accumulator. It crosses the red threshold h = 8 and β
alarms fire β all 11 alarms land during the ramp climb, none after the peak. This is the empirical proof of "catch the ramp, not the peak".
Each cycle simulates one "train arrival" pattern: Warmup β Ramp (30 s) β Spike (30 RPS Γ 5 s) β Cooldown β Drain.
- C1 warmup = 8 s: tests the truly cold case β no prior history
- C2βC4 warmup = 35 s: long enough for the scavenger to scale workers back down, but the EWMA baseline still remembers the last spike β this tests the predictor's cross-cycle memory
- The orange Ramp band is the CUSUM detection window: every worker we want at peak time has to be forked within those 30 seconds
| W (s) | Fixed CUSUM | Adaptive CUSUM |
|---|---|---|
| 5 | 48 | 287 |
| 10 | 45 | 135 |
| 20 | 0 | 0 |
| 35 | 0 | 0 |
| 60 | 0 | 0 |
| 120 | 32 | 0 |
- Adaptive fails at short W (Ο_Ο cliff): the running Ο estimator cannot decay between back-to-back bursts, so the z-score never crosses threshold. Measured Ο_Ο β 6.6 s
- Fixed fails at long W (aliasing miss): threshold h=8 was tuned for the typical ramp, and the 2 s measurement window can truncate the ramp at boundary cases
- Sweet spot: W = 20 ~ 60 s β both predictors work here. The main result at W=35 sits inside this band, which is why both score 0
- Grid totals: fixed=131 < adaptive=423 β raw counts actually favor fixed
β Framing: two failure modes, no winner. Fixed handles tight cadence; adaptive handles loose cadence. The predictor isn't a choice β it's a knob. Adaptive's real contribution is scale invariance + aliasing robustness, not lower cold-start counts.
- n = 1 per sweep point: the sweep grid is single-trial due to time budget heading into the class talk β CloudLab multi-trial (n β₯ 5) is deferred to the final report
- Current Python load_tester caps near 300 RPS, not enough to model realistic edge bursts (target β₯ 2k RPS, requires
wrkor a Rust async generator) - Regime-aware ensemble is the natural research follow-up to the sweep result: fixed + adaptive + meta-controller, picking automatically by workload cadence β closes the W β€ 10 s gap without giving up adaptive's scale invariance
- All tests still run on localhost loopback (no real network latency)
- Reported RSS sums per-process residency and double-counts CoW shared pages; true PSS is lower
- Presentation deliverables:
docs/pre_how_4.md(13-slide script, EN/ZH bilingual) +figures/pre/(5 figures)







