Skip to content

ffengc/edge-faas-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Edge FaaS Cold-Start Mitigation via EWMA + CUSUM Prediction

Language: English | δΈ­ζ–‡η‰ˆ

Warning

Not all source files have been fully polished yet (e.g., comments, code structure, and style). Some cleanup work has been done, but much remains. All code will be brought up to professional engineering quality before the end of the semester. Apologies for the current state.πŸ™πŸ™πŸ™

Prof: Ramesh Govindan

Proposal

Proposal Report: Proposal.pdf

Check #1

#1 β€” What We're Doing

This project builds a clean-slate, bare-metal FaaS (Function-as-a-Service) worker node from scratch in C++, targeting the cold-start problem in edge serverless environments.

The core research question: Can a lightweight, O(1) prediction heuristic (EWMA + CUSUM + Little's Law) reduce cold-start penalties in cyclic workloads β€” while consuming far less overhead than heavyweight time-series models like ARIMA?

Key Design Principles

  • No Kubernetes, no Docker. OS-level process primitives + Copy-on-Write as a "proxy sandbox." The goal is to decouple and precisely measure our prediction algorithm's overhead without KVM/container noise masking the signal. This is a methodological adjustment made after the advisor raised the Firecracker question: we are measuring the control plane, not reinventing data-plane isolation technology.
  • Strict Reactor / DispatchPool separation. The epoll event loop never blocks; all UDS I/O is offloaded to a 64-thread DispatchPool. The predictor runs on the same thread as the event loop with zero contention.
  • Simplicity-first. EWMA + CUSUM run in O(1) time and O(1) memory. We claim this is sufficient for cyclic IoT workloads and intend to prove it against ARIMA on a Pareto (latency vs. memory) curve.

Architecture


#2 β€” Completion Status

Implemented (Core System β€” All Running)

Module File Status
C++ Epoll Reactor (ET mode, non-blocking) tcp_server.hpp βœ… Done
64-thread DispatchPool (all blocking I/O here) DispatchPool.hpp βœ… Done
WorkerPool: fork + scale-down scavenger thread WorkerPool.hpp βœ… Done
Predictor: EWMA + CUSUM + Little's Law Predictor.hpp βœ… Done
Dynamic T feedback (UpdateServiceTime) web_server.hpp βœ… Done
Python Worker (simulates T=500ms AI inference) worker.py βœ… Done
Two-cycle comparison test + drain phase load_tester.py βœ… Done

Not Yet Done (Gap to Paper-Level Evaluation)

Missing Item Type Priority
ARIMA baseline (Python statsmodels) Code + Experiment πŸ”΄ High β€” explicitly requested by advisor
Reactive baseline (predictor disabled, cold fallback only) Code + Experiment πŸ”΄ High β€” needed to show worst case
Static baseline (fixed N workers, no scaling) Code + Experiment 🟑 Medium β€” needed for Pareto curve
Memory / CPU overhead measurement Experiment πŸ”΄ High β€” core data for Pareto curve
Predictor inference latency measurement Experiment 🟑 Medium β€” proves EWMA/CUSUM << ARIMA
CloudLab bare-metal testing Environment 🟑 Medium β€” loopback has no network noise
Cyclic bursty workload with wrk Experiment 🟑 Medium β€” proposal uses wrk, not Python script

#3 β€” Test Journey & Results

Phase 1 β€” V1.0: Trivial Workload (Misleading "Perfect" Results)

Log: test_20260303_013519.log

In V1.0, the Python worker processed a 1Γ—1 pixel image β€” service time T β‰ˆ 0.001s. With such a tiny T, Little's Law gives N = ⌈60 Γ— 0.001βŒ‰ + 1 = 2 workers even at peak 60 RPS. The predictor never needed to scale beyond N=2, so cold starts never occurred. Worse, scale-down was not yet implemented: workers forked in Cycle 1 stayed alive forever, so Cycle 2 trivially had zero cold starts β€” not because of EWMA memory, but because workers were never killed.

Result: P50=1.4ms, P95=1.7ms, P99=2.1ms, 0 cold starts. Looks great, means nothing.

Root cause of the misleading result: T was too small β†’ N was always 2 β†’ predictor was irrelevant. No scale-down meant Cycle 1 "pre-warmed" Cycle 2 for free.


Phase 2 β€” Threshold Bug: Everything Classified as COLD

Logs: test_20260303_022336.log, test_20260303_022743.log

After adding time.sleep(0.5) to the worker (T=500ms), the warm path latency became ~502ms. The cold-start threshold in load_tester.py was still set at rtt > 500 β€” so every warm request (502ms) was classified as COLD. 100% COLD across all phases. The fix was to set the threshold at 700ms, the midpoint between the warm path (~502ms) and the cold fallback (~800ms).


Phase 3 β€” V2.0: Properly Calibrated Experiment

Log: test_20260303_220742.log

With T=500ms, threshold=700ms, idle_timeout=6s, and an 8-second drain phase between cycles to force the scavenger to kill all workers:

Phase Total WARM COLD P50 P99
C1-Warmup (2 RPS, 10s) 20 18 2 503 ms 806 ms
C1-Spike (30 RPS, 5s) 150 55 95 801 ms 806 ms
C1-Cooldown (2 RPS, 6s) 12 12 0 502 ms 504 ms
[Drain: 8s zero traffic β€” scavenger kills all workers]
C2-Warmup (2 RPS, 5s) 10 8 2 515 ms 812 ms
C2-Spike (30 RPS, 5s) 150 64 86 801 ms 804 ms

Key comparison β€” spike phase only:

  • C1-Spike: 55 warm / 95 cold β†’ 37% warm hit rate
  • C2-Spike: 64 warm / 86 cold β†’ 43% warm hit rate
  • Cold-start reduction: βˆ’9.5%

What caused the improvement? After C1's spike, EWMA climbs to ~15 RPS. During cooldown it decays to ~2.9 by C2-Warmup. Little's Law gives ⌈2.9 Γ— 0.5βŒ‰ + 1 = 3 workers β€” one more than in C1's warmup. That extra pre-warmed worker accounts for the +9 warm hits in C2's spike.

Why is the improvement modest? CUSUM is reactive, not predictive: it fires at second 2 of the spike, but new workers need 0.9s to warm up. Most spike improvement comes from EWMA memory during warmup, not CUSUM during the spike itself. The real bottleneck is that 30 RPS Γ— 0.5s = 15 concurrent workers needed, but we only pre-warm ~3. The system simply doesn't have enough workers during a steep spike.


#4 β€” Next Steps

To reach paper-quality evaluation, three things are needed in order of priority:

  1. Implement baselines β€” Reactive (no predictor), Static (fixed N), and ARIMA (Python statsmodels). Without these, we have no Pareto curve and no paper claim.
  2. Measure overhead β€” CPU and memory footprint of each policy. This is the axis that differentiates us from ARIMA.
  3. CloudLab deployment β€” Real bare-metal nodes with real network latency. Loopback measurements are clean but don't represent edge conditions.

Build & Run

# Dependencies
pip install Pillow

# Build
make clean && make

# Run server (port 8080)
./server

# Run two-cycle comparison test
python3 load_tester.py

Log output: logs/test_<TIMESTAMP>.log


Check #2

check #2 report: checkin2.pdf

What Changed Since Check #1

Change Description
CoW Template Process Added worker_template.py: loads Pillow once at startup; all subsequent workers are CoW-forked from it. Per-worker cold start drops from ~800 ms to ~100 ms
Narrative Reframe Dropped the circular "bypass KVM for clean measurement" argument. New framing: edge inference nodes cannot afford MicroVM overhead β€” OS process isolation is the appropriate data plane
Reactive Baseline ./server reactive β€” pure scale-on-demand, no prediction; establishes cold-start lower bound
Static Baseline ./server static 15 β€” fixed pool of 15 workers kept alive throughout; establishes resource upper bound
ARIMA Baseline ./server arima β€” separate Python process running ARIMA(2,1,2); measures cost of heavyweight forecasting
4-Cycle Workload load_tester.py upgraded to 4 cycles (C1: 8 s warmup; C2–C4: 35 s warmup) to give ARIMA sufficient history
Experiment Automation run_experiments.sh runs all 4 baselines serially and archives logs to logs/exp_<mode>_<ts>/
Resource Monitor resource_monitor.py samples RSS, CPU, and worker count at 1 Hz; outputs CSV for Pareto analysis

Check #2 Checklist

Completed:

  • β˜‘ CoW template process (worker_template.py + refactored WorkerPool.hpp)
  • β˜‘ Reactive / Static / ARIMA baselines
  • β˜‘ 4-cycle bursty-ramp load tester
  • β˜‘ Automated experiment runner + log archiving
  • β˜‘ Resource monitor + Pareto analysis
  • β˜‘ Visualizations (3 figures: cold-start bar chart, worker time-series, Pareto scatter)

Remaining (Final Deliverable):

  • ☐ CloudLab bare-metal deployment
  • ☐ wrk-based high-frequency load generation
  • ☐ Adaptive CUSUM threshold (fix C2/C4 regression)
  • ☐ Frontend Web UI
  • ☐ Final written report

How to Run

# Dependencies
pip install Pillow statsmodels

# Build
make clean && make

# Run a specific mode
./server ewma          # default: EWMA+CUSUM prediction
./server reactive      # reactive baseline
./server static 15     # fixed pool of 15 workers
./server arima         # ARIMA prediction

# Run 4-cycle load test
python3 load_tester.py

# Run all 4 baselines (recommended β€” auto-archives logs)
./run_experiments.sh

# Run a single baseline
./run_experiments.sh ewma

Log output: logs/exp_<mode>_<timestamp>/ (contains test.log, server.log, resource.csv)

Key Findings & Known Issues

Key Findings:

  • EWMA+CUSUM achieves zero cold starts in C1/C3 (odd cycles) β€” Reactive never does. Predictive pre-warming works for bursty-ramp traffic
  • C2/C4 regress to cold=36. Root cause: the first Ramp measurement window straddles the Cooldown/Ramp boundary and captures only ~9 RPS; CUSUM accumulates to 7.83 β€” just 0.17 below the threshold of 8 β€” so the alarm fires only after the Spike has begun
  • ARIMA requires 4+ cycles to converge (C4 cold=10); it underperforms Reactive in C1–C3, confirming the "history cold-start" cost
  • EWMA+CUSUM has the lowest average RSS (2,021 MB) despite peaking at 23 workers β€” the scavenger recycles workers every 6 s, keeping each new fork in a clean CoW state

Known Issues / Limitations:

  • Fixed CUSUM threshold causes every-other-cycle regression; adaptive threshold is the planned fix
  • All tests run on localhost loopback; CloudLab results with real network latency may differ
  • RSS figures are summed per-process and double-count CoW shared pages; true physical memory (PSS) is lower

Before Pre β€” Final Results Heading into the Class Talk

This section captures the consolidated final results compiled ahead of the 2026-04-30 CSCI 599 class presentation. It systematically supplements Check #2 with three additions: CoW cold-start quantification, the Adaptive CUSUM baseline, and the Warmup-Sweep ablation. All figures live in figures/pre/ β€” see MANIFEST.md for provenance.

What Was Added Since Check #2

Addition Files / Command Description
CoW cold-start quantification figures/plot_cow.py β†’ slide05_cow.png Hand-measured from worker.py simulated cold start + server.log "Worker N ready (CoW fork)" lines: Naive (exec Python + import) β‰ˆ 900 ms vs CoW (fork from warm parent) β‰ˆ 100 ms β€” 9Γ— faster, no runtime dependency
CUSUM real-data trace figures/plot_rps_cusum.py β†’ slide06_rps_cusum.png CUSUM accumulator trajectory reconstructed from sweep #3's real server.log (90 predictor ticks, 11 SPIKE DETECTED events, drift=5, h=8) β€” confirms alarms fire entirely during the ramp phase
Adaptive CUSUM baseline ./server ewma_adaptive Uses EWMSD (running Οƒ) for z-score normalization β€” the alternative to fixed-drift CUSUM
Workload design figure figures/plot_workload.py β†’ slide07_workload.png load_tester.py's 4-cycle Bursty-Ramp parameters visualized as a timeline; the Ramp band is labeled as the CUSUM detection window
5-mode main result figures/plot_main_result.py β†’ slide08_main_result.png 2026-04-20 re-run of all 5 modes Γ— 4 cycles; cold counts parsed from each load_tester_output.txt SPIKE COMPARISON table
Warmup-Sweep ablation figures/plot_sweep.py β†’ slide10_sweep.png sweep #1 (W=5, 120 s endpoints) + sweep #3 (W=10, 20, 35, 60 s interior); fixed vs adaptive comparison

Main Result (5 modes Γ— 4 cycles, single trial on 2026-04-20)

Mode C1 C2 C3 C4 Total Notes
Static-15 (over-provisioned) 0 0 0 0 0 15 workers pinned the entire run β€” upper-bound reference
Adaptive CUSUM (EWMSD z-score) 0 0 0 0 0 W=35 happens to sit in the sweet spot
Fixed CUSUM (ours) 0 14 33 0 47 C3 hit a clock-aliasing event
Reactive (scale on backlog) 20 15 20 12 67 Reactive baseline
ARIMA (smoothed Target) 20 18 31 16 85 Heavyweight time-series forecasting

Total cold starts = number of spike-phase requests (out of 600 across 4 cycles) classified as cold (RTT > 700 ms).

Key findings:

  • Predict-based (Fixed CUSUM) clearly beats reactive baselines: 30% fewer cold starts than Reactive, 45% fewer than ARIMA
  • 33 of those 47 come from a single C3 clock-aliasing event: the 2-second measurement window happened to truncate the ramp, leaving CUSUM at 18.24 instead of the expected ~8–10 (server.log at t=1776671064), delaying the first SPIKE DETECTED by 4 seconds. This is a known failure mode of fixed-drift CUSUM, not a bug. Without this event, total β‰ˆ 14 β€” almost at the Static-15 floor, but without pinning 15 workers
  • CoW Template drops per-worker spin-up from 900 ms to 100 ms (9Γ— speedup, no runtime dependency)
  • CUSUM fires during the ramp, not after the peak: a single 200 s 4-cycle run produces 11 SPIKE DETECTED alarms, all of them during the ramp climb β€” none after the peak

CoW Template β€” 9Γ— Cold-Start Speedup

We pre-import Pillow and set up the socket once in a single template Python process. Every new worker is cloned from it via fork(). Linux's copy-on-write makes the fork itself nearly free β€” Pillow's code and import tables are read-only, so almost no pages are duplicated.

β†’ No image. No snapshot. No registry. Just fork() from a warm parent.

CUSUM Real-Data Trace β€” Alarms During the Ramp, Not the Peak

Top panel: blue is measured RPS, dashed orange is the EWMA baseline (Ξ±=0.2). The baseline lags by design β€” that lag is what lets CUSUM observe the gap when RPS pulls away.

Bottom panel: green is the CUSUM accumulator. It crosses the red threshold h = 8 and β˜… alarms fire β€” all 11 alarms land during the ramp climb, none after the peak. This is the empirical proof of "catch the ramp, not the peak".

Workload Design β€” Bursty-Ramp Γ— 4 Cycles

Each cycle simulates one "train arrival" pattern: Warmup β†’ Ramp (30 s) β†’ Spike (30 RPS Γ— 5 s) β†’ Cooldown β†’ Drain.

  • C1 warmup = 8 s: tests the truly cold case β€” no prior history
  • C2–C4 warmup = 35 s: long enough for the scavenger to scale workers back down, but the EWMA baseline still remembers the last spike β€” this tests the predictor's cross-cycle memory
  • The orange Ramp band is the CUSUM detection window: every worker we want at peak time has to be forked within those 30 seconds

Warmup-Sweep Ablation β€” Two Failure Modes, No Winner

W (s) Fixed CUSUM Adaptive CUSUM
5 48 287
10 45 135
20 0 0
35 0 0
60 0 0
120 32 0
  • Adaptive fails at short W (Ο„_Οƒ cliff): the running Οƒ estimator cannot decay between back-to-back bursts, so the z-score never crosses threshold. Measured Ο„_Οƒ β‰ˆ 6.6 s
  • Fixed fails at long W (aliasing miss): threshold h=8 was tuned for the typical ramp, and the 2 s measurement window can truncate the ramp at boundary cases
  • Sweet spot: W = 20 ~ 60 s β€” both predictors work here. The main result at W=35 sits inside this band, which is why both score 0
  • Grid totals: fixed=131 < adaptive=423 β€” raw counts actually favor fixed

⚠ Framing: two failure modes, no winner. Fixed handles tight cadence; adaptive handles loose cadence. The predictor isn't a choice β€” it's a knob. Adaptive's real contribution is scale invariance + aliasing robustness, not lower cold-start counts.

Known Limitations / Final Report TODOs

  • n = 1 per sweep point: the sweep grid is single-trial due to time budget heading into the class talk β†’ CloudLab multi-trial (n β‰₯ 5) is deferred to the final report
  • Current Python load_tester caps near 300 RPS, not enough to model realistic edge bursts (target β‰₯ 2k RPS, requires wrk or a Rust async generator)
  • Regime-aware ensemble is the natural research follow-up to the sweep result: fixed + adaptive + meta-controller, picking automatically by workload cadence β€” closes the W ≀ 10 s gap without giving up adaptive's scale invariance
  • All tests still run on localhost loopback (no real network latency)
  • Reported RSS sums per-process residency and double-counts CoW shared pages; true PSS is lower
  • Presentation deliverables: docs/pre_how_4.md (13-slide script, EN/ZH bilingual) + figures/pre/ (5 figures)

About

USC CS599 | Mitigating Serverless Cold Starts in Edge Environments via Histogram-based PredictionπŸš€πŸš€πŸš€

Resources

License

Stars

Watchers

Forks

Contributors