fronted/scanner: client-side CDN front discovery (draft)#488
Draft
myleshorton wants to merge 8 commits into
Draft
fronted/scanner: client-side CDN front discovery (draft)#488myleshorton wants to merge 8 commits into
myleshorton wants to merge 8 commits into
Conversation
Adds a probe-based scanner that turns the existing fronted.yaml.gz masquerades — plus opportunistic CloudFront-range samples and Akamai hostname-regex draws — into a ranked list of (IP, outer SNI, inner Host) tuples that work from the client's network position. Why this exists at all: censorship in IR moves fast enough that a config push isn't a tight enough loop, and the working fronts are per-(ISP, geography, time-of-day) per Samim Mirhosseini. The scanner runs client-side and reports per-client truth. Pieces: - scanner.go: Candidate / Result / Probe / Scan / RankWorking. Probe does TCP + uTLS handshake + HTTPS GET to TestURL with the inner Host header. Only OK on a 2xx. - candidates.go: CandidatesFromConfig flattens domainfront.Config into the primary probe pool. SNIsForProvider extracts the masquerade-domain pool for use with CloudFrontCandidates. - cloudfront.go: 204 CloudFront IPv4 prefixes embedded; weighted random sampling pairs IPs with caller-supplied outer SNIs. - akamai.go: SystemResolver (OS/ISP resolver — the ISP is the right source in IR). Akamai candidates leave SNI empty matching fronted.yaml.gz and verify against AkamaiCertHostname for every entry. GenerateAkamaiHostnames produces the Psiphon/MahsaNG regex pattern. 22 unit tests, plus opt-in (SCANNER_INTEGRATION=1) live-network tests. Akamai integration: ~100% hit rate against the canonical edge hostname.
Adds the layer on top of the probe primitives: a Service that runs scans on a schedule, persists working fronts to disk, exposes a round-robin Pick API for consumers, and re-scans when consumers report failures. Lifecycle: - NewService(cfg) loads any prior cache (filtered by CacheTTL so stale entries don't seed the live pool with already-blocked IPs) - Start(ctx) runs the periodic refresh loop until ctx is canceled or Close is called - Working() returns the current ranked list; Pick() returns the next one round-robin so all working fronts get traffic rather than every dial pinning to the lowest-latency entry - ReportFailure(c) tracks per-front failures; after two failures within a refresh cycle the front is dropped, and if the working list falls below MinWorkingFronts a refresh is signaled - Refresh() is a manual trigger BuildPool composes candidates from the three feeders (known masquerades from fronted.yaml.gz, regex-generated Akamai hostnames resolved via SystemResolver, random CloudFront IPs paired with masquerade SNIs). Sample sizes <= 0 disable a feeder. Cache schema is versioned JSON written atomically (write tmp + rename). Missing file is not an error — first-boot loads nothing and proceeds to the first scan. Defaults: RefreshInterval 1h, CacheTTL 6h (matches Samim's "time-of-day" observation that working fronts shift on roughly that timescale), MinWorkingFronts 3. Tests: 11 new (cache save/load/TTL/missing/version + service round-robin/empty/failure-removal/low-water-signal/cache-restore/ no-config-is-error + BuildPool known-only and CloudFront paths).
Adds the consumer layer that converts the scanner.Service's working list into []FrontSpec entries ready for the lantern-box meek outbound's JSON configuration. Provider owns the Service lifecycle, wires the bypass dialer so probes don't loop through the active VPN TUN, and uses TrustedCAsPool from the loaded domainfront config so cert validation matches production. FrontSpec is a local mirror of lantern-box/option.FrontSpec — same JSON shape, kept local to avoid version-coupling radiance to lantern-box's release cadence (the meek option type lands in lantern-box#265 and isn't published yet). Service lifecycle fix: Close no longer hangs when Start was never called. NewProvider returns an error for nil Config instead of panicking inside TrustedCAsPool. Adds a live-network timing benchmark (TestLive_TimeToFirstWorking, gated on SCANNER_INTEGRATION=1) that loads the production fronted.yaml.gz, builds a 70+ candidate pool, runs a full scan, and reports time-to-first-working / total scan time / per-feeder hit rate / per-probe latency p50/p90. On a sample run from a US dev network: - pool: 72 candidates (50 known + Akamai-DNS-resolved + 10 CloudFront-random) - time to first working front: 205ms - scan complete: 35/72 working in 8.79s - akamai: 35/36 working (97%) - cloudfront: 0/36 working (0%) — fronted.yaml.gz cloudfront testurl is stale - per-probe latency: p50=218ms, p90=1.47s, min=142ms Sub-second time-to-usability means a cold-boot client gets a working front before the user notices. CloudFront's 0% is the known POP-vs-distribution issue (#3525); production deployment with a fresh, globally-served test URL would lift that.
Flips the default candidate pool composition so per-scan-fresh IPs from the AWS CloudFront prefix list and DNS-resolved Akamai edges are the primary discovery source, with the pre-resolved IPs in fronted.yaml.gz reduced to opt-in via KnownSample > 0. Why: the YAML's pre-resolved IPs are the same baked list every user gets and don't move per (ISP, location, time-of-day). The raw-range feeders self-heal as CDN edges rotate and produce per-user-fresh candidates — matching Samim Mirhosseini's observation that the working fronts vary across all three dimensions. BuildPool semantic change: KnownSample <= 0 now skips the known feeder entirely (previously it meant "use all known"). Callers explicitly opt in by passing KnownSample > 0. Provider defaults: KnownSample removed from defaults() (defaults to 0 → skip), CloudFrontSample=30, AkamaiSample=3 (4 hostnames after adding canonical → typically ~8 unique IPs after DNS dedup). Re-ran the live timing benchmark with new defaults from a US dev network against the production fronted.yaml.gz: - pool: 38 candidates (30 CloudFront-raw + 8 Akamai-DNS-resolved) - time to first working front: 154ms (was 205ms) - scan complete: 8/38 working in 10.7s - akamai: 8/8 working (100%) - cloudfront: 0/30 working (0%) — stale testurl in YAML - per-probe latency: p50=244ms p90=292ms min=154ms Tail latency tightened (p90 1.47s → 292ms) because the working pool is now uniformly fresh rather than mixing pre-resolved IPs of varying age. CloudFront's 0% is a fixable production deployment issue (fresh globally-served distribution), not a discovery flaw. Sub-200ms time-to-first-working means cold-boot clients have a working front before the user notices.
http.Transport routes via DialTLSContext (our pre-opened fronted TLS conn) only for https URLs. With an http:// TestURL the request fell through to plain DNS + port 80, bypassing the front entirely — every probe was effectively a direct-DNS plaintext request to the inner hostname instead of a fronted request via the chosen CDN edge. Akamai's TestURL in fronted.yaml.gz is https:// so its probes were fine; CloudFront's is http:// so its probes were structurally broken. The fix surfaces a separate finding: even with probes routed correctly, CloudFront returns HTTP 421 "Misdirected Request" for every (random IP × masquerade SNI) pair AND for every pre-validated pair in fronted.yaml.gz. AWS now strictly enforces SNI/Host match, killing the cross-distribution Host header routing technique our YAML attempts. CloudFront fronting via this scheme is not just stale data — it's structurally disabled at the AWS layer. Workable CloudFront fronting requires alternate-domain-names on the same distribution (outer SNI and inner Host both belong to one CloudFront distribution AWS owns the cert for), which is a different deployment than fronted.yaml.gz uses today. Tracking as follow-up.
CloudFront fronting works when the client sends no SNI extension and keeps the inner Host in the request. The TLS handshake completes with CloudFront's default *.cloudfront.net cert (or a customer cert pinned to that edge); CloudFront then routes by inner Host alone since no SNI claims a different distribution. Sending a non-empty SNI triggered HTTP 421 "Misdirected Request" because CloudFront strictly enforces SNI/Host match — exactly the behavior the earlier 0% hit rate exposed. Production's fronted.yaml.gz CloudFront masquerades have always shipped with sni: "" for the same reason; the bug was in my scanner's CloudFrontCandidates setting SNI = masquerade-domain. Two changes in CloudFrontCandidates: - SNI: "" (was masquerade-domain) — sidesteps 421 enforcement. - VerifyHostname: InnerHost (was masquerade-domain) — when no SNI, CloudFront serves either the *.cloudfront.net default cert (which wildcards the inner Host) or a customer-pinned cert. Verifying against InnerHost filters to the former, where cross-distribution Host routing actually reaches our backend. Verifying against the masquerade-domain rejected the wildcard cert and lost the working cases. Live-network results after the fix: - CloudFront random sampling: 1-3/30 working (3-8%) — was 0/30. The hit rate is structural (POP-vs-distribution coverage); each hit is an edge that genuinely routes to our distribution. - Akamai: 100% unchanged. - Time to first working front: 149ms.
Adds the radiance-side wiring that takes a FrontSpec list (from the fronted/scanner Service via kindling/meek.Provider) and turns it into a sing-box outbound the live tunnel can route through. Two pieces: 1. kindling/meek.BuildOutbound(tag, url, fronts) constructs a sing-box O.Outbound with Type="meek" and a local MeekOutboundOptions struct whose JSON shape mirrors lantern-box/option.MeekOutboundOptions exactly. The local copy sidesteps the lantern-box version-coupling: lantern-box v0.0.82 doesn't have the meek outbound type registered, so we can't import lbO.MeekOutboundOptions today. Once the lantern-box bump lands the local copy + MeekOutboundType constant can be replaced one-for-one with the upstream symbols. Returns ok=false when fronts is empty so callers skip injection when the scanner hasn't produced anything yet. 2. vpn.BoxOptions gains an optional MeekOutbound *O.Outbound field. buildOptions injects it into Outbounds and appends its Tag to the selector tags list immediately after mergeAndCollectTags (and before the auto/manual selector outbounds are built) so the meek outbound participates in routing alongside API-supplied ones. Nil = no-op, no behavior change for callers that don't set it. Until lantern-box's meek type is registered in radiance's pinned version, setting MeekOutbound is a no-op end-to-end — libbox will reject the unknown "meek" type at config unmarshal. The wiring is ready; activation flips when (a) lantern-box bumps and (b) the caller (whoever owns the VPNClient) populates MeekOutbound from a meek.Provider's FrontSpecs. Tests: 2 new in kindling/meek (BuildOutbound empty-fronts/shape), 2 new in vpn (MeekInjection/MeekOmittedWhenNil) confirming the selector tag list includes the meek tag and Outbounds is augmented correctly.
Single source of truth for the meek-server URL the production wiring will dial through Akamai. End-to-end verified 2026-05-23: domain-fronted POST returns the echoed payload in ~470ms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft — companion to lantern-box#TBD (meek outbound).
Summary
Adds a client-side probe-based scanner that turns the existing
fronted.yaml.gzmasquerades — plus opportunistic CloudFront-range samples and Akamai hostname-regex draws — into a ranked list of(IP, outer SNI, inner Host)tuples that work from this client's network position right now.Why this can't be a server-curated list: censorship in IR moves faster than our config-push cadence, and per Samim Mirhosseini (developer behind patterniha/MITM-DomainFronting, in this Slack thread) the working fronts are "different for each person depending on what ISP they use, their location and time of day". The right discovery loop is per-client, not per-deploy.
Architecture
Layer 1 — client-side probe (this PR)
Probe(ctx, Candidate, Options) Resultperforms the full check that any one front actually works for this client:Candidate.IPAddress:443via the suppliedDialerServerName = Candidate.SNI(empty = no SNI sent, Akamai style; non-empty = sent verbatim, CloudFront style)Candidate.VerifyHostnameCandidate.TestURLwithHost: Candidate.InnerHostScan(ctx, []Candidate, Options) []Resultruns Probe concurrently with configurable budget.Layer 2 — candidate generation
Three feeders into the candidate pool, each suited to a different CDN's edge model:
CandidatesFromConfig(*domainfront.Config)flattens the existingfronted.yaml.gzmasquerades. Pre-validated(IP, SNI)pairs; this is the primary input.CloudFrontCandidates(n, snis, ...)for discovering CloudFront edges beyond the curated list. Embedded snapshot of AWS's 204 CloudFront IPv4 prefixes (cloudfront_prefixes.txt); weighted random sampling pairs IPs with caller-supplied outer SNIs. Expected hit rate is partial — each CloudFront edge serves a subset of distributions per POP, so the probe filters mismatches. Acceptable for discovery.AkamaiCandidates(ctx, hostnames, SystemResolver{}, ...)for discovering Akamai edges via the OS/ISP resolver. Critical: this is the correct path even in IR — the ISP returns real Akamai IPs (Akamai isn't blocked, hosts too much Iranian critical infra) and those IPs are geographically near the client's network. DoH endpoints themselves are blocked in IR.GenerateAkamaiHostnames(n)produces draws froma([1-9]|1[0-9])([0-9]{2})\.(dsc)?(b|d|g|g2|na|r|w7)\.akamai\.net— same regex pattern shipped in Psiphon's server entries and adopted by MahsaNG / Shir-o-Khorshid. ~3,500 hostnames in the regex space, all resolve through the same Akamai general edge property.For Akamai,
VerifyHostnameis always set to the canonicala248.e.akamai.netregardless of which regex hostname was used to discover the IP — the regex hostnames aren't in the cert's SAN list, but the edge's default cert always validates againsta248.e.akamai.net. This was a non-obvious bug in the initial draft; live-network testing exposed it (cert-mismatch failures with regex-generated VerifyHostnames).Sequence
sequenceDiagram participant App as radiance client participant Scn as fronted/scanner participant SYS as System Resolver (ISP) participant CDN as CDN edge (Akamai/CloudFront) Note over App: needs working front App->>Scn: Scan(candidates) par per candidate Scn->>SYS: LookupHost(a248.e.akamai.net) [Akamai feeder] SYS-->>Scn: real edge IPs end loop concurrent probes Scn->>CDN: TCP + uTLS(SNI=⟂ or masquerade) Scn->>CDN: GET TestURL with Host: api.iantem.io CDN-->>Scn: 200 OK / 403 / TLS-mismatch end Scn-->>App: RankWorking() — sorted by latencyTest coverage
22 unit tests, all green:
CandidatesFromConfigflatteningCloudFrontPrefixesweighted samplingPlus opt-in (
SCANNER_INTEGRATION=1) live-network tests:TestLive_AkamaiSystemResolver: ~100% hit rate (16/16 in latest run; IPs spanning 3 different POP clusters)TestLive_CloudFrontRandomIPs: diagnostic only — hit rate is partial as expectedTestLive_CloudFrontKnownMasquerades: diagnostic — confirms howfronted.yaml.gzstale entries filter outWhat's NOT in this PR
kindling/domainfront(refresh its working-pool from scanner output) or the lantern-box meek outbound (feed scanner-discovered fronts asFrontsconfig) happens separately.Reference
🤖 Generated with Claude Code