Skip to content

Add /load-shareurl proxy for hosts blocked on some school networks#623

Merged
jpolitz merged 4 commits into
horizonfrom
shareurl-proxy
Apr 30, 2026
Merged

Add /load-shareurl proxy for hosts blocked on some school networks#623
jpolitz merged 4 commits into
horizonfrom
shareurl-proxy

Conversation

@jpolitz
Copy link
Copy Markdown
Member

@jpolitz jpolitz commented Apr 28, 2026

Joe says:

We really want to make sure that shareurl works from raw.githubusercontent.com. Notably this proxies only that domain for now, specifically to make sure we support this common use case. In general I expect that less constrained users (e.g. ugrad) just fetch code happily with CORS from the browser in the long run.

This also comes with some improvements to proxying, extracted into a helper:

  • Real size limits
  • Streaming, not blocking waiting for the whole response
  • CSP headers so if someone tries to load something executable (e.g. in SVG) it really can't run
  • Some filtering on redirects to avoid weird cases of redirecting to 127.0.0.1 inside the server

Since we have existing proxying for images it's nice to improve this and have a good helper.

Claude says (lightly edited by Joe):

Some classrooms will likely load curriculum via #shareurl from raw.githubusercontent.com on networks that block GitHub. The new /load-shareurl endpoint fetches an allowlisted URL on the browser's behalf and streams the response back, so those classrooms get a working path.

A small window.fetch shim in beforePyret.js routes any fetch to an allowlisted host through the proxy automatically. That covers every browser fetch path in one place: drive.js's makeUrlFile, the url/url-file import prefetches in cpo-main.js, and the Pyret runtime's F.fetch via cross-fetch.

/downloadImg moves to share the same streaming-fetch helper, picking up nosniff

  • sandbox response headers, an IP-filtering HTTP agent (request-filtering-agent), a response size cap, and a request timeout.

Joe says:

We really want to make sure that shareurl works from raw.githubusercontent.com.
Notably this proxies *only* that domain for now, specifically to make sure we
support this common use case. In general I expect that less constrained users
(e.g. ugrad) just fetch code happily with CORS from the browser in the long run.

This also comes with some improvements to proxying, extracted into a helper:

- Real size limits
- Streaming, not blocking waiting for the whole response
- CSP headers so if someone tries to load something executable (e.g. in SVG) it
  really can't run
- Some filtering on redirects to avoid weird cases of redirecting to 127.0.0.1
  inside the server

Since we have existing proxying for images it's nice to improve this and have a
good helper.

Claude says (lightly edited by Joe):

Some classrooms will likely load curriculum via #shareurl from
raw.githubusercontent.com on networks that block GitHub. The new /load-shareurl
endpoint fetches an allowlisted URL on the browser's behalf and streams the
response back, so those classrooms get a working path.

A small window.fetch shim in beforePyret.js routes any fetch to an allowlisted
host through the proxy automatically. That covers every browser fetch path in
one place: drive.js's makeUrlFile, the url/url-file import prefetches in
cpo-main.js, and the Pyret runtime's F.fetch via cross-fetch.

/downloadImg moves to share the same streaming-fetch helper, picking up nosniff
+ sandbox response headers, an IP-filtering HTTP agent
(request-filtering-agent), a response size cap, and a request timeout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jpolitz jpolitz requested review from blerner and schanzer April 28, 2026 04:52
@blerner
Copy link
Copy Markdown
Member

blerner commented Apr 28, 2026

Seems reasonably straightforward.

Do we want to proxy raw.githubusercontent automatically, or only if it fails first?

Do you intend to have different timeouts in the two proxy usages? Or should those be made constants instead of magic numbers?

@jpolitz
Copy link
Copy Markdown
Member Author

jpolitz commented Apr 28, 2026

I'll do some magic # fixup, that's good feedback.

Re: eager vs. fallback proxying – the problem with fallback is that in the failure case it can take a long time for the first request to fail on a school content filter. So the experience is lousy.

If we end up proxying a lot, we can do some heuristic thing where we try for 2s or something and give up and proxy if that doesn't work.

@blerner
Copy link
Copy Markdown
Member

blerner commented Apr 28, 2026

Maybe we start proxying automatically, and then in the background, we try to request a specific file that we own, without a proxy: if it goes through, we can disable the proxy?

@jpolitz
Copy link
Copy Markdown
Member Author

jpolitz commented Apr 28, 2026

Good point. Can even just do proxy/non-proxy in parallel on the first request, and then disable the proxy if the regular request succeeds.

If we add more domains that becomes a dictionary of "proxy-for" different ones. But that's not the worst thing ever, and would adapt to firewall rules at schools...

jpolitz and others added 3 commits April 28, 2026 06:51
Claude says (Joe felt little need to edit):

Replaces the four magic numbers in the /downloadImg and /load-shareurl
proxy call sites with named module-level constants. Bumps the image
proxy to 20 MB / 30 s so that fetches of phone-sized images and slow
Drive ?export= responses don't get cut off; /load-shareurl stays at
1 MB / 10 s since it's plaintext from a fast host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joe says:

Constraints:

- Don't want to proxy everything (be kind to our servers)
- Don't want to wait a long time for the direct request to fail (flaky network,
  weird school blocker)

So we kick off both the proxy and the direct request the first time we have a
proxy-able candidate.

- If the direct request succeeds, cancel the proxy request to CPO and do direct
  requests for that domain from this page
- If the direct request fails, use only proxy requests for that domain for this
  page

A direct request counts as success if it returns 200 OK and text/plain (making
sure we don't weirdly accept 200 OK with like... a HTML login page because of a
school SSO portal)

I think this is a good balance of server load and letting working clients do
their thing.

Claude says:

Responding to Ben's PR feedback to do better than always-proxy. The
first fetch to an allowlisted host now runs direct + /load-shareurl in
parallel; whichever returns first that verifies (direct = 2xx
text/plain) is served to the caller, and the outcome of the direct
request locks per-host shouldProxy state for the rest of the page-load.
So unrestricted networks pay the proxy hop only once and rapidly switch
to direct; blocked networks see direct fail and switch to proxy.

Header-only verification (rather than body comparison) lets us abort
the in-flight proxy fetch as soon as direct is trusted, keeping server
load near zero on the healthy path. proxyStreamFetch now also tears
down the upstream connection when the client disconnects, so the abort
actually saves bandwidth on the server side too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joe says:

I think I constrained Claude so much in this session that it had to say things
I like below :-)

This is post-convo with Ben where we were still unhappy with (a) variable names
and (b) closed over mutable variables instead of declarative promise APIs.

Claude says:

Following a second pass of code review: the previous version used a
new Promise((resolve, reject) => { let gotSomeResponse = false; ... })
block with a flag tracking which side had won. That conflated two
concerns — "first success wins the response" and "abort proxy iff direct
beat it" — into one imperative state machine.

This pass separates them. The race is expressed as three composed
Promise primitives:

  - directP: direct fetch, gated on content-type verification (rejects
    if direct returned the wrong shape, so it drops out of the response
    race entirely).
  - shouldProxyPromise (Promise.race): false iff direct verified before
    the timeout; locks the per-host policy for the page-load.
  - directFinishedSuccessfullyAndFirstP (Promise.race): does direct
    settle before proxy, AND was that settlement a verified success?
    Only then do we proxyCtrl.abort(). This keeps the abort safe — we
    never abort once proxy has already returned headers, which would
    error the caller's body stream mid-read.
  - responsePromise (Promise.any): whichever of directP or proxyP
    fulfills first.

Tested both healthy and DevTools-blocked flows. Observed the timing
edge case where both proxy and direct return 200 within a few ms of
each other: shouldProxy still locks 'direct' as expected, since the
verdict is decoupled from whether the abort caught proxy in time.

Also adds a comment explaining the deliberate non-handling of caller-
provided init.signal: the signal overwrite means a caller aborting
won't cancel the proxy fetch, but the alternative (event-listener
forwarding or AbortSignal.any) is more complexity than the
not-quite-fully-aborted case justifies right now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jpolitz jpolitz merged commit 5713053 into horizon Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants