Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct / real URLs should be enforced, to avoid breaking adblockers #551

Open
pes10k opened this issue Jan 28, 2020 · 43 comments
Open

Correct / real URLs should be enforced, to avoid breaking adblockers #551

pes10k opened this issue Jan 28, 2020 · 43 comments
Labels
discuss Needs a verbal or face-to-face discussion privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response.

Comments

@pes10k
Copy link

pes10k commented Jan 28, 2020

Currently there is no enforced relationship between the URL used to look up resources in the package, and where the resource came from online. Consistent URLs are an imperfect, but extremely useful signal for privacy protecting tools (filter lists, adblockers, disconnect, Firefox and Edge built in protections, safe browsing, etc.).

The current proposal would allow for all WebPackage'd sites to circumvent all URL based tools by simply randomizing URLs as a post processing step in amppackager or similar. This could even be done per-request per page. Since URLs are effectively just indexes into the package (and not keys for decision making, caching, etc), they can be changed arbitrarily w/o affecting how the package loads, but preventing the URL-based privacy preserving tools from running.

A (partial) possible solution to the problem is to play a cut-and-choose, commitment-auditing style games with the URLs. At package time, the packager has to make commitments about which URL each resource came from, and the size, shape etc of the resource. These can be made / mixed with the URL of the page being packaged.

The client can then, w/ some probability, audit some number of the URLs in the package. If the commitments fail, deterring counter measures can be taken against the packing origin (e.g. global decaying block list of misbehaving packagers, etc).

@jyasskin
Copy link
Member

jyasskin commented Feb 5, 2020

Because this is an issue where the potential attackers may not have thought of all the attacks we want to defend against, I don't want to discuss this issue in public. I'm going to try to discuss it in https://github.com/WICG/webpackage/security/advisories/GHSA-g5qv-3cw4-38gv instead. Send me an email with an aspect of the problem that isn't yet discussed here in order to be added to that discussion.

@plehegar plehegar added the privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. label Feb 10, 2020
@pes10k
Copy link
Author

pes10k commented Apr 1, 2020

just wanted to check in on this, has anything changed / any updates?

@pes10k
Copy link
Author

pes10k commented Apr 20, 2020

Copying comments over from the closed PR thread in #573, and editing slightly given the new context

In general, i'm happy to continue discussing point by point above, but lets not loose the forest for the trees. The general claim is that:

  1. consistent, descriptive URLs are useful for adblocking (edit, and content blocking in general)
  2. this proposal reduces the consistency and descriptiveness of URLs by changing them into arbitrary, opaque indexes into an archive.

Are we disagreeing about either of the above points?

Since rollup got mentioned above, its a perfect example here. Before rollup-and-the-link world, content blocking was ideal; URLs described (both conceptually, and frequently) one resource, and the user agent could reason about each URL independently. Post rollup-world, URLs are less useful (though not useless), since JS URLs now describe (often) many resources, about which its increasingly difficult for the UA to reason individually about. (on going research here, etc). URLs represent multiple interests, the user will often feel differnetly about, but which UA's are (generally) forced into an all or nothing position about.

This proposal does the same thing, but for websites entirely! The UA effectively gets just one URL to reason about (the entire web package), but looses the ability to reason about sub resources. This is very (very!) bad if we intend the web to be an open, transparent, user-first system!

Okie, now, replying to individual points, but eager to not loose site of the above big picture…

@jyasskin

#573 (comment) is wrong about the performance implications

This is not correct. Its partially correct in v8, bc in some cases v8 will defer the parsing of function bodies, but (i) even then there are exceptions, and (ii) I have even less familiarity with how other JS engines do this. I know that, for example, spidermonkey does not not defer parsing in cases where v8 will (e.g. JS in HTML attributes, onclick=X), but I dont have enough information to say in general (and I know even less about JavaScriptCore). But, point is

  1. there is in all cases some difference, because there is at lease some additional parsing and executing going on
  2. there may be significant difference in other platforms
  3. caching makes all this even more different, as platforms may differ on how and when they cache inline script
  4. none of this difference hangs on standards defined behavior, and so is not a sound basis for this standard to rely on

@twifkak

can't the site choose not to include the 3p script in the bundle

Sure, a site could choose this, but i'm not sure I follow the point. My point isn't that sites have to evade content blockers in the proposal, its that it gives them new options to circumvent the user's goals / aims / wishes.

As for collisions between bundles and the unbundled web…

Again, im not sure I follow you here. My point is that it'd be simple to change URLs during "bundling" so that they're (i) impossible for content blockers to reason about, and (ii) ensure they don't collide with real world urls. Say, every bundled resource has its URL changed to be a random domain 256 character domain and path.

My example involves changes that would mostly be internal to the CMS, and hence the cost amortized across its customers

Needing to update the large number of existing CMS's seems like a perfect example of why this is difficult for sites! Let alone other costs (loosing cache, in your hash guessing scheme paying an extra network request and on some platform OS thread or process, making static sites unworkable, etc etc etc).

TL;DR as much as possible,

  1. Yes URLs can be opaque on the web no2
  2. they are never the less still useful (see Google safe browsing, EasyList, disconnect, caching policies etc etc etc)
  3. the claim isn't that this proposal does something to fundamentally change URLs, its that it (i) takes something expensive and possible for the server to do now, and makes it free and trivial, and (ii) makes (for packages) the entire package into a single yes / no decision for the UA, where before the UA had far more information and ability to choose / advocate on behalf of the user

@twifkak
Copy link
Collaborator

twifkak commented Apr 20, 2020

  1. this proposal allows reduces the consistency and descriptiveness of URLs by changing them into arbitrary, opaque indexes into an archive.

"Disagreeing" implies 100% confidence to me, so let's just say I'm skeptical of this. I'm certainly counter-arguing the point.

This proposal does the same thing [as rollup]

Except that it establishes distinct boundaries between resources. It may be easier to detect matching JS subresources inside a bundle than inside a rollup, since they are distinct and the publisher has less incentive to mangle them (no need to avoid JS global namespace conflicts).

It could include rollup'd JS payloads that contain a mix of 1p and 3p content, but I don't see how doing so helps evade detection over unbundled rollup.

can't the site choose not to include the 3p script in the bundle

Sure, a site could choose this, but i'm not sure I follow the point. My point isn't that sites have to evade content blockers in the proposal, its that it gives them new options to circumvent the user's goals / aims / wishes.

I was responding to your comment "web bundles give sites a new way of delivering code to users... in a way that has zero additional marginal cost... since the code is already delivered / downloaded as part of the bundle, there is no additional cost to making it an async request vs inlining it".

We're somewhat in subjective space here, but I'd argue that the apples-to-apples comparison is:

  • unbundled: link to 3p script vs rollup or inline or 1p mirror
  • bundled: link to 3p script vs rollup or inline or 1p mirror or bundle

Regarding bytes delivered over the network from edge server to browser, bundling doesn't appear to change the cost relative to baseline. I haven't thought through bytes at rest, or between various layers of serving hierarchy. I wonder the degree to which such a cost is the limiting factor right now.

It does offer another option for 1p-ifying the script in order to evade detection, but one that doesn't seem to offer the site any reduced marginal cost.

As for collisions between bundles and the unbundled web…

Again, im not sure I follow you here. My point is that it'd be simple to change URLs during "bundling" so that they're (i) impossible for content blockers to reason about, and (ii) ensure they don't collide with real world urls. Say, every bundled resource has its URL changed to be a random domain 256 character domain and path.

I was arguing that it might not be so simple, depending on the circumstances. HTTP cache and ServiceWorker might offer spaces for collision between bundled URLs and unbundled URLs. Thus, making random paths undetectable seems similarly hard in both the unbundled and bundled world.

Your point about random domains is interesting. In order to be undetectable, the random domain+paths have to look real. Given that servers may vary their responses to different requestors, it's impossible to know that a real-looking 3p URL doesn't name a real resource (and that it won't over the length of the bundled resource's lifetime). However, it's probably sufficient to assert no collision on the cache (e.g. double- or triple-) key. A bundle generator need only an avoid-list of 3p URLs the site uses. I'm not sure if this allows an easier implementation than my proposed path randomizer, though.

Needing to update the large number of existing CMS's seems like a perfect example of why this is difficult for sites!

Isn't an update also necessary for adding bundle support?

Let alone other costs (loosing cache, in your hash guessing scheme paying an extra network request and on some platform OS thread or process, making static sites unworkable, etc etc etc).

In these aspects, it would be interesting to compare the costs between bundled and unbundled blocklist-avoidance in more detail.

@pes10k
Copy link
Author

pes10k commented Apr 21, 2020

It could include rollup'd JS payloads that contain a mix of 1p and 3p content, but I don't see how doing so helps evade detection over unbundled rollup.

There are big differences. You can't roll up 3p scripts (easily), and you can't roll up the other kinds of resources folks might want to block (images, videos, etc etc).

I didn't mean to suggest that this is just like rollup on a technical level; only that it further turns websites into black boxes that UA's can't be selective about / advocate for the user in, and in that way was similar to rollup.

It does offer another option for 1p-ifying the script in order to evade detection, but one that doesn't seem to offer the site any reduced marginal cost.

The difference here is that to get the kind of evasion you can get in a web bundle, you'd need to roll into an existing script, inline the code, or pull it into a 1p URL (that could itself be targeted by filter lists, etc). In a WebBundle world, the bundler has the best option for evading, without having to do the more difficult work (i.e. zero marginal cost).

I'm fine to say small marginal cost if that gets us by this point, but the general point is that sites get new evasion capabilities at-little-to-no-cost.

A bundle generator need only an avoid-list of 3p URLs the site uses. I'm not sure if this allows an easier implementation than my proposed path randomizer, though.

Its easier bc

  1. You have to just write the evasion in one place (the bundler) instead of changing ever app on the web!
  2. You get the evasion / opaque URL without needing to do network guess games, etc, which can be expensive (think Drupal plus apache!) or straight up impossible (static site generators)

Isn't an update also necessary for adding bundle support?

I can't see why. At least for sites where the bundle content is static (AMP like pages), I have all the information I need to build the bundle just by pointing at an existing site / URL, no changing of CMS needed (you might want to add options for excluding certain domains, resources, etc, but thats all equally easy and do-once-for-the-whole-web)

@twifkak
Copy link
Collaborator

twifkak commented Apr 21, 2020

Isn't an update also necessary for adding bundle support?

I can't see why. At least for sites where the bundle content is static (AMP like pages), I have all the information I need to build the bundle just by pointing at an existing site / URL, no changing of CMS needed (you might want to add options for excluding certain domains, resources, etc, but thats all equally easy and do-once-for-the-whole-web)

By "I" in your second sentence, who do you mean? If "a distributor" or "the site's CDN", why would they limit such a technique to unsigned bundles? The argument that pulling content into a 1p URL is difficult enough to impede adoption doesn't seem to apply in this case. (edit: Likewise with a "2p" subdomain dedicated to mirroring content of a given 3p.)

The degree to which the HTML is amenable to static analysis seems to affect the feasibility of such an implementation, but not along the bundled-or-not axis.

@pes10k
Copy link
Author

pes10k commented Apr 21, 2020

in that particular case, i just meant a site maintainer looking to create a web bundle. You were making the argument (if i understood correctly) that it would be the same amount of work to rewrite a CMS to create web bundles, as it would be to rewrite a CMS to do other kinds of URL-based filtering evasion. My point was just that no CSM rewriting would be needed at all to web bundle. I can treat the server side code as a blackbox, poke at it with automation, and create a bundle from the results (i.e. if I can create a record-replay style HAR of the site, i can create a web bundle of the site).

(I also think this is not the right way to think about the comparison, since rewriting the CMS is just one of many things you'd need to change to do filter list evasion: caching, performance concerns, etc etc etc)

@twifkak
Copy link
Collaborator

twifkak commented Apr 21, 2020

Ah, I see. I lack sufficient awareness of sites and their owners, to judge whether "update my CMS version" or "install an HTTP middlebox" is harder, on average (including auxiliary changes such as you mention, in both cases). I could see it going both ways.

@pes10k
Copy link
Author

pes10k commented Apr 22, 2020

I don't think its likely useful for us to speculate about which is easier in general, but at the least I think / hope we can agree on:

  1. Web packaging would allow new categories of folks to obscure URLs that currently can't, bc they're not in a position to run middleware, like static site generators, people on shared hosting that scrapes down to apache and (S)FTP, people with heavy cache needs because of DDOS or other concerns, etc
  2. It would allow them to do so w/o having to pay the costs they'd pay to do something similar on live websites (again caching, handling extra requests due to hash-misses in the scheme you proposed, etc)

@jyasskin
Copy link
Member

I appreciate the focus on the high-level problem, but I need you to be precise about a single situation where web packaging would hurt ad blocking, so we can figure out if that's actually the case. If I answer the first situation, and your response is to bring up a second without acknowledging that you were wrong about the first, we're not going to make progress.

For example, take a static site running on apache with no interesting modules, where the author can run arbitrary tools to generate the files they then serve statically. That author wants to run fingerprint.js hosted by a CDN, but it's getting blocked by an ad blocker. So they download the script to their static site, naming it onIhE6oDT7A7LKUj.js so as not to be obvious about it, and refer to that instead. They lose caching on browsers without partitioned caches. Putting it in a web package doesn't get them that caching back, and might lose same-site caching.

So what's the most compelling situation where web packaging does help the author avoid an ad blocker?

@pes10k
Copy link
Author

pes10k commented Apr 23, 2020

I appreciate the focus on the high-level problem, but I need you to be precise about a single situation where web packaging would hurt ad blocking, so we can figure out if that's actually the case. If I answer the first situation, and your response is to bring up a second without acknowledging that you were wrong about the first, we're not going to make progress.

Which point are you referring to? About defer? I referred to that at length above. What did i miss?

So they download the script to their static site, naming it onIhE6oDT7A7LKUj.js

I similarly feel like we've discussed this several times. Its problem enough to have to create a rule per site (static site copies file locally and serves from one or a fixed number of URLs), but with packaging you can easily create a new URL for the same resource per page (or even per bundle, or per request).

The point isn't that you can't do these things on the web today, its that packing makes them trivial and free to do. I really feel like this point has been made as well as it can be made, and that the gap between whats possible on the web today, and what web packaging would make easy and free, is large and self evident.

I don't think arguing about this same point further is productive. If I haven't made the case already, more from me is not likely to be useful. If you're curious how other filter list maintainers or content blockers would feel about it, it'd be best to bring them back into the conversation.

@jyasskin
Copy link
Member

Ok, so you're claiming it's difficult for the static site to have foo.html refer to onIhE6oDT7A7LKUj.js, but bar.html refer to GBtTuFJWnrLgXMs6.js with the same content? Why is that?

Having it different per request breaks your assumption that this is a static site running no interesting middleware, so can't happen.

@pes10k
Copy link
Author

pes10k commented Apr 23, 2020

Im saying that each step like that is additional work, all of which makes it more difficult for sites to do, and so less likely.

Again, you're arguing its possible, im happy to ceed that; im arguing that your proposal makes it much easier. As evidence, you keep suggesting extra work sites could do (some easy, some costly) to get a weaker form of what your proposal gives them. Thats making my point twice.

Put differently, there are expensive services sites subscribe to that use dynamic URL tricks to avoid their unwanted resources winding up on filter lists (as said before Admiral is the highest profile, but not the only one). Your proposal gives a stronger ability to avoid content blocking tools (or security and privacy tools like ITP, ETP, disconnect), to all sites for free.

Having it different per request breaks your assumption that this is a static site running no interesting middleware, so can't happen.

Like i said in #551 (comment), I can build a bundle by pointing a web crawler at my site, (a la catapult or record-replay or anything else) and then turn that into a bundle; there is no middleware needed.

Sincerely, I've explained these points fully and to the best of my ability. If there are new points of disagreement, lets move the conversation to those. Otherwise, i think we've hit stalemate and it'd be best to either bring in other opinions from folks who have a strong interest in content blocking, and / or to just move the disagreement to another forum (the larger web community, TAG, etc)

@twifkak
Copy link
Collaborator

twifkak commented Apr 23, 2020

I don't think its likely useful for us to speculate about which is easier in general

Happy to discuss a different aspect. I think we got to this place because we were trying to address your earlier comment that:

this proposal reduces the consistency and descriptiveness of URLs by changing them into arbitrary, opaque indexes into an archive.

I think we agree that it changes URLs into arbitrary, opaque indexes only to the extent that they aren't already. Obviously they can be used that way today:

  1. URLs may be arbitrary indexes; my sha example tried to demonstrate an example of something that looks as arbitrary as possible to the outside user, while being stateless.
  2. Servers may even Vary by Referer, or by some server-side user fingerprint (e.g. IP plus low-precision timestamp plus header order). Given that, one could imagine a stateful implementation that generates entirely arbitrary URLs.

So it's more gray-area than that. It's about prevalence. That's where ease of adoption came into the discussion. If you think there are other axes that affect prevalence (e.g. ease of revenue generation), we should discuss those as well.

Still, it seems like your recent comment discusses ease/difficulty of adoption, so I'm guessing your comment was narrower in scope. You're just saying the relative ease of CMS upgrade vs gateway install is not worth discussing because not all site owners run CMSes. That's fair, but I think it should also be fair that "# of site owners who meet this constraint" is a relevant variable. For instance, "bundles make it easier to avoid adblockers when running a site in Unlambda" is uninteresting, unless it leads to a broader issue.

  1. Web packaging would allow new categories of folks to obscure URLs that currently can't, bc they're not in a position to run middleware, like static site generators, people on shared hosting that scrapes down to apache and (S)FTP, people with heavy cache needs because of DDOS or other concerns, etc

I when you say "folks [who can't] run middleware", I assume you're not including commercial CDNs in your definition of middleware, but rather custom software running in their internal stack, even though "heavy cache needs" are usually met through the addition of edge infrastructure, usually provided by CDNs.

On the one hand, I think that's a limiting definition, because if this technique is profitable enough to become prevalent, then it's likely that either at least one CDN would add support for this, or at least one person would publish how to do it on existing edge compute services provided by popular CDNs.

On the other hand, I'll try to stick with the constraint. Both nginx and Apache provide support for custom error pages. So any URL that's not generated by the static site generator could be served fingerprint.js. (Might be able to further restrict this with an if directive on some header that distinguishes navigations from subresource fetches, so that users still see normal 404 pages.)

  1. It would allow them to do so w/o having to pay the costs they'd pay to do something similar on live websites (again caching, handling extra requests due to hash-misses in the scheme you proposed, etc)

I believe many popular CDNs provide some means of manipulating the cache key. For instance, here's varnish. I'm not familiar enough with vcl to know how expressive that language is, so this is just speculation:

Make the cache key based on hash(url) mod N. Make N high enough to guarantee ~no collisions on real pages. Make N low enough that you can deliberately generate enough fake URLs that all have the same hash mod N.

If this sounds a bit like my old scheme, it's because I'm not very clever. Forgive the lack of creativity. :)

@jyasskin
Copy link
Member

"For free" is not true since it's not like the bundling tool we write is going to rename subresources to obfuscated strings, but I agree that we need some other folks involved to figure out who's confused here. We'll send a TAG review soon, and the privacy consideration about this should help them know to think about it.

@KenjiBaheux
Copy link
Collaborator

I'm sorry but when I read this:

I don't think its likely useful for us to speculate about which is easier in general, but at the least I think / hope we can agree on

and this:

Again, you're arguing its possible, im happy to ceed that; im arguing that your proposal makes it much easier.

... it seems hard to reconcile.

I may have missed it but was there a solid explanation, not a speculation or belief, of why it's "much easier" with the proposal vs. without?

@pes10k
Copy link
Author

pes10k commented Apr 24, 2020

I'll going to attempt to summarize my case here one last time, and then stop, because i really think we're retreading the same points over and over. If the following doesn't express the concern, more words from me aren't going to help.

  1. The proposal turns the package and all the contents in the package into an all or nothing decision. It's trivially easy for the packager to randomize the URLs or otherwise strip all information out of the URL.
  2. Even worse, the bundler can change the URLs of the bundled resources to be ones that it knows wont be blocked, because they're needed for other sites to work. E.g. say i want to bundle example.org.com/index.html, which has some user-desirable code called users-love-this.js, and some code people def don't want called coin-miner.js. Assume filter lists make sure to not block the former, and intentionally block the latter. When I'm building my bundle, i can rename coin-miner.js to be users-love-this.js, while leaving the "real web" example.org/{users-love-this.js,coin-miner.js} resources unmodified. So its worse than URLs having no information, URLs in bundles can have negative information; URLs can be misleading, by pointing to something different in the bundle than outside the bundle (or having the same URL point to different resources in different bundles)
  3. All the above can be done treating the page / site being bundled as a black box.
  4. Sites can currently play games with urls, but the non-web-bundle world constrains this in many ways. Current sites are constrained in how much URL obfuscation they can play by some combination of the below costs (partial list):
    (i) additional network requests (the hash guessing games mentioned above)
    (ii) maintaining additional state somewhere to map between the different names for the same file (on disk, in db, somewhere). This is especially difficult if you expect most of your requests to be fulfilled from CDNs / edge caches
    (iii) the high performance / caching / etc costs from inlining resources
    (iv) all sorts of costs from moving from static sites to dynamic sites
    (v) breaking existing references to existing pages
    (vi) sinking cache hit rates
    (vii) implementation complexity
  5. Even if all the costs described in readme: tiny typo fix #4 went to zero, it would at the very least require modifications to the millions of sites / applications on the web. With web bundles, you just need to write one tool once to turn the entire web into arbitrary URLs
  6. Given you're going to build a web bundle with of a set of resources, obfuscating the URLs in the bundle has exactly the same cost / performance characteristics as honest URLs (i.e. zero marginal cost). This is not true of any of the suggested ways of obfuscating URLs in an application
  7. Examples that involve any level of cleverness between multiple parties (e.g. custom CDN modifications), increasing the number of non-cacheable responses, etc are making my point. Much more of the web looks like "install wordpress on hostmonster, copy paste your tracking script snippit in the template, done" then Google-level dynamic web applications who can reason differently about the core application, middleware, reverse proxies, and CDNs.
  8. Anyone who maintains a filter list or similar can easily testify that websites / trackers / etc play these kinds URL games all the time, and will do much more if given the opportunity (there are businesses who do exactly this as their entire business model). Things like those mentioned in 4 are important, practical, real world constraints that keep things from getting worse.

So, i stand by my original claim: the proposal turns something that is currently possible (but constrained by the points in #4) into something that can be done at no additional cost to the bundler (once the approach is implemented, once, in any bundling tool).

Put differently, put yourself in the shoes of someone who runs a drupal or wordpress site on godaddy or pantheon or wpengine (so cheap-as-possible hosting to real-money-PAS deployments). Which is the easier task:

  1. implementing and paying for a way to do URL obfuscation from your drupal or wordpress application
  2. do URL obfuscation when building a web bundle?

@jyasskin
Copy link
Member

It's just fundamentally confused to write "With web bundles, you just need to write one tool once [to rewrite the source and destination of URLs]" but deny "With Wordpress, you just need to write one plugin once". I think other reviewers will understand that.

@pes10k
Copy link
Author

pes10k commented Apr 24, 2020

thats not the claim at all, anywhere @jyasskin.

The claim is that 1) there are more types of sites on the web than wordpress, 2) you would not do that in most wordpress applications because you need to cache everything aggressively in wordpress to keep it from falling over, 3) if you did that in wordpress you would have all the costs mentioned in 4.

@twifkak
Copy link
Collaborator

twifkak commented Apr 24, 2020

i really think we're retreading the same points over and over.

I think there is a fair bit of repetition in the discussion, but some new stuff over time, too. I still believe that we could resolve this difference of opinion (either by convincing one of us, or revealing the underlying axiomatic difference), but yeah it would take a lot of time that maybe neither of us has.

FWIW, I don't think you're being disingenuous by not "acknowledging you were wrong" about anything. I think its a classic problem of conversations (especially textual ones) in whether to treat silence as concurrence. Whenever discussing, my primary goal is to convince myself (in any direction); thus, I'm happy to keep my prior beliefs until an update. As for convincing others, I believe they have to do the hard work of wanting to be convinced (in any direction), and from that will follow the right questions.

That said, I'll respect your decision to stop.

Just wanted to make one comment. :)

  1. Even if all the costs described in readme: tiny typo fix #4 went to zero, it would at the very least require modifications to the millions of sites / applications on the web. With web bundles, you just need to write one tool once to turn the entire web into arbitrary URLs

I focused on a non-middleware solution in my last comment because you had suggested that as the case to focus on in your previous comment.

I also believe it would be feasible to write a gateway that randomizes unbundled URLs (rewriting HTML minimally) using a stateless method as I proposed earlier. It might even be possible to serve the unwanted JS at URLs that are otherwise used for navigational HTML, by varying on headers. A quick inspection of DevTools shows that Chrome varies Accept and Upgrade-Insecure-Requests between navigation and subresource requests. AFAICT, this could be deployed in one place, as an edge worker, and thus have minimal impact on cache hit ratio.

@mikesherov
Copy link

mikesherov commented Aug 26, 2020

Leaving my comment here, as a follow up to my tweet, at the risk of rehashing points already made here.

I do believe WebBundles and Signed Exchanges are a net positive, but it's important to discuss the tradeoffs. The crux of the argument in this issue in favor of Signed Exchanges / WebBundles is "this is not a new threat". @pes10k has been making the argument "while it's not a new kind of threat, it's feasibility is dramatically increased once you build a web standard that allows the threat". His arguments ultimately boil down to: literal economic cost and universality of the exploit.

  1. economic cost: while the cost of servers is rapidly approaching zero, it's still a non-zero cost to run a reverse proxy somewhere. @slightlyoff dismisses this on twitter as a "20 line cloudflare worker" to underscore the ease of implementation, but it ignores the non-zero cost associated with doing that... including having to give CF a credit card. Yes, free tier server hosting also exists, but the limits of free servers are still much much lower than limits of free static hosting. I can provide many examples of this, but I don't want to belabor the point. It should suffice to say that moving the exploit client side opens up a much wider distribution surface.

  2. Universality of exploit: by allowing for this exploit in a web standard, we're canonicalizing the machinery required to weaken Same Origin Policy, and give the entire web a single way to do this. Yes, Wordpress powers a huge swath of the net, but there is no single language / implementation that is a web standard backend language. What that means in practice today, is if you want to install anti ad block tech for your site you need to install a Wordpress plugin, or a drupal plugin, or a 20 line cloudflare worker, or a lambda on edge function, or a.... you get the point. By building a web standard that is essentially a client side reverse proxy, we're creating a single surface area in a standardized set of technologies. This means a single piece of code can be written that handles this for all cases.... because the web makes it easy and standard!

It's worth noting that in either case, URL allowlists/blocklists aren't a great way to block ads and tracking anyway, but we can't ignore that It's all that is still available to Chrome Extension Developers as of Manifest V3, which removed more powerful features for extension devs to block ads under the banner of performance. We should consider whether this proposal is yet another cut in the death by 1000 cuts of ad blocking tech.

Ultimately I believe in WebBundles and Signed Exchanges, but should not write this off as a non-concern simply because this is also exploitable server side.

@jyasskin
Copy link
Member

@mikesherov Thanks for chiming in! I think the part I'm missing is how the cost of dynamically generating bundles to avoid blocked URLs, is less than the cost of dynamically picking which URLs to reply to.

If someone just replaces a static URL on a server with a (different) static URL inside a bundle, it seems straightforward for the URL blocker to block the one that's inside the bundle, so putting the bundles on a static host won't actually break URL-based blockers. And once you have to write code to dynamically generate the bundles, you're back at the economic cost and non-universality of the existing circumvention techniques.

@mikesherov
Copy link

@mikesherov Thanks for chiming in! I think the part I'm missing is how the cost of dynamically generating bundles to avoid blocked URLs, is less than the cost of dynamically picking which URLs to reply to.

@pes10k laid it out as such: "Even worse, the bundler can change the URLs of the bundled resources to be ones that it knows wont be blocked, because they're needed for other sites to work. E.g. say i want to bundle example.org.com/index.html, which has some user-desirable code called users-love-this.js, and some code people def don't want called coin-miner.js. Assume filter lists make sure to not block the former, and intentionally block the latter. When I'm building my bundle, i can rename coin-miner.js to be users-love-this.js, while leaving the "real web" example.org/{users-love-this.js,coin-miner.js} resources unmodified. So its worse than URLs having no information, URLs in bundles can have negative information; URLs can be misleading, by pointing to something different in the bundle than outside the bundle (or having the same URL point to different resources in different bundles)"

If someone just replaces a static URL on a server with a (different) static URL inside a bundle, it seems straightforward for the URL blocker to block the one that's inside the bundle, so putting the bundles on a static host won't actually break URL-based blockers. And once you have to write code to dynamically generate the bundles, you're back at the economic cost and non-universality of the existing circumvention techniques.

"it seems straightforward for the URL blocker to block the one that's inside the bundle" I think this is the thing that remains to be seen. What would resolve this (for me at least), is a description and POC on how ad blockers that are chrome extensions with the limitations of Manifest V3 will be able to function in a post Signed Exchanges world. Perhaps I'm lacking imagination in the solution space, but I think this is where a lot of the questions come from.

@jyasskin
Copy link
Member

Ah, I think I see. We're proposing a way to name resources inside of bundles (discussion on wpack@ietf.org), so if you have a bundle at https://example.org/tricky.wbn which contains https://example.org/users-love-this.js (with the content of coin-miner.js), the ad blocker could block that particular resource by naming package:https%3a%2f%2fexample.org%2ftricky.wbn$https%3a%2f%2fexample.org/users-love-this.js.

We'll have to make sure that Chrome Manifest V3 lets that block even "authoritative" subresources.

@jyasskin
Copy link
Member

Also, Signed Exchanges contain just one resource, so the blocker would just block the SXG itself. Only bundles (possibly containing signed exchanges or signatures for groups of resources to make them authoritative) have this risk.

@kuro68k
Copy link

kuro68k commented Aug 30, 2020

What about the user's bandwidth? Will ad-blockers be able to reliable prevent fetching giant-banner-ad-generator.js to save bandwidth? It sounds like the answer is no because it will come as part of the bundle. Bandwidth saving is one of the major use-cases for ad-blockers, especially on mobile.

@briankanderson
Copy link

What about the user's bandwidth? Will ad-blockers be able to reliable prevent fetching giant-banner-ad-generator.js to save bandwidth? It sounds like the answer is no because it will come as part of the bundle. Bandwidth saving is one of the major use-cases for ad-blockers, especially on mobile.

This is critical for low-bandwidth/high-latency links as well (VSAT). I used to manage a handful-of-megabits sat connection for several hundred users and the only way to actually make anything work was through extensive selective blocking (coupled with local caching). The approach being developed here would eliminate any possibility of this and have huge impact for such users. When each 1Mbps costs upwards of $10K USD PER MONTH, "just buying more bandwidth" isn't a solution. As such, it seems that this would mainly benefit those in the developed world and have very real consequences for those who are not.

I strongly suggest that the authors consider the impact of these "fringe" cases not as "fringe", but actually how the majority of the people in the world access and use the Internet.

@kuro68k
Copy link

kuro68k commented Aug 30, 2020

For that matter will it be compatible with browser settings such as "disable images". The expectation is that images won't be downloaded. Or "disable JavaScript" for that matter.

@WICG WICG deleted a comment from tkocou Aug 30, 2020
@WICG WICG locked as too heated and limited conversation to collaborators Aug 30, 2020
@jyasskin
Copy link
Member

Speculation about how this project is an evil plot belongs somewhere else. I'll reopen this issue tomorrow.

To the extent that sites have a local giant-banner-ad-generator.js today, instead of compiling it into the rest of their Javascript, they can equally well have a separate bundle for the ad-related things tomorrow. Doing so improves their user experience even for high-bandwidth users, since it improves caching and everyone's sensitive to loading latency.

The low-bandwidth/high-latency case is one of the core use cases for the overall web packaging project, but we need signing (or adoption) to let the local cache distribute trusted packages, in addition to the bundles discussed in this issue.

@WICG WICG unlocked this conversation Aug 31, 2020
@ron-wolf
Copy link

ron-wolf commented Aug 31, 2020

Thread is long as hell, so I haven’t read it all; perhaps what I’m about to say has already been addressed. All I’ll say is I think there are other (possibly better) reasons to enforce single canonical resources for URLs, besides preserving ad-blocking functionality. In short: the scope of this issue is broader than its title suggests.

@jyasskin, thanks for unlocking the issue! I hope the discussion will be thoughtful and civil.

@jyasskin
Copy link
Member

jyasskin commented Sep 1, 2020

@ron-wolf I think this thread has focused on the use case of blocking resources, rather than other reasons to encourage resources to live at just one URL or for each URL to have just one representation. I think it'll be easier to discuss your other use case(s) in a new issue, just to prevent them from getting lost in the noise here. Could you elaborate what use cases you're hoping to preserve, and how you see bundles causing problems for those use cases?

@KenjiBaheux
Copy link
Collaborator

@jyasskin would you mind filing a new issue to discuss the bandwidth / disable X use case brought up by @briankanderson and @kuro68k ? This feels different enough from the original issue.

@jyasskin
Copy link
Member

jyasskin commented Sep 1, 2020

@KenjiBaheux Done: #594.

@kuro68k
Copy link

kuro68k commented Sep 1, 2020

@jyasskin I think anything which relies on the kindness of the site operator is a bad idea.

While it is possible to inline JS ads it's unusual because of the way ad networks work and their desire for metrics.

@ocumo
Copy link

ocumo commented Sep 1, 2020

I am a bit appalled.

The elephant in the room is the argument that because evil exists and thrives anyway despite costs or existing countermeasures, then the "obvious" solution would be to go ahead and make it the standard.

In other words: Because, say, theft will always happen, no matter what society does to prevent it, then society should make it legal, providing thiefs with a set of free and standard tools and means to commit the perfect crime. That way society saves the hassle of investigations, arresting, judging, etc. On top of that, good citizens would also always find cool ways to use those tools too. Clever, huh?

Kind of akin the freedom of gun possession talk. Yeah, it'd be a cool thing to have brutally powerful automatic weapons cheaply available in the nearest news kiosk, bakery, or gas station. I would love to have one, to shoot at cans and impress girls into what a wonderful and heroic male I am for mating. On the other hand, all kind of crooks, nuts, inferiority complex wackos and other psychosocial lunatics, perverts, criminals and troglodites would also love it. But not exacly to shoot at cans.

I watch the news. There is a reason why God didn't give donkeys horns.

As a developer, I am extremely excited to use this extremely cool technology. I can't wait to learn the (soon-to-be?) "new standard". And, boy, do I already have ideas of how to use it, dude!

As a responsible citizen and father, and as a simple user myself, I feel totally different, though. The geek kid inside me can't convince the mature man that worries about a rare, dying concept: ethics. Society can't be all about technology and code details, without the faintest high level concerns about boundaries, purposes and consequences framing it. Without that frame, all this is nothing but juvenile. Cool, but juvenile. Sorry.

That blurred and now very alien concept ("ethics") still matters to me, more than the excitement of debating geekie latency details or implementations of countermeasures against countermeasures of "clever" pieces of code.

In the contest of "who is more clever": the geek who wants to do a thing just because he can, vs. the geek who disagrees on details of the implementation, the winner is...

...definitely not society or civil rights, let alone ethics, decency or intelligency itself.

I trust that in Google there should certainly be senior management willing to look at this from high above the nuts and bolts, variables, arrays, classes, functions, if/then, semi-colons, O(N) and so forth, and give it a truly responsible thought, purpose and direction for the sake of the Internet for the Good of the Society.

@jyasskin jyasskin added the discuss Needs a verbal or face-to-face discussion label Oct 20, 2020
@johannesrld
Copy link

Frankly this is going to ruin the web

@qertis
Copy link

qertis commented Jan 27, 2021

We need to stop webpackage

@EternityForest
Copy link

EternityForest commented Feb 16, 2021

@ocumo I very much respect your position on ethics, but with anything like this we need to look at the positives.

Signed bundles may give access to resources in places with no, or heavily censored internet. They may

When most people think privacy, they probably think PRISM at some point. Making it harder to track you may keep some commercial entities from tracking you, but it will not stop someone with full access to all the fiber and 83974 other ways to track you.

Privacy may be a human right, but it has to be balanced against the what many consider the right to education and other benefits of the internet, and it has to be real privacy, of the kind that actually benefits people.

The increase in bandwidth is a very real concern. Advertisers can, and will, and do, crap down your connection.

So what if we just limit the scope of these things? Web bundles have some amazing possible benefits, but they mostly have to do with offline functionally, sharing them via email, and things like that.

Why do we need to allow a behind-the-scenes web bundle at all? Why can these not be under full and absolute user control?

Web bundles could be installable "apps". Browsable in a list, offline searchable, installable from file, exportable to file, or just viewable by double clicking the file just like .html.

This would provide the feeling of control you might hope to get from an online app, without allowing anyone to waste your bandwidth behind the scenes.

This would cover all the best use cases, and in fat cover them a little better, giving users explicit visibility into their apps.

Bundles could even have their own top level origin, based on a unique public key just for that bundle, allowing them to have a persistent identity across updates without needing to have a domain name to make one.

Another possibility is "Support Bundles" which area explicitly marked as being accessible from other pages. If a site wants to embed resources from a bundle, it says "Foo.com would like to use a resource pack. Need to download 50MB".

Some other site could come in and ask for resource pack support using a meta tag, The UA would then say "Foo.com would like to access your existing resources packs", allowing large libraries to be shared, while warning users about fingerprinting.

If the user accepts, the existing resource packs are used to serve any resources requested in the rest of that page.

If the user declines, those resources are fetched one-by-one as normal.

But this is ultimately still a small niche use case.

The real beauty of web bundles, I think, is in offline apps which a user can redistribute, back up, or delete, and in proposals like the self-modifying PDF Forms like features.

@kuro68k
Copy link

kuro68k commented Feb 16, 2021

If we are talking balancing privacy then I'd make two points.

  1. People must be able to choose the balance they want, and not be penalised for it. That's actually the law in GDPR jurisdictions.

  2. The benefits of web bundles are so vanishingly small that it seems unlikely this trade off would be worth it in almost all cases.

By the way, the idea that this will be censorship resistant is unfortunately misguided. I suggest speaking to some people living in such countries about it, you will find that op-sec is the issue, not access. Clearly an opaque bundle of data that the browser has more limited ability to pre-screen is not going to help.

@EternityForest
Copy link

EternityForest commented Feb 16, 2021

@kuro68k Assuming Web Bundles are used in (what I consider to be) their most useful configuration, which is as manually downloaded, shared, or remixed bundles, I don't see how this in any way limits the browser's ability to filter anything.

They may be shared monolithically, but the browser is completely free to pretend that a certain resource does not exist. The individual requests in a bundle can still be discarded after download by the same blocker APIs.

I'm being generous and thinking from the perspective of a proper implementation, regardless of current proposals. Perhaps bad things can be done in some nonsense implementation where URLs are random and meaningless and can be freely changed regardless of the original source.

But in a properly designed system, web bundles are essentially just enabling a sneakernet or cache proxy based transport, that does the same thing HTTP does now. If adblock can't catch jdnekndiwjkfinrjrPrivacyViolator.js in a bundle, why would it catch exactly the same URL served over HTTPS?

The current proposal may have issues, what bothers me is that people seem to be more interested in killing the project entirely than fixing them.

There is of course still a bandwidth issue, many of us can't afford the data to download the bad content in the first place even if blockers make them inert.

But that issue largely goes away if we just disallow transparently navigating to a webbundle. If they are instead treated as installable apps, publishers won't want to hide content behind a manual download and install process any more than they would with current app store apps.

Bundles do seem problematic when used as originally proposed, downloaded a bandwidth wasting crapload in the background. But much less so when used as sharable offline apps, just like a lightweight APK.

@kuro68k
Copy link

kuro68k commented Feb 16, 2021

Being able to email someone a browser exploit doesn't sound like a good idea. Or for that matter something that could reveal their true location, e.g. unmask their IP address using other APIs.

Think about how heavily HTML email is sanitised.

And really, what the benefit to the user? It seems extremely small.

@EternityForest
Copy link

EternityForest commented Feb 16, 2021 via email

@levicki
Copy link

levicki commented Sep 1, 2021

@jyasskin

I would like to chime in with my 2 cents on the subject of WebBundles as a developer, webmaster, browser, and browser extension user.

I will do so by asking a series of questions for you to kindly demonstrate how WebBundles would benefit us in each case.

How are bundles going to improve caching for large sites with dynamic content such as Facebook, LinkedIn, YouTube?

For example, I am opening my Facebook page and browser receives a bundle. Seconds later after dozens of comments and posted pictures of cats, not to mention different ads, news, and promoted posts being shown is that bundle which my browser received still a valid representation of that page or will the site have to create a new bundle each time the content changes? If the bundles are supposed to be updated, what is the benefit of updating what basically amounts to an archive, compared to overwriting individually cached files?

Will it be possible for a browser to selectively download resources from the bundle?

Specification mentions random access and streaming and I would like to understand this better. People developed a way to selectively download an individual file from a ZIP archive in order to avoid downloading a whole ZIP archive when they need just a single file from it. I would like to know if browser will be able to download say, the fun-page.html itself, jQuery.js, cute-cat-jumping.gif, and considerate-ad.jpg, but refuse to download a dozen of large-flashy-animated-ad.gif. If the answer is that you always have to download the whole bundle, then I am afraid you haven't properly considered the use case for people on metered and low-bandwidth connections.

For bundles appended to generic self-extracting executables will it be possible to have the executables digitally signed?

Specification says:

Recipients loading the bundle in a random-access context SHOULD start by reading the last 8 bytes and seeking backwards by that many bytes to find the start of the bundle, instead of assuming that the start of the file is also the start of the bundle. This allows the bundle to be appended to another format such as a generic self-extracting executable.

Digital signature for Windows PE executables works by appending the signing certificate at the end of the file. How is this appending of bundle supposed to work without breaking parsing specification, executable signing, or worse yet, encouraging distributing web bundles with unsigned executables? I cannot comment on ELF format, but for PE, the proper way to include web bundle in an executable would be as a binary resource.

How can a browser receiving a bundle for a first time verify that a received bundle matches (and contains) what was requested?

From what I see in the specification, primary URL and all metadata in the bundle are optional, and from a brief skimming I see no mechanism of ensuring bundle data integrity past the initial total length check. I also see no way of proving bundle origin. What would happen if say a rogue CDN re-bundled original bundle by adding coin-miner.js, and returned that?

How can a IDS/IPS/AV solution block malicious content in web bundles without blocking whole pages?

Currently, a FortiGate firewall with web filtering will intercept individual resource requests from a browser and block only the ones containing malicious content. This does not necessarily result in blocking of a whole web page. If I understand correctly, once there is a bundle, there will be just one request to the website, and if the bundle contains malicious data, said bundle will not be received at all because it will be blocked. Having to scan large bundles will also dramatically increase already high memory demand on web filtering hardware. Most of them work in proxy mode and they will have to receive the whole bundle before scanning and deciding whether to pass it on or block it.

Will ad-blockers still conserve user's bandwidth by blocking resources in web bundles?

Currently, ad-blockers such as uBlock Origin prevent loading of individual resources by the browser if the user deems that content undesirable. Blocking the individual fetches considerably increases page loading and rendering speed and conserves a lot of bandwidth. Many people use ad-blockers to conserve bandwidth on metered connections and speed up page loads on low bandwidth connections. How is that supposed to work with web bundles?

Can end-user still customize content received from a website?

I am an avid user of Stylus and TamperMonkey. These extensions work by allowing me to either amend CSS (for those sites that don't respect web accessibility guidelines), and to inject scripts to be executed to modify page look or behavior. How will those extensions work with web bundles?

I apologize in advance if some of those were already answered, and I am eagerly awaiting your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Needs a verbal or face-to-face discussion privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response.
Projects
None yet
Development

No branches or pull requests