Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm] Set default for disableIntegrityCheck to true #102515

Closed
wants to merge 1 commit into from

Conversation

kg
Copy link
Contributor

@kg kg commented May 21, 2024

From my profiling, in Firefox we spend ~11% of CPU time during startup in their network stack, making copies of data coming from network/cache and hashing it to perform the integrity check. By disabling SRI all of this CPU usage goes away. The cost of this integrity check will scale as files get larger, so it is more significant for big binaries i.e. AOT'd applications.

It's hard to measure the actual impact in terms of wall clock time but from my understanding of how all this works, our .wasm modules and assemblies can't begin being processed until the integrity check has finished, which means the whole asset has to be loaded over the network and hashed in a blocking fashion before something like streaming wasm compilation can start. (It's possible they are doing some of this work in parallel, but ultimately the integrity check has to block execution.)

I don't have any visibility into what impact this has on Chrome because their profiler doesn't expose internals like Firefox's.

To make this mergeable we would need to enforce a set of rules to determine when it is a valid default. From my past discussions with experts, if the following rules hold:

  • Both the HTML origin and the subresources must be control over the same entity
  • The subresources are hosted in a location with security controls no looser than the HTML origin's

It should be possible to safely disable SRI for assets served over HTTPS. Potential scenarios where SRI would matter, and my reasoning for why SRI isn't a meaningful improvement:

  • SRI hash verification protects against data corruption at rest and during transit
    • Other parts of the stack should already do this. Packets are checksummed at the transport level, and assets are typically served via gzip/brotli which will amplify corruption to the point that the file will catastrophically fail to load
    • This would in practice detect rare corruptions "earlier" and turn them into a specific type of error instead of a class of weird data corruption errors. I'm not sure this is worth the cost of potentially higher startup time and higher CPU usage.
    • How common are these kinds of file corruptions? I have in the past seen cloudflare serve up corrupted bytes from their cache, but other than that I've never seen it.
  • SRI hash verification protects against man-in-the-middle attacks
    • I view this as fully theoretical. Anyone who can MITM the individual subresources can mitm the html file or the config that contains the hashes.
    • SRI seems aimed at protecting against things like malicious CDNs, so we would want to make the default SRI-on for scenarios like that.
  • Protection against intentional modifications at rest
    • Like the MITM scenario, any attacker able to edit assets for an application at rest could edit the html or the config containing the hashes too.

So I think the rules we would enforce should be:

  • All assets must come from the same origin as the host html file and configs
  • All assets must be served over HTTPS

It might also make sense to enable SRI for the very first page load, since that one is already slow and "corruption on the server or over the network" is more likely than "corruption at rest in the browser cache". But cold start time is also more important for conversion rates, so it might be what matters to users.

@kg kg added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 21, 2024
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@javiercn
Copy link
Member

@kg overall what you mention makes sense, however, I would add that we didn't use integrity for security purposes (as you mentioned, those things are covered by HTTPS and SRI doesn't add anything on the same origin as if you are able to modify an asset in that context you are also able to modify the code that makes the request with the integrity value).

The reason we used integrity was to break reliably on invalid caching. In many cases it will occur that the app is sitting behind an HTTP proxy (corporate proxy, cloudflare, etc) and people will not correctly configure the caching rules in those situations.

We always request blazor.boot.json with the no-cache directive, but we don't/can't do the same for other assets without taking a perf hit, so having the integrity check is important to block the app from potentially loading a mix of dlls and causing really hard to debug errors.

I would ask the following questions:

  • in Firefox we spend ~11% of CPU time during startup in their network stack

  • Does this mean we spend 11% of startup time or just a % of what's considered download time.
  • While Firefox is a browser we support, Chrome and Safari have a much bigger market share, we should understand if this impacts those, as otherwise we are not benefiting the majority of our customers.
  • We should be able to turn it on/off, perform a number of runs and average the time out to get some numbers on the actual impact, couldn't we? (I understand that there are many other factors that might impact the numbers here).
    • If we are not able to see a clear difference, is it worth removing it? Specially if the difference isn't in chrome/edge/safari, which account for 95%+ of our marketshare.

Now, with all this said, I'm not opposed to us removing this check, however, I think we should skip this check based on whether we are fingerprinting or not. That will still break things early if there are caching issues, and in practice will be equivalent to disabling it, since we are fingerprinting all assets by default.

Finally, some of this overhead might go away in the future or be required for other reasons (we start preloading things and the time spent on it becomes irrelevant, or we setup CSP in a way that requires including integrity on the requests).

So, in summary,

I'm ok if we do this based on whether the app is fingerprinting its assets or not automatically (with the option to turn it on even in that case), but we should also understand if this is having a significant impact on the main browsers we target.

@kg
Copy link
Contributor Author

kg commented Jul 19, 2024

@kg overall what you mention makes sense, however, I would add that we didn't use integrity for security purposes (as you mentioned, those things are covered by HTTPS and SRI doesn't add anything on the same origin as if you are able to modify an asset in that context you are also able to modify the code that makes the request with the integrity value).

The reason we used integrity was to break reliably on invalid caching. In many cases it will occur that the app is sitting behind an HTTP proxy (corporate proxy, cloudflare, etc) and people will not correctly configure the caching rules in those situations.

Yeah, this is tough to solve. It's possible SRI is still the best real-world solution for it, but my hope is we can find something cheaper.

We always request blazor.boot.json with the no-cache directive, but we don't/can't do the same for other assets without taking a perf hit, so having the integrity check is important to block the app from potentially loading a mix of dlls and causing really hard to debug errors.

I would ask the following questions:

* > in Firefox we spend ~11% of CPU time during startup in their network stack

* Does this mean we spend 11% of startup time or just a % of what's considered download time.

It's 11% of total CPU samples during the startup loop profile (100 warm starts with a 10ms delay between them). I don't know of any browser profiling tool that allows me to determine how many of those samples are on the 'hot path' (that is, blocking startup), but based on the stacks, the SRI hashing prevents bytes from flowing to the parts of the browser that would parse JS or decode WASM; it is functionally synchronous. However, on even mid-spec devices the CPU is not likely to be maxed out during asset loading, so it's possible that the CPU time spent hashing is "free". I don't know how to determine whether this is true.

* While Firefox is a browser we support, Chrome and Safari have a much bigger market share, we should understand if this impacts those, as otherwise we are not benefiting the majority of our customers.

I haven't figured out how to determine whether Chrome has the same limitations - it's possible it would show up in their low-level tracing. From my understanding of the spec though, it would have to, and it's just a question of whether they hide the latency introduced by the synchronous hashing through some cleverness.

* We should be able to turn it on/off, perform a number of runs and average the time out to get some numbers on the actual impact, couldn't we? (I understand that there are many other factors that might impact the numbers here).

I tried this and the noise level in the measurements was too high for me to decide for sure whether it was an improvement. The CPU usage in the profiles for SRI was gone, and by examining the stacks it also appeared that at least in Firefox the codepaths used are more efficient ones too. I'm not sure how one would go about measuring this effectively - maybe on a low spec device with CPU turbo & downclocking disabled?

  * If we are not able to see a clear difference, is it worth removing it? Specially if the difference isn't in chrome/edge/safari, which account for 95%+ of our marketshare.

If it has no impact on chrome or safari I would say it's not worth it.

Now, with all this said, I'm not opposed to us removing this check, however, I think we should skip this check based on whether we are fingerprinting or not. That will still break things early if there are caching issues, and in practice will be equivalent to disabling it, since we are fingerprinting all assets by default.

Yeah, that seems like a good way to decide on whether to do it. I agree that if we don't have something like fingerprinting to provide the same value as SRI, we can't disable SRI.

Finally, some of this overhead might go away in the future or be required for other reasons (we start preloading things and the time spent on it becomes irrelevant, or we setup CSP in a way that requires including integrity on the requests).

So, in summary,

I'm ok if we do this based on whether the app is fingerprinting its assets or not automatically (with the option to turn it on even in that case), but we should also understand if this is having a significant impact on the main browsers we target.

Agreed, we shouldn't do this without measurement data to support it. I've been unable to find a way to gather data for or against this that seems trustworthy.

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-meta-mono NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants