feat(core): persist asset fingerprinting cache#37822
Conversation
During some recent investigations package fingerprinting was found to do a bunch of duplicate work (mostly `stat`ting the same files over and over again). That and a number of minor tweaks bring the fingerprinting time of one specific directory down from ~19s to ~13s (about ~30% improvement), while maintaining byte-for-byte compatibility with the previous implementation. In a future change, we will investigate changes that are allowed to change the hash to improve the performance even more.
Asset fingerprinting is now at the highest possible speed, and yet it can still take a lot of time to fingerprint a large directory (~13s to do ~4k files on my machine). We already used to have an in-memory fingerprinting cache to speed up multiple fingerprints of the same files. This PR now persists that cache across executions, to bring the same speed to re-synths. The cache file itself has a maximum number of entries, and so does the CDK cache subdirectory (`~/.cdk/cache/fingerprints`) that holds the set of all possible fingerprints.
|
PRs without a linked issue will receive lower priority for review and merging. Please update the description to follow the PR template and include a line like |
…nto huijbers/asset-fingerprinting-cache
kumvprat
left a comment
There was a problem hiding this comment.
Curious to see the improvement with these new changes
Do we need to add new unit tests to test out the behaviour of this new directory based cached or the existing tests already cover this ?
| } | ||
| } | ||
| } catch { | ||
| // Cache file doesn't exist or is corrupt — start fresh |
There was a problem hiding this comment.
What happens here ? Should we delete the corrupt file if it exists so that the data can be generated fresh and cached on next save call ?
There was a problem hiding this comment.
That happens automatically. If the file is corrupt, then one the next save() we will overwrite it with a good file.
There was a problem hiding this comment.
Should we have some log lines that tell us about cache hit/misses ? If this already being logged inside the logic in fingerprint.ts it should be okay
There was a problem hiding this comment.
I don't really want to do this right now. There's no real place for that information to go at the moment. I definitely don't want to send it to stdout or stderr, because it will interfere with user-controlled app output. There is also no place yet in telemetry for us to send non-timing numbers.
Both of those seem too heavy a lift for the current PR. And ultimately we mostly care about the duration (which we will have numbers on) and I'm pretty confident that the duration is going to linearly correlate with cache hit rate.
So let me turn the question around: what future decisions are you thinking of making from that cache hit rate?
There was a problem hiding this comment.
The idea was to directly correlate any synth time improvements to this change, without logging or telemetry data how can we definitely attribute that this change led to improvements
There was a problem hiding this comment.
Asking again: what future decisions would we make based on that?
The only one I can see is "take the code out again". I suppose that is fair. We could take it out at any point if we want to, because it doesn't affect functionality.
If we ever think this code is producing more problems than it's worth maintaining, we can always instrument it then to see if it's pulling its weight.
Unfortunately, we will not have before/after telemetry to compare, so we can't see the impact of the change to pat ourselves on the back for a job well done. But we've run CDK for years without that kind of telemetry, and we've done an okay job, I would say. I think we'll manage.
In the mean time, we will be looking at the telemetry of synthesis times and duration hotspots and continuing to drive those down.
There was a problem hiding this comment.
My main concern here is not whether we congratulate ourselves for a job well done or badly done.
Like you said with telemetry we can determine if later we want to keep the code or not, my counter question would be : If this code is not pulling it's weight in near or long term future, why this optimisation is not working as we expect it to work? And the answer to that question is almost always hidden in the opaque implementation changes we do to cdk. Opaque not in the sense to customers but opaque to telemetry/metrics.
If a proper tracking mechanism via telemetry is harder to implement or beyond the scope of this PR, I agree maybe we tackle it together and have proper tracking mechanism so that we become pro-active in these kind of optimisation opportunities.
Again, nothing against the changes here but advocating for better instrumentation, that's all
My plan was indeed to say: the current existing tests and integ tests are exercising this code path, ensuring that it doesn't break any functionality. Otherwise, all tests I really want to add are end-to-end. I don't want to add mocks (or whatever) than confirm that "exactly this code path is followed", because those end up brittle against code changes. The real functionality we would test is "the second time you call So I opted for no test and manual confirmation. |
mrgrain
left a comment
There was a problem hiding this comment.
Help me understand this @rix0rrr
The large file fingerprint cache is effectively replacing hash(content) with a faster hash(inode+mtime+size). Previously this was kept in memory, to avoid re-caching large files for a second time in case we have lookups and the synth loop needs to run multiple time.
If I understand it correctly, this change seems to propose:
- we now cache the hash of all files
- we now persist this cache across multiple executions
Doesn't this effectively mean we will always use hash(inode+mtime+size)? Sure the first time we see a file we calculate it's hash, but after that we seem to not care anymore. Why not just switch to hash(inode+mtime+size) in general?
Well it shouldn't be that, so I sure hope that's not what accidentally happened 👀 . Let me double-check the code to be sure. We are still calculating and outputting the hash of the contents of the target file. What is supposed to be happening is that the key into the cache table is
Because that's not portable. The same file on a different computer would most likely have a different |
Make sense 👍🏻
I can tell you this: It's the standard way to quickly check if a file changed without hashing its content and when you don't care that much about accuracy (e.g. because a different slower process will catch a change later on). |
Well yeah, but you can also just compare In fact it wasn't |
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
Merge Queue Status
This pull request spent 30 minutes 30 seconds in the queue, including 30 minutes 5 seconds running CI. Required conditions to merge
|
|
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
|
Comments on closed issues and PRs are hard for our team to see. |
Asset fingerprinting is now at the highest possible speed (dominated by single-threaded reading of all files), and yet it can still take a lot of time to fingerprint a large directory (~13s to do ~37k files on my machine).
We already used to have an in-memory fingerprinting cache to speed up multiple fingerprints of the same files.
This PR now persists that cache across executions, to bring the same speed to re-synths, bringing the time down to ~3s (a ~75% reduction).
The cache file itself has a maximum number of entries, and so does the CDK cache subdirectory
(
~/.cdk/cache/fingerprints) that holds the set of all possible fingerprints.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license