Skip to content

feat(core): persist asset fingerprinting cache#37822

Merged
mergify[bot] merged 12 commits into
mainfrom
huijbers/asset-fingerprinting-cache
May 13, 2026
Merged

feat(core): persist asset fingerprinting cache#37822
mergify[bot] merged 12 commits into
mainfrom
huijbers/asset-fingerprinting-cache

Conversation

@rix0rrr
Copy link
Copy Markdown
Contributor

@rix0rrr rix0rrr commented May 11, 2026

Asset fingerprinting is now at the highest possible speed (dominated by single-threaded reading of all files), and yet it can still take a lot of time to fingerprint a large directory (~13s to do ~37k files on my machine).

We already used to have an in-memory fingerprinting cache to speed up multiple fingerprints of the same files.

This PR now persists that cache across executions, to bring the same speed to re-synths, bringing the time down to ~3s (a ~75% reduction).

The cache file itself has a maximum number of entries, and so does the CDK cache subdirectory
(~/.cdk/cache/fingerprints) that holds the set of all possible fingerprints.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

rix0rrr added 2 commits May 8, 2026 15:35
During some recent investigations package fingerprinting was found to do
a bunch of duplicate work (mostly `stat`ting the same files over and
over again).

That and a number of minor tweaks bring the fingerprinting time of one
specific directory down from ~19s to ~13s (about ~30% improvement),
while maintaining byte-for-byte compatibility with the previous
implementation.

In a future change, we will investigate changes that are allowed
to change the hash to improve the performance even more.
Asset fingerprinting is now at the highest possible speed, and yet it
can still take a lot of time to fingerprint a large directory (~13s
to do ~4k files on my machine).

We already used to have an in-memory fingerprinting cache to speed
up multiple fingerprints of the same files.

This PR now persists that cache across executions, to bring the same
speed to re-synths. The cache file itself has a maximum number of
entries, and so does the CDK cache subdirectory
(`~/.cdk/cache/fingerprints`) that holds the set of all possible
fingerprints.
@rix0rrr rix0rrr added the pr-linter/exempt-integ-test The PR linter will not require integ test changes label May 11, 2026
@github-actions github-actions Bot added the p2 label May 11, 2026
@mergify mergify Bot added the contribution/core This is a PR that came from AWS. label May 11, 2026
@mergify mergify Bot temporarily deployed to automation May 11, 2026 09:06 Inactive
@mergify mergify Bot temporarily deployed to automation May 11, 2026 09:06 Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

⚠️ This pull request description does not follow the correct template structure.

PRs without a linked issue will receive lower priority for review and merging. Please update the description to follow the PR template and include a line like Closes #123 in the Issue section. If no existing issue matches your change, create one first.

Copy link
Copy Markdown
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This review is outdated)

@rix0rrr rix0rrr added pr-linter/exempt-readme The PR linter will not require README changes pr-linter/exempt-test The PR linter will not require test changes labels May 11, 2026
@rix0rrr rix0rrr changed the title feat: persist asset fingerprinting cache feat(core): persist asset fingerprinting cache May 13, 2026
Copy link
Copy Markdown
Contributor

@kumvprat kumvprat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to see the improvement with these new changes

Do we need to add new unit tests to test out the behaviour of this new directory based cached or the existing tests already cover this ?

Comment thread packages/aws-cdk-lib/core/lib/fs/fingerprint-disk-cache.ts
}
}
} catch {
// Cache file doesn't exist or is corrupt — start fresh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here ? Should we delete the corrupt file if it exists so that the data can be generated fresh and cached on next save call ?

Copy link
Copy Markdown
Contributor Author

@rix0rrr rix0rrr May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That happens automatically. If the file is corrupt, then one the next save() we will overwrite it with a good file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some log lines that tell us about cache hit/misses ? If this already being logged inside the logic in fingerprint.ts it should be okay

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to do this right now. There's no real place for that information to go at the moment. I definitely don't want to send it to stdout or stderr, because it will interfere with user-controlled app output. There is also no place yet in telemetry for us to send non-timing numbers.

Both of those seem too heavy a lift for the current PR. And ultimately we mostly care about the duration (which we will have numbers on) and I'm pretty confident that the duration is going to linearly correlate with cache hit rate.

So let me turn the question around: what future decisions are you thinking of making from that cache hit rate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to directly correlate any synth time improvements to this change, without logging or telemetry data how can we definitely attribute that this change led to improvements

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking again: what future decisions would we make based on that?

The only one I can see is "take the code out again". I suppose that is fair. We could take it out at any point if we want to, because it doesn't affect functionality.

If we ever think this code is producing more problems than it's worth maintaining, we can always instrument it then to see if it's pulling its weight.

Unfortunately, we will not have before/after telemetry to compare, so we can't see the impact of the change to pat ourselves on the back for a job well done. But we've run CDK for years without that kind of telemetry, and we've done an okay job, I would say. I think we'll manage.

In the mean time, we will be looking at the telemetry of synthesis times and duration hotspots and continuing to drive those down.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is not whether we congratulate ourselves for a job well done or badly done.

Like you said with telemetry we can determine if later we want to keep the code or not, my counter question would be : If this code is not pulling it's weight in near or long term future, why this optimisation is not working as we expect it to work? And the answer to that question is almost always hidden in the opaque implementation changes we do to cdk. Opaque not in the sense to customers but opaque to telemetry/metrics.
If a proper tracking mechanism via telemetry is harder to implement or beyond the scope of this PR, I agree maybe we tackle it together and have proper tracking mechanism so that we become pro-active in these kind of optimisation opportunities.

Again, nothing against the changes here but advocating for better instrumentation, that's all

@rix0rrr
Copy link
Copy Markdown
Contributor Author

rix0rrr commented May 13, 2026

Do we need to add new unit tests to test out the behaviour of this new directory based cached or the existing tests already cover this ?

My plan was indeed to say: the current existing tests and integ tests are exercising this code path, ensuring that it doesn't break any functionality.

Otherwise, all tests I really want to add are end-to-end. I don't want to add mocks (or whatever) than confirm that "exactly this code path is followed", because those end up brittle against code changes. The real functionality we would test is "the second time you call fingerprint() on the same directory is faster"... but that is a timing test which is easily disturbed by machine load, which ends up as a flaky test.

So I opted for no test and manual confirmation.

Copy link
Copy Markdown
Contributor

@mrgrain mrgrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand this @rix0rrr

The large file fingerprint cache is effectively replacing hash(content) with a faster hash(inode+mtime+size). Previously this was kept in memory, to avoid re-caching large files for a second time in case we have lookups and the synth loop needs to run multiple time.

If I understand it correctly, this change seems to propose:

  • we now cache the hash of all files
  • we now persist this cache across multiple executions

Doesn't this effectively mean we will always use hash(inode+mtime+size)? Sure the first time we see a file we calculate it's hash, but after that we seem to not care anymore. Why not just switch to hash(inode+mtime+size) in general?

@rix0rrr
Copy link
Copy Markdown
Contributor Author

rix0rrr commented May 13, 2026

The large file fingerprint cache is effectively replacing hash(content) with a faster hash(inode+mtime+size).

Well it shouldn't be that, so I sure hope that's not what accidentally happened 👀 . Let me double-check the code to be sure.

We are still calculating and outputting the hash of the contents of the target file.

What is supposed to be happening is that the key into the cache table is hash(inode+mtime+size).

  • Why the hash() of those values? Shrug, good question. That was already there. I suppose a simple concatenation of those values would also be just fine, and saves a hashing operation so slightly faster. Why not.
  • Why inode and not file name? I think this is done for quick invalidation. A different inode could obviously not be the same file, so that's an easy way to force a re-read.
  • Why store fields that invalidate the cache in the key and not the value? In other words, why { [inode+mtime+size] -> hash } and not { inode -> [mtime, size, hash] }? Just another easy way of invalidating: if we have a cache hit we know it's good, and we don't have to have another if to check the additional fields.

Why not just switch to hash(inode+mtime+size) in general?

Because that's not portable. The same file on a different computer would most likely have a different mtime and definitely would have a different inode. So snapshot tests would basically always fail. We are still fingerprinting the file contents, it's just that the (per-machine) cache uses (per-machine) filesystem metadata.

@mrgrain mrgrain self-requested a review May 13, 2026 09:51
@mrgrain mrgrain dismissed their stale review May 13, 2026 09:51

satisfied

@mrgrain
Copy link
Copy Markdown
Contributor

mrgrain commented May 13, 2026

Because that's not portable. The same file on a different computer would most likely have a different mtime and definitely would have a different inode. So snapshot tests would basically always fail. We are still fingerprinting the file contents, it's just that the (per-machine) cache uses (per-machine) filesystem metadata.

Make sense 👍🏻

  • Why the hash() of those values? Shrug, good question. That was already there. I suppose a simple concatenation of those values would also be just fine, and saves a hashing operation so slightly faster. Why not.

I can tell you this: It's the standard way to quickly check if a file changed without hashing its content and when you don't care that much about accuracy (e.g. because a different slower process will catch a change later on).

@rix0rrr
Copy link
Copy Markdown
Contributor Author

rix0rrr commented May 13, 2026

I can tell you this: It's the standard way to quickly check if a file changed without hashing its content and when you don't care that much about accuracy (e.g. because a different slower process will catch a change later on).

Well yeah, but you can also just compare prev_inode + prev_time + prev_size === cur_inode + cur_time + cur_size. No need to do hash(prev_inode + prev_time + prev_size) === hash(cur_inode + cur_time + cur_size).

In fact it wasn't hash(), it was JSON.stringify(), which is more akin to the plain comparison. But equally unnecessary, I replaced it with a string concat.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 13, 2026

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 13, 2026

Merge Queue Status

  • Entered queue2026-05-13 10:29 UTC · Rule: default-squash
  • Checks passed · in-place
  • Merged2026-05-13 10:59 UTC · at b9b8573468718f6aa4d54ed1ad145b5bf607ce8d · squash

This pull request spent 30 minutes 30 seconds in the queue, including 30 minutes 5 seconds running CI.

Required conditions to merge

@mergify mergify Bot temporarily deployed to automation May 13, 2026 10:29 Inactive
@mergify mergify Bot temporarily deployed to automation May 13, 2026 10:29 Inactive
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 13, 2026

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify Bot merged commit 605a776 into main May 13, 2026
19 of 20 checks passed
@mergify mergify Bot deleted the huijbers/asset-fingerprinting-cache branch May 13, 2026 10:59
@github-actions
Copy link
Copy Markdown
Contributor

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 13, 2026
@aws-cdk-automation aws-cdk-automation removed the pr/needs-maintainer-review This PR needs a review from a Core Team Member label May 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

contribution/core This is a PR that came from AWS. p2 pr-linter/exempt-integ-test The PR linter will not require integ test changes pr-linter/exempt-readme The PR linter will not require README changes pr-linter/exempt-test The PR linter will not require test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants