Skip to content

remove unsafe use of maven dependency caching, and cache deps explicitly#15106

Closed
xvrl wants to merge 11 commits intoapache:masterfrom
xvrl:cache-mvn-deps-explicitly
Closed

remove unsafe use of maven dependency caching, and cache deps explicitly#15106
xvrl wants to merge 11 commits intoapache:masterfrom
xvrl:cache-mvn-deps-explicitly

Conversation

@xvrl
Copy link
Member

@xvrl xvrl commented Oct 7, 2023

Fix unsafe usage of maven dependency caching in actions/setup-java,
and replace with explicit save/restore of maven dependencies only without relying on mvn install.

Using the the setup-java maven cache is not safe for steps calling mvn install,
since setup-java cache keys are only based on the pom hashfiles whereas
mvn install causes artifacts specific to that commit to end up into the local maven repository.

This disables the use of the setup-java maven cache for all steps.

This also removes references to polluted setup-java cache keys, which we were referencing as fall-back
cache using restore-keys: in some integration and unit tests.

Instead, to avoid pulling dependencies all the time, maven dependencies are first resolved explicitly
using maven compile / test-compile, and then cached/restored explicitly or used as fallback for
cache misses with explicit commit tags.

@abhishekagarwal87
Copy link
Contributor

@xvrl - The static checks are failing.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please reference a failed build which have failed because of that caching?

since the cache is labeled by the pom hashes I don't think its showstopper to have installed artifacts from the project being built - but yeah; it could be left out

my problem with using random maven commands to download artifacts is that they don't always do everything - for example: druid has some extra maven plugin stuff which downloads a nodejs.tgz into the local m2 repo - we were hitting some issues because of rate limitations...

@xvrl
Copy link
Member Author

xvrl commented Oct 12, 2023

could you please reference a failed build which have failed because of that caching?

I have experienced this locally where I had artifacts from a different build causing integration tests to fail. This change removes that possibility. See also further below in my comment for an example that happened as part of this change.

since the cache is labeled by the pom hashes I don't think its showstopper to have installed artifacts from the project being built

The pom hash is not sufficient, different code could have the same pom hash, causing incorrect artifacts in the maven repository from being used at runtime. Some tests may also accidentally pass due to artifacts being present instead of failing if they are not rebuilt.

Only those cases where we explicitly want to reuse artifacts from a previous build step for same commit should be restoring maven install artifacts. Those are already marked as such https://github.com/apache/druid/blob/master/.github/workflows/unit-and-integration-tests-unified.yml#L72-L80

my problem with using random maven commands to download artifacts is that they don't always do everything - for example: druid has some extra maven plugin stuff which downloads a nodejs.tgz into the local m2 repo

fair, although the goal of caching is to download as many dependencies as possible without affecting the integrity of the build. If we miss a few it should not be a dealbreaker. I'm trying to achieve this with as few commands a possible. We may need to resolve additional dependencies explicitly if they are only called during some specific phases of the build.

It looks like mvn dependency:resolve trips up trying to download druid artifacts in submodules so we may need a different approach than what I tried regardless. This is actually a good example of where it worked locally because my maven repo had druid artifacts from a previous mvn install, but failed in CI without those artifacts. Our default maven cache should only ever contain artifacts already in maven central.

@xvrl
Copy link
Member Author

xvrl commented Oct 12, 2023

@xvrl - The static checks are failing.

@abhishekagarwal87 yes, see my comment about bout dependency:resolve failing on druid artifacts, I'll have to look for an alternative way to pull maven dependencies.

@kgyrtkirk
Copy link
Member

could you please reference a failed build which have failed because of that caching?

I have experienced this locally where I had artifacts from a different build causing integration tests to fail. This change removes that possibility. See also further below in my comment for an example that happened as part of this change.

I don't understand how these caches affect your local builds - if you can't point to a build which have failed/misbehaved then I don't really understand why do we need to change this.

those artifacts could be there and cached and reloaded - but they will not affect things; as they will get overwritten by the local build because incremental compile could not kick in for a fresh checkout - so for the install target it will need to compile and create the artifacats from 0.

[...] the goal of caching is to download as many dependencies as possible without affecting the integrity [...]

I don't think it should be a best-effort thing...we should try to avoid as external (re)downloads as possible...
let's say you miss 1 artifact; and you cache stuff and fan out to a 100 jobs; that will be cause 100 hits on that artifact

I think as an alternative we could probably purge the local druid artifacts (remove ~/.m2/repository/org/apache/druid/ ?) at the end of the job so that they are not placed in the caches

The pom hash is not sufficient, different code could have the same pom hash, causing incorrect artifacts in the maven repository from being used at runtime. Some tests may also accidentally pass due to artifacts being present instead of failing if they are not rebuilt.

That depends on the usecase:

  • If it just builds the project from zero it should work - since the actual code should not affect the referenced dependencies.
  • if its used during testing to avoid the rebuild of the project; like here then yes - all files of the project should be accounted for - as it will try to evade the build phase

Could you describe the scenario you have in mind?

@xvrl
Copy link
Member Author

xvrl commented Oct 19, 2023

Could you describe the scenario you have in mind?

Any phase where we restore the maven repository without explicitly restoring from a specific git commit sha should be safe to assume that the maven repository will only contain publicly available dependencies.

We cannot assume that every phase will always call install first, since that is already not the case today, e.g. our web console step does not https://github.com/apache/druid/blob/master/.github/workflows/static-checks.yml#L173-L177

The cache might be deleted for any reason, and in cases where we fall back to using the setup-java maven cache such as here https://github.com/apache/druid/blob/master/.github/workflows/unit-and-integration-tests-unified.yml#L80 it's possible the maven repo would contain artifacts that do not get built again.

For example, if a PR removes a submodule, but some code depending on that submodule still exists, it could pass if the cache contains that artifact, but fail if the cache does not.

@xvrl
Copy link
Member Author

xvrl commented Oct 19, 2023

@kgyrtkirk to answer your question in the other thread

I think instead we should change install to package so that mvn doesn't pollute the local .m2 caches - or that doesn't work?

package works in some cases, e.g. unit tests, but for integration tests the artifacts need to be in the local maven repo.

@kgyrtkirk
Copy link
Member

We cannot assume that every phase will always call install first, since that is already not the case today, e.g. our web console step does not https://github.com/apache/druid/blob/master/.github/workflows/static-checks.yml#L173-L177

that's pretty interesting :D a cache should never be taken for granted...
for the record: it does run the tests with 0 downloaded artifacts...so:

  • its either working incorrectly even in that case
  • or doesn't care at all for what's in the cache

The cache might be deleted for any reason, and in cases where we fall back to using the setup-java maven cache such as here https://github.com/apache/druid/blob/master/.github/workflows/unit-and-integration-tests-unified.yml#L80 it's possible the maven repo would contain artifacts that do not get built again.

For example, if a PR removes a submodule, but some code depending on that submodule still exists, it could pass if the cache contains that artifact, but fail if the cache does not.

I agree that we should prefer to not keep artifacts produced from the current sources in the cache

I think that we would need to cache more than what the proposed PR tries (include the node stuff/etc) and possibly remove the attempts to avoid compilation in some jobs - because they use up too much cache space and cause churn.
So instead of some random maven commands which supposed to work; consider:

  • somehow configure maven to install artifacts into a secondary location (and also look them up there)
    • last time I checked this was impossible ; didn't see it possible right now
  • configure the cache to exclude a set of directories during save
    • seems like path is glob; so it could possibly configured to include+exclude (didn't tried it yet)
    • this will need an extra action in every job to maintain the cache
  • purge the druid artifacts from the .m2 at the end (before save)
    • this could be compatible with the setup-java we are using; so it could be less intrusive
    • failing to add this step will pollute the cache with artifacts from the project...
  • change the project to not use maven local
    • not sure if this is possible...

I'm right now biased toward the configure the cache to not include stuff direction; for a couple reasons:

  • it may not pick up garbage
  • less likely to make it suck

some notes on current state:

@xvrl xvrl force-pushed the cache-mvn-deps-explicitly branch from 2e5c736 to baba1b6 Compare October 24, 2023 03:43
Comment on lines +62 to +63
restore-keys: |
maven-${{ runner.os }}-depsonly-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed as it could result in dragging outdate artifacts forward:

  • project declares artifact#1.0
  • cache stores artifact#1.0
  • pom changes to use artifact#2.0
  • because of fallback next cache will still contain artifact#1.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the approach suggested in https://github.com/actions/cache/blob/main/examples.md#java---maven.

Even if this includes older artifacts I believe it would still save us time compared to downloading all artifacts again on every change to them pom. This is no different than what happens when developing locally. I rarely have to clean my local .m2 cache, and it does not grow that fast.

GitHub also limits the size of cached artifacts, so I expect the size to stay bounded. If this becomes a problem we can revisit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be in an example; but I disagree with that: right now we load from zero the repo cache 4 times for every PR; switching that to open a new one if the pom.xml-s change is a huge step forward already...I think its safer to not pick up "some" earlier cache when we will be saving it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we add a mvn dependency:purge-local-repository step before compile, would that make you feel better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump @kgyrtkirk ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how purge-local-repository would help here ; wouldn't that still remove the local repo in case there is a cache hit?

I wonder if the problem is with the opportunistic caching of project artifacts; can't we somehow avoid them getting into the cache:

  • exclude them from the cache
    • I did experiment with this - and its a bit convoluted due to the fact that excludes work by filtering the included list see here
  • remove them with an always() step

I think this second approach is better - because if someone decides to add another exclude - it will be less likely to trigger the exclusion issue differently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we can use purge-local-repository to do what you suggest, it has lots of options to include/exclude various artifacts / purge only snapshot, and doesn't make any assumptions where the maven repo is stored.
  • I fail to see how your suggestion addresses your earlier point that the cache would just keep growing though.


- name: compile and cache maven dependencies
# run maven compile step to cache all shared maven dependencies
# maven install will be run in a subsequent step to avoid polluting the global maven cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why you haven't replied back to my comments;
why can't we just exclude the artifacts built by the project from the cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excluding artifacts is brittle since it requires knowing every artifact that the project might write to the local maven repo, and knowing what might or might not be safe to cache.

Instead I'd rather we err on the safe side and only add things we know are safe to cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we are developing the project ...and it should only produce org.apache.druid artifacts - if it doesn't we should fix that...
but all of them seem to be under .m2/repository/org/apache/druid - do you know about any which is not installed there?

I don't feel that using a crafted maven command to try to warm up the cache would be a better approach

Copy link
Member Author

@xvrl xvrl Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we also want to be able to cache things like npm for the web console like you suggested, then we will need to rely on a compile/test-compile step to get all the cache entries warmed up. Without that we cannot guarantee that all plugins will be downloaded or invoked to pull their dependencies

@gianm
Copy link
Contributor

gianm commented Oct 30, 2023

Any idea if this patch would help with #15276 (comment)? I just noticed some weird behavior where an IT repeatedly fails on an npm step even after multiple attempts to re-run the failed step. It might be some caching issue?

@xvrl
Copy link
Member Author

xvrl commented Nov 9, 2023

@gianm I don't think so, I would imagine the npm caching is separate, and we don't restore any of that today. This might be a different issue.

@github-actions
Copy link

github-actions bot commented Mar 7, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@xvrl
Copy link
Member Author

xvrl commented Mar 21, 2024

@kgyrtkirk I no longer have time to work on this, so if you or someone else would like to push it over the hill feel free. In the meantime I will close it.

@xvrl xvrl closed this Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants