Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins copyArtifacts can be very slow #4500

Closed
llxia opened this issue Apr 6, 2023 · 5 comments · Fixed by #4902
Closed

Jenkins copyArtifacts can be very slow #4500

llxia opened this issue Apr 6, 2023 · 5 comments · Fixed by #4902

Comments

@llxia
Copy link
Contributor

llxia commented Apr 6, 2023

Problem:
In order to ensure we can download all 3rd party test dependencies during test execution time smoothly, we have test.getDependency and systemtest.getDependency jobs to pre-stage them on Jenkins. During test execution, copyArtifacts API is used to download 3rd party test dependencies from the getDependency jobs. Ideally, the approach should be fast and simple because we are coping between Jenkins nodes. However, due to infra globalization, our machines are in different locations with various network speeds. copyArtifacts may take between in less than a min to 30+ mins in openj9. This significantly impacted users' experience, especially when they only want to run a short test that may take only secs to complete.

Ideally, we should resolve the network issue in different geo, but it looks like we do not have a good solution so far. From a test pipeline perspective, I think we could try to improve the user experience by leveraging existing 3rd party lib downloading logic in TKG, at least for Grinder.

Solution 1:
In Grinder, instead of copyArtifacts from getDependency by default, we do nothing. With the current logic, it should take care of itself. :)
Details: In TKG, each test will try to download its own required 3rd party libs if missing. If the lib exists (i.e., already downloaded by the previous test), lib sha will be verified. Only a mismatched sha will result in a re-download. This will ensure that we will not download all 3rd party test dependencies and only download on-demand. Similar to dynamic compilation.
https://github.com/adoptium/TKG/blob/master/scripts/getDependencies.pl
For regular test jobs, I think it should stay the same for now because most likely 3rd party test dependencies are needed. The on-demand download can be enabled with a flag.

Additional improvement:
Currently, when re-download the 3party lib via TKG, it uses 3rd party URLs directly. They may not be available during the test execution. The risk is low, but it can happen. This was the reason that we created getDependency to pre-stage originally. We can update to point the download URL to our Jenkins job first. If it fails, then use 3rd party URL.
Note: To download from Jenkins job URL, we will need Jenkins credentials.

Solution 2:
pre-stage 3rd party test dependencies on the machine. In this case, the test job should pre-stage 3rd party test dependencies on the machine (outside the workspace) if they do not exist. It should redownload if sha does not match (i.e., out of date).

Summary
The logic for Solution 1 is very simple as we only need to add a flag. I really like its simplicity and the fact that we only download on demand, but we may still get affected by network connectivity (even with Jenkins URL). Solution 2 seems to be the best solution in general.

Also, it does not matter which solution we choose I think systemtest.getDependency logic should be updated. It has its own logic (in both STF and aqa-systemtest repo). It does not align with the rest in TKG (i.e., I do not think systemtest has the ability to check sha and re-download).

@smlambert @pshipton @renfeiw do you have any suggestions?

@pshipton
Copy link
Contributor

pshipton commented Apr 6, 2023

I cast my vote for solution 2, which seems similar to having a git reference repo. We have a "refrence repo" of 3rd party libs which are updated as required. Which perhaps suggests another solution? If the 3rd party libs are in a git repo which persists on the machines (outside of the workspace), then it gets refreshed which does nothing when it's already up to date.

@smlambert
Copy link
Contributor

smlambert commented Apr 6, 2023

Solution 1 was selected originally because we were under the impression that we'd be working with ephemeral/dynamic test nodes more than we currently are. As it turns out, we have most nodes that are very rarely refreshed (pets not cattle). While I would like to start moving towards the dynamic nodes model, there is no 'loss' by using Solution 2 even if we 'shoot' the machine at the end of the test run, and there is a gain for the many cases where machines remain in the pool untouched for weeks.

Solution 2 is not unlike how we deal with JCK material. I am not sure about a Solution 3, keeping git repos for each 3rd party lib, there is a fairly long list at this point, and it would imply requiring other things installed on machines to support building of new test material.

Agree re: an update to systemtest.getDependency, it has also been failing consistently for a long time now (at least at ci.adoptium.net):

Cloning into 'STF'...
[Pipeline] sh
+ ant -f ./aqa-systemtest/openjdk.build/build.xml -Djava.home=/home/jenkins/workspace/systemtest.getDependency/j2sdk-image/jre -Dprereqs_root=/home/jenkins/workspace/systemtest.getDependency/systemtest_prereqs configure
Error: JAVA_HOME is not defined correctly.
  We cannot execute java
[Pipeline] cleanWs
[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is used...
[WS-CLEANUP] done

@AdamBrousseau
Copy link
Contributor

I don't think a git repo is needed. Git doesn't play well with binaries in general anyways. I like solution 2 of keeping a local cache and verifying via checksums and only downloading (refreshing the cache) what changes, either via copyArtifatcs or curling the original download site. If the cache location is configurable or we define a set of locations a user can use (eg. WORKSPACE/something and $HOME/something) I think that will keep the all/most use cases happy. If we keep all the logic in the test code it saves us writing a separate job for it. Possibly even eliminates the getTestDependency job altogether?

@AdamBrousseau
Copy link
Contributor

@llxia could an interim solution be to cache the jars to Artifactory rather than Jenkins? Would that be faster to implement than caching to local machines? That would likely solve the issues we've been having at OpenJ9.

@pshipton
Copy link
Contributor

pshipton commented Nov 7, 2023

I believe this problem is the current blocker for running OpenJ9 testing on the UNB machines. The pipe to download these resources seems to be very slow, taking 40+min to download, or maybe slower depending on how many machines are trying to download at the same time. Unfortunately when testing starts, there can be a large number of machines trying to download these resources all at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants