Jenkins copyArtifacts can be very slow #4500

llxia · 2023-04-06T18:44:12Z

Problem:
In order to ensure we can download all 3rd party test dependencies during test execution time smoothly, we have test.getDependency and systemtest.getDependency jobs to pre-stage them on Jenkins. During test execution, copyArtifacts API is used to download 3rd party test dependencies from the getDependency jobs. Ideally, the approach should be fast and simple because we are coping between Jenkins nodes. However, due to infra globalization, our machines are in different locations with various network speeds. copyArtifacts may take between in less than a min to 30+ mins in openj9. This significantly impacted users' experience, especially when they only want to run a short test that may take only secs to complete.

Ideally, we should resolve the network issue in different geo, but it looks like we do not have a good solution so far. From a test pipeline perspective, I think we could try to improve the user experience by leveraging existing 3rd party lib downloading logic in TKG, at least for Grinder.

Solution 1:
In Grinder, instead of copyArtifacts from getDependency by default, we do nothing. With the current logic, it should take care of itself. :)
Details: In TKG, each test will try to download its own required 3rd party libs if missing. If the lib exists (i.e., already downloaded by the previous test), lib sha will be verified. Only a mismatched sha will result in a re-download. This will ensure that we will not download all 3rd party test dependencies and only download on-demand. Similar to dynamic compilation.
https://github.com/adoptium/TKG/blob/master/scripts/getDependencies.pl
For regular test jobs, I think it should stay the same for now because most likely 3rd party test dependencies are needed. The on-demand download can be enabled with a flag.

Additional improvement:
Currently, when re-download the 3party lib via TKG, it uses 3rd party URLs directly. They may not be available during the test execution. The risk is low, but it can happen. This was the reason that we created getDependency to pre-stage originally. We can update to point the download URL to our Jenkins job first. If it fails, then use 3rd party URL.
Note: To download from Jenkins job URL, we will need Jenkins credentials.

Solution 2:
pre-stage 3rd party test dependencies on the machine. In this case, the test job should pre-stage 3rd party test dependencies on the machine (outside the workspace) if they do not exist. It should redownload if sha does not match (i.e., out of date).

Summary
The logic for Solution 1 is very simple as we only need to add a flag. I really like its simplicity and the fact that we only download on demand, but we may still get affected by network connectivity (even with Jenkins URL). Solution 2 seems to be the best solution in general.

Also, it does not matter which solution we choose I think systemtest.getDependency logic should be updated. It has its own logic (in both STF and aqa-systemtest repo). It does not align with the rest in TKG (i.e., I do not think systemtest has the ability to check sha and re-download).

@smlambert @pshipton @renfeiw do you have any suggestions?

The text was updated successfully, but these errors were encountered:

pshipton · 2023-04-06T19:25:41Z

I cast my vote for solution 2, which seems similar to having a git reference repo. We have a "refrence repo" of 3rd party libs which are updated as required. Which perhaps suggests another solution? If the 3rd party libs are in a git repo which persists on the machines (outside of the workspace), then it gets refreshed which does nothing when it's already up to date.

smlambert · 2023-04-06T21:25:54Z

Solution 1 was selected originally because we were under the impression that we'd be working with ephemeral/dynamic test nodes more than we currently are. As it turns out, we have most nodes that are very rarely refreshed (pets not cattle). While I would like to start moving towards the dynamic nodes model, there is no 'loss' by using Solution 2 even if we 'shoot' the machine at the end of the test run, and there is a gain for the many cases where machines remain in the pool untouched for weeks.

Solution 2 is not unlike how we deal with JCK material. I am not sure about a Solution 3, keeping git repos for each 3rd party lib, there is a fairly long list at this point, and it would imply requiring other things installed on machines to support building of new test material.

Agree re: an update to systemtest.getDependency, it has also been failing consistently for a long time now (at least at ci.adoptium.net):

Cloning into 'STF'...
[Pipeline] sh
+ ant -f ./aqa-systemtest/openjdk.build/build.xml -Djava.home=/home/jenkins/workspace/systemtest.getDependency/j2sdk-image/jre -Dprereqs_root=/home/jenkins/workspace/systemtest.getDependency/systemtest_prereqs configure
Error: JAVA_HOME is not defined correctly.
  We cannot execute java
[Pipeline] cleanWs
[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is used...
[WS-CLEANUP] done

AdamBrousseau · 2023-10-27T16:06:01Z

I don't think a git repo is needed. Git doesn't play well with binaries in general anyways. I like solution 2 of keeping a local cache and verifying via checksums and only downloading (refreshing the cache) what changes, either via copyArtifatcs or curling the original download site. If the cache location is configurable or we define a set of locations a user can use (eg. WORKSPACE/something and $HOME/something) I think that will keep the all/most use cases happy. If we keep all the logic in the test code it saves us writing a separate job for it. Possibly even eliminates the getTestDependency job altogether?

AdamBrousseau · 2023-11-07T18:01:49Z

@llxia could an interim solution be to cache the jars to Artifactory rather than Jenkins? Would that be faster to implement than caching to local machines? That would likely solve the issues we've been having at OpenJ9.

pshipton · 2023-11-07T18:06:10Z

I believe this problem is the current blocker for running OpenJ9 testing on the UNB machines. The pipe to download these resources seems to be very slow, taking 40+min to download, or maybe slower depending on how many machines are trying to download at the same time. Unfortunately when testing starts, there can be a large number of machines trying to download these resources all at the same time.

llxia mentioned this issue Oct 30, 2023

Remove unused 3rd party jars adoptium/TKG#468

Closed

llxia mentioned this issue Nov 13, 2023

refine copyArtifacts logic #4863

Merged

This was referenced Dec 5, 2023

pre-stage test libs on the machine #4902

Merged

Allow LIB_DIR, customUrl and curlOpts to be set for get dependency logic adoptium/TKG#476

Merged

smlambert closed this as completed in #4902 Dec 6, 2023

This was referenced Dec 7, 2023

pre-stage test libs on the machine for parallel case #4908

Merged

Pass LIB_DIR in genParallelList adoptium/TKG#479

Merged

system test prerequisite problem - zip file is empty #4912

Open

Windows cannot delete workspace eclipse-openj9/openj9#17175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jenkins copyArtifacts can be very slow #4500

Jenkins copyArtifacts can be very slow #4500

llxia commented Apr 6, 2023

pshipton commented Apr 6, 2023 •

edited

Loading

smlambert commented Apr 6, 2023 •

edited

Loading

AdamBrousseau commented Oct 27, 2023

AdamBrousseau commented Nov 7, 2023

pshipton commented Nov 7, 2023 •

edited

Loading

Jenkins copyArtifacts can be very slow #4500

Jenkins copyArtifacts can be very slow #4500

Comments

llxia commented Apr 6, 2023

pshipton commented Apr 6, 2023 • edited Loading

smlambert commented Apr 6, 2023 • edited Loading

AdamBrousseau commented Oct 27, 2023

AdamBrousseau commented Nov 7, 2023

pshipton commented Nov 7, 2023 • edited Loading

pshipton commented Apr 6, 2023 •

edited

Loading

smlambert commented Apr 6, 2023 •

edited

Loading

pshipton commented Nov 7, 2023 •

edited

Loading