Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] - Test skipping publishing step for daily 7.17 and 8.x test jobs #5884

Merged
merged 1 commit into from
Apr 21, 2023

Conversation

andrewkroh
Copy link
Member

@andrewkroh andrewkroh commented Apr 14, 2023

What does this PR do?

As an optimization add a parameter to skip the package publishing step.
This avoids any interaction with the remote publishing Jenkins pipeline
so it can avoid errors not related to testing.

There is no need to attempt publishing for jobs that only are concerned
with testing (e.g. daily tests agains 7.17 and 8.x).

Background

This PR began as just a test to try to understand why CI jobs are failing.

I am concerned that daily CI testing is not running for all packages against the latest 8.7 snapshot. There is a job that is running daily, but I suspect it’s not doing any real testing. It mostly fails on publishing and never gets to the point of running package or system tests.

  • Test-only type of jobs should never be doing any publishing. We should skip that stage to speed it up.
  • Why does the job publish before it tests? Shouldn't we check that the tests are passing before publishing?

Investigation

Screenshot 2023-04-14 at 09 59 46

Screenshot 2023-04-14 at 09 52 36

The integrations/main job (https://fleet-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/main/1283/pipeline) fails while invoking the remote publishing job. Then it never advances to testing. Why not move the testing before the publishing?

[2023-04-14T03:17:49.092Z] Triggering parameterized remote job 'https://internal-ci.elastic.co/job/package_storage/job/publishing-job-remote'
[2023-04-14T03:17:49.092Z]   Using job-level defined 'Credentials Authentication' as user 'local-readonly' (Credentials ID 'local-readonly-api-token')
[2023-04-14T03:17:49.092Z] Triggering remote job now.
[2023-04-14T03:17:49.092Z] reuse cached crumb: internal-ci.elastic.co
[2023-04-14T03:17:49.286Z]   Remote job queue number: 1327319
[2023-04-14T03:17:49.286Z] Remote build started!
[2023-04-14T03:17:49.286Z]   Remote build URL: https://internal-ci.elastic.co/job/package_storage/job/publishing-job-remote/5504/
[2023-04-14T03:17:49.286Z]   Remote build number: 5504

Screenshot 2023-04-14 at 10 00 35

publishing-job-remote fails with NPE.

https://internal-ci.elastic.co/job/package_storage/job/publishing-job-remote/5504/console

[2023-04-14T03:21:05.633Z] INFO: fetchAndPrepareBuildInfo (see build-info.json)
[2023-04-14T03:21:05.633Z] INFO: curl https://internal-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/package_storage/publishing-job-remote/runs/5504/ -o build-info.json
[2023-04-14T03:21:05.834Z] jq: error (at build-info.json:5): Cannot iterate over null (null)
[2023-04-14T03:21:05.935Z] INFO: prepareErrorMetrics(errorMetrics)
[2023-04-14T03:21:05.941Z] [Pipeline] archiveArtifacts
[2023-04-14T03:21:05.944Z] Archiving artifacts
[2023-04-14T03:21:05.959Z] [Pipeline] timeout
[2023-04-14T03:21:05.961Z] Timeout set to expire in 5 min 0 sec
[2023-04-14T03:21:05.961Z] [Pipeline] {
[2023-04-14T03:21:05.976Z] [Pipeline] readJSON
...
[2023-04-14T03:21:06.134Z] ERROR: generateBuildReport: Error generating build report
[2023-04-14T03:21:06.134Z] java.lang.NullPointerException: Cannot invoke method length() on null object

@andrewkroh andrewkroh force-pushed the ci/skip-publishing-param branch 2 times, most recently from b99a8d6 to 5955947 Compare April 14, 2023 13:41
@andrewkroh andrewkroh added the ci label Apr 14, 2023
@elasticmachine
Copy link

elasticmachine commented Apr 14, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-04-20T18:18:44.236+0000

  • Duration: 110 min 21 sec

Test stats 🧪

Test Results
Failed 0
Passed 4741
Skipped 11
Total 4752

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@elasticmachine
Copy link

elasticmachine commented Apr 14, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (401/401) 💚
Files 96.974% (673/694) 👍
Classes 96.974% (673/694) 👍
Methods 91.704% (6688/7293) 👍
Lines 92.027% (142924/155307) 👍
Conditionals 100.0% (0/0) 💚

@andrewkroh
Copy link
Member Author

andrewkroh commented Apr 14, 2023

The result of this run against 8.7.1-SNAPSHOT shows a lot test failures. This image shows only test cases that failed. There a more integrations that failed to install before a test case executed (like jolokia_input, cel, zeek, cloud_defend, and barracuda_cloudgen_firewall).

Screenshot 2023-04-14 at 15 05 03

An interesting failure is that input packages no longer install due to the template. The errors are:

Error: error running package system tests: could not complete test run: could not add data stream config to policy: could not add package to policy; API status code = 400; response body = {"statusCode":400,"error":"Bad Request","message":"illegal_argument_exception\n\tCaused by:\n\t\tillegal_argument_exception: invalid composite mappings for [logs-cel.cel]\n\tRoot causes:\n\t\tillegal_argument_exception: composable template [logs-cel.cel] template after composition with component templates [logs-cel.cel@package, logs-cel.cel@custom, .fleet_globals-1, .fleet_agent_id_verification-1] is invalid"}

Error: error running package system tests: could not complete test run: could not add data stream config to policy: could not add package to policy; API status code = 400; response body = {"statusCode":400,"error":"Bad Request","message":"illegal_argument_exception\n\tCaused by:\n\t\tillegal_argument_exception: invalid composite mappings for [metrics-jolokia.jolokia]\n\tRoot causes:\n\t\tillegal_argument_exception: composable template [metrics-jolokia.jolokia] template after composition with component templates [metrics-jolokia.jolokia@package, metrics-jolokia.jolokia@custom, .fleet_globals-1, .fleet_agent_id_verification-1] is invalid"}

cloud_defend was found in the registry.

[2023-04-14T14:39:41.566Z] Error: error running package asset tests: could not complete test run: can't install the package: can't install the package: could not install package; API status code = 404; response body = {"statusCode":404,"error":"Not Found","message":"[cloud_defend] package not found in registry"}

UDP package failed with:

Error: error running package system tests: could not complete test run: failed to tear down runner: error reassigning original policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception\n\tRoot causes:\n\t\tversion_conflict_engine_exception: [dc55f519-adce-4ad4-af2f-2699759ed029]: version conflict, required seqNo [49], primary term [1]. current document has seqNo [50] and primary term [1]"}

As an optimization add a parameter to skip the package publishing step.
This avoids any interaction with the remote publishing Jenkins pipeline
so it can avoid errors not related to testing.

There is no need to attempt publishing for jobs that only are concerned
with testing (e.g. daily tests agains 7.17 and 8.x).
@andrewkroh andrewkroh changed the title [ci] - Test skipping publishing [ci] - Test skipping publishing step for daily 7.17 and 8.x test jobs Apr 14, 2023
@andrewkroh andrewkroh added Team:Fleet Label for the Fleet team Team:Ecosystem Label for the Packages Ecosystem team labels Apr 14, 2023
@andrewkroh andrewkroh marked this pull request as ready for review April 14, 2023 19:32
@andrewkroh andrewkroh requested a review from a team as a code owner April 14, 2023 19:32
@elasticmachine
Copy link

Pinging @elastic/fleet (Team:Fleet)

@kpollich
Copy link
Member

cc @mrodm - I think you'd be best suited to look into this.

@andrewkroh
Copy link
Member Author

I created an issue for the input packages not working. elastic/kibana#155057

@mrodm
Copy link
Contributor

mrodm commented Apr 18, 2023

Why not move the testing before the publishing?

Previously it was like that, but publishing new package versions were delayed too much when PR. Even if testing was not successful, publishing was performed too:

  post {
    always {
      publishCoverageReports()
      packageStoragePublish()
    }

So it was decided to move that step before testing, so publish of new packages was carried out first (#4647).

publishing-job-remote fails with NPE.
https://internal-ci.elastic.co/job/package_storage/job/publishing-job-remote/5504/console

There is an ongoing PR to try to solve these errors. They are related to a limit of GitHub checks per SHA.

Following this approach to speed up the tests by avoiding to run the publishing stage, what about adding a third job in parallel to just run the publishing stage ? @andrewkroh @jsoriano That would require adding another boolean parameter like skip_testing

@andrewkroh
Copy link
Member Author

Thanks for the background info on the existing structure.

what about adding a third job in parallel to just run the publishing stage ?

That's sound equivalent and should address the issue of testing being blocked (or stopped) by publishing. The scheduled 7.latest/8.latest testing jobs would not launch the publishing job, right?

@andrewkroh
Copy link
Member Author

andrewkroh commented Apr 18, 2023

Another issue that went undiagnosed was #5912.

@mrodm
Copy link
Contributor

mrodm commented Apr 19, 2023

That's sound equivalent and should address the issue of testing being blocked (or stopped) by publishing. The scheduled 7.latest/8.latest testing jobs would not launch the publishing job, right?

That's right, with that this schedule daily would trigger three Jenkins builds:

  • Build running 7.17.x (latest) without publishing stage.
  • Build running 8.x (latest) without publishing stage.
  • Build running just publishing without testing stage (a new parameter should be added for that).

PR related to remove notifications to GitHub was already merged, that also should help to avoid those errors 🤞

@andrewkroh
Copy link
Member Author

andrewkroh commented Apr 20, 2023

PR related to remove notifications to GitHub was already merged, that also should help to avoid those errors

It looks like the publishing errors have gone away. 👍

Could we merge this PR to take that publishing step out of those daily jobs anyways?

@mrodm
Copy link
Contributor

mrodm commented Apr 21, 2023

Could we merge this PR to take that publishing step out of those daily jobs anyways?

Yes, I think it could help in some cases.

In any case, I've created this other issue #5947 to create that third job mentioned in #5884 (comment)

@andrewkroh andrewkroh merged commit b988f03 into elastic:main Apr 21, 2023
1 check passed
@mrodm mrodm mentioned this pull request Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Team:Ecosystem Label for the Packages Ecosystem team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants