Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Detection Engine] Fixing ML FTR tests #182183

Closed
wants to merge 22 commits into from

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Apr 30, 2024

Summary

🚧 🚧

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@rylnd
Copy link
Contributor Author

rylnd commented Apr 30, 2024

Flaky run with only ML FTR tests running (25x): https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5819

Let's see if we can't reproduce the failure in isolation.

These don't appear to fail in isolation, so let's see if another test is
what's causing the failure.
@rylnd
Copy link
Contributor Author

rylnd commented Apr 30, 2024

None of the isolated tests failed, so I've added debugging code and I'm now running 50x tests not in isolation: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5821

We got a single failure in 50 executions, but there was no useful debug
info. Trying again with a broader pattern since that might tell us more.
@rylnd
Copy link
Contributor Author

rylnd commented Apr 30, 2024

We got a failure in the above run, but the debugging info did not provide anything useful. I've tried broadening the debugging data we're collecting, and ran it again: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5824

My theory is that this dynamic template is being used in a rare
situation where data is being inserted before the index mappings have
been applied. If true, removing this will (at least) cause a different error to be
produced in the same situation, if not fix the issue.
@rylnd
Copy link
Contributor Author

rylnd commented May 1, 2024

While there was a failure in the previous round, many of the runs were cancelled so I ran again: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5825.

The results were mostly the same, although I did observe the tests fail when no anomaly data/mappings were present, which all but eliminates "dirty environment" as the potential cause here, which would leave only "race condition" as the explanation.

I've triggered another run without the dynamic_template mapping on the index that's being affected, to see if we can change/eliminate the error: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5828

@rylnd
Copy link
Contributor Author

rylnd commented May 1, 2024

Interesting development: the previous run of 60x without the dynamic_template was 100% successful. I'm going to run another 200x to see if that holds, but if so my theory about a race condition between mappings and dynamic_template seems to be correct.

200x build: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5832

This reverts commit fde0334.

The error was still caused when this was absent, meaning it's not
involved.
Something's happening within es_archiver, and I'm trying to figure out
what.
@rylnd
Copy link
Contributor Author

rylnd commented May 3, 2024

I added some more verbose debugging in bdab2be; running another 60x now: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5848

* Debug concurrency setting (Might reduce this to 1 to eliminate that as
  an issue)
* Debug order of file streams (to ensure mappings are being picked up
  first)
* Debug index creation (to see how/whether the ML index is being
  created)
* Debug index creation response (to see if there's some non-fatal
  error/warning).
@rylnd
Copy link
Contributor Author

rylnd commented May 6, 2024

The previous run had two failures. The debug info showed es_archiver receiving the documents as they're written in the archive (e.g. as host.name: '' and not host: { name: '' }, but as ops is a WeakMap we can't really enumerate and print it.

Since we consistently see that the anomalies index has no mappings in these failures, I'm adding more debugging around the creation of the index and mappings, and triggering another run: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5860

@rylnd
Copy link
Contributor Author

rylnd commented May 7, 2024

The last run was unusably verbose, as I neglected to limit debugging to just the tests/calls I cared about. New run: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5861

@rylnd
Copy link
Contributor Author

rylnd commented May 7, 2024

No failures in the previous 60 runs; it's possible that the act of logging the actions before taking them gives enough time for the race condition to resolve consistently. Going to try another 60 to see. https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5882

@rylnd
Copy link
Contributor Author

rylnd commented May 10, 2024

I was investigating a bug in ML suppression and in the course of doing so ended up adding the security_solution/anomalies archive to a cypress test. When I did so, I was immediately met with our familiar error about being unable to parse host.name. I'm now 99% certain that this archive is just bad.

I tracked down the last change to this archive to #133510, which notably modified the data but not the mappings for this archive. Checking out and inspecting the data between before/after that commit, we can see what was changed:

git diff --no-index --word-diff=porcelain --word-diff-regex=. <(head -n1 data_old.json) <(head -n1 data.json)
diff --git a/dev/fd/63 b/dev/fd/62
--- a/dev/fd/63
+++ b/dev/fd/62
@@ -1 +1 @@
 {"type":"doc","value":{"id":"
+v3_
 linux_anomalous_network_activity_
-ecs_
 record_1586274300000_900_0_-96106189301704594950079884115725560577_5","index":".ml-anomalies-custom-
+v3_
 linux_anomalous_network_activity
-_ecs
 ","source":{"actual":[1],"bucket_span":900,"by_field_name":"process.name","by_field_value":"store","detector_index":0,"function":"rare","function_description":"rare","host.name":["mothra"],"influencers":[{"influencer_field_name":"user.name","influencer_field_values":["root"]},{"influencer_field_name":"process.name","influencer_field_values":["store"]},{"influencer_field_name":"host.name","influencer_field_values":["mothra"]}],"initial_record_score":33.36147565024334,"is_interim":false,"job_id":"
+v3_
 linux_anomalous_network_activity
-_ecs
 ","multi_bucket_impact":0,"probability":0.007820139656036713,"process.name":["store"],"record_score":33.36147565024334,"result_type":"record","timestamp":1605567488000,"typical":[0.007820139656036711],"user.name":["root"]}}}

This is just the first line, but you can see that the id and index of the documents was changed, while the mappings were not. I suspect this is causing the issues we're seeing, so I'm going to try to resolve the mismatches to validate that.

rylnd added 2 commits May 10, 2024 13:02
These had previously diverged, and I suspect are causing sporadic
failures.
@rylnd
Copy link
Contributor Author

rylnd commented May 10, 2024

I updated the mappings to match data in 03f6073, and have triggered a new 100x build on that: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5933.

Edit: and another 100x since the former was flaky: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5934

@rylnd
Copy link
Contributor Author

rylnd commented May 14, 2024

All of the failures in the previous build were test timeouts (3/100), which raises confidence that the error was due to the data/mapping mismatch. Merging the latest main and running 100x more just to be sure: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5975

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#5975

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 73/100 tests passed.

see run history

With the mappings changes, the error has now changed, but we're still
occasionally getting test failures due to timeouts.

I _suspect_ that the fact of debugging may be interrupting/blocking some
other process, occasionally, so I'm removing them and running this again
to see how it behaves.
@rylnd
Copy link
Contributor Author

rylnd commented May 16, 2024

Well, the mapping change has certainly changed the error, but now we're just getting random timeouts on our test runs. I'm going to see if removing the debugging output will address that, at all: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6021

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6021

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 116/200 tests passed.

see run history

@rylnd
Copy link
Contributor Author

rylnd commented Jul 9, 2024

Update: with ML Suppression now merged, most of the data-related fixes are in main. However, we're still seeing the same failures occurring, notably with the new FTR tests that were added/enabled in that PR.

Those tests are now skipped in main, and I'm continuing to pursue the failures on this branch (which has now been updated with latest main).

When we last checked in, the silent failures seemed to be due to no alerts being generated by the rule. Silent failure aside (which I'm also investigating), it's not yet clear whether the alerts aren't being generated due to:

  1. A rule error
  2. The anomalies not being mapped/indexed properly, and thus unavailable to ES
  3. The ML API not returning the indexed anomalies for some other reason.

@rylnd
Copy link
Contributor Author

rylnd commented Jul 9, 2024

New build to see where we're at: build

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6502

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 73/100 tests passed.

see run history

The big idea with this commit is to log both the actual ML API results
in the rule, along with a "pretty close" raw ES call. This should allow
us to determine whether the ML API is misbehaving, somehow.
Since the tests seem to be hanging waiting for a successful rule run,
let's see how it's failing.
@rylnd
Copy link
Contributor Author

rylnd commented Jul 10, 2024

Running the latest changes here. Lots more output, hopefully we can see if the ML API is behaving as we expect. If nothing else, we'll see how the rule is failing that's causing the timeout.

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6509

[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 100/100 tests passed.

see run history

@rylnd
Copy link
Contributor Author

rylnd commented Jul 10, 2024

No failures in the last run of 100 ☹️ . Running 150 again here.

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6516

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 123/150 tests passed.

see run history

@rylnd
Copy link
Contributor Author

rylnd commented Jul 10, 2024

Alright, I think I finally found the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity:

Setup Modules Failure Response
{
  "jobs": [
    { "id": "v3_linux_anomalous_network_port_activity", "success": true },
    {
      "id": "v3_linux_anomalous_network_activity",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery",
              "node": "dKzpvp06ScO0OxqHilETEA",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            }
          ]
        },
        "status": 503
      }
    }
  ],
  "datafeeds": [
    {
      "id": "datafeed-v3_linux_anomalous_network_port_activity",
      "success": true,
      "started": false,
      "awaitingMlNodeAllocation": false
    },
    {
      "id": "datafeed-v3_linux_anomalous_network_activity",
      "success": false,
      "started": false,
      "awaitingMlNodeAllocation": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "resource_not_found_exception",
              "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
            }
          ],
          "type": "resource_not_found_exception",
          "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
        },
        "status": 404
      }
    }
  ],
  "kibana": {}
}

I'm still investigating what the error means, but we can see in the most recent build that all of the failures have that same errant response (while the green runs do not), so this looks very promising.

Beyond the error itself, it appears that multiple jobs fail to be set up because of a single job index being unavailable, as can be observed in this run:

Multiple Job Failures due to (reportedly) single job index
{
  "jobs": [
    { "id": "v3_linux_anomalous_network_port_activity", "success": true }, // NB: JOB WAS SUCCESSFUL
    { "id": "v3_linux_rare_metadata_process", "success": true },
    {
      "id": "v3_linux_rare_metadata_user",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity", // NB: FAILURE DUE TO OTHER INDEX
              "node": "OiEtZdepT-ep8cToYLs-7w",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            }
          ]
        },
        "status": 503
      }
    },
    {
      "id": "v3_rare_process_by_host_linux",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            },
            { "type": "no_shard_available_action_exception", "reason": null }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity",
              "node": "OiEtZdepT-ep8cToYLs-7w",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            },
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_network_connection_discovery",
              "node": null,
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": null
              }
            }
          ]
        },
        "status": 503
      }
    },
    {
      "id": "v3_linux_anomalous_network_activity",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            },
            { "type": "no_shard_available_action_exception", "reason": null }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity",
              "node": "OiEtZdepT-ep8cToYLs-7w",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            },
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_network_connection_discovery",
              "node": null,
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": null
              }
            }
          ]
        },
        "status": 503
      }
    }
  ]
}

```</details>

As a quick and simple fix to the error we occasionally encounter, it
might be as simple as trying the call again! I'll run these changes in
the flaky test runner and see if the sporadic issues will eventually
resolve themselves.
@rylnd
Copy link
Contributor Author

rylnd commented Jul 11, 2024

@yctercero pointed out that the simple solution here might simply be to retry that setup call until all the jobs have been installed. d8334cb accomplishes that, and here is the accompanying 150x flaky run. 🤞

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6517

[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 150/150 tests passed.

see run history

@rylnd
Copy link
Contributor Author

rylnd commented Jul 11, 2024

The previous 150x run with the retry logic is green; that most likely means we have a solution! 🎉

However, since the failure rate was so low for this I'm running another 200x to see if something doesn't pop up: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6525

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6525

[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 200/200 tests passed.

see run history

rylnd added a commit to rylnd/kibana that referenced this pull request Jul 11, 2024
The flakiness here ends up being caused by sporadic unavailability of
shards during module setup. The underlying cause of that unavailability
is likely a race condition between ML, ES, and/or FTR, but luckily we
don't need to worry about that because simply retrying the API call
causes it to eventually succeed.

In those cases, some of the jobs will report a 4xx status, but that's
expected.

This is the result of a lot of prodding and CPU cycles on CI; see elastic#182183 for
the full details.
@rylnd
Copy link
Contributor Author

rylnd commented Jul 11, 2024

Alright, I'm happy with the last 350 tests being green. I'm running another batch of flaky runs on #188155, but I'm going to close this for now.

@rylnd rylnd closed this Jul 11, 2024
@rylnd rylnd deleted the ml_rule_ftr_debugging branch July 11, 2024 19:33
rylnd added a commit to rylnd/kibana that referenced this pull request Jul 11, 2024
This call was found to be sporadically failing in elastic#182183. This applies
the same changes made in elastic#188155, but for Cypress tests instead of FTR.
rylnd added a commit that referenced this pull request Jul 12, 2024
## Summary

The full chronicle of this endeavor can be found
[here](#182183), but [this
comment](#182183 (comment))
summarizes the identified issue:

> I [finally
found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)
the cause of these failures in the response to our "setup modules"
request to ML. Attaching here for posterity:
>
> <details>
> <summary>Setup Modules Failure Response</summary>
> 
> ```json
> {
>   "jobs": [
> { "id": "v3_linux_anomalous_network_port_activity", "success": true },
>     {
>       "id": "v3_linux_anomalous_network_activity",
>       "success": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>             }
>           ],
>           "type": "search_phase_execution_exception",
>           "reason": "all shards failed",
>           "phase": "query",
>           "grouped": true,
>           "failed_shards": [
>             {
>               "shard": 0,
> "index":
".ml-anomalies-custom-v3_linux_network_configuration_discovery",
>               "node": "dKzpvp06ScO0OxqHilETEA",
>               "reason": {
>                 "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>               }
>             }
>           ]
>         },
>         "status": 503
>       }
>     }
>   ],
>   "datafeeds": [
>     {
>       "id": "datafeed-v3_linux_anomalous_network_port_activity",
>       "success": true,
>       "started": false,
>       "awaitingMlNodeAllocation": false
>     },
>     {
>       "id": "datafeed-v3_linux_anomalous_network_activity",
>       "success": false,
>       "started": false,
>       "awaitingMlNodeAllocation": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>             }
>           ],
>           "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>         },
>         "status": 404
>       }
>     }
>   ],
>   "kibana": {}
> }
> 
> ```
> </details>

This branch, then, fixes said issue by (relatively simply) retrying the
failed API call until it succeeds.

### Related Issues
Addresses:
- #171426
- #187478
- #187614
- #182009
- #171426

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] [ESS Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)
- [x] [Serverless Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)


### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
rylnd added a commit to rylnd/kibana that referenced this pull request Jul 12, 2024
## Summary

The full chronicle of this endeavor can be found
[here](elastic#182183), but [this
comment](elastic#182183 (comment))
summarizes the identified issue:

> I [finally
found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)
the cause of these failures in the response to our "setup modules"
request to ML. Attaching here for posterity:
>
> <details>
> <summary>Setup Modules Failure Response</summary>
>
> ```json
> {
>   "jobs": [
> { "id": "v3_linux_anomalous_network_port_activity", "success": true },
>     {
>       "id": "v3_linux_anomalous_network_activity",
>       "success": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>             }
>           ],
>           "type": "search_phase_execution_exception",
>           "reason": "all shards failed",
>           "phase": "query",
>           "grouped": true,
>           "failed_shards": [
>             {
>               "shard": 0,
> "index":
".ml-anomalies-custom-v3_linux_network_configuration_discovery",
>               "node": "dKzpvp06ScO0OxqHilETEA",
>               "reason": {
>                 "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>               }
>             }
>           ]
>         },
>         "status": 503
>       }
>     }
>   ],
>   "datafeeds": [
>     {
>       "id": "datafeed-v3_linux_anomalous_network_port_activity",
>       "success": true,
>       "started": false,
>       "awaitingMlNodeAllocation": false
>     },
>     {
>       "id": "datafeed-v3_linux_anomalous_network_activity",
>       "success": false,
>       "started": false,
>       "awaitingMlNodeAllocation": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>             }
>           ],
>           "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>         },
>         "status": 404
>       }
>     }
>   ],
>   "kibana": {}
> }
>
> ```
> </details>

This branch, then, fixes said issue by (relatively simply) retrying the
failed API call until it succeeds.

### Related Issues
Addresses:
- elastic#171426
- elastic#187478
- elastic#187614
- elastic#182009
- elastic#171426

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] [ESS Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)
- [x] [Serverless Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

(cherry picked from commit 3df635e)
rylnd added a commit that referenced this pull request Jul 12, 2024
… (#188259)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[Detection Engine] Addresses Flakiness in ML FTR tests
(#188155)](#188155)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ryland
Herrick","email":"ryalnd@gmail.com"},"sourceCommit":{"committedDate":"2024-07-12T19:10:25Z","message":"[Detection
Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n##
Summary\r\n\r\nThe full chronicle of this endeavor can be
found\r\n[here](#182183), but
[this\r\ncomment](#182183 (comment)
the identified issue:\r\n\r\n> I
[finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe
cause of these failures in the response to our \"setup
modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n>
<details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n>
\r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\":
\"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n>
{\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n>
\"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n>
\"root_cause\": [\r\n> {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n>
\"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n>
\"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\":
0,\r\n>
\"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n>
\"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n>
\"datafeeds\": [\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\":
true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false\r\n> },\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\":
false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n>
{\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No
known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n>
],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\":
\"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n>
},\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n>
}\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said
issue by (relatively simply) retrying the\r\nfailed API call until it
succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n-
#171426
#187478
#187614
#182009
#171426
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [x] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n-
[x] [Serverless Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n###
For maintainers\r\n\r\n- [x] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:skip","Feature:Detection
Rules","Feature:ML Rule","Feature:Security ML Jobs","Feature:Rule
Creation","Team:Detection Engine","Feature:Rule
Edit","v8.16.0"],"number":188155,"url":"#188155
Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n##
Summary\r\n\r\nThe full chronicle of this endeavor can be
found\r\n[here](#182183), but
[this\r\ncomment](#182183 (comment)
the identified issue:\r\n\r\n> I
[finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe
cause of these failures in the response to our \"setup
modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n>
<details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n>
\r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\":
\"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n>
{\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n>
\"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n>
\"root_cause\": [\r\n> {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n>
\"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n>
\"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\":
0,\r\n>
\"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n>
\"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n>
\"datafeeds\": [\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\":
true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false\r\n> },\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\":
false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n>
{\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No
known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n>
],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\":
\"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n>
},\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n>
}\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said
issue by (relatively simply) retrying the\r\nfailed API call until it
succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n-
#171426
#187478
#187614
#182009
#171426
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [x] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n-
[x] [Serverless Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n###
For maintainers\r\n\r\n- [x] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/188155","number":188155,"mergeCommit":{"message":"[Detection
Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n##
Summary\r\n\r\nThe full chronicle of this endeavor can be
found\r\n[here](#182183), but
[this\r\ncomment](#182183 (comment)
the identified issue:\r\n\r\n> I
[finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe
cause of these failures in the response to our \"setup
modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n>
<details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n>
\r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\":
\"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n>
{\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n>
\"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n>
\"root_cause\": [\r\n> {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n>
\"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n>
\"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\":
0,\r\n>
\"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n>
\"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n>
\"datafeeds\": [\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\":
true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false\r\n> },\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\":
false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n>
{\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No
known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n>
],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\":
\"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n>
},\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n>
}\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said
issue by (relatively simply) retrying the\r\nfailed API call until it
succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n-
#171426
#187478
#187614
#182009
#171426
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [x] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n-
[x] [Serverless Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n###
For maintainers\r\n\r\n- [x] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b"}}]}]
BACKPORT-->
rylnd added a commit that referenced this pull request Jul 16, 2024
This API call was found to be sporadically failing in #182183. This
applies the same changes made in #188155, but for Cypress tests instead
of FTR.

Since none of the cypress tests are currently skipped, this PR just
serves to add robustness to the suite, which performs nearly identical
setup to that of the FTR tests. I think the biggest difference is how
often these tests are run vs FTRs. Combined with the low failure rate
for the underlying issue, cypress's auto-retrying may smooth over many
of these failures when they occur.


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] [Detection Engine Cypress - ESS x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)
- [ ] [Detection Engine Cypress - Serverless x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Jul 16, 2024
This API call was found to be sporadically failing in elastic#182183. This
applies the same changes made in elastic#188155, but for Cypress tests instead
of FTR.

Since none of the cypress tests are currently skipped, this PR just
serves to add robustness to the suite, which performs nearly identical
setup to that of the FTR tests. I think the biggest difference is how
often these tests are run vs FTRs. Combined with the low failure rate
for the underlying issue, cypress's auto-retrying may smooth over many
of these failures when they occur.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] [Detection Engine Cypress - ESS x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)
- [ ] [Detection Engine Cypress - Serverless x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)

(cherry picked from commit ed934e3)
kibanamachine added a commit that referenced this pull request Jul 16, 2024
#188483)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[Detection Engine] Fix flake in ML Rule Cypress tests
(#188164)](#188164)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ryland
Herrick","email":"ryalnd@gmail.com"},"sourceCommit":{"committedDate":"2024-07-16T19:21:13Z","message":"[Detection
Engine] Fix flake in ML Rule Cypress tests (#188164)\n\nThis API call
was found to be sporadically failing in #182183. This\r\napplies the
same changes made in #188155, but for Cypress tests instead\r\nof
FTR.\r\n\r\nSince none of the cypress tests are currently skipped, this
PR just\r\nserves to add robustness to the suite, which performs nearly
identical\r\nsetup to that of the FTR tests. I think the biggest
difference is how\r\noften these tests are run vs FTRs. Combined with
the low failure rate\r\nfor the underlying issue, cypress's
auto-retrying may smooth over many\r\nof these failures when they
occur.\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [ ] [Detection Engine Cypress -
ESS
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)\r\n-
[ ] [Detection Engine Cypress - Serverless
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)","sha":"ed934e3253b47a6902904633530ec181037d4946","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Detection
Rules","Feature:ML Rule","Feature:Security ML Jobs","Feature:Rule
Creation","backport:prev-minor","Team:Detection Engine","Feature:Rule
Edit","v8.16.0"],"title":"[Detection Engine] Fix flake in ML Rule
Cypress
tests","number":188164,"url":"#188164
Engine] Fix flake in ML Rule Cypress tests (#188164)\n\nThis API call
was found to be sporadically failing in #182183. This\r\napplies the
same changes made in #188155, but for Cypress tests instead\r\nof
FTR.\r\n\r\nSince none of the cypress tests are currently skipped, this
PR just\r\nserves to add robustness to the suite, which performs nearly
identical\r\nsetup to that of the FTR tests. I think the biggest
difference is how\r\noften these tests are run vs FTRs. Combined with
the low failure rate\r\nfor the underlying issue, cypress's
auto-retrying may smooth over many\r\nof these failures when they
occur.\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [ ] [Detection Engine Cypress -
ESS
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)\r\n-
[ ] [Detection Engine Cypress - Serverless
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)","sha":"ed934e3253b47a6902904633530ec181037d4946"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/188164","number":188164,"mergeCommit":{"message":"[Detection
Engine] Fix flake in ML Rule Cypress tests (#188164)\n\nThis API call
was found to be sporadically failing in #182183. This\r\napplies the
same changes made in #188155, but for Cypress tests instead\r\nof
FTR.\r\n\r\nSince none of the cypress tests are currently skipped, this
PR just\r\nserves to add robustness to the suite, which performs nearly
identical\r\nsetup to that of the FTR tests. I think the biggest
difference is how\r\noften these tests are run vs FTRs. Combined with
the low failure rate\r\nfor the underlying issue, cypress's
auto-retrying may smooth over many\r\nof these failures when they
occur.\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [ ] [Detection Engine Cypress -
ESS
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530)\r\n-
[ ] [Detection Engine Cypress - Serverless
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)","sha":"ed934e3253b47a6902904633530ec181037d4946"}}]}]
BACKPORT-->

Co-authored-by: Ryland Herrick <ryalnd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants