[Detection Engine] Fixing ML FTR tests #182183

rylnd · 2024-04-30T17:49:35Z

Summary

🚧 🚧

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Flaky Test Runner was used on any tests changed
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

rylnd · 2024-04-30T17:51:18Z

Flaky run with only ML FTR tests running (25x): https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5819

Let's see if we can't reproduce the failure in isolation.

These don't appear to fail in isolation, so let's see if another test is what's causing the failure.

rylnd · 2024-04-30T19:23:18Z

None of the isolated tests failed, so I've added debugging code and I'm now running 50x tests not in isolation: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5821

We got a single failure in 50 executions, but there was no useful debug info. Trying again with a broader pattern since that might tell us more.

rylnd · 2024-04-30T22:14:51Z

We got a failure in the above run, but the debugging info did not provide anything useful. I've tried broadening the debugging data we're collecting, and ran it again: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5824

My theory is that this dynamic template is being used in a rare situation where data is being inserted before the index mappings have been applied. If true, removing this will (at least) cause a different error to be produced in the same situation, if not fix the issue.

rylnd · 2024-05-01T18:31:08Z

While there was a failure in the previous round, many of the runs were cancelled so I ran again: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5825.

The results were mostly the same, although I did observe the tests fail when no anomaly data/mappings were present, which all but eliminates "dirty environment" as the potential cause here, which would leave only "race condition" as the explanation.

I've triggered another run without the dynamic_template mapping on the index that's being affected, to see if we can change/eliminate the error: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5828

rylnd · 2024-05-01T21:31:13Z

Interesting development: the previous run of 60x without the dynamic_template was 100% successful. I'm going to run another 200x to see if that holds, but if so my theory about a race condition between mappings and dynamic_template seems to be correct.

200x build: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5832

This reverts commit fde0334. The error was still caused when this was absent, meaning it's not involved.

Something's happening within es_archiver, and I'm trying to figure out what.

rylnd · 2024-05-03T17:16:08Z

I added some more verbose debugging in bdab2be; running another 60x now: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5848

* Debug concurrency setting (Might reduce this to 1 to eliminate that as an issue) * Debug order of file streams (to ensure mappings are being picked up first) * Debug index creation (to see how/whether the ML index is being created) * Debug index creation response (to see if there's some non-fatal error/warning).

rylnd · 2024-05-06T18:18:18Z

The previous run had two failures. The debug info showed es_archiver receiving the documents as they're written in the archive (e.g. as host.name: '' and not host: { name: '' }, but as ops is a WeakMap we can't really enumerate and print it.

Since we consistently see that the anomalies index has no mappings in these failures, I'm adding more debugging around the creation of the index and mappings, and triggering another run: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5860

rylnd · 2024-05-07T01:41:07Z

The last run was unusably verbose, as I neglected to limit debugging to just the tests/calls I cared about. New run: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5861

rylnd · 2024-05-07T17:03:05Z

No failures in the previous 60 runs; it's possible that the act of logging the actions before taking them gives enough time for the race condition to resolve consistently. Going to try another 60 to see. https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5882

rylnd · 2024-05-10T17:50:26Z

I was investigating a bug in ML suppression and in the course of doing so ended up adding the security_solution/anomalies archive to a cypress test. When I did so, I was immediately met with our familiar error about being unable to parse host.name. I'm now 99% certain that this archive is just bad.

I tracked down the last change to this archive to #133510, which notably modified the data but not the mappings for this archive. Checking out and inspecting the data between before/after that commit, we can see what was changed:

git diff --no-index --word-diff=porcelain --word-diff-regex=. <(head -n1 data_old.json) <(head -n1 data.json)

diff --git a/dev/fd/63 b/dev/fd/62
--- a/dev/fd/63
+++ b/dev/fd/62
@@ -1 +1 @@
 {"type":"doc","value":{"id":"
+v3_
 linux_anomalous_network_activity_
-ecs_
 record_1586274300000_900_0_-96106189301704594950079884115725560577_5","index":".ml-anomalies-custom-
+v3_
 linux_anomalous_network_activity
-_ecs
 ","source":{"actual":[1],"bucket_span":900,"by_field_name":"process.name","by_field_value":"store","detector_index":0,"function":"rare","function_description":"rare","host.name":["mothra"],"influencers":[{"influencer_field_name":"user.name","influencer_field_values":["root"]},{"influencer_field_name":"process.name","influencer_field_values":["store"]},{"influencer_field_name":"host.name","influencer_field_values":["mothra"]}],"initial_record_score":33.36147565024334,"is_interim":false,"job_id":"
+v3_
 linux_anomalous_network_activity
-_ecs
 ","multi_bucket_impact":0,"probability":0.007820139656036713,"process.name":["store"],"record_score":33.36147565024334,"result_type":"record","timestamp":1605567488000,"typical":[0.007820139656036711],"user.name":["root"]}}}

This is just the first line, but you can see that the id and index of the documents was changed, while the mappings were not. I suspect this is causing the issues we're seeing, so I'm going to try to resolve the mismatches to validate that.

These had previously diverged, and I suspect are causing sporadic failures.

rylnd · 2024-05-10T18:07:51Z

I updated the mappings to match data in 03f6073, and have triggered a new 100x build on that: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5933.

Edit: and another 100x since the former was flaky: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5934

rylnd · 2024-05-14T21:17:11Z

All of the failures in the previous build were test timeouts (3/100), which raises confidence that the error was due to the data/mapping mismatch. Merging the latest main and running 100x more just to be sure: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5975

kibanamachine · 2024-05-14T22:44:11Z

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#5975

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 73/100 tests passed.

see run history

With the mappings changes, the error has now changed, but we're still occasionally getting test failures due to timeouts. I _suspect_ that the fact of debugging may be interrupting/blocking some other process, occasionally, so I'm removing them and running this again to see how it behaves.

rylnd · 2024-05-16T22:12:41Z

Well, the mapping change has certainly changed the error, but now we're just getting random timeouts on our test runs. I'm going to see if removing the debugging output will address that, at all: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6021

kibanamachine · 2024-05-17T01:27:21Z

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6021

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 116/200 tests passed.

see run history

rylnd · 2024-07-09T22:12:35Z

Update: with ML Suppression now merged, most of the data-related fixes are in main. However, we're still seeing the same failures occurring, notably with the new FTR tests that were added/enabled in that PR.

Those tests are now skipped in main, and I'm continuing to pursue the failures on this branch (which has now been updated with latest main).

When we last checked in, the silent failures seemed to be due to no alerts being generated by the rule. Silent failure aside (which I'm also investigating), it's not yet clear whether the alerts aren't being generated due to:

A rule error
The anomalies not being mapped/indexed properly, and thus unavailable to ES
The ML API not returning the indexed anomalies for some other reason.

rylnd · 2024-07-09T22:15:19Z

New build to see where we're at: build

kibanamachine · 2024-07-10T00:15:32Z

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6502

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 73/100 tests passed.

see run history

The big idea with this commit is to log both the actual ML API results in the rule, along with a "pretty close" raw ES call. This should allow us to determine whether the ML API is misbehaving, somehow.

Since the tests seem to be hanging waiting for a successful rule run, let's see how it's failing.

rylnd · 2024-07-10T15:36:08Z

Running the latest changes here. Lots more output, hopefully we can see if the ML API is behaving as we expect. If nothing else, we'll see how the rule is failing that's causing the timeout.

kibanamachine · 2024-07-10T17:53:50Z

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6509

[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 100/100 tests passed.

see run history

rylnd · 2024-07-10T18:17:03Z

No failures in the last run of 100 ☹️ . Running 150 again here.

kibanamachine · 2024-07-10T21:03:43Z

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6516

[❌] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 123/150 tests passed.

see run history

rylnd · 2024-07-10T21:25:01Z

Alright, I think I finally found the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity:

Setup Modules Failure Response

{
  "jobs": [
    { "id": "v3_linux_anomalous_network_port_activity", "success": true },
    {
      "id": "v3_linux_anomalous_network_activity",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery",
              "node": "dKzpvp06ScO0OxqHilETEA",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            }
          ]
        },
        "status": 503
      }
    }
  ],
  "datafeeds": [
    {
      "id": "datafeed-v3_linux_anomalous_network_port_activity",
      "success": true,
      "started": false,
      "awaitingMlNodeAllocation": false
    },
    {
      "id": "datafeed-v3_linux_anomalous_network_activity",
      "success": false,
      "started": false,
      "awaitingMlNodeAllocation": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "resource_not_found_exception",
              "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
            }
          ],
          "type": "resource_not_found_exception",
          "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
        },
        "status": 404
      }
    }
  ],
  "kibana": {}
}

I'm still investigating what the error means, but we can see in the most recent build that all of the failures have that same errant response (while the green runs do not), so this looks very promising.

Beyond the error itself, it appears that multiple jobs fail to be set up because of a single job index being unavailable, as can be observed in this run:

Multiple Job Failures due to (reportedly) single job index

{
  "jobs": [
    { "id": "v3_linux_anomalous_network_port_activity", "success": true }, // NB: JOB WAS SUCCESSFUL
    { "id": "v3_linux_rare_metadata_process", "success": true },
    {
      "id": "v3_linux_rare_metadata_user",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity", // NB: FAILURE DUE TO OTHER INDEX
              "node": "OiEtZdepT-ep8cToYLs-7w",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            }
          ]
        },
        "status": 503
      }
    },
    {
      "id": "v3_rare_process_by_host_linux",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            },
            { "type": "no_shard_available_action_exception", "reason": null }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity",
              "node": "OiEtZdepT-ep8cToYLs-7w",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            },
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_network_connection_discovery",
              "node": null,
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": null
              }
            }
          ]
        },
        "status": 503
      }
    },
    {
      "id": "v3_linux_anomalous_network_activity",
      "success": false,
      "error": {
        "error": {
          "root_cause": [
            {
              "type": "no_shard_available_action_exception",
              "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
            },
            { "type": "no_shard_available_action_exception", "reason": null }
          ],
          "type": "search_phase_execution_exception",
          "reason": "all shards failed",
          "phase": "query",
          "grouped": true,
          "failed_shards": [
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_anomalous_network_port_activity",
              "node": "OiEtZdepT-ep8cToYLs-7w",
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
              }
            },
            {
              "shard": 0,
              "index": ".ml-anomalies-custom-v3_linux_network_connection_discovery",
              "node": null,
              "reason": {
                "type": "no_shard_available_action_exception",
                "reason": null
              }
            }
          ]
        },
        "status": 503
      }
    }
  ]
}

```</details>

As a quick and simple fix to the error we occasionally encounter, it might be as simple as trying the call again! I'll run these changes in the flaky test runner and see if the sporadic issues will eventually resolve themselves.

rylnd · 2024-07-11T02:56:56Z

@yctercero pointed out that the simple solution here might simply be to retry that setup call until all the jobs have been installed. d8334cb accomplishes that, and here is the accompanying 150x flaky run. 🤞

kibanamachine · 2024-07-11T06:07:35Z

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6517

[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 150/150 tests passed.

see run history

rylnd · 2024-07-11T14:35:21Z

The previous 150x run with the retry logic is green; that most likely means we have a solution! 🎉

However, since the failure rate was so low for this I'm running another 200x to see if something doesn't pop up: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6525

kibanamachine · 2024-07-11T18:07:01Z

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6525

[✅] x-pack/test/security_solution_api_integration/test_suites/detections_response/detection_engine/rule_execution_logic/trial_license_complete_tier/configs/ess.config.ts: 200/200 tests passed.

see run history

The flakiness here ends up being caused by sporadic unavailability of shards during module setup. The underlying cause of that unavailability is likely a race condition between ML, ES, and/or FTR, but luckily we don't need to worry about that because simply retrying the API call causes it to eventually succeed. In those cases, some of the jobs will report a 4xx status, but that's expected. This is the result of a lot of prodding and CPU cycles on CI; see elastic#182183 for the full details.

rylnd · 2024-07-11T19:29:31Z

Alright, I'm happy with the last 350 tests being green. I'm running another batch of flaky runs on #188155, but I'm going to close this for now.

This call was found to be sporadically failing in elastic#182183. This applies the same changes made in elastic#188155, but for Cypress tests instead of FTR.

## Summary The full chronicle of this endeavor can be found [here](#182183), but [this comment](#182183 (comment)) summarizes the identified issue: > I [finally found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368) the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity: > > <details> > <summary>Setup Modules Failure Response</summary> > > ```json > { > "jobs": [ > { "id": "v3_linux_anomalous_network_port_activity", "success": true }, > { > "id": "v3_linux_anomalous_network_activity", > "success": false, > "error": { > "error": { > "root_cause": [ > { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > ], > "type": "search_phase_execution_exception", > "reason": "all shards failed", > "phase": "query", > "grouped": true, > "failed_shards": [ > { > "shard": 0, > "index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery", > "node": "dKzpvp06ScO0OxqHilETEA", > "reason": { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > } > ] > }, > "status": 503 > } > } > ], > "datafeeds": [ > { > "id": "datafeed-v3_linux_anomalous_network_port_activity", > "success": true, > "started": false, > "awaitingMlNodeAllocation": false > }, > { > "id": "datafeed-v3_linux_anomalous_network_activity", > "success": false, > "started": false, > "awaitingMlNodeAllocation": false, > "error": { > "error": { > "root_cause": [ > { > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > } > ], > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > }, > "status": 404 > } > } > ], > "kibana": {} > } > > ``` > </details> This branch, then, fixes said issue by (relatively simply) retrying the failed API call until it succeeds. ### Related Issues Addresses: - #171426 - #187478 - #187614 - #182009 - #171426 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] [ESS Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528) - [x] [Serverless Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529) ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

## Summary The full chronicle of this endeavor can be found [here](elastic#182183), but [this comment](elastic#182183 (comment)) summarizes the identified issue: > I [finally found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368) the cause of these failures in the response to our "setup modules" request to ML. Attaching here for posterity: > > <details> > <summary>Setup Modules Failure Response</summary> > > ```json > { > "jobs": [ > { "id": "v3_linux_anomalous_network_port_activity", "success": true }, > { > "id": "v3_linux_anomalous_network_activity", > "success": false, > "error": { > "error": { > "root_cause": [ > { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > ], > "type": "search_phase_execution_exception", > "reason": "all shards failed", > "phase": "query", > "grouped": true, > "failed_shards": [ > { > "shard": 0, > "index": ".ml-anomalies-custom-v3_linux_network_configuration_discovery", > "node": "dKzpvp06ScO0OxqHilETEA", > "reason": { > "type": "no_shard_available_action_exception", > "reason": "[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]" > } > } > ] > }, > "status": 503 > } > } > ], > "datafeeds": [ > { > "id": "datafeed-v3_linux_anomalous_network_port_activity", > "success": true, > "started": false, > "awaitingMlNodeAllocation": false > }, > { > "id": "datafeed-v3_linux_anomalous_network_activity", > "success": false, > "started": false, > "awaitingMlNodeAllocation": false, > "error": { > "error": { > "root_cause": [ > { > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > } > ], > "type": "resource_not_found_exception", > "reason": "No known job with id 'v3_linux_anomalous_network_activity'" > }, > "status": 404 > } > } > ], > "kibana": {} > } > > ``` > </details> This branch, then, fixes said issue by (relatively simply) retrying the failed API call until it succeeds. ### Related Issues Addresses: - elastic#171426 - elastic#187478 - elastic#187614 - elastic#182009 - elastic#171426 ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] [ESS Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528) - [x] [Serverless Rule Execution FTR x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529) ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) (cherry picked from commit 3df635e)

… (#188259) # Backport This will backport the following commits from `main` to `8.15`: - [[Detection Engine] Addresses Flakiness in ML FTR tests (#188155)](#188155)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)

This API call was found to be sporadically failing in #182183. This applies the same changes made in #188155, but for Cypress tests instead of FTR. Since none of the cypress tests are currently skipped, this PR just serves to add robustness to the suite, which performs nearly identical setup to that of the FTR tests. I think the biggest difference is how often these tests are run vs FTRs. Combined with the low failure rate for the underlying issue, cypress's auto-retrying may smooth over many of these failures when they occur. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] [Detection Engine Cypress - ESS x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530) - [ ] [Detection Engine Cypress - Serverless x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531)

This API call was found to be sporadically failing in elastic#182183. This applies the same changes made in elastic#188155, but for Cypress tests instead of FTR. Since none of the cypress tests are currently skipped, this PR just serves to add robustness to the suite, which performs nearly identical setup to that of the FTR tests. I think the biggest difference is how often these tests are run vs FTRs. Combined with the low failure rate for the underlying issue, cypress's auto-retrying may smooth over many of these failures when they occur. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [ ] [Detection Engine Cypress - ESS x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6530) - [ ] [Detection Engine Cypress - Serverless x 200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6531) (cherry picked from commit ed934e3)

#188483) # Backport This will backport the following commits from `main` to `8.15`: - [[Detection Engine] Fix flake in ML Rule Cypress tests (#188164)](#188164)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Ryland Herrick <ryalnd@gmail.com>

First things first: unskip and isolate ML tests

2d43d44

rylnd added 2 commits April 30, 2024 14:21

Add some debug data to our ML execution

b1252cf

Don't isolate ML tests

9c9bd61

These don't appear to fail in isolation, so let's see if another test is what's causing the failure.

Broaden debugging pattern

1bdb614

We got a single failure in 50 executions, but there was no useful debug info. Trying again with a broader pattern since that might tell us more.

Revert "Remove dynamic template from our anomaly mappings"

21d2c59

This reverts commit fde0334. The error was still caused when this was absent, meaning it's not involved.

More verbose debugging of writing our anomalies

bdab2be

Something's happening within es_archiver, and I'm trying to figure out what.

Only log relevant es_archiver calls

dfc94c6

rylnd added 2 commits May 10, 2024 13:02

Revert most debugging changes

a7532f6

Align archive mappings with data

03f6073

These had previously diverged, and I suspect are causing sporadic failures.

Merge branch 'main' into ml_rule_ftr_debugging

5a8c833

rylnd mentioned this pull request May 16, 2024

[Detection Engine] Adds Alert Suppression to ML Rules #181926

Merged

7 tasks

rylnd mentioned this pull request Jul 9, 2024

[DRAFT] Attempting Cypress test fixes #186562

Closed

rylnd added 2 commits July 10, 2024 10:31

Log everything so we can see what's happening

40c056a

The big idea with this commit is to log both the actual ML API results in the rule, along with a "pretty close" raw ES call. This should allow us to determine whether the ML API is misbehaving, somehow.

Log Rule executions as we inspect them

2eb69ef

Since the tests seem to be hanging waiting for a successful rule run, let's see how it's failing.

Retry failed ML Job setup in DE FTR tests

d8334cb

As a quick and simple fix to the error we occasionally encounter, it might be as simple as trying the call again! I'll run these changes in the flaky test runner and see if the sporadic issues will eventually resolve themselves.

rylnd mentioned this pull request Jul 11, 2024

[Detection Engine] Addresses Flakiness in ML FTR tests #188155

Merged

5 tasks

rylnd closed this Jul 11, 2024

rylnd deleted the ml_rule_ftr_debugging branch July 11, 2024 19:33

rylnd added a commit to rylnd/kibana that referenced this pull request Jul 11, 2024

Retry our ML API call in Cypress tests

f4d94d2

This call was found to be sporadically failing in elastic#182183. This applies the same changes made in elastic#188155, but for Cypress tests instead of FTR.

rylnd mentioned this pull request Jul 11, 2024

[Detection Engine] Fix flake in ML Rule Cypress tests #188164

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Detection Engine] Fixing ML FTR tests #182183

[Detection Engine] Fixing ML FTR tests #182183

rylnd commented Apr 30, 2024

rylnd commented Apr 30, 2024

rylnd commented Apr 30, 2024

rylnd commented Apr 30, 2024

rylnd commented May 1, 2024

rylnd commented May 1, 2024

rylnd commented May 3, 2024

rylnd commented May 6, 2024

rylnd commented May 7, 2024

rylnd commented May 7, 2024

rylnd commented May 10, 2024

rylnd commented May 10, 2024 •

edited

Loading

rylnd commented May 14, 2024

kibanamachine commented May 14, 2024

rylnd commented May 16, 2024

kibanamachine commented May 17, 2024

rylnd commented Jul 9, 2024

rylnd commented Jul 9, 2024

kibanamachine commented Jul 10, 2024

rylnd commented Jul 10, 2024

kibanamachine commented Jul 10, 2024

rylnd commented Jul 10, 2024 •

edited

Loading

kibanamachine commented Jul 10, 2024

rylnd commented Jul 10, 2024

rylnd commented Jul 11, 2024

kibanamachine commented Jul 11, 2024

rylnd commented Jul 11, 2024

kibanamachine commented Jul 11, 2024

rylnd commented Jul 11, 2024

[Detection Engine] Fixing ML FTR tests #182183

[Detection Engine] Fixing ML FTR tests #182183

Conversation

rylnd commented Apr 30, 2024

Summary

Checklist

Risk Matrix

For maintainers

rylnd commented Apr 30, 2024

rylnd commented Apr 30, 2024

rylnd commented Apr 30, 2024

rylnd commented May 1, 2024

rylnd commented May 1, 2024

rylnd commented May 3, 2024

rylnd commented May 6, 2024

rylnd commented May 7, 2024

rylnd commented May 7, 2024

rylnd commented May 10, 2024

rylnd commented May 10, 2024 • edited Loading

rylnd commented May 14, 2024

kibanamachine commented May 14, 2024

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#5975

rylnd commented May 16, 2024

kibanamachine commented May 17, 2024

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6021

rylnd commented Jul 9, 2024

rylnd commented Jul 9, 2024

kibanamachine commented Jul 10, 2024

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6502

rylnd commented Jul 10, 2024

kibanamachine commented Jul 10, 2024

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6509

rylnd commented Jul 10, 2024 • edited Loading

kibanamachine commented Jul 10, 2024

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6516

rylnd commented Jul 10, 2024

rylnd commented Jul 11, 2024

kibanamachine commented Jul 11, 2024

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6517

rylnd commented Jul 11, 2024

kibanamachine commented Jul 11, 2024

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6525

rylnd commented Jul 11, 2024

rylnd commented May 10, 2024 •

edited

Loading

rylnd commented Jul 10, 2024 •

edited

Loading