Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Investigate auto-healing when no write index is set for alerts as data alias #179829

Closed
2 tasks
ymao1 opened this issue Apr 2, 2024 · 1 comment · Fixed by #184161
Closed
2 tasks
Assignees
Labels
Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Apr 2, 2024

We've run into multiple SDHs where concrete indices exist for an alerts-as-data resource but none of them are set as the write index for an alias. In this scenario we throw an error and alert resources fail to install. We should look into whether we can automatically pick a concrete index and set it as the write index to avoid these types of failures. It seems this scenario can occur when alerts indices are restored from snapshot.

Definition of Done

  • System finds a write index automatically when none are specified
  • System no long fails to install the resources
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry labels Apr 2, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 added the research label Apr 2, 2024
@doakalexi doakalexi self-assigned this May 17, 2024
doakalexi added a commit that referenced this issue May 29, 2024
… alerts as data alias (#184161)

Resolves #179829

## Summary

We've run into multiple SDHs where concrete indices exist for an
alerts-as-data resource but none of them are set as the write index for
an alias. This PR adds code to pick a concrete index and set it as the
write index to avoid these types of failures.

### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### To verify

1. Go to [dev tools](http://localhost:5601/app/dev_tools#/console)
2. Create an ES Query rule
```
POST kbn:/api/alerting/rule
{
  "params": {
    "searchType": "esQuery",
    "timeWindowSize": 5,
    "timeWindowUnit": "m",
    "threshold": [
      -1
    ],
    "thresholdComparator": ">",
    "size": 100,
    "esQuery": "{\n    \"query\":{\n      \"match_all\" : {}\n    }\n  }",
    "aggType": "count",
    "groupBy": "all",
    "termSize": 5,
    "excludeHitsFromPreviousRun": false,
    "sourceFields": [],
    "index": [
      ".kibana"
    ],
    "timeField": "created_at"
  },
  "consumer": "stackAlerts",
  "schedule": {
    "interval": "1m"
  },
  "tags": [],
  "name": "test",
  "rule_type_id": ".es-query",
  "actions": []
}
```

3. Run the following commands to set `"is_write_index": false`
```
POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": ".internal.alerts-stack.alerts-default-000001",
        "alias": ".alerts-stack.alerts-default"
      }
    }, {
      "add": {
        "index": ".internal.alerts-stack.alerts-default-000001",
        "alias": ".alerts-stack.alerts-default",
        "is_write_index": false
      }
    }
  ]
}

GET .internal.alerts-stack.alerts-default-000001/_alias/*
```
4. Stop Kibana, but keep ES running
5. Start Kibana and verify that the rule runs successfully
6. Run the GET alias command to verify `"is_write_index": true`
```
GET .internal.alerts-stack.alerts-default-000001/_alias/*
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
3 participants