Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016

Open
30 of 41 tasks
banderror opened this issue May 31, 2021 · 25 comments
Open
30 of 41 tasks
Labels
epic Feature:RAC label obsolete Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete

Comments

@banderror
Copy link
Contributor

banderror commented May 31, 2021

Summary

Consolidate #98353 and #98935 into a single implementation.

We ended up having 4 implementations of index management/writing/reading related/similar to the problem we're trying to solve in RAC: two in rule_registry (RuleDataService and EventLogService), one in security_solution (for .siem-signals index), one in event_log plugin. We should compare them, mind their strong and weak parts and build a consolidated implementation in rule_registry.

High-level plan:

Regarding robust index bootstrapping, consider this:

  • Race conditions during index bootstrapping should be handled one way or another. Possible options: a) robust idempotent logic with error handling; b) leveraging task_manager to make sure bootstrapping procedure runs only once at a time; c) using some sort of distributed lock, and d) maybe something else I'm missing. Maybe we could check how Saved Objects service bootstraps .kibana index.
  • Errors should be handled correctly. Pay special attention to errors from Elasticsearch APIs.

When the consolidated implementation is ready, make sure to update all references to it from plugins which already use it: security_solution, observability, apm, uptime, etc.

Tasks for 7.15

Tasks for 7.16

Backlog

Indexing and index bootstrapping logic:

API enhancements for RuleDataService and RuleDataClient:

User-defined resources::

Misc:

Consider these as well:

@banderror banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Theme: rac label obsolete labels May 31, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@banderror
Copy link
Contributor Author

We had a chat with @marshallmain last week and here's a summary of what we discussed and how decided to proceed with the consolidated implementation:

  • We'd like the consolidated implementation to have an API mostly compatible with EventLogService ([RAC] Rule monitoring: Event Log for Rule Registry #98353), but many implementation details (especially around index bootstrapping) taken from RuleDataService/RuleDataClient ([RAC] Decouple registry from alerts-as-data client #98935).
  • We'd like it to provide a schema for documents (alerts, events) so that we could have static compiler checks in the code.
  • We'd like to decouple the implementation from FieldMap. We'd leave FieldMap in the code, but remove it from generic parameters of classes and functions, making it a helper utility, rather than a hard dependency of the consolidated implementation.
  • We'd like the implementation to be based on component templates and the new idex template API.
    • Leveraging component templates would allow us to address one specific problem we have in Security Solution with users who need to be able to extend or override detection alert mappings.
  • We'd like to have index bootstrapping logic similar to what we already have in Security Solution.
  • We'd like component/index templates to be versioned, so that when we change the schema in the code, we could update templates and rollover indices - all that based on the versioning.

@banderror
Copy link
Contributor Author

I opened a draft PR (#101453) for the consolidated implementation. Here's a comparison of the 3 implementations in rule_registry:

Feature RuleDataService EventLogService Consolidated implementation
Schema Implicit, schema-less (indexing function accepts documents as any) Explicit schema for documents Explicit schema for documents
Dependency on FieldMap No hard dependency Hard dependency No hard dependency
Component templates Yes No Yes
Versioning of schema, mappings, settings No No Yes
Soft document migrations (index rollover) No No Yes
Index management API Imperative Declarative Declarative
Index management encapsulation Low-level methods are exposed to clients, clients are responsible for correct implementation of index management Low-level methods are encapsulated, machanism is responsible for index management Low-level methods are encapsulated, machanism is responsible for index management

@banderror
Copy link
Contributor Author

@banderror
Copy link
Contributor Author

banderror commented Jun 9, 2021

One more thing I forgot to mention is about component templates. In the consolidated implementation, I'm proposing the following "architecture" for templates (as you can see here and here in the code):

During index bootstrapping a number of templates will be created by the Event Log mechanism. They will have a certain order of precedence and the "next" template will override properties from all the "previous" ones. Here's the list of templates from "start" (most generic, least precedence) to "finish" (most specific, most precedence):

  1. Mechanism-level .alerts-mappings component template. Specified internally by the Event Log mechanism. Contains index mappings common to all logs (observability alerts, security execution events, etc).
  2. Mechanism-level .alerts-settings component template. Specified internally by the Event Log mechanism. Contains index settings which make sense to all logs by default.
  3. Log-level .alerts-{log.name}-app application-defined component template. Specified and versioned externally by the application (plugin) which defines the log. Contains index mappings and/or settings specific to this particular log. This is the place where you as application developer can override or extend the default framework mappings and settings.
  4. Log-level .alerts-{log.name}-user user-defined component template. Specified internally by the Event Log mechanism, is empty, not versioned. By updating it, the user can override default mappings and settings.
  5. Log-level .alerts-{log.name}-user-{spaceId} user-defined space-aware component template. Specified internally by the Event Log mechanism, is empty, not versioned. By updating it, the user can override default mappings and settings of the log in a certain Kibana space.
  6. Log-level .alerts-{log.name}-{spaceId}-template index template. Its version and most of its options can be specified externally by the application (plugin) which defines the log. This is the place where you as application developer can override user settings. However, mind that the Event Log mechanism has the last word and injects some hard defaults into the final index template to make sure it works as it should.

Template #6 overrides #5, which overrides #4, which overrides #3, etc. More on composing multiple templates in the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-template.html#multiple-component-templates

Mechanism-level .alerts-mappings and .alerts-settings component templates are managed purely by the mechanism, bootstrapped on rule_registry plugin start, common to all logs defined via the consolidated implementation, and it's not possible for a client (application/plugin) to change them. It's only 2 of them in the cluster.

All the other log-level templates (3 component templates and 1 index template) are created per each log defined via the consolidated implementation.

As a developer defining a log (and thus its mappings and settings), you are able to specify templates #3 (application-defined component template) and optionally #6 (index template).

@banderror
Copy link
Contributor Author

@marshallmain @madirey @peluja1012 @xcrzx @ecezalp @spong here's my 2 cents on our consolidation efforts and #101453 vs #102586

This is all just my subjective opinion, my understanding might be incomplete, and I don't have any strong opinions on anything below.

#101453

Pros:

  • Provides an explicit TypeScript schema for documents (alerts, events) based on io-ts. This gives us 2 things: 1) static compiler checks and 2) ability to implement validation-on-write and/or normalization-on-read, if we wish to add it. Not a critical thing, but very nice to have, and as we discussed (and as far as I understood), we will likely want to have this in the final implementation.
  • Provides a declarative way of defining a log (its name, TypeScript schema, templates, ILM policy). Instead of imperatively calling a bunch of methods to bootstrap a log, a developer defines an object describing it, and then resolves this definition to get the actual log client.
  • This design allows to encapsulate the mechanism of index bootstrapping, which is a good thing in my opinion. I'm not sure the implementation should give a lot of flexibility for solutions in terms of index bootstrapping, this might and likely will lead to inconsistencies in terms of naming of index and component templates, how many templates each solution uses, when it bootstraps them, in which order, etc. I just don't think this is a necessary flexibility. I feel like encapsulating low-level details is better than exposing them via methods and making a solution responsible for proper index bootstrapping.
  • I tried to add this sort of flexibility by introducing a system of standardised component templates (see [RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016 (comment)). This is a draft and could be tweaked for our needs, e.g. we could remove space id from there and make all these parts shared between spaced, this is an implementation detail.

Cons:

  • All the current clients already using RuleDataService would need to be updated and migrated to the implementation based on EventLogService. This should not be a huge deal, but considering the fact that we wanted to have a consolidated implementation by 7.14 FF, this is now a big risk.
  • This migration would need to be reviewed and approved by Observability team => so it would take even more time and possible friction.

#102586

Pros:

  • Fixes the most important part (index bootstrapping) in RuleDataService. Even without explicit schema and everything else, this should be enough to use this as a consolidated implementation. So we could move forward sooner.

Cons:

  • Opposite to the draft based on EventLogService: no schema, no encapsulation (exposing low-level methods for index bootstrapping), imho too much flexibility and so on.

Random thoughts regarding a system of component templates and their versioning

I really liked the implementation in #102586 which puts a version in the _meta of each template => which gets propagated to the final index template => which gets propagated to the index:

_meta: {
  "versions": {
      ".alerts-security-solution-index-template": 1,
      ".alerts-security-solution-mappings": 1,
      ".alerts-ecs-mappings": 1,
      ".alerts-technical-mappings": 1
   }
}

I also like the idea of lazy index creation. FWIW, in #101453 index bootstrapping is also done lazily. The difference is in #102586 more parts are shared between spaces, and so 2/3 of the bootstrapping happens outside of RuleDataClient (client of a concrete log) and 1/3 inside (index rollover or creation). Whereas in #101453 most of the index bootstrapping is done by the client of a concrete log.

I'm wondering however what do you think about standardizing templates similar to how I did it in the draft (see #101016 (comment)).

Random thoughts regarding the new agreement on namespacing strategy

Let me copy-paste this to here

Rationale and proposed UI for the {namespace} portion
The {namespace} portion of the index naming strategy is meant to provide flexibility for our users to segment the Alerts data in different ways, similar to how the data stream naming strategy has a {namespace} that can be used for different use cases. For example, users might want different ILM policies for the alerts created by some rules compared to others.

We generally don’t want to make a direct connection between this concept of {namespace} and Kibana spaces. Users can choose to include the Kibana space id in the {namespace}, but they might also choose some other means to segment the data.

Therefore, we’d like the {namespace} to be configurable on a per-rule basis via a simple text input widget. The value of the {namespace} is saved in the Rule object. In addition, there will be a Kibana advanced setting per space that defines the default value of this input widget.

In other words, users that would like to separate alerts in different indices by using the space IDs can configure the value of the Kibana advance setting to be equal to the space ID. When creating a new rule in that space, the {namespace} option will be pre-filled with the space ID. They can still change it on every rule if they wish to.

Note about multi-space future: This should continue to work the same way even when multi-space rules become available. The namespace value will be stored on each rule, so it will continue to be tied to the space in which it was created. A user who shares that rule to a 2nd space, for example, would need to decide for themselves the best namespace value in that scenario. A warning could be shown at Rule sharing time indicating the index in which the alerts will be written and giving the user the opportunity to change it. Newly created rules would continue to read in the “default value” from the advanced settings in the current Kibana Space where the rule is being created from.

I think this agreement does not have a lot of effect on any of the two approaches. #101453 API does not force a developer to bind a log to a Kibana space. It provides a method for resolving a log which accepts a space id in its parameters:

export interface IEventLogResolver {
  resolve<TMap extends FieldMap>(
    definition: IEventLogDefinition<TMap>,
    spaceId: string
  ): Promise<IEventLog<Event<TMap>>>;
}

For its implementation, spaceId is just a string, and we could just rename it to namespace and pass this value from parameters of a rule instance.

Final thoughts

I easily could be just biased towards #101453, but I feel like it provides a better API in general (althought it can be not perfect, missing certain methods etc).

But given the time left before 7.14 FF and overall fatigue from working on alerts-as-data/event-log/RAC stuff, I think as of now we should proceed with #102586, because "progress, simple perfection".

We need to ship something and move forward, and with #102586 theres much more chances to finalize this implementation sooner.

@marshallmain
Copy link
Contributor

Thanks for summarizing Georgii! I'm copy pasting the note I sent to Georgii on slack here to summarize my perspective, developed over the past 2 weeks.

At a very high level, there are 2 primary goals I have in mind for the eventual converged Alert Data client.

  1. Ease of understanding for developers who may not have worked with it before. Since this client is shared across multiple teams and doesn't have clear ownership right now, it's even more critical IMO that it's simple for developers to understand how the client works.
  2. Flexibility as the RAC initiative evolves. Since we're developing a client to be shared across teams, the client must be able to meet both teams requirements (and ideally future teams requirements as well) without requiring significant changes to accommodate them. To me, this implies that the client should act as a library of useful functions for the most part, allowing solutions to compose those functions as they see fit.

I spent 2 days or so comparing the RuleDataClient with the draft consolidated EventLogService to build a mental model of each implementation, assess the existing features, and start estimating the difficulty of adding on the additional required features to each implementation. After those 2 days, I felt very confident in my mental model of the RuleDataClient, but I was still struggling to remember how the abstraction layers fit together in the EventLogService. I could follow the call stack when looking at a specific piece, but the number of calls into various classes meant I would quickly forget the details of what each class did and stored. As for features, your comment does an excellent job summing up the all the key differences we were aware of at the time. When I started digging in to the implementations, I realized that additionally the namespacing strategy differed between them. The EventLogService creates templates and indices at the same time and creates a separate template for each space, whereas the RuleDataClient does not create concrete indices until documents are written and the template creation process is decoupled from the index creation. This positions the RuleDataClient well to support the arbitrary namespaces that users can define for their RAC alerts since a single set of templates already serves all namespaces for a particular solution's alert indices. Due to the requirement for arbitrary namespaces, it looked to me like the EventLogService would have to be updated to decouple the template creation from index creation - in the process breaking the existing template/index versioning and rollover logic in the EventLogService.

So overall, there appeared to be 2 paths forward:

  • Continue building a consolidated implementation on the EventLogService
    • Decouple template creation and index creation to support arbitrary user defined namespaces (namespaces are defined per-rule)
    • Update versioning and index rollover logic to handle decoupled creation process
    • Need to get buy-in from Observability to replace their existing usage of RuleDataClient with EventLogService
  • Build a consolidated implementation on the RuleDataClient
    • Add versioning and index rollover logic immediately - we can go without this initially, but when the time comes where we need to migrate it's easier if we have it built from the beginning
    • Add static typing for RuleDataClient writer as a follow up - lack of static typing is not a blocker for migrating rules to use the RuleDataClient

Since (1) there are no blockers preventing the RuleDataClient from being used as the basis for Security Solution rules in RAC - any enhancements are great but not required for shipping so we can more easily make enhancements in parallel with migrating rules, (2) the RuleDataClient is in use already by Observability, and (3) I was more confident in my understanding of the RuleDataClient implementation and thus felt that other devs new to the consolidated client would have an easier time working on it, I decided to try adding the versioning and rollover logic to the RuleDataClient.

@marshallmain
Copy link
Contributor

marshallmain commented Jun 24, 2021

Summarizing next steps after meeting with @banderror and @xcrzx:
Priorities:

  • Simple, declarative way to create schemas for .alerts-* indices
  • Solution plugins should not have to know about internal details of the rule registry
    • We shouldn't have to create a ready signal in solution plugins and pass it into the rule data client constructor
    • Devs should be clearly guided away from inefficient patterns even if they don't know the internal details of why it's inefficient. e.g. with the rollover check in getWriter, a single writer should be used for the duration of a rule executor. getWriter should not be called repeatedly. We should make this clear in documentation, and also rename getWriter to something that sounds more appropriate like initializeWriter - this has a better chance of letting devs know that initializeWriter is not something you want to call repeatedly.
  • Rule registry should be able to enforce that templates created with the RuleDataPluginService have the required technical fields mapped correctly

Details:

  • RuleDataPluginService should provide a function that takes a "log definition" in some form, creates the appropriate component and index templates, and returns a Promise<RuleDataClient> for that log. Rule executors can then await this promise to ensure that they don't write to the log until it is properly bootstrapped. This removes the need for solution plugins to define a "ready signal" to pass into the RuleDataClient that signals when the templates are created, instead the rule registry handles all the synchronization.
    • "log definition" implementation TBD - initially the log definition could be as simple as an array of component templates and an index template that would be passed to RuleDataPluginService.createOrUpdateComponentTemplate and RuleDataPluginService.createOrUpdateIndexTemplate. A schema for the log can be passed in as an optional parameter alongside the log definition to get static typing for the documents written by the RuleDataClient writer.
    • In the context of the current security solution RuleDataPluginService usage, we'd take the templates here and pass them in to a single function like RuleDataPluginService.bootstrapSolutionTemplates(templates): Promise<RuleDataClient> instead of imperatively creating the templates and later directly instantiating a RuleDataClient.
    • The RuleDataPluginService can automatically add the technicalComponentTemplate as the final component template in the index template, ensuring that the technical fields will be included and mapped correctly.
    • The log definition should be simple and flexible, allowing solutions to define arbitrary component templates
    • A more advanced log definition could be defined as a set of FieldMaps that the RuleDataPluginService would transform into the appropriate component and index templates. This would enable the RuleDataPluginService to build the TS types for the log automatically. However, this could also be done as an additional refactor after starting with a simpler implementation.
  • The technical component template is the only required template for .alerts indices. The rule registry may provide other component templates that solutions can share for ease of use, e.g. the full ECS component template.

This approach should adopt the declarative style from the event_log. The main difference is it aims to be a lighter-weight implementation that encapsulates template bootstrapping details while providing some flexibility in log definitions. Rather than caching objects for repeated retrieval from the service, the RuleDataPluginService will simply create templates and return a Promise<RuleDataClient>. Solution plugins will be responsible for passing the RuleDataClient around to the appropriate places where it is needed.

@banderror
Copy link
Contributor Author

I'd like to just add my 2 cents and try to provide a more high-level overview of our recent discussions with @marshallmain and @xcrzx.

We had a chat on the following topics:

  • TypeScript schema based on io-ts

    • manually defined io-ts types vs dev-time code generation vs runtime conversion
    • based on FieldMap vs mappings from component and index templates
    • do we need validation on write?
    • do we need normalization on read?
  • Component and index templates

    • standardized vs total freedom
    • what exactly are "technical fields"
    • how do we reuse building blocks and merge mappings: in the TS code vs via component templates
  • Index bootstrapping logic

    • level of encapsulation and flexibility: what should be exposed to the user, and what should be hidden inside the mechanism
    • imperative (e.g. a hook where you would call low-level methods for component template updates etc) vs declarative (log definition object)
    • invariants: what the implementation must guarantee
    • error handling and race conditions
  • Misc

    • API of getting RuleDataClient instance. What should be bound to a namespace - RuleDataClient or writer and reader?
    • caching of RuleDataClient instances

@marshallmain perfectly described all the details, I just want to again list what we're going to do in the next PR in a more high-level way:

  • Add a way to specify TypeScript schema for RuleDataClient. Schema should be optional. Dev should be able to define it manually or convert a FieldMap to it. This is basically what was already implemented in [RAC] Build a consolidated implementation in Rule Registry (Draft) #101453
    • Out of scope (maybe later): generating schema from Elasticsearch mappings (so we could have templates as a single source of truth); validation on write; normalization on read.
  • Use flexible approach to component templates, but enforce anything that is critical to correct work of this implementation. For example, a solution dev should be able to specify any number of component templates for a log (e.g. 3 custom component templates for security.events carrying something that makes sense for security rule execution log), but the implementation must include references to the common templates (e.g. with "technical fields") under the hood.
    • We're not going to standardize templates like proposed in [RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016 (comment)
    • All templates will be space-agnostic, so shared between all concrete logs. There won't be a space-aware template.
    • We're probably not going to introduce a user component template, because if the user updates mappings while Kibana is running, we will need to check "if we need to rollover the index" before each write (we'd rather do that at the start and only once). Instead, as far as we understand at this point, the recommendation is to use runtime fields.
    • We would like to leverage component templates by default and avoid merging mappings in the code where possible. So that could be our idiomatic way of combining mappings and settings.
  • Index bootstrapping logic should be encapsulated. RuleDataService should provide a method that would implement bootstrapping; it would accept a log definition object and return a RuleDataClient. Invariants should be enforced by this logic, race conditions and errors handled correctly.

Again, for more details please refer to #101016 (comment).

@banderror
Copy link
Contributor Author

@jasonrhodes @dgieselaar @smith please review ^^^^ and give us some feedback 🙂 Let me know if we should elaborate more on anything. The plan is to start working on the next PR soon, would be great to unblock #102586 sooner rather than later because it brings changes to the public API of RuleDataService and RuleDataClient.

@weltenwort
Copy link
Member

weltenwort commented Jul 1, 2021

The plan makes sense to me. I'm absolutely in favor of a declarative asset/template API which takes care of the "technical fields". Automatic type derivation sounds nice but a bit of duplication (mapping and io-ts type, for example) might be acceptable to unblock the follow-up efforts sooner IMO.

@xcrzx
Copy link
Contributor

xcrzx commented Jul 14, 2021

As discussed with @banderror, here are some weak spots in the current implementation that we could address in the consolidated implementation.

1. Index name creation & update logic

Currently, index name creation and update logic is split into two pieces.

rule_registry/server/rule_data_client/index.ts contains index name creation logic:

function createWriteTargetIfNeeded() {
  // ...
  const concreteIndexName = `${alias}-000001`;
  // ...
}

rule_registry/server/rule_data_plugin_service/utils.ts contains index name update logic:

function incrementIndexName(oldIndex: string) {
  const baseIndexString = oldIndex.slice(0, -6);
  const newIndexNumber = Number(oldIndex.slice(-6)) + 1;
  if (isNaN(newIndexNumber)) {
    return undefined;
  }
  return baseIndexString + String(newIndexNumber).padStart(6, '0');
}

incrementIndexName accepts index names created by createWriteTargetIfNeeded and returns undefined if the name doesn't match the pattern. However, the relationship between the two functions is not explicit. And it could lead to bugs if they become unsynchronized. A more robust option would be to have a single function that appends -000001 if the name doesn't contain it or increments the number otherwise.

Also, it should be flexible and not rely on 6 last digits, because there can be edge cases in general (#102586 (comment)).

2. Index mapping update logic

Index mapping update logic has several white spots. Add my concerns as comments to the updateAliasWriteIndexMapping method.

function updateAliasWriteIndexMapping({ index, alias }: { index: string; alias: string }) {
  const clusterClient = await this.getClusterClient();

  const simulatedIndexMapping = await clusterClient.indices.simulateIndexTemplate({
    name: index,
  });
  const simulatedMapping = get(simulatedIndexMapping, ['body', 'template', 'mappings']);
  try {
    await clusterClient.indices.putMapping({
      index,
      body: simulatedMapping,
    });
    return;
  } catch (err) {
    if (err.meta?.body?.error?.type !== 'illegal_argument_exception') {
     /* 
     * This part is unclear. Why do we skip the rollover if we've caught 'illegal_argument_exception'?
     */
      this.options.logger.error(`Failed to PUT mapping for alias ${alias}: ${err.message}`);
      return;
    }
    const newIndexName = incrementIndexName(index);
    if (newIndexName == null) { 
      /* 
       * I think we should not fail here. If the index name had no "-000001" suffix, we should append it.
       */
      this.options.logger.error(`Failed to increment write index name for alias: ${alias}`);
      return;
    }
    try {
     /* 
      * We don't check response here. It could return "{ acknowledged: false }". 
      * Should we consider a rollover to be successful in that case? 
      */
      await clusterClient.indices.rollover({
        alias,
        new_index: newIndexName,
      });
    } catch (e) {
      if (e?.meta?.body?.error?.type !== 'resource_already_exists_exception') {
      /* 
       * This part is also unclear. We log only 'resource_already_exists_exception' but silence all other exceptions. 
       * It looks incorrect.
       */
        this.options.logger.error(`Failed to rollover index for alias ${alias}: ${e.message}`);
      }
      
       /* 
       * There could be many reasons for the first attempt to update mappings to fail (network, etc.). 
       * I think we should retry in case of failure. 
       */
    }
  }
}

3. Error handling during bootstrapping

Currently, indices bootstrapping considered to be successful even if some requests failed. There is no way to programmatically retrieve the outcome of index template initialization as we only log errors to console. See initialisation in ecurity_solution/server/plugin.ts for example:

const initializeRuleDataTemplatesPromise = initializeRuleDataTemplates().catch((err) => {
  this.logger!.error(err);
});

It could lead to a situation when index bootstrapping fails, but RuleDataClient doesn't know anything about it and writes documents into an index that doesn't have proper mappings. In the worst case, dynamic mappings would be applied to the indexed documents, leading to unexpected behavior, like broken aggregations in the application code.

A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the rule_registry library level.

@banderror
Copy link
Contributor Author

4. Retry logic during bootstrapping

We should also add some retry logic to the bootstrapping logic.

I talked to Elasticsearch team, and they recommended to add retries to component and index template updates, or IMO wherever else we could get { acknowledged: false } from ES during index bootstrapping.

Here's the log of our discussion:


Q: In Kibana (specifically, RAC project), we started to use the new template APIs for bootstrapping our indices (PR for reference #102586). This code is not well tested and not production ready, but it’s going to be used from Security and Observability solutions in the upcoming releases, so I’d like to make sure that we implemented it correctly. Specifically, we started to use:

We make sure to wait for the responses from these endpoints before any other code will be able to write anything to the corresponding indices. So any writing happens only if these calls succeed.

In those responses, we normally get 200 OK with the body:

{ acknowledged: true }

I wasn’t able to find it in the docs or elsewhere, so I’d like to double-check two things.

The first thing is about acknowledged: true:

  • The meaning - does it mean that the request is accepted, but work needs to be done asynchronously in the background?
  • Can we get acknowledged: false in a 200 OK response?
  • What errors can we get there in general, can we refer to some spec with listed errors and response structure?

The second thing is about race conditions between index mgmt and indexing requests. Is it possible to get a race condition between successfully acknowledged template mgmt requests, and the following (potentially, immediately after that) attempt to write a document to an index which name matches the name of the updated index template?

  • Here we have 2 cases in the code:
    • Initial index bootstrapping. This index doesn’t exist yet, and we create it explicitly in the code of Kibana.
    • Subsequent index upgrade (when we changed our templates). The index already exists. We do 2 things: first, we try to update its mappings/settings directly (to get them from the index template, we use POST /_index_template/_simulate API); second, if this fails, we do a rollover of the alias.
  • So basically, when creating initial index OR updating existing one OR rolling over alias, is it possible that the concrete index won’t get the most recent mappings and settings from the (just updated) templates?
  • If yes, is there a way to handle it? E.g. wait for completion of those index mgmt requests?

When developing locally, we’ve got into some weird broken state where the concrete index ended up with dynamic mappings instead of mappings from the index template. Most likely, this was caused by our incorrectly implemented code + constant server reloads during development. However, I decided to double-check if any race conditions are possible in general.


A:

The first thing is about acknowledged: true:

  • The meaning - does it mean that the request is accepted, but work needs to be done asynchronously in the background?

That means that the change has been acknowledged by the master and accepted with a cluster state update

  • Can we get acknowledged: false in a 200 OK response?
  • What errors can we get there in general, can we refer to some spec with listed errors and response structure?

You can get "false" for the acknowledged, if, when publishing the new cluster state, the request times out before enough nodes have acknowledged and accepted the cluster state

there is not any listed errors, so generally a good rule to consider would be to retry in that case

  • So basically, when creating initial index OR updating existing one OR rolling over alias, is it possible that the concrete index won’t get the most recent mappings and settings from the (just updated) templates?

if the template update response have "acknowledged: true", then retrieving those settings through something like the simulate API or creating a new index will use the just updated template


Q: If this question even makes sense, how many nodes is enough though? Does acknowledged: true guarantee that any subsequent write request will be handled by a node that received the most recent cluster state with the new mappings etc?

Just want to make sure there’s no way to get a race condition, leading to some weird state like let’s say a node which doesn’t know about a newly created index, processes write request and creates an index with dynamic mappings instead of mappings specified in the templates.


A: You shouldn't have to worry about this assuming the request is acknowledged: true, because that means the master node has acknowledged it, and all those requests (index creation, mappings updates) go through the master node

@banderror
Copy link
Contributor Author

We identified a few potential issues with index bootstrapping and described them in the 2 previous comments.

I'm planning to address them as part of this consolidation task. I can split this up to several PRs, or do it in a single one, not sure at this point which way is better.

@jasonrhodes @weltenwort @marshallmain @xcrzx

@marshallmain
Copy link
Contributor

Thanks for the writeups @xcrzx and @banderror ! Those sound like good improvements to make. The Q&A with answers from the ES team is especially useful, thanks Georgii for contacting them and writing up the answers!

This part is unclear. Why do we skip the rollover if we've caught 'illegal_argument_exception'?

It's inverted, we skip the rollover if we catch anything except for illegal_argument_exception - that's the error returned by ES when the mapping update contains a conflicting field definition (e.g. a field changes types). We expect to get that error for some mapping changes we might make, and in those cases we want to continue on to rollover the index. Other errors are unexpected and should trigger a retry IMO.

We log only 'resource_already_exists_exception' but silence all other exceptions.

This is inverted as well, we silence resource_already_exists_exception and log all others.

I think we should not fail here. If the index name had no "-000001" suffix, we should append it.

I'm a little worried about appending to the index name in the case of failures without knowing what caused the failure. Appending might fix the case where the suffix gets dropped somehow, but it could also introduce other edge cases. If the suffix gets dropped somehow during the rollover process and we fail to find the suffix on the next check and start with "-000001" again, then "-000001" could already exist. In that case it would appear that the rollover process has completed successfully when really nothing has happened.

I think the most robust approach would be to go back to using the dry_run parameter on the rollover API to retrieve the next_index name directly from Elasticsearch and avoid any attempts to replicate the rollover naming logic ourselves. We pay a slight performance cost by making an additional call to ES, but we're only doing this once at startup time so it seems worth it to minimize bug potential here.

We don't check response here. It could return "{ acknowledged: false }". Should we consider a rollover to be successful in that case?

This is a great point and something we should definitely handle. "{ acknowledged: false }" seems like a failure to me.

There could be many reasons for the first attempt to update mappings to fail (network, etc.). I think we should retry in case of failure.

Adding retry logic for the mapping update process in the case of (unexpected) errors would definitely be a good improvement.

A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the rule_registry library level.

Definitely agree with this, we should not allow writes if the templates are not installed correctly.

@xcrzx
Copy link
Contributor

xcrzx commented Jul 15, 2021

It's inverted, we skip the rollover if we catch anything except for illegal_argument_exception ...

Sorry, my bad, didn't see that little !. Thank you for the clarification. I'll add it as a comment to the code.

If the suffix gets dropped somehow during the rollover process and we fail to find the suffix on the next check and start with "-000001" again, then "-000001" could already exist.

The current implementation doesn't handle this case either. It fails when the index name doesn't contain a numerical suffix. So yeah, going back to using the dry_run is probably the best option we have.

@banderror
Copy link
Contributor Author

Adding @marshallmain's suggestions regarding index bootstrapping:

  1. There is currently no ILM policy associated with the concrete indices created by the RuleDataClient. We can start by associating all concrete indices with the default ILM policy created by the rule registry. However, we also need to add an additional index template per index alias in order to specify the rollover_alias for the concrete index. (https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#ilm-gs-alias-apply-policy) Without an ILM policy, the index will eventually become much larger than necessary and could hinder performance.
  2. In the security solution, we'd like to add an additional alias to the alerts-as-data indices in order to make them backwards compatible with existing user defined queries and visualizations built on the .siem-signals naming scheme. To accomplish this, we'd add an additional parameter to RuleDataClient that allows callers to specify an optional "secondary alias" that would be added to the concrete indices.

A note regarding ILM policy, index template and namespace:

  • Each alerts-as-data index managed via rule_registry must include a (in general, user-defined) namespace: .alerts-{registrationContext}.{datasetSuffix}-{namespace}
  • At least in Security, this namespace is not known at star-up time (setup phase of the plugin). It becomes known in the scope of a rule instance, e.g. inside the rule type executor.
  • We'd like to have an ILM policy applied to our indices. We'd like to be able to decide, should we use the default shared policy, or specify a custom one for a particular index.
  • Correctly applying a policy with rollover logic requires creating an index template that would be pointing to the corresponding index alias with namespace in it:
    • index.lifecycle.name should be equal to the policy name
    • index.lifecycle.rollover_alias has to be the alias to rollover when the concrete index reaches the condition defined in the policy
  • Since the alias includes the namespace, we need to specify a different index.lifecycle.rollover_alias for every namespace - Elasticsearch doesn't automatically find the correct alias for us.
  • Which means we will have to delay creating/updating the index template until later - until the namespace becomes known.

@banderror
Copy link
Contributor Author

Adding suggestions after syncing with Observability folks (cc @weltenwort, btw I don't know GitHub usernames of Kerry and Panagiota, so please feel free to tag them):

  1. Enforce allowed values for the datasetSuffix on the API level. In the index naming convention .alerts-{registrationContext}.{datasetSuffix}-{namespace}, datasetSuffix can only be alerts or events. It would make sense to restrict it in the RuleDataService interface to make it safer and more explicit for developers.
  2. Provide different default values for namespace:
    • in getWriter set namespace = 'default' by default
    • in getReader maybe set namespace = '*' by default

@banderror
Copy link
Contributor Author

Just want to socialise here that something like #101541 (a higher-level abstraction on top of index management) is out of scope of this consolidation effort.

@weltenwort
Copy link
Member

cc @Kerry350 @mgiota

@banderror banderror changed the title [RAC] Consolidate the two implementations in Rule Registry [RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin Jul 21, 2021
@mgiota
Copy link
Contributor

mgiota commented Sep 2, 2021

@banderror I took the liberty to add one more ticket in your list for 7.15 (@jasonrhodes can you confirm this is for 7.15?) This is a new bug that we spotted while working on #110788.

@banderror
Copy link
Contributor Author

@mgiota Thank you, I can only support that 👍 👍

@banderror
Copy link
Contributor Author

Hey everyone, I removed this ticket from the backlog of the Detection Rules area. We (@elastic/security-detections-response-rules) are not the owners anymore (however feel free to still ping us if you have any tech questions about the ticket).

Ownership of this epic and its sub-tasks now goes to the Detection Alerts area (Team:Detection Alerts label). Please ping @peluja1012 and @marshallmain if you have any questions.

@marshallmain
Copy link
Contributor

Transferring again to @elastic/response-ops as they now own the rule registry implementation.

@marshallmain marshallmain added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Alerts Security Detection Alerts Area Team labels Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Feature:RAC label obsolete Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests

6 participants