Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ability to auto-install ingest pipelines #95782

Merged
merged 12 commits into from May 10, 2023

Conversation

eyalkoren
Copy link
Contributor

@eyalkoren eyalkoren commented May 3, 2023

Adding the ability to auto-install ingest pipelines and refer to them from index templates through index.default_pipeline and index.final_pipeline.

This would enable required capabilities like #95551 and #95522.

@eyalkoren eyalkoren changed the title Adding ability to auto-install ingest pipelines through index templates Adding ability to auto-install ingest pipelines May 3, 2023
@eyalkoren eyalkoren self-assigned this May 3, 2023
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v8.9.0 labels May 3, 2023
@eyalkoren eyalkoren added >feature and removed needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v8.9.0 labels May 3, 2023
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label May 3, 2023
@eyalkoren eyalkoren added :Data Management/Data streams Data streams and their lifecycles v8.9.0 and removed needs:triage Requires assignment of a team area label labels May 3, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label May 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @eyalkoren, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment for the forbidden APIs check - It does look like the CI is failing on some templates not being ready, might be worth trying to reproduce locally?

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Put the previous comment on the wrong file 😵‍💫

@eyalkoren
Copy link
Contributor Author

@elastic/enterprise-search this PR adds the same capability you added through #95198, only to core. It does pretty much the exact same thing, but I also added a validation step to component templates, so that those are not allowed to be registered if they are dependent on an ingest pipeline that is not yet in place. I assume this doesn't break any assumption you are relying on, however it did break AnalyticsTemplateRegistryTests, because behavioral_analytics-events-settings.json has a final_pipeline dependency. Please review my fix for that.

@afoucret
Copy link
Contributor

afoucret commented May 4, 2023

@eyalkoren I appreciate that you pick this change to be part of the core. I was about to prepare a PR for this now I have more bandwidth.

I will review the issue later today but would like to highlight two recently merged PRs to get sure these fetaures are part of the new implementation:

@eyalkoren
Copy link
Contributor Author

@afoucret thanks for taking a look!
I know too little to judge about this, so input from @jbaiera and @dakrone will be required here.

My take on the two:

  • Checking that all nodes in the cluster are >= 8.8.0 before installing… #95780 - sounds like a generic thing, so probably makes sense to add. I will gladly add this, I will only wait for approval from the DM team
  • [8.8] Check if an analytics event data stream exists before installing pipeline. (#95621) #95775 - this actually sounds like a more case-specific issue and not sure the general logic is suitable for the core. I think it makes sense to install dependencies in order. What would happen if a template T refers to a pipeline P and both are added at the same cluster-change-event handling and T gets installed slightly before P (components are added asynchronously)? Could it be that an incoming request will cause the first backing index to use settings that don't include the pipeline and the following indices will? Can you think of a different event to trigger the late pipeline addition, and use a pipeline processor with ignore_missing_pipeline: true in the pipeline you register before registering the template?

@eyalkoren
Copy link
Contributor Author

The only failing test now is EnterpriseSearchRestIT, which probably cannot be fixed without properly addressing @afoucret's comment above.
We either need to change the recently added Enterprise Search check, or the verification added within this PR.

@jbaiera @dakrone @felixbarny - your thoughts on this will be appreciated.

@jbaiera
Copy link
Member

jbaiera commented May 5, 2023

I don't think #95780 will be required for the index registry work since the StackTemplateRegistry requires the items to be installed on the master node only, which is guaranteed to be the latest version in the cluster based on our upgrade guidance.

As for the pipeline jam - I'm not sure I understand how this works today. Is the data stream created via an explicit API call or when a document is first ingested? When a document arrives for the data stream and it doesn't exist, the ingest service will resolve the template for the index the document would go into and uses the pipelines defined on the template, but if the pipelines only exist when the data stream exists then there will be no pipeline definition for the document and the ingest will fail, thus never creating the data stream. I'm assuming that the data stream is created via some other API call or else I'm missing something. @afoucret can you clarify this?

Assuming everything does work just fine otherwise, then we could modify the getPipeline method to accept the current cluster state as an argument so that it can modify which pipelines to return based on which resources already exist. Since this check is done on every cluster update we could easily detect the first time the dependent data stream is present and conditionally add the pipeline to the list to be installed.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented May 7, 2023

I don't think #95780 will be required for the index registry work since the StackTemplateRegistry requires the items to be installed on the master node only, which is guaranteed to be the latest version in the cluster based on our upgrade guidance.

Hmm, is that what we want though? Can this (applying pipeline only on master node) cause a scalability issue? Does this mean that the master is the only node with ingest role, or that only this pipeline is installed on it?
I am not knowledgeable enough about the finer details.

@eyalkoren eyalkoren requested a review from jbaiera May 8, 2023 14:05
@eyalkoren eyalkoren requested a review from jbaiera May 9, 2023 05:02
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the iterations!

@eyalkoren eyalkoren merged commit 1dda989 into elastic:main May 10, 2023
12 checks passed
@eyalkoren eyalkoren deleted the ingest-pipeline-registry branch May 10, 2023 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >feature Team:Data Management Meta label for data/management team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants