Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a simulate ingest api #101409

Merged
merged 42 commits into from Nov 15, 2023
Merged

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Oct 26, 2023

This PR introduces a new _ingest/simulate API that runs any pipelines on the given data that would be executed for a given index, but instead of indexing the data into the index, returns the transformed documents. The difference from the simulate pipeline API is that the simulate pipeline API only runs the single pipeline it is given. This new API could potentially run an unlimited number of pipelines -- the given pipeline, the default pipeline for the index given, any default pipelines in indices that the reroute processor forwards the data to, and the final pipeline of the last index in the chain.
For example, if we have the following pipelines:

curl -X PUT "localhost:9200/_ingest/pipeline/my-pipeline?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-long-field",
        "value": 10
      }
    },
    {
      "set": {
        "field": "my-boolean-field",
        "value": true
      }
    },
    {
      "lowercase": {
        "field": "my-keyword-field"
      }
    },
    {
      "reroute": {
        "destination": "my-index-2"
      }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-final-pipeline?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-boolean-field",
        "value": false
      }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-pipeline-2?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-long-field",
        "value": 20
      }
    },
    {
      "uppercase": {
        "field": "my-keyword-field"
      }
    }
    }
  ]
}
'

curl -X PUT "localhost:9200/_ingest/pipeline/my-final-pipeline-2?pretty" -H 'Content-Type: application/json' -d'
{
  "processors": [
    {
      "set": {
        "field": "my-new-boolean-field",
        "value": false
      }
    }
  ]
}
'

And then the following index:

curl -X PUT "localhost:9200/my-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "default_pipeline": "my-pipeline",
      "final_pipeline": "my-final-pipeline"
    }
  }
}
'

Then calling _ingest/_simulate with this data:

curl -X POST "localhost:9200/_ingest/_simulate?pretty&index=my-index" -H 'Content-Type: application/json' -d'
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ]
}
'

would return

{
  "docs" : [
    {
      "doc" : {
        "_id" : "_id",
        "_index" : "my-index-2",
        "version" : -3,
        "_source" : {
          "my-long-field" : 20,
          "my-keyword-field" : "foo",
          "my-boolean-field" : true
        },
        "executed_pipelines" : [
          "my-pipeline",
          "my-pipeline-2",
          "my-final-pipeline-2"
        ]
      }
    },
    {
      "doc" : {
        "_id" : "_id",
        "_index" : "my-index-2",
        "version" : -3,
        "_source" : {
          "my-long-field" : 20,
          "my-keyword-field" : "bar",
          "my-boolean-field" : true
        },
        "executed_pipelines" : [
          "my-pipeline",
          "my-pipeline-2",
          "my-final-pipeline-2"
        ]
      }
    }
  ]
}

You can also specify substitute pipeline definitions so that you can try pipeline changes without actually having to change pipelines. For example, to substitute a new my-pipeline-2, you could do the following:

curl -X POST "localhost:9200/_ingest/_simulate?pretty&index=my-index" -H 'Content-Type: application/json' -d'
{
  "docs": [
    {
      "_source": {
        "my-keyword-field": "FOO"
      }
    },
    {
      "_source": {
        "my-keyword-field": "BAR"
      }
    }
  ],
  "pipeline_substitutions": {
    "my-pipeline-2": {
      "processors": [
        {
          "set": {
            "field": "my-new-boolean-field",
            "value": true
          }
        }
      ]
    }
  }
}
'

This substitutes the pipeline body given in the request for the my-pipeline-2 stored in the cluster. The pipeline definition is only changed for this request, and does not impact anything else running on the cluster now or in the future.

@masseyke masseyke added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.12.0 labels Oct 26, 2023
@github-actions
Copy link

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@joshdover
Copy link
Member

Really like the executed_pipelines and pipeline_substitutions features here. This will make our routing features much simpler to understand and work with 👍

@masseyke masseyke added the buildkite-opt-in Opts your PR into Buildkite instead of Jenkins label Nov 3, 2023
@masseyke
Copy link
Member Author

masseyke commented Nov 3, 2023

@elasticmachine run elasticsearch-ci/part-1

@masseyke masseyke requested a review from dakrone November 9, 2023 23:38
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks Keith! I left one more comment which should be addressed (and maybe a yaml test for it), but nothing major

docs/reference/ingest/apis/simulate-ingest.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/apis/simulate-ingest.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/apis/simulate-ingest.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/apis/simulate-ingest.asciidoc Outdated Show resolved Hide resolved
docs/reference/ingest/apis/simulate-ingest.asciidoc Outdated Show resolved Hide resolved
<titleabbrev>Simulate ingest</titleabbrev>
++++

Executes ingest pipelines against a set of provided documents, optionally

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this and in the doc for the existing simulate, I'm surprised there is no mention of the simulation/testing aspect. In isolation, this reads as if it actually executes the pipeline (and thus indexes some data).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a line below that reads No data is indexed into $Elasticsearch. I'll add a line about the intended use of this API at the top though.

docs/reference/ingest/apis/simulate-ingest.asciidoc Outdated Show resolved Hide resolved
@yuliacech
Copy link
Contributor

Hi @masseyke, I was just looking at this new API to asses how we could use it in Kibana to extend the functionality of testing an ingest pipeline. I think we would probably mostly use the substitutions. When testing with substitutions we need an index that uses the pipeline that is being tested, so I'm curious if there is a way to get a list of those indices from ES?

@masseyke
Copy link
Member Author

Hi @masseyke, I was just looking at this new API to asses how we could use it in Kibana to extend the functionality of testing an ingest pipeline. I think we would probably mostly use the substitutions. When testing with substitutions we need an index that uses the pipeline that is being tested, so I'm curious if there is a way to get a list of those indices from ES?

You're saying that you would like a way to give elasticsearch a pipeline name, and have it return the list of indices that have that pipeline as the default pipeline or final pipeline?

@yuliacech
Copy link
Contributor

@masseyke yes, the context is like following: the user is editing an existing pipeline and wants to test their changes. in the UI we would show them a list of indices that already use this pipeline and let them simulate an ingest. We would substitute the existing pipeline with the payload that the user has edited in the UI so far.

@masseyke
Copy link
Member Author

@masseyke yes, the context is like following: the user is editing an existing pipeline and wants to test their changes. in the UI we would show them a list of indices that already use this pipeline and let them simulate an ingest. We would substitute the existing pipeline with the payload that the user has edited in the UI so far.

We had an offline discussion about this. We do not have an API right now to get a list of indices for a given pipeline. I think the expected use of this API is a little bit different though, and not something that we currently have a kibana UI for. This API is meant for developing and troubleshooting the integration of a collection of pipelines all working together (as opposed to developing an individual pipeline -- the simulate pipeline API is more appropriate for that). For example, say ingestion into an index has been going fine in production, and has broken with some new piece of data. This API could be used to figure out which pipelines are running, figure out what the output of all those pipelines is, and experiment with modifications to one or more of the pipelines. Once a pipeline has been worked out, you'd still want to run it through whatever regression testing you have in order to make sure that the change doesn't break some other index.

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. I don't see how this could affect bulk processing performance (nice, btw!), but I've been surprised before, so I think you should keep an eye on the benchmarks once this has been merged and be ready to do some digging if anything seems off.

@masseyke
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-1

@masseyke masseyke merged commit 643d825 into elastic:main Nov 15, 2023
14 checks passed
@masseyke masseyke deleted the adding-simulate-ingest-api branch November 15, 2023 23:25
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Nov 16, 2023
This commit introduces a new _ingest/simulate API that runs any pipelines
on the given data that would be executed for a given index, but instead of
indexing the data into the index, returns the transformed documents and
the list of pipelines that were executed.
andreidan pushed a commit to andreidan/elasticsearch that referenced this pull request Nov 22, 2023
This commit introduces a new _ingest/simulate API that runs any pipelines
on the given data that would be executed for a given index, but instead of
indexing the data into the index, returns the transformed documents and
the list of pipelines that were executed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buildkite-opt-in Opts your PR into Buildkite instead of Jenkins :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature Team:Data Management Meta label for data/management team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants