Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

final_pipeline appears to process twice when using the date match index processor #83653

Closed
edsyoo opened this issue Feb 8, 2022 · 2 comments · Fixed by #94000
Closed

final_pipeline appears to process twice when using the date match index processor #83653

edsyoo opened this issue Feb 8, 2022 · 2 comments · Fixed by #94000
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@edsyoo
Copy link

edsyoo commented Feb 8, 2022

Elasticsearch Version

7.16.2

Installed Plugins

No response

Java Version

bundled

OS Version

Cloud

Problem Description

When using the final_pipeline option and the date match index in a pipeline to process data, the final_pipeline appears to run twice. This can be shown by using the append processor which adds two of the same values or a rename processor which shows a failure, but it will show the name of a field changed and shows a failure due to the old named field not being present.

Also referenced in a discuss topic to confirm that the example is valid.

Additional note: This seems to run fine on a 7.6.2 environment, so something between that and 7.16.2 may have caused this issue. #69727 or #75047 could be related.

Steps to Reproduce

Append method of showing the final_pipeline running twice:

PUT _ingest/pipeline/test-final-pipeline
{
  "processors": [
    {
      "append": {
        "field": "pipeline",
        "value": "completed"
      }
    }
  ]
}

PUT _template/test-template
{
  "index_patterns": [
    "test*"
  ],
  "settings": {
    "final_pipeline": "test-final-pipeline",
    "number_of_shards": 2
  }
}

POST /test-1/_doc
{
  "time-field": "2021-02-10T12:02:01.789Z",
  "CONTENT": "blah1"
}

PUT /_ingest/pipeline/routing-test-pipeline
{
  "description": "Time series DAY pipeline",
  "processors": [
     {
         "date_index_name": {
             "field": "time-field",
             "index_name_prefix": "test-",
             "date_rounding": "d",
             "index_name_format": "yyyy-MM-dd"
         }
     }
 ]
}

POST /test-1/_doc?pipeline=routing-test-pipeline
{
  "time-field": "2021-02-10T12:02:01.789Z",
  "CONTENT": "blah2"
}

Rename processor used to show a failure.

PUT _ingest/pipeline/test-final-pipeline
{
  "processors": [
    {
      "rename": {
        "field": "CONTENT",
        "target_field": "MALCONTENT"
      }
    }
  ]
}

PUT _template/test-template
{
  "index_patterns": [
    "test*"
  ],
  "settings": {
    "final_pipeline": "test-final-pipeline",
    "number_of_shards": 2
  }
}

POST /test-1/_doc
{
  "time-field": "2021-02-10T12:02:01.789Z",
  "CONTENT": "blah1"
}

PUT /_ingest/pipeline/routing-test-pipeline
{
  "description": "Time series DAY pipeline",
  "processors": [
     {
         "date_index_name": {
             "field": "time-field",
             "index_name_prefix": "test-",
             "date_rounding": "d",
             "index_name_format": "yyyy-MM-dd"
         }
     }
 ]
}

POST /test-1/_doc?pipeline=routing-test-pipeline
{
  "time-field": "2021-02-10T12:02:01.789Z",
  "CONTENT": "blah2"
}

Logs (if relevant)

No response

@edsyoo edsyoo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP needs:triage Requires assignment of a team area label labels Feb 8, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Feb 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@pquentin pquentin removed the needs:triage Requires assignment of a team area label label Feb 14, 2022
@HiDAl
Copy link
Contributor

HiDAl commented Dec 15, 2022

Hello, I've tested this in the latest version of Elasticsearch and the issue is still occurring. Even more, I was able to recreating other scenarios where the issue is happening. Basically, every time you update the _index information for the current request, the final_pipeline will be executed twice.

This is another version, this time using a script processor:


PUT _ingest/pipeline/test-final-pipeline
{
    "processors": [
        {
            "set": {
                "field": "some-field",
                "value": "value"
            },
            "append": {
                "field": "pipeline",
                "value": [
                    "completed",
                    "{{{_index}}}"
                ]
            },
            "script": {
                "lang": "painless",
                "source": "ctx.CONTENT += ' >> ';" // mimic an append, just to validate the whole pipeline is executed twice and not only the final processor
            }
        }
    ]
}


PUT _template/test-template
{
  "index_patterns": [
    "test*"
  ],
  "settings": {
    "final_pipeline": "test-final-pipeline",
    "number_of_shards": 3
  }
}


PUT _ingest/pipeline/routing-test-pipeline
{
    "description": "Time series DAY pipeline",
    "processors": [
        {
            "date_index_name": {
                "field": "time-field",
                "index_name_prefix": "test-",
                "date_rounding": "d",
                "index_name_format": "yyyy-MM-dd"
            },
            "set": {
                "field": "where",
                "value": "from routing-test-pipeline"
            }
        }
    ]
}

PUT _ingest/pipeline/change-by-script
{
    "processors": [
        {
            "set": {
                "field": "where",
                "value": "from changed-by-script"
            },
            "script": {
                "lang": "painless",
                "source": "ctx._index = 'testing-index-final';"
            }
        }
    ]
}

The following document creation will execute the final_pipeline twice:


POST /test-1/_doc?pipeline=routing-test-pipeline
{
  "time-field": "2022-02-10T12:02:01.789Z",
  "CONTENT": "routing-pipeline"
}

POST /test-1/_doc?pipeline=change-by-script
{
    "time-field": "2022-02-10T12:02:01.789Z",
    "CONTENT": "change-by-æscript"
}

Right now I'm working on a bugfix for this issue.

HiDAl added a commit to HiDAl/elasticsearch that referenced this issue Dec 15, 2022
Because all the steps of the current pipeline are handled by the
IngestService, we have to mark the final pipeline as NOOP, in such a way
the next iteration of TransportBulkAction doesn't re-run the
`executePipelines` method later (see
BulkTransportAction#doInternalExecute)

Closes (elastic#83653)
HiDAl added a commit to HiDAl/elasticsearch that referenced this issue Dec 16, 2022
Because all the steps of the current pipeline are handled by the
IngestService, we have to mark the final pipeline as NOOP in case there 
is a `_index` rewrite, in such a way the next iteration of 
TransportBulkAction doesn't re-run the`executePipelines` method later 
(see BulkTransportAction#doInternalExecute)

Closes (elastic#83653)
HiDAl added a commit to HiDAl/elasticsearch that referenced this issue Dec 19, 2022
Because all the steps of the current pipeline are handled by the
IngestService, we have to mark the final pipeline as NOOP in case there 
is a `_index` rewrite, in such a way the next iteration of 
TransportBulkAction doesn't re-run the`executePipelines` method later 
(see BulkTransportAction#doInternalExecute)

Closes (elastic#83653)
HiDAl added a commit to HiDAl/elasticsearch that referenced this issue Dec 20, 2022
In case, a processor changes the index of a request and the new index 
has a final_pipeline, this pipeline executed twice. This happens because
`IngestService` overwrites the pipeline information, but it never 
cleans it up later. This change ensures that `IngestService` does not 
change the pipeline information for this case using methods without side
effects


Closes elastic#83653
HiDAl added a commit to HiDAl/elasticsearch that referenced this issue Dec 21, 2022
In case, a processor changes the index of a request and the new index 
has a final_pipeline, this pipeline executed twice. This happens because
`IngestService` overwrites the pipeline information, but it never 
cleans it up later. This change ensures that `IngestService` does not 
change the pipeline information for this case using methods without side
effects


Closes elastic#83653
HiDAl added a commit to HiDAl/elasticsearch that referenced this issue Jan 23, 2023
In case, a processor changes the index of a request and the new index
has a final_pipeline, this pipeline executed twice. This happens because
`IngestService` overwrites the pipeline information, but it never
cleans it up later. This change ensures that `IngestService` does not
change the pipeline information for this case using methods without side
effects

Closes elastic#83653
@HiDAl HiDAl closed this as completed Feb 16, 2023
@HiDAl HiDAl reopened this Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
4 participants