Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline processors fails when field is not present #19995

Closed
djschny opened this issue Aug 15, 2016 · 15 comments
Closed

pipeline processors fails when field is not present #19995

djschny opened this issue Aug 15, 2016 · 15 comments
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v5.0.0-beta1

Comments

@djschny
Copy link
Contributor

djschny commented Aug 15, 2016

Elasticsearch version: v5.0.0-alpha5

Plugins installed: []

JVM version: 1.8.0_92-b14

OS version: Mac 10.11.15

Description of the problem including expected versus actual behavior:

The convert pipeline processor throws an error if the target field to convert does not exist in the document. Ideally adding an option of ignore_missing or similar would exist and would default to true so that way for the majority of users it just works. The ignore_failure option would not work in this scenario because we would still want to fail in the situation where there was a parse error or other unknown error that happened.

This can be illustrated with the very common/poster-child example of apache log data where we would have the equivalent Logstash config in an ingest node pipeline:

PUT _ingest/pipeline/apachelogs
{
  "description": "Pipeline to parse Apache logs",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{COMBINEDAPACHELOG}"
        ]
      }
    },
    {
      "date": {
        "field": "timestamp",
        "target_field": "timestamp",
        "formats": [
          "dd/MMM/YYYY:HH:mm:ss Z"
        ]
      }
    },
    {
      "convert": {
        "field": "response",
        "type": "integer"
      }
    },
    {
      "convert": {
        "field": "bytes",
        "type": "integer"
      }
    }
  ]
}

In this situation bytes is optional and not present in the case of HEAD/OPTIONS/DELETE requests. If the bytes field is not present ideally the pipeline should still work for an end user. See related elastic/beats#2229 for original discovery and background information.

CC @talevy as talked briefly about it earlier this morning

@talevy talevy added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v5.0.0-beta1 labels Aug 15, 2016
@talevy talevy self-assigned this Aug 15, 2016
@clintongormley
Copy link

I can imagine this situation is pretty common. Wondering if convert should know to ignore missing fields and null values out of the box? Not sure how consistent this is with other processors?

@talevy
Copy link
Contributor

talevy commented Aug 23, 2016

Will open a PR for this shortly. a new ignore_missing boolean option will be added to processors so that we can handle such field-missing exceptions separately from the broader ignore_failure.

@djschny
Copy link
Contributor Author

djschny commented Aug 26, 2016

Hitting this a lot across many processors. For example running into the issue with convert processor as well. Looking forward to the addition of ignore_missing.

@talevy
Copy link
Contributor

talevy commented Aug 27, 2016

@djschny I am scared of there being too many ways to control pipeline flow using exception handling. Mind sharing how you are hitting this with other processors and why the ignore_failure option is not sufficient?

@djschny
Copy link
Contributor Author

djschny commented Aug 28, 2016

Mind sharing how you are hitting this with other processors and why the ignore_failure option is not sufficient?

Sure, consider the following simple example and processor:

      "convert": {
        "field": "test_score",
        "type": "float"
      }

In the above the field programming_ability is a string and want to convert it to a float. Some of the documents do not have this field. For those missing the field I want the processor to not throw an error. If I use ignore_failure the issue is that it will ignore problems where a document does have a field and the value does not parse due to a string that is not an float (for example "none".

For many of these processors the absence of a field, a field with a null value, and potentially even an empty string field should all be ignored and not throw an error IMO.

@djschny
Copy link
Contributor Author

djschny commented Aug 29, 2016

I'm augmenting the title to be generic as so far I have run into this issue when implementing pipelines and for the following processors:

  • convert
  • trim
  • grok
  • rename

@djschny djschny changed the title convert pipeline processor fails when field is not present pipeline processors fails when field is not present Aug 29, 2016
@talevy
Copy link
Contributor

talevy commented Sep 7, 2016

@djschny I've update the PR to include these, except for rename. I feel that ignore_failure captures the escape behavior well enough for that case (only other way it would fail is if you supply an invalid path as the destination)

@djschny
Copy link
Contributor Author

djschny commented Sep 7, 2016

Cool thanks @talevy

However from an external user perspective, consistency would be better. Externally ignore_missing feels like a global setting that would apply to all processors (regardless of how it is implemented internally).

To have to set it one way on some processors and another way on others just seems odd IMO.

@talevy
Copy link
Contributor

talevy commented Sep 7, 2016

However from an external user perspective, consistency would be better

I can't disagree there, I am all for consistency. It is just not the case that all processors
operate by extracting information from one field.

feels like a global setting that would apply to all processors

Although, this may be a common theme (extracting data from one field) for many (if not all) of the existing processors, this is more of a coincidence rather than a rule.

Since processors are can be as flexible as "fetch data from somewhere else, and inject in document"(#20340) or "translate all field values to Spanish", the ignore_missing would not make sense in all those cases, while something like ignore_failure does.

As you can see by the search-processor discussion, the scope of what processors can/will/should do is still contested. I would like to avoid introducing further generalizations that may become difficult to move away from in the future, as things stabilize.

TL;DR I think the ignore_missing should be applied in a case-by-case basis and the documentation should help overcome any assumptions that one may make about such a field existing in all processors.

what do you think?

@djschny
Copy link
Contributor Author

djschny commented Sep 7, 2016

I'm all fine with not making it a global/common property. No problem there and understand the concern.

However for the rename processor I believe it should still be applied there. In addition to the consistency item mentioned, it would help distinguish between situations where the field was missing and there was an actual error. For example as a user I would want a document that did not have the presence of a field to just be a no-op. However if the field was renamed to something that caused an error (which is possible if the target name is a field or ingest metadata reference that is a non-valid json field name), then I would want to that to actually cause an error, but if ignore_failure was set to true, then this would be masked.

@talevy
Copy link
Contributor

talevy commented Sep 7, 2016

++ I'll add it to the rename

@djschny
Copy link
Contributor Author

djschny commented Sep 10, 2016

Saw that the PR merged, thanks! However I noticed ignore_missing defaults to false. From my experience thus far working with pipelines to perform tasks, majority of the time I need ignore_missing to be true.

Were there any particular considerations into why defaulting to false was chosen?

@clintongormley
Copy link

Were there any particular considerations into why defaulting to false was chosen?

because it is explicit. we tell you by default if there is a problem, but we give you the tools to deal with it as you need.

@talevy
Copy link
Contributor

talevy commented Sep 13, 2016

Closing since this has been added to the discussed processors here: #20194

@djschny
Copy link
Contributor Author

djschny commented Oct 10, 2016

Looks like this is a problem on the split processor as well. I have opened a new issue to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v5.0.0-beta1
Projects
None yet
Development

No branches or pull requests

3 participants