Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Add conditional per processor #32398

Merged
merged 24 commits into from Aug 30, 2018

Conversation

Projects
None yet
6 participants
@original-brownbear
Copy link
Member

commented Jul 26, 2018

  • Adds conditional if setting to all processors in a pipeline
  • closes #21248
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jul 26, 2018

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2018

Still WIP, missing tests and some cleanup. I'd just like to get the ok from everyone that this is what we're looking for.

With this one you can now add a field if (script type) to any processor and the boolean return of the script will be used.

E.g.:

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "if": "ctx.foo == 'bar'",
          "field" : "field2",
          "value" : "_value"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

->

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "field2": "_value",
          "foo": "bar"
        },
        "_ingest": {
          "timestamp": "2018-07-26T13:09:23.228451Z"
        }
      }
    },
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "foo": "rab"
        },
        "_ingest": {
          "timestamp": "2018-07-26T13:09:23.228473Z"
        }
      }
    }
  ]
}
@talevy

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2018

@original-brownbear yup. that is what I had in mind! thanks! I was wondering why we don't extend AbstractProcessor to understand conditionals and bake it into the framework instead of conditionally overriding the type to be "conditional", I feel like it should have the conditional support by default, but only invoked if the if exists. what do you think? I understand this means that the script dependencies may be passed into all the factories, but maybe not?

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2018

@talevy

I was wondering why we don't extend AbstractProcessor to understand conditionals and bake it into the framework instead

Yea that sounds nice in theory (though just in terms of code quality performance wise it probably doesn't matter in any way, either way you'll have the same number of interface(ish) calls in there).
In practice, see below :(

I understand this means that the script dependencies may be passed into all the factories, but maybe not?

Yea this, but worse yet, you'd have to change the Processor API in one way or another. Either you'll add a top level conditionallyExecute that invokes executemethod if the script returns true and change all the high level calls in composite/foreach etc. processor to invoke that or you'll have to literally change every implementation to check the conditional by some provided API => huge changes for some aesthetic benefit at best imo.
Also, you'd probably have to change literally every processor factory (you gotta extract the if script field somehow and pass the script to the abstract class still, so that would make this even less practical).

=> I think wrapping the processor as the implementation is probably the smallest/safest change we're going to get here.

@talevy

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2018

though just in terms of code quality performance wise it probably doesn't matter in any way

true, I didn't even think it would 😄

OK, this sounds good. One thing we should double-check is that when exceptions occurs in either the internal or conditional (if), the correct processor type is shown, since both the wrapping conditional and the inner processor share the same tag.

@rjernst

This comment has been minimized.

Copy link
Member

commented Jul 27, 2018

The general approach here seems ok. My main concern before doing a deeper review is the object passed into the conditional script needs to be immutable. I think we should look at what variables are available too, as ctx only exists for legacy reasons (because that is what update scripts exposed when ingest was added).

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2018

@rjernst why don't we just expose a new kind of interface there instead of ctx, we could call it doc or whatever and only make a method .get(String Path) available on it for lookup (we could make that method wrap collections/maps as discussed ... those structures won't really be used much anyway and people will rather look up scalars directly by path so the performance hit probably is irrelevant)?

@jakelandis

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2018

I like the 'if' syntax and general direction.

I share @talevy 's concern over handling messaging. For example today, if you change your example from "ctx.foo == 'bar'" to `"ctx.foo = 'bar'" The following error occurs:

"error": {
        "root_cause": [
          {
            "type": "exception",
            "reason": "java.lang.IllegalArgumentException: ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
            "header": {
              "processor_type": "conditional"
            }
          }
        ],
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
          "caused_by": {
            "type": "script_exception",
            "reason": "runtime error",
            "script_stack": [
              "ctx.foo = 'bar'",
              "^---- HERE"
            ],
            "script": "ctx.foo = 'bar'",
            "lang": "painless",
            "caused_by": {
              "type": "class_cast_exception",
              "reason": "java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean"
            }
          }
        },
        "header": {
          "processor_type": "conditional"
        }
      }
    }

It's a bit misleading where the error is happening due to the conditional processor type. However, adding a tag helps alot, so maybe it's a non issue.

...
"header": {
              "processor_type": "conditional",
              "processor_tag": "my_set"
            }

If we add per processor metrics, there would be a similar concern to accurately represent the combination of the inner processor and outer processor in the same metric.

I am abit concerned by allowing arbitrary processing inside the if condition. Meaning the if condition can be much more then just a true/false single expression evaluation. When you combine this change with the ability to call other processors from scripting it can compound this concern ? For example, you could call grok from inside an if and then make your true/false decision based on that. I am not sure if arbitrary processing inside the if is a good or bad thing.

If we don't want to allow that kind of arbitrary processing , could implement a much a simpler dsl that only allows single expression boolean evaluations ? Perhaps a custom subset of painless ?

Also, since we are using scripting for the true/false evaluation, would we also support alternative scripting languages ?

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2018

@jakelandis

It's a bit misleading where the error is happening due to the conditional processor type. However, adding a tag helps alot, so maybe it's a non issue.

I think this isn't an issue afaik the bucket selector aggregation scripts will behave the same way.

Also, since we are using scripting for the true/false evaluation, would we also support alternative scripting languages ?

We can also use the other languages here, expression in particular you can just make work like in the bucket aggregation selector scripts and make it return 1.0 for true and count all other returns as false.

I am not sure if arbitrary processing inside the if is a good or bad thing.

I'm a little torn here too. Obviously something like the LS config "language" has a much lower barrier to entry and is really short to write. Then again ... we already have the ES scripting, it's super powerful and existing users are probably proficient in it to a degree too.
IMO it may be worth exploring adding a simpler conditional language (if not outright the LS one) as an enhancement to this down the road (downside obviously would be that it's yet another thing to support).

That said, one strong argument I would have for "this is a good thing" is that when looking at LS you often see the pattern of:

  1. Run grok on an event and make it set some field depending on a complex regex
  2. Check if that field was set to a specific value in another a conditional
  3. Run another grok action setting some other field
  4. And then another conditional checking that field

... those cases would become a lot simpler to implement on the user side with the flexibility of Painless.

Perhaps a custom subset of painless ?

As far as I understand the code this would be very hard. Also, it becomes a very strange situation where you're essentially just arbitrarily taking away flexibility from the user (without making our lives easier in return, if anything constraining what subset of Painless or the other languages we're supporting is more effort than just outright using them).
I'd just address this (if we want to address it even) by maybe suggesting to users to keep things simple in these conditionals (though I don't really see a good argument to put behind that other than aesthetics).


@rjernst what do you think about moving forward with wrapping the ctx (we can rename it next week if you want :)) with an immutable map for now to make some progress here?
Maybe it's easiest to not deviate too much from the script processor here anyway? The more I think about it the less I like the idea of not passing ctx that works in one way to a script processor to the conditional..Having the if script field work in a totally different way (even though it's JSON schema is the same and you'd probably even have the both of them on the same screen next to each other when coding things up) is is really weird ergonomics imo.

@rjernst

This comment has been minimized.

Copy link
Member

commented Jul 30, 2018

what do you think about moving forward with wrapping the ctx

IMO we should think about what we want the variables to be for the ingest script context long term. If ctx is what we want, then keeping it in conditional ingest scripts is fine. But if we think another script signature would work better long term, then the new conditional script should match the future signature we want. This will make it easier to migrate to the new signature, rather than adding new uses of ctx that will make the ability to deprecate and change to something else more painful.

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2018

@rjernst

IMO we should think about what we want the variables to be for the ingest script context long term. If ctx is what we want, then keeping it in conditional ingest scripts is fine.

fair point :) IMO, with the way we are approaching the calling of processor from scripts as of right now (having static methods that do what the processors do as opposed to setting up and invoking actual Processor instances like in my POC #32043) ctx (the name is weird and I'd like doc or so better :P but w/e) being a nested map is an ok approach, especially with painless allowing for the map.key lookup syntax here. To me the using of IngestDocument as an input really only made sense if we cared about passing the actual IngestDocument along to other processors. If we don't want to do that, it just complicates the syntax in scripts I guess.
... but we should/could probably have that discussion elsewhere :)

original-brownbear added some commits Aug 19, 2018

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2018

@rjernst added the lazy wrapping (urgh, that's quite a bit of code to handle all the cases that we have to look out for because they could leak a mutable view of the Map :)).

Can you take a quick look if that's an approach you're ok with? If you're ok with it then I think we only need some tests for the lazy wrapping (+ whatever your my find) here :)

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Aug 24, 2018

@jakelandis

I (~50%) agree it may be worth exploring a simpler language :)

The upsides definitely are:

  • Having an easy DSL for conditionals worked well for LS too (and probably will for Beats as well)
    • I think product wise I like an easy DSL a lot
  • It removes the complication of having to deal with mutability
    • Though I'm not so sure how troubling this is with the map wrapping added here, it's slightly annoying code, but the performance hit shouldn't be so bad (or even visible) thanks to escaping, I think.

The downsides are:

@rjernst

This comment has been minimized.

Copy link
Member

commented Aug 27, 2018

I don't think we should have another language. This is a huge burden to maintain. We have been trying very hard to reduce the number of languages we must support within elasticsearch. For example, we got rid of groovy, python, and javascript scripting languages, and we have been working towards making expressions on par performance-wise with expressions so we can remove that too.

IMO the overhead of wrapping these objects to make them immutable should be small, and the consistency of interacting with documents in the same way across conditionals or general script processors is hugely beneficial.

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Aug 27, 2018

@rjernst sweet, so add tests + be happy with this? :)

@rjernst
Copy link
Member

left a comment

Thanks @original-brownbear. Tests for the immutability would be good. I left a few more comments as well.

@Override
public void execute(IngestDocument ingestDocument) throws Exception {
if (scriptService.compile(condition, ProcessorConditionalScript.CONTEXT)
.newInstance(condition.getParams()).execute(new UnmodifiableIngestData(ingestDocument.getSourceAndMetadata()))) {

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 27, 2018

Member

break this long line into compiling the script in a separate line from executing it?

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 27, 2018

Author Member

Sure will do :)

@@ -82,6 +82,7 @@ public IngestCommonPlugin() {
processors.put(KeyValueProcessor.TYPE, new KeyValueProcessor.Factory());
processors.put(URLDecodeProcessor.TYPE, new URLDecodeProcessor.Factory());
processors.put(BytesProcessor.TYPE, new BytesProcessor.Factory());
processors.put(ConditionalProcessor.TYPE, new ConditionalProcessor.Factory(parameters.scriptService));

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 27, 2018

Member

I thought we were only exposing the conditional via each processor as if, not as a processor on its own?

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 27, 2018

Author Member

See #32398 (comment) and discussion leading up to it. This is just an implementation to prevent us from having to adjust every processor.

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 27, 2018

Member

Sorry I don't understand. Can you reiterate? I don't understand why using it from parsing if requires it be generally available. As it is here, it would be available to construct directly by a user, right?

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 27, 2018

Author Member

@rjernst

Sure, let me try to make it short:
In order to get the conditional we have to either add a conditional evaluation to every execute method for every processor or make some abstract parent have an execute implementation that holds the conditional evaluation. (I mean there's other options, but I think the two mentioned are the simplest and others will be even more noisy/risky/...)
Either option requires us to change all existing processors and also their factories (as a result of us parsing the configuration in each processor factory).
=> I went for this implementation since it was shortest and didn't have any functional/performance downsides over alternatives anyway (as far as I can see).

As it is here, it would be available to construct directly by a user, right?

Yea true, if you think that's a problem I can prevent that in the parser easily though :) Should I?

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 27, 2018

Member

Yes I think we should prevent that. But I'm still not understanding why it is necessary for this processor to have a factory or be registered. It can be constructed completely locally, and directly via its ctor, within ConfigurationUtils.

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 28, 2018

Author Member

@rjernst that's more of a convenience thing because org.elasticsearch.ingest.ConfigurationUtils#readProcessor(java.util.Map<java.lang.String,org.elasticsearch.ingest.Processor.Factory>, java.lang.String, java.lang.Object) (which is called from like 3 places in prod. code) would to also get the script factory as an input then (which will trigger a pretty big change if you factor in test code).

Just tried to keep this less noisy again :) => bad idea?, better to add the script factory as an input here?

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 28, 2018

Member

With what I'm suggesting, the script factory is not needed. I don't think the ConditionalProcessor should be constructed from the script factory at all. Instead, have maybeWrapConditional or something like that which takes the processor we have already constructed, and creates/wraps it with a conditional processor (or returns the input if there is no conditional). There is no reason to have a factory for the conditional processor, just parse the config directly in that method, and construct there based on the config.

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 28, 2018

Author Member

@rjernst you will still need the ScriptFactory as a method input to org.elasticsearch.ingest.ConfigurationUtils#readProcessorConfigs in some form or another won't you?

I 100% agree, that the current solution looks kind of convoluted :), but it was the only way I saw out of blowing up the change-set with passing the ScriptFactory into the methods inside ConfigurationUtils.
Looking at your discussion so far it seems it's probably preferable to go with the bigger changeset + cleaner solution I guess? :)
Sorry again about the confusion I caused here :)

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 29, 2018

Member

Yeah, you are right, that method would have to take ScriptFactory. But I think this makes sense? And I think the signature changes should be mostly minimal? Maybe I am underestimating the extent, but I think it is fairly well contained to a handful of methods.

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 29, 2018

Author Member

When I first tried it, it looked like a fairly massive change :) (but this is also really subjective).
I'll just code it up today and we can take a look, the result is def. nicer than what we have here ... so probably worth it

/**
* A script used by the Ingest Script Processor.
*/
public abstract class ProcessorConditionalScript {

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 27, 2018

Member

Can we call this IngestConditionalScript?

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 27, 2018

Author Member

sure :)

}

private static Object wrapUnmodifiable(Object raw) {
if (raw instanceof Map) {

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 27, 2018

Member

Can you add a comment here that these types must match what the json parser can create, and that anything not handled here must be immutable already (eg boxed numerics)?

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Aug 27, 2018

Author Member

sure :)

@original-brownbear original-brownbear removed the WIP label Aug 29, 2018

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Aug 29, 2018

@rjernst alright, handled this by passing down the ScriptFactory now and removing the magic around a conditional processor factory and rewriting the config maps :)
Also added some tests around the immutability of the data passed in.
Should be good for another review :)

@jakelandis jakelandis referenced this pull request Aug 29, 2018

Open

ingest: Documentation improvements for ingest node #33188

9 of 11 tasks complete
@rjernst
Copy link
Member

left a comment

Thanks @original-brownbear! LGTM

} catch (Exception e) {
throw newConfigurationException(type, tag, null, e);
}
}
throw newConfigurationException(type, tag, null, "No processor type exists with name [" + type + "]");
}

private static Script maybeExtractConditional(Map<String, Object> config) throws IOException {

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 29, 2018

Member

since you are returning the script, I think this can just be called extractConditional

LoggingDeprecationHandler.INSTANCE, stream)) {
return Script.parse(parser);
}
} else {

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 29, 2018

Member

No need for an else, it can just be outside the if

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Aug 29, 2018

@rjernst thanks! Will merge once green :)

@original-brownbear original-brownbear merged commit cc4d705 into elastic:master Aug 30, 2018

4 checks passed

CLA Commit author has signed the CLA
Details
elasticsearch-ci Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@original-brownbear original-brownbear deleted the original-brownbear:21248 branch Aug 30, 2018

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 4, 2018

Ingest: Add conditional per processor (elastic#32398)
* Ingest: Add conditional per processor
* closes elastic#21248

original-brownbear added a commit that referenced this pull request Sep 4, 2018

Ingest: Add conditional per processor (#32398) (#33380)
* Ingest: Add conditional per processor
* closes #21248

@Mpdreamz Mpdreamz referenced this pull request Dec 13, 2018

Closed

[meta] 6.5.0 Release #3457

61 of 107 tasks complete

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.