Skip to content

expression transform improvements and fixes#13947

Merged
clintropolis merged 6 commits intoapache:masterfrom
clintropolis:expr-transform-improvements-and-binding-consistency
Mar 22, 2023
Merged

expression transform improvements and fixes#13947
clintropolis merged 6 commits intoapache:masterfrom
clintropolis:expr-transform-improvements-and-binding-consistency

Conversation

@clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 19, 2023

Description

changes:

  • fixes inconsistent handling of byte[] values between ExprEval.bestEffortOf and ExprEval.ofType, which could cause byte[] values to end up as java toString values instead of base64 encoded strings in ingest time transforms
  • improved ExpressionTransform binding to re-use ExprEval.bestEffortOf when evaluating a binding instead of throwing it away
  • improved ExpressionTransform array handling in anticipation of nested columns + arrays = array columns! #13803, added RowFunction.evalDimension that returns List<String> to back Row.getDimension and remove the automatic coercing of array types that would typically happen to ExpressionTransform now unless using Row.getDimension
  • improved ExpressionPostAggregator to use partial type information from decoration
  • added some tests for ExpressionTransform with array inputs

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

changes:
* fixes inconsistent handling of byte[] values between ExprEval.bestEffortOf and ExprEval.ofType, which could cause byte[] values to end up as java toString values instead of base64 encoded strings in ingest time transforms
* improved ExpressionTransform binding to re-use ExprEval.bestEffortOf when evaluating a binding instead of throwing it away
* improved ExpressionTransform array handling, added RowFunction.evalDimension that returns List<String> to back Row.getDimension and remove the automatic coercing of array types that would typically happen to expression transforms unless using Row.getDimension
* added some tests for ExpressionTransform with array inputs
Comment on lines +518 to +520
if (value instanceof byte[]) {
return new StringExprEval(StringUtils.encodeBase64String((byte[]) value));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has me wondering, what if an expression actually wants the byte[] how would it be defined so that if one expression returns a byte[] and the next one wants to use it, then it will just be passed through without being base64 encoded in between?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is for things that are asking for STRING typed values, (though they might be multi-value strings as well). COMPLEX types will accept bytes as is and try to deserialize them into the appropriate object using a TypeStrategy that wraps the ObjectStrategy.

However, to complicate this answer slightly, where the type passed to this method comes from varies depending on where it is being called. This ofType method is what backs IdentifierExpr which is what feeds input values into expressions. When backed by a segment, the type will be the type which was stored in the segment, etc. For places we don't know though, such as expression transforms, we fall back to using bestEffortOf, which will handle byte[] as a STRING type. It lacks the complex type name so cant handle byte[] to complex object translation, so we turn it into a string because at least then something could do something with it.

Responding to this made me realize that the expression post-agg bindings could be improved with the partial type information derived from the aggregators 'result' type, so I have updated them to use it accordingly in the latest commit.

Back to byte[], we could alternatively consider leaving it as 'unknown' COMPLEX, though that would cause some issue with nested columns which is the other main user of 'bestEffortOf', which uses it to try to derive the type information of these values. Since we don't have a native binary blob type, STRING is most useful here so we can at least preserve the values (and for JSON, byte[] already come in as base64 strings, so byte[] really only appear in other nested formats, such as Avro, Parquet, Protobuf, and ORC).

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

builder.put("e", new String[] {null, "foo", "bar"});
builder.put("f", new String[0]);
bindings = InputBindings.withMap(builder.build());
bindings = InputBindings.forMap(builder.build());

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [InputBindings.forMap](1) should be avoided because it has been deprecated.
public void testDoubleEval()
{
Expr.ObjectBinding bindings = InputBindings.withMap(ImmutableMap.of("x", 2.0d));
Expr.ObjectBinding bindings = InputBindings.forMap(ImmutableMap.of("x", 2.0d));

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [InputBindings.forMap](1) should be avoided because it has been deprecated.
public void testLongEval()
{
Expr.ObjectBinding bindings = InputBindings.withMap(ImmutableMap.of("x", 9223372036854775807L));
Expr.ObjectBinding bindings = InputBindings.forMap(ImmutableMap.of("x", 9223372036854775807L));

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [InputBindings.forMap](1) should be avoided because it has been deprecated.
Expr.ObjectBinding bindings = InputBindings.withMap(bindingsMap);
Expr.ObjectBinding bindings = InputBindings.forMap(bindingsMap);

try {

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [InputBindings.forMap](1) should be avoided because it has been deprecated.
.put("str1", "v1")
.put("str2", "v2");
bindings = InputBindings.withMap(builder.build());
bindings = InputBindings.forMap(builder.build());

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [InputBindings.forMap](1) should be avoided because it has been deprecated.
@clintropolis clintropolis merged commit f4392a3 into apache:master Mar 22, 2023
@clintropolis clintropolis deleted the expr-transform-improvements-and-binding-consistency branch March 22, 2023 06:26
@hongming-xu
Copy link

hongming-xu commented Mar 27, 2023

hi, @clintropolis , the modification that adds evalDimension(Row row) method in RowFunction interface,
Will it have an impact on the transform extensions that were previously implemented?

References:
(1)Adding a Transform Extension
https://druid.apache.org/docs/latest/development/modules.html#adding-a-transform-extension

@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
abhishekagarwal87 pushed a commit that referenced this pull request Jan 2, 2024
…15452)

The PR: #13947 introduced a function evalDimension() in the interface RowFunction.
There was no default implementation added for this interface which causes all the implementations and custom transforms to fail and require to implement their own version of evalDimension method. This PR adds a default implementation in the interface which allows the evalDimension to return value as a Singleton array of eval result.
@maytasm maytasm mentioned this pull request Apr 7, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants