-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify equality and typed in filter behavior for numeric match values on string columns #16593
modify equality and typed in filter behavior for numeric match values on string columns #16593
Conversation
… string columns changes: * EqualityFilter and TypedInfilter numeric match values against string columns will now use StringComparators.NUMERIC instead of converting the numeric values directly to string for pure string equality. This effectively is an implicit cast of the STRING values to the numeric match value type, which is consistent with the casts which are eaten in the SQL layer, as well as classic druid behavior * added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since `1.0` would become `'1.0'` which does not match `'1'`.
I think the functions in Calcite which could make it easy to discard casts (like RexUtil#isLiteral) ; could have made it too easy to discard such things :) |
final Set<String> stringSet; | ||
// when matching strings to numeric match values, use numeric comparator to implicitly cast the string to number | ||
if (matchValueType.isNumeric()) { | ||
stringSet = new ObjectAVLTreeSet<>(StringComparators.NUMERIC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like regardless of that the values arrive as sortedValues
; the build cost of this type of TreeSet
will still be O(NlogN)
; and I guess lookup will also be O(logN)
the old approach of a hashmap build O(N)
; lookup O(1)
I feel like instead of delaying the normalization until the comparision is a bit too much - why not introduce normalization of the input values instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comments about trying to have consistent behavior no matter how sqlUseBoundAndSelectors
is set. If sqlUseBoundAndSelectors
is set to true
(which was always the old behavior before these new filters and the context flag were introduced), something like stringCol in (1.0, 2.0, 3.0)
would be planned into an OrFilter
of BoundFilter
, e.g. (1.0 <= stringCol <= 1.0) OR (2.0 <= stringCol <= 2.0) OR (3.0 <= stringCol <= 3.0)
, which is even less efficient than this sorted set.
The goal here is to get this to behave as if though stringCol
was cast to double, which means we cannot convert the match value set to Strings and use a plain set. We either need to use this numeric comparator, or actually cast the entire string column to a double externally so that this filter uses a double predicate instead of string predicate (not eating the cast in SQL, in which case this filter would instead be using a double predicate because the column would be a double virtual column instead of string regular column).
Since using this comparator is basically guaranteed not to be worse than the older behavior (and in fact slightly better since we don't have to do the more expensive BoundFilter
check and can use the set instead of the loop that OR
would need to do), this is the 'easier' approach.
I agree its potentially not the best approach, but the possibly best approach requires quite a lot more work and validation, which is what I really meant by saving not eating the cast for a follow-up PR.
if (value == null) { | ||
return DruidPredicateMatch.UNKNOWN; | ||
} | ||
return DruidPredicateMatch.of(StringComparators.NUMERIC.compare(value, castForComparison.asString()) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringComparators.NUMERIC
is full of interesting calls - like: convertStringToBigDecimal
- seems like something which could possibly generate some garbage collector pressure ...especially if it gets used inside a for-each-row comparator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is trying to make this filter when comparing strings to numerics consistent with the behavior of the old BoundFilter
, which was what was previously always used for comparing strings to numeric values until these new filters were added, equality was planned to something like 1.0 <= stringCol <= 1.0
. BoundFilter
uses this comparator when comparing strings to numeric values, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/filter/BoundFilter.java#L265. The goal here is to make the behavior the same regardless of how sqlUseBoundAndSelectors
(the flag that controls whether or not we use the new or old filters is set).
Casting the string column to a double column is also relatively expensive, currently we use https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/ExprEval.java#L627, where Doubles.tryParse
involves a regex. This could probably be done more efficiently, but again goes back to requiring more dramatic changes be made in the native layer which I was trying to avoid for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the TypedInFilter
class
- knows in which type the match should happen
matchValueType
- the values it has should already be preapared to be used with that
Casting the string column to a double column is also relatively expensive
with a virtual column that might be true;
but I think that casting will happen quite a lot with the current approach as well:
let me try to explain what I see right now:
- load an AVL tree
- do comparisions with a comparator
- which will create a BigDecimal when
- not a comparision against
null
- the number(s) are not a
long
- not a comparision against
- that's kinda like all interesting cases...
- due to the AVL tree lookup there will be
O(logN)
scaler - the
BigDecimal
creation will also happen in cases the comparator returnfalse
- which will create a BigDecimal when
I wonder why not:
- remove the AVL tree stuff
- the sorted values should be
double
-s if not already...but could populate the set withBigDecimal
-s for simplicty - add
decimalValue = convertStringToBigDecimal(value)
- use
decimalValue
to lookup in the map
Coudln't this avoid the tree build penalty/multiple comparisions/normalizations during every comparision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set is built once per query, not every comparison, however fair point about lookup speed using a plain HashSet for predicates, though there is a fair bit of complexity here that needs careful consideration and probably a bunch of measurement I was looking to avoid, at least until the bug is fixed.
TypedInFilter
currently is constructed with a sorted List
of match values, the sortedness is very important for computing indexes. We definitely optimized for that use case over the predicate filtering use case.
Right now, when the column type (predicate type) matches the filters matchValueType
, we just use the sorted list directly, and Collections.binarySearch
, so that no set is constructed at all, though it has this same logn lookup speed that you're complaining about here, which is again a fair complaint.
As mentioned before though, we do need the sorted list, so it can't really go away, so at best we can also build a hashset, so the next question is where and how to do it.
Its really not that awesome to be computing any sets in the 'makePredicate' functions because technically if several segments are in the processing pool they would be blocked while one of them builds the sets, which is kind of sad.
We could consider building the "default" set (meaning set of the matchValueType) outside of the predicate factory, as part of TypedInFilter
constructor, though that downside is it guarantees double memory footprint instead of just when matchValueType doesn't match the column predicate type, along with the added up front cost of constructing it.
When building the set we also need to coerce the values to handle any ugly that happens during json deserialization, such as longs coming in as ints or doubles/floats or whatever, which isn't really a problem when computing indexes or with the current predicates using Collections.binarySearch
because the match type comparator handles that (not to be confused with the StringComparators.numeric
added in this PR, which is string predicate + number match value).
Anyway, this would let same predicate and matchValueType
plus these string to numeric comparisons use that default set which would be computed up front, and then only have to deal with the more blocky other set construction for stranger cases of matchValueType
not matching predicate type.
The predicate maker functions currently build sets backwards, to try to cast the match values of matchValueType
to the predicate type, on the assumption that the number of values the predicate is going to be evaluated against is going to be a lot bigger than the matchValue set size (millions of rows vs however big the in filter is). I feel like there still seems to be some benefit to converting the set when safe rather than converting every row value (like number predicates with string matchValues), so I don't think we necessarily would always want to use a hashset of the match values and convert the row values either, because it means we always have to cast every row value the predicate is processing. I also don't think we want to build all of these combinations up front, since the expense would be too much if things are never used.
Anyway, all of this is why I was suggesting that longer term it makes sense to make virtual columns more efficient and also not eat the casts in the SQL layer, so that way the predicate chosen would always match the matchValueType
. However, still we have an open question of whether we should build a hashset up front or not, and the answer is i don't really know yet, maybe?
I guess all of this is why I was suggesting for comparison against the behavior of "sqlUseBoundAndSelectors":true
, which makes an OrFilter
of BoundFilter
for IN
, and a single BoundFilter
for numeric equality using StringComparators.NUMERIC
, so that this bug can be fixed without answering all of the other hard questions right now, because the current behavior is simply incorrect. However, I can at least try to explore building a same type hashset to cut down on the sets built inside the supplier and take some measurements if you view it as a blocker to fixing this bug.
It is like basically a 1 line change to not discard the casts in SQL layer, The additional work is all at the native layer to make it so that there is no performance penalty to having the casts be explicit (lots of things implicitly perform the cast at the native layer is why we eat them in the first place), and |
if (matchValue.type().isNumeric()) { | ||
return value -> { | ||
if (value == null) { | ||
return DruidPredicateMatch.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking if this change was a regression in the original code. Earlier, there was DruidObjectPredicate.alwaysFalseWithNullUnknown
, however now we return UNKNOWN
. Since the match value is null, the modified code looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DruidObjectPredicate.alwaysFalseWithNullUnknown
is still there, it is the call to DruidObjectPredicate.equalTo
that is modified. equalTo
returns unknown if the value input is null.
The DruidObjectPredicate.alwaysFalseWithNullUnknown
is for if ExprEval.castForEqualityComparison
returns null, which is a short circuit for things that can never match, such as if matching double value 1.1
to a long column and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the updates!
… on string columns (apache#16593) * fix equality and typed in filter behavior for numeric match values on string columns changes: * EqualityFilter and TypedInfilter numeric match values against string columns will now cast strings to numeric values instead of converting the numeric values directly to string for pure string equality, which is consistent with the casts which are eaten in the SQL layer, as well as classic druid behavior * added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since `1.0` would become `'1.0'` which does not match `'1'`.
Description
This PR changes
EqualityFilter
andTypedInfilter
numeric match values against string columns to now cast the strings to numbers instead of converting the numeric values directly to string for pure string equality. This makes using these filters (the default in SQL compatible mode) behave consistently with 'default' value mode which uses theBoundFilter
for numeric comparison of string values.The added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since
1.0
would become'1.0'
which does not match'1'
.I investigated an alternative of just having the SQL planner not eat
CAST
operators like it currently does, since a query like... WHERE stringColumn = 1.0
ends up as... WHERE CAST(stringColumn as DOUBLE) = 1.0
before we eat the cast, which makes the native layer a lot more explicit, however this change basically does the same thing and is a lot less disruptive.It may still be worth investigating not eating casts during SQL planning, but I'll save that for follow-up work.
Release note
Modified behavior of using
EqualityFilter
andTypedInFilter
to match numeric typed values (particularly DOUBLE) against string columns to effectively cast the strings to use numerical comparison, for more consistent Druid behavior betweensqlUseBoundAndSelectors
context flag.This PR has: