-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent empty-set filtering behavior on multi-value columns #2750
Comments
Looking for thoughts on what makes sense. IMO: filters are supposed to be filtering on values, not the entire set, so from that standpoint, it makes sense for a But, this behavior sorta conflicts with what happens in single-value columns, where missing fields and fields equal to So I guess I'm conflicted on what should happen here. |
One possibility is to make it so empty sets in multi-value columns on disk don't match null (don't include those rows in that bitmap), and then make IncrementalIndex behave that way too with some logic like this:
I believe this behavior makes sense if you think "null" is an actual value rather than meaning "not present". |
Another possibility is to make it so empty sets always match I believe this behavior makes sense if you think "null" means "not present". |
@gianm I think that approach makes the most sense for the same example I give in #995 (comment) :
A counter -example is: If there is a sequence of events for which "cake" is absent, and another sequence of events where "cake" is empty |
@drcrallen I don't follow -- are you suggesting a particular approach? |
@gianm I propose a selector on null should match any of the following:
|
Since we treat columns that aren't present as if they match null (and we treat missing fields and empty arrays in JSON the same as null fields in JSON), it seems to me like Druid already has a lot of built-in bias towards treating null as "not present" rather than as an actual value. So I'm leaning towards thinking that filters on |
@drcrallen I think I agree with you |
@drcrallen |
I agree with @xvrl regarding not focusing on |
I feel like having the selector match both |
@vogievetsky IMHO if they don't want to match on null they shouldn't put null in the multi-value set. The only reason it would be put there is because they want to use it in filtering. |
@drcrallen I actually agree that I guess ideally it would be great to have some way to tell Maybe a |
@vogievetsky does that also mean |
@vogievetsky byRow filters will probably be added at some point (#2217 proposes them, for example) so let's assume we are only talking about non-byRow filters right now. |
BTW also going with @drcrallen suggestion and just telling people not to put nulls in MV dimension sets unless they know what they are doing is fine I think. |
PR #2753 makes things behave such that filtering on null DOES match empties. |
@drcrallen yes I assumed that |
|
The behavior is now that filters on "null" will match rows with no values. The behavior in the past was inconsistent; sometimes these filters would match and sometimes they wouldn't. Adds tests for this behavior to SelectorFilterTest and BoundFilterTest, for query-level filters and filtered aggregates. Fixes apache#2750.
Right now filtering on nulls on multi-value columns with a filter like
{"type": "selector", "dimension": "foo", "value": null}
sometimes matches empty sets and sometimes doesn't.Empty sets occur when writing a row in a multi-value column where the underlying input row's field was either missing,
null
, or[]
.Query-level filters on
dim = null
:null
and[]
tonull
when indexing, and its ValueMatcherFactory considers that representation a match fornull
)null
)null
)FilteredAggregator filters on
dim = null
:[null]
)[]
, which FilteredAggregator does not consider a match fornull
)[]
, which FilteredAggregator does not consider a match fornull
)The text was updated successfully, but these errors were encountered: