-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature to "fix" filtering on multi-valued dimensions #2130
Conversation
c9bdece
to
7a4a4aa
Compare
@himanshug can you comment on the operational difference between the |
@drcrallen filter in DimensionSpec only make sense for multi-valued dimensions. Existing [Regex]Dim filter is used to select row from segment. one row would explode into multiple rows based on multivalued dimension. Then DimensionSpec filter would apply on these exploded rows and select from them. |
} | ||
|
||
if (matched == null) { | ||
matched = new HashSet<>(values.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is applied to multiple segments at the same time, the HashSet
reference is gonna be changing out from under us. Would be better to make a new one here and then close over the DimensionSelector
666cc04
to
c2dcc1d
Compare
@cheddar review comment addressed. |
### Filtering DimensionSpecs | ||
These are only valid for multi-valued dimensions. They take a delegate DimensionSpec and a filtering criteria, multiple values list of dimension is filtered as per given criteria. | ||
|
||
Following filtered dimension spec acts as a whiltelist or blacklist for values as per the configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitelist*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one or two examples of how to use this optimization will help clear up the confusion.
… for arbitrary filtering/transformations to returned dimension values
41dddb2
to
c988d02
Compare
@fjy updated the doc, see if it is better now? |
m adding some more examples to doc. |
I'm 👍, I find the docs easier to read and understand now as well. @fjy ? |
Following filtered dimension spec retains only the values matching regex. Note that `listFiltered` is faster than this and one should use that for whitelist or blacklist usecase. | ||
```json | ||
{ "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> } | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make these as having filters? Unless I am mistaken these are basically restricted HAVING filters. At the very least there should be a link in the HAVING filter spec doc to this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to expand in this I filed #1984 some time ago, it looks like #2043 fixed it (I am not 100% sure, need to try it out).
So doing a having filter with:
{
"type": "dimSelector",
"dimension": "<dimension>",
"value": <dimension_value>
}
Should work. (I have not tested that yet but the PR was merged).
What would be the difference between doing listFiltered
and dimSelector
on the relevant dimensions?
I understand that listFiltered
would work for topNs but ideally topN
should support having
filters just like groupBy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of my teammates added the new having specs to work on dimension values to solve the problem. However having specs are only applied at the broker after all the processing is done, so historicals will process/merge all the unwanted rows and pass them to broker where broker will further merge them and then in the end having filter will discard those. that would cause a lot of unnecessary memory and cpu consumption across the cluster. filters in this PR will get applied to the lowest possible level in the pipeline.
That said, I would add a line in the doc saying similar results can be obtained via having filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this will work for both topN and groupBy
…dimensions can be filtered correctly also adding UTs for multi-valued dimensions
c988d02
to
635982f
Compare
@fjy @vogievetsky I added documentation for query beharior on multi-valued dimensions , please see multi-valued-dimensions.md introcuded in this PR. hopefully, that makes things clearer. |
635982f
to
bad571f
Compare
👍 after travis |
bad571f
to
e1ea93b
Compare
Feature to "fix" filtering on multi-valued dimensions
I understand that this is more efficient. I think ideally Druid should be able to separate having filters into work that can be performed on the brokers and work that needs to be done in the post process step. While a smart library like Plywood could auto split the having filters to leverage this use-case users using the native Druid API directly would benefit from this greatly. |
Currently, if you have a row in druid that has a multi-valued dimension with values ["v1", "v2", "v3"] and you send a query grouping by that dimension with filter for value "v1". In the response you will get rows containing "v2" and "v3" as well.
For more details see the UT MultiValuedDimensionTest.testGroupByWithDimFilter() in this PR.
This PR introduces the feature to add regex or list-of-string-values filters in new DimensionSpec implementations which can be used to do proper filtering on multi-valued dimensions so that unwanted rows are discarded as early in processing pipeline as possible.
Not having this ability causes unnecessary major resource utilization across brokers and historicals where result set would sometime contain 90% unwanted rows.