Feature to "fix" filtering on multi-valued dimensions #2130

himanshug · 2015-12-19T09:11:03Z

Currently, if you have a row in druid that has a multi-valued dimension with values ["v1", "v2", "v3"] and you send a query grouping by that dimension with filter for value "v1". In the response you will get rows containing "v2" and "v3" as well.
For more details see the UT MultiValuedDimensionTest.testGroupByWithDimFilter() in this PR.

This PR introduces the feature to add regex or list-of-string-values filters in new DimensionSpec implementations which can be used to do proper filtering on multi-valued dimensions so that unwanted rows are discarded as early in processing pipeline as possible.

Not having this ability causes unnecessary major resource utilization across brokers and historicals where result set would sometime contain 90% unwanted rows.

drcrallen · 2015-12-22T18:40:13Z

@himanshug can you comment on the operational difference between the RegexFilteredDimensionSpec and an extraction filter using regex?

himanshug · 2015-12-22T19:07:42Z

@drcrallen filter in DimensionSpec only make sense for multi-valued dimensions. Existing [Regex]Dim filter is used to select row from segment. one row would explode into multiple rows based on multivalued dimension. Then DimensionSpec filter would apply on these exploded rows and select from them.
In many cases the regex pattern used might be same in both places but it can be different as well depending upon the use case.

cheddar · 2015-12-29T17:47:28Z

processing/src/main/java/io/druid/query/dimension/ListFilteredDimensionSpec.java

+    }
+
+    if (matched == null) {
+      matched = new HashSet<>(values.size());


If this is applied to multiple segments at the same time, the HashSet reference is gonna be changing out from under us. Would be better to make a new one here and then close over the DimensionSelector

himanshug · 2015-12-30T19:24:11Z

@cheddar review comment addressed.

fjy · 2015-12-30T20:27:12Z

docs/content/querying/dimensionspecs.md

+### Filtering DimensionSpecs
+These are only valid for multi-valued dimensions. They take a delegate DimensionSpec and a filtering criteria, multiple values list of dimension is filtered as per given criteria.
+
+Following filtered dimension spec acts as a whiltelist or blacklist for values as per the configuration.


The following*

what configuration?

will change

I think one or two examples of how to use this optimization will help clear up the confusion.

… for arbitrary filtering/transformations to returned dimension values

himanshug · 2015-12-30T21:40:15Z

@fjy updated the doc, see if it is better now?

himanshug · 2015-12-30T21:57:15Z

m adding some more examples to doc.

cheddar · 2015-12-30T22:09:11Z

I'm 👍, I find the docs easier to read and understand now as well. @fjy ?

vogievetsky · 2015-12-30T22:12:14Z

docs/content/querying/dimensionspecs.md

+Following filtered dimension spec retains only the values matching regex. Note that `listFiltered` is faster than this and one should use that for whitelist or blacklist usecase.
+```json
+{ "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> }
+```


Why not just make these as having filters? Unless I am mistaken these are basically restricted HAVING filters. At the very least there should be a link in the HAVING filter spec doc to this doc.

Just to expand in this I filed #1984 some time ago, it looks like #2043 fixed it (I am not 100% sure, need to try it out).

So doing a having filter with:

{ "type": "dimSelector", "dimension": "<dimension>", "value": <dimension_value> }

Should work. (I have not tested that yet but the PR was merged).

What would be the difference between doing listFiltered and dimSelector on the relevant dimensions?

I understand that listFiltered would work for topNs but ideally topN should support having filters just like groupBy

one of my teammates added the new having specs to work on dimension values to solve the problem. However having specs are only applied at the broker after all the processing is done, so historicals will process/merge all the unwanted rows and pass them to broker where broker will further merge them and then in the end having filter will discard those. that would cause a lot of unnecessary memory and cpu consumption across the cluster. filters in this PR will get applied to the lowest possible level in the pipeline.
That said, I would add a line in the doc saying similar results can be obtained via having filters.

also, this will work for both topN and groupBy

…dimensions can be filtered correctly also adding UTs for multi-valued dimensions

himanshug · 2015-12-31T00:05:13Z

@fjy @vogievetsky I added documentation for query beharior on multi-valued dimensions , please see multi-valued-dimensions.md introcuded in this PR. hopefully, that makes things clearer.

fjy · 2015-12-31T00:10:18Z

👍 after travis

Feature to "fix" filtering on multi-valued dimensions

vogievetsky · 2015-12-31T06:03:12Z

I understand that this is more efficient. I think ideally Druid should be able to separate having filters into work that can be performed on the brokers and work that needs to be done in the post process step. While a smart library like Plywood could auto split the having filters to leverage this use-case users using the native Druid API directly would benefit from this greatly.

himanshug added the Release Notes label Dec 19, 2015

himanshug added this to the 0.9.0 milestone Dec 19, 2015

himanshug force-pushed the fix_filter_multi_valued branch 2 times, most recently from c9bdece to 7a4a4aa Compare December 20, 2015 06:22

cheddar reviewed Dec 29, 2015
View reviewed changes

himanshug force-pushed the fix_filter_multi_valued branch 3 times, most recently from 666cc04 to c2dcc1d Compare December 30, 2015 19:19

fjy reviewed Dec 30, 2015
View reviewed changes

adding decorate(DimensionSelector) to DimensionSpec to enable support…

fa5c3bb

… for arbitrary filtering/transformations to returned dimension values

himanshug force-pushed the fix_filter_multi_valued branch 2 times, most recently from 41dddb2 to c988d02 Compare December 30, 2015 21:35

vogievetsky reviewed Dec 30, 2015
View reviewed changes

Add support for filtering at DimensionSpec level so that multivalued …

b47d807

…dimensions can be filtered correctly also adding UTs for multi-valued dimensions

himanshug force-pushed the fix_filter_multi_valued branch from c988d02 to 635982f Compare December 31, 2015 00:02

himanshug force-pushed the fix_filter_multi_valued branch from 635982f to bad571f Compare December 31, 2015 00:06

documenting querying behavior on multi-valued dimensions

e1ea93b

himanshug force-pushed the fix_filter_multi_valued branch from bad571f to e1ea93b Compare December 31, 2015 00:14

fjy added a commit that referenced this pull request Dec 31, 2015

Merge pull request #2130 from himanshug/fix_filter_multi_valued

f821943

Feature to "fix" filtering on multi-valued dimensions

fjy merged commit f821943 into apache:master Dec 31, 2015

binlijin mentioned this pull request Jan 12, 2016

fix topN filtering on multi-valued dimension bug #2255

Closed

himanshug deleted the fix_filter_multi_valued branch February 8, 2016 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature to "fix" filtering on multi-valued dimensions #2130

Feature to "fix" filtering on multi-valued dimensions #2130

himanshug commented Dec 19, 2015

drcrallen commented Dec 22, 2015

himanshug commented Dec 22, 2015

cheddar Dec 29, 2015

himanshug commented Dec 30, 2015

fjy Dec 30, 2015

fjy Dec 30, 2015

fjy Dec 30, 2015

himanshug Dec 30, 2015

fjy Dec 30, 2015

himanshug commented Dec 30, 2015

himanshug commented Dec 30, 2015

cheddar commented Dec 30, 2015

vogievetsky Dec 30, 2015

fjy Dec 30, 2015

vogievetsky Dec 30, 2015

himanshug Dec 30, 2015

himanshug Dec 30, 2015

himanshug commented Dec 31, 2015

fjy commented Dec 31, 2015

vogievetsky commented Dec 31, 2015

Feature to "fix" filtering on multi-valued dimensions #2130

Feature to "fix" filtering on multi-valued dimensions #2130

Conversation

himanshug commented Dec 19, 2015

drcrallen commented Dec 22, 2015

himanshug commented Dec 22, 2015

Choose a reason for hiding this comment

himanshug commented Dec 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Dec 30, 2015

himanshug commented Dec 30, 2015

cheddar commented Dec 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Dec 31, 2015

fjy commented Dec 31, 2015

vogievetsky commented Dec 31, 2015