Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to "fix" filtering on multi-valued dimensions #2130

Merged
merged 3 commits into from
Dec 31, 2015

Conversation

himanshug
Copy link
Contributor

Currently, if you have a row in druid that has a multi-valued dimension with values ["v1", "v2", "v3"] and you send a query grouping by that dimension with filter for value "v1". In the response you will get rows containing "v2" and "v3" as well.
For more details see the UT MultiValuedDimensionTest.testGroupByWithDimFilter() in this PR.

This PR introduces the feature to add regex or list-of-string-values filters in new DimensionSpec implementations which can be used to do proper filtering on multi-valued dimensions so that unwanted rows are discarded as early in processing pipeline as possible.

Not having this ability causes unnecessary major resource utilization across brokers and historicals where result set would sometime contain 90% unwanted rows.

@himanshug himanshug added this to the 0.9.0 milestone Dec 19, 2015
@himanshug himanshug force-pushed the fix_filter_multi_valued branch 2 times, most recently from c9bdece to 7a4a4aa Compare December 20, 2015 06:22
@drcrallen
Copy link
Contributor

@himanshug can you comment on the operational difference between the RegexFilteredDimensionSpec and an extraction filter using regex?

@himanshug
Copy link
Contributor Author

@drcrallen filter in DimensionSpec only make sense for multi-valued dimensions. Existing [Regex]Dim filter is used to select row from segment. one row would explode into multiple rows based on multivalued dimension. Then DimensionSpec filter would apply on these exploded rows and select from them.
In many cases the regex pattern used might be same in both places but it can be different as well depending upon the use case.

}

if (matched == null) {
matched = new HashSet<>(values.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is applied to multiple segments at the same time, the HashSet reference is gonna be changing out from under us. Would be better to make a new one here and then close over the DimensionSelector

@himanshug himanshug force-pushed the fix_filter_multi_valued branch 3 times, most recently from 666cc04 to c2dcc1d Compare December 30, 2015 19:19
@himanshug
Copy link
Contributor Author

@cheddar review comment addressed.

### Filtering DimensionSpecs
These are only valid for multi-valued dimensions. They take a delegate DimensionSpec and a filtering criteria, multiple values list of dimension is filtered as per given criteria.

Following filtered dimension spec acts as a whiltelist or blacklist for values as per the configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitelist*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one or two examples of how to use this optimization will help clear up the confusion.

… for arbitrary filtering/transformations to returned dimension values
@himanshug himanshug force-pushed the fix_filter_multi_valued branch 2 times, most recently from 41dddb2 to c988d02 Compare December 30, 2015 21:35
@himanshug
Copy link
Contributor Author

@fjy updated the doc, see if it is better now?

@himanshug
Copy link
Contributor Author

m adding some more examples to doc.

@cheddar
Copy link
Contributor

cheddar commented Dec 30, 2015

I'm 👍, I find the docs easier to read and understand now as well. @fjy ?

Following filtered dimension spec retains only the values matching regex. Note that `listFiltered` is faster than this and one should use that for whitelist or blacklist usecase.
```json
{ "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> }
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just make these as having filters? Unless I am mistaken these are basically restricted HAVING filters. At the very least there should be a link in the HAVING filter spec doc to this doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to expand in this I filed #1984 some time ago, it looks like #2043 fixed it (I am not 100% sure, need to try it out).

So doing a having filter with:

{
    "type": "dimSelector",
    "dimension": "<dimension>",
    "value": <dimension_value>
}

Should work. (I have not tested that yet but the PR was merged).

What would be the difference between doing listFiltered and dimSelector on the relevant dimensions?

I understand that listFiltered would work for topNs but ideally topN should support having filters just like groupBy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of my teammates added the new having specs to work on dimension values to solve the problem. However having specs are only applied at the broker after all the processing is done, so historicals will process/merge all the unwanted rows and pass them to broker where broker will further merge them and then in the end having filter will discard those. that would cause a lot of unnecessary memory and cpu consumption across the cluster. filters in this PR will get applied to the lowest possible level in the pipeline.
That said, I would add a line in the doc saying similar results can be obtained via having filters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this will work for both topN and groupBy

…dimensions can be filtered correctly

also adding UTs for multi-valued dimensions
@himanshug
Copy link
Contributor Author

@fjy @vogievetsky I added documentation for query beharior on multi-valued dimensions , please see multi-valued-dimensions.md introcuded in this PR. hopefully, that makes things clearer.

@fjy
Copy link
Contributor

fjy commented Dec 31, 2015

👍 after travis

fjy added a commit that referenced this pull request Dec 31, 2015
Feature to "fix" filtering on  multi-valued dimensions
@fjy fjy merged commit f821943 into apache:master Dec 31, 2015
@vogievetsky
Copy link
Contributor

I understand that this is more efficient. I think ideally Druid should be able to separate having filters into work that can be performed on the brokers and work that needs to be done in the post process step. While a smart library like Plywood could auto split the having filters to leverage this use-case users using the native Druid API directly would benefit from this greatly.

@himanshug himanshug deleted the fix_filter_multi_valued branch February 8, 2016 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants