Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filtering on long columns (including __time) #3180

Merged
merged 11 commits into from
Jul 21, 2016

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Jun 23, 2016

Fixes #2816

This PR adds support for filtering on long columns, including __time, using the non-bitmap indexed column filtering support added by #3018.

This patch changes the interface of the ValueMatcherFactory regarding predicate handling. Filters will now create a DruidPredicateFactory, an object that can create a predicate suitable for each filterable column type (currently String and long only).

I have included some benchmarks to check performance of predicate matching, using a set of the affected filters in an OrFilter, applied during an IncrementalIndex read, also during a TimeseriesQuery with FilteredAggregators on both types of indexes.

Patch

IncrementalIndexReadBenchmark.readWithFilters           1500000     basic  avgt   50  717525.937 ± 17697.485  us/op

Benchmark                                                (rowsPerSegment)  (schema)  Mode  Cnt       Score       Error  Units
FilteredAggregatorBenchmark.querySingleIncrementalIndex           1500000     basic  avgt   50  375765.889 ±  2526.498  us/op
FilteredAggregatorBenchmark.querySingleQueryableIndex             1500000     basic  avgt   50  115972.728 ±   318.759  us/op

Master

Benchmark                                                (rowsPerSegment)  (schema)  Mode  Cnt       Score      Error  Units
IncrementalIndexReadBenchmark.readWithFilters                     1500000     basic  avgt   50  739239.845 ± 17020.546  us/op
FilteredAggregatorBenchmark.querySingleIncrementalIndex           1500000     basic  avgt   50  375252.713 ±  3771.233  us/op
FilteredAggregatorBenchmark.querySingleQueryableIndex             1500000     basic  avgt   50  116420.209 ±   545.924  us/op

Benchmarks for basic queries are shown below:

Patch

Benchmark                                              (defaultStrategy)  (initialBuckets)  (limit)  (numProcessingThreads)  (numSegments)  (pagingThreshold)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndex                             v1                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   422807.560 ± 11421.148  us/op
GroupByBenchmark.queryMultiQueryableIndex                             v2                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   149036.531 ±  4474.590  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v1                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   232410.637 ±   886.871  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v2                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50    63459.120 ±  1587.029  us/op
GroupByBenchmark.querySingleQueryableIndex                            v1                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   197057.165 ±  1406.680  us/op
GroupByBenchmark.querySingleQueryableIndex                            v2                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50    43364.877 ±   371.839  us/op
SearchBenchmark.queryMultiQueryableIndex                             N/A               N/A     1000                     N/A              1                N/A            750000           basic.A          N/A  avgt   50    19444.809 ±   132.849  us/op
SearchBenchmark.querySingleIncrementalIndex                          N/A               N/A     1000                     N/A              1                N/A            750000           basic.A          N/A  avgt   50   267336.014 ±  6745.615  us/op
SearchBenchmark.querySingleQueryableIndex                            N/A               N/A     1000                     N/A              1                N/A            750000           basic.A          N/A  avgt   50    19897.377 ±   120.242  us/op
SelectBenchmark.queryIncrementalIndex                                N/A               N/A      N/A                     N/A              1               1000             25000           basic.A          N/A  avgt   50    59108.504 ±   242.895  us/op
SelectBenchmark.queryMultiQueryableIndex                             N/A               N/A      N/A                     N/A              1               1000             25000           basic.A          N/A  avgt   50    88906.562 ±   643.717  us/op
SelectBenchmark.queryQueryableIndex                                  N/A               N/A      N/A                     N/A              1               1000             25000           basic.A          N/A  avgt   50    88235.696 ±   441.188  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex                N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50    15608.863 ±    27.664  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                         N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50   141074.858 ±   641.014  us/op
TimeseriesBenchmark.querySingleIncrementalIndex                      N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50  1503494.382 ± 18168.028  us/op
TimeseriesBenchmark.querySingleQueryableIndex                        N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50   138702.814 ±   894.963  us/op
TopNBenchmark.queryMultiQueryableIndex                               N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A           10  avgt   50   181720.297 ±  1162.611  us/op
TopNBenchmark.querySingleIncrementalIndex                            N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A           10  avgt   50  1463885.039 ± 27795.640  us/op
TopNBenchmark.querySingleQueryableIndex                              N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A           10  avgt   50   176074.100 ±   739.388  us/op

Master

Benchmark                                              (defaultStrategy)  (initialBuckets)  (limit)  (numProcessingThreads)  (numSegments)  (pagingThreshold)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndex                             v1                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   424694.116 ± 13803.220  us/op
GroupByBenchmark.queryMultiQueryableIndex                             v2                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   147794.277 ±  4489.902  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v1                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   234001.315 ±  2181.479  us/op
GroupByBenchmark.querySingleIncrementalIndex                          v2                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50    64369.325 ±   497.782  us/op
GroupByBenchmark.querySingleQueryableIndex                            v1                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50   195698.273 ±  2367.676  us/op
GroupByBenchmark.querySingleQueryableIndex                            v2                -1      N/A                       4              4                N/A            100000           basic.A          N/A  avgt   50    43429.411 ±   289.636  us/op
SearchBenchmark.queryMultiQueryableIndex                             N/A               N/A     1000                     N/A              1                N/A            750000           basic.A          N/A  avgt   50    19438.288 ±   103.333  us/op
SearchBenchmark.querySingleIncrementalIndex                          N/A               N/A     1000                     N/A              1                N/A            750000           basic.A          N/A  avgt   50   273534.285 ±  8815.582  us/op
SearchBenchmark.querySingleQueryableIndex                            N/A               N/A     1000                     N/A              1                N/A            750000           basic.A          N/A  avgt   50    20063.783 ±   102.915  us/op
SelectBenchmark.queryIncrementalIndex                                N/A               N/A      N/A                     N/A              1               1000             25000           basic.A          N/A  avgt   50    58486.232 ±   712.939  us/op
SelectBenchmark.queryMultiQueryableIndex                             N/A               N/A      N/A                     N/A              1               1000             25000           basic.A          N/A  avgt   50    84278.384 ±   532.538  us/op
SelectBenchmark.queryQueryableIndex                                  N/A               N/A      N/A                     N/A              1               1000             25000           basic.A          N/A  avgt   50    90504.923 ±   695.419  us/op
TimeseriesBenchmark.queryFilteredSingleQueryableIndex                N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50    15731.451 ±    22.003  us/op
TimeseriesBenchmark.queryMultiQueryableIndex                         N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50   140230.784 ±   788.050  us/op
TimeseriesBenchmark.querySingleIncrementalIndex                      N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50  1475996.560 ± 17033.559  us/op
TimeseriesBenchmark.querySingleQueryableIndex                        N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A          N/A  avgt   50   142505.862 ±   765.027  us/op
TopNBenchmark.queryMultiQueryableIndex                               N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A           10  avgt   50   186033.316 ±  1075.556  us/op
TopNBenchmark.querySingleIncrementalIndex                            N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A           10  avgt   50  1458937.409 ± 24749.999  us/op
TopNBenchmark.querySingleQueryableIndex                              N/A               N/A      N/A                     N/A              1                N/A            750000           basic.A           10  avgt   50   175634.645 ±   763.507  us/op

@jon-wei jon-wei added this to the 0.9.2 milestone Jun 23, 2016

String jsFn = "function(x) { return(x === 'Wednesday' || x === 'Thursday') }";
assertFilterMatches(
new JavaScriptDimFilter(Column.TIME_COLUMN_NAME, jsFn, exfn, JavaScriptConfig.getDefault()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't automatically convert to strings for javascript, can we add a JavaScriptDimFilter test that operates on the time column values directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl I have one in the testTimeFilterAsLong() test, the JavascriptDimFilter there compares directly on longs

@nishantmonu51
Copy link
Member

@jon-wei can you add some docs also with an example doing filtering on a time column ?
or are you planning to do it in a follow up PR?

@jon-wei
Copy link
Contributor Author

jon-wei commented Jun 27, 2016

@nishantmonu51 I've added a section to the docs on filtering on __time

@jon-wei jon-wei force-pushed the time_filtering branch 2 times, most recently from a18ec5d to 0e1cdda Compare June 27, 2016 22:03
### Filtering on the Timestamp Column
Filters can also be applied to the timestamp column. The timestamp column has long millisecond values.

To refer to the timestamp column, use the string "__time" as the dimension name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`__time`

would probably mess with the syntax highlighting less

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm cool, fixed the quotes

@fjy
Copy link
Contributor

fjy commented Jun 28, 2016

minor comments to be fixed, but 👍 after those are addressed

**Example**

Filtering on a long timestamp value:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json highlighting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm Added json highlightning

@jon-wei
Copy link
Contributor Author

jon-wei commented Jun 29, 2016

@fjy I've moved the common logic to functions in Filters, addressed the other comments as well

@jon-wei
Copy link
Contributor Author

jon-wei commented Jun 29, 2016

Ran the ValueMatcher-using benchmarks again, results haven't changed:

IncrementalIndexReadBenchmark.readWithFilters                     1500000     basic  avgt   50  727398.052 ± 15572.380  us/op
FilteredAggregatorBenchmark.querySingleIncrementalIndex           1500000     basic  avgt   50  377579.446 ±  1635.633  us/op
FilteredAggregatorBenchmark.querySingleQueryableIndex             1500000     basic  avgt   50  115233.656 ±   345.338  us/op

@@ -157,6 +163,45 @@ public void remove()
);
}

public static ValueMatcher getTimeValueMatcher(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling this getLongValueMatcher so it can be used with other long dims too, when the time comes?

/**
* Compound predicate class that accepts all supported types
*/
public interface DruidPredicate extends Predicate<Object>, DruidLongPredicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like long predicate should extend druid predicate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i dont understand the purpose of interfaces with no methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjy To support separate String/Long ValueMatchers on the factory, the Filter side needs to pass in a different kind of predicate for each type, so this interface is used to combine the predicate implementations into a single object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a more descriptive name than DruidPredicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjy renamed this to DruidCompositePredicate

@jon-wei jon-wei changed the title Support filtering on __time column Support filtering on __time column [WIP] Jul 1, 2016
@jon-wei jon-wei changed the title Support filtering on __time column [WIP] Support filtering on __time column Jul 1, 2016
/**
* Composite predicate class that can accept all supported types
*
* The apply() method inherited from Predicate<Object> is intended for String values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the Predicate<Object> seems like premature generalization. We aren't getting any use out of this now (we don't have non-primitive Object types that filters support), and generally the filters all call toString on this object anyway. How do you feel about making this a Predicate<String>?

@jon-wei
Copy link
Contributor Author

jon-wei commented Jul 16, 2016

going to change DruidCompositePredicate to DruidPredicateFactory, will update this PR

*
* A separate method is present for each supported primitive type to avoid boxing (for performance reasons)
*/
public interface DruidCompositePredicate extends Predicate<Object>, DruidLongPredicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is unused now and could be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm ah, my bad, should've put a WIP on the last commit, I wasn't quite done with the PR changes yet but wasn't expecting you to review so soon :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, okay, I'll wait for you to finish :)

@jon-wei
Copy link
Contributor Author

jon-wei commented Jul 19, 2016

Updated patch benchmarks in original comment

public InFilter(String dimension, Set<String> values, ExtractionFn extractionFn)
{
this.dimension = dimension;
this.values = values;
this.extractionFn = extractionFn;
setLongValues();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it might be worth only doing this if a long predicate is actually requested. I could see the parsing taking a while for long IN filters. But make sure we only do this once even if getLongPredicate is called many times.

@@ -44,6 +45,11 @@
private final String value;
private final ExtractionFn extractionFn;

private Object initLock = new Object();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initLock should be final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm fixed non-final lock object

@gianm
Copy link
Contributor

gianm commented Jul 20, 2016

👍 after travis

@fjy
Copy link
Contributor

fjy commented Jul 21, 2016

Stil 👍 from me

@fjy fjy merged commit a42ccb6 into apache:master Jul 21, 2016
@gianm gianm mentioned this pull request Sep 29, 2016
@jon-wei jon-wei deleted the time_filtering branch October 6, 2017 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants