Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to SearchQuery to choose a search query execution strategy #3792

Merged
merged 10 commits into from Jan 11, 2017

Conversation

jihoonson
Copy link
Contributor

@jihoonson jihoonson commented Dec 20, 2016

This PR is related to #3775

Supported strategies are

  1. Index-only query execution
  2. Cursor-based scan
  3. Auto: choose an efficient strategy for a given query

This change is Reviewable

…egy.

Supported strategies are
1) Index-only query execution
2) Cursor-based scan
3) Auto: choose an efficient strategy for a given query
@jihoonson
Copy link
Contributor Author

jihoonson commented Dec 20, 2016

Here are some JMH benchmark results (SearchBenchmark.java) to decide proper thresholds for selecting a good strategy. The benchmark tests were conducted with my local machine.

Operation time with varying filter selectivity

Operation time with varying search dimension's cardinality

@fjy fjy added this to the 0.10.0 milestone Dec 20, 2016
@fjy fjy assigned fjy and gianm Dec 20, 2016
@@ -71,6 +73,12 @@
public class SearchQueryRunner implements QueryRunner<Result<SearchResultValue>>
{
private static final EmittingLogger log = new EmittingLogger(SearchQueryRunner.class);

private static final double HIGH_FILTER_SELECTIVITY_THRESHOLD_FOR_CONCISE = 0.99;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the strategies themselves return these constants? This way if we add additional bitmap algorithms we can just add new strategies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually reading more over this code, I think this might belong in the bitmap factories

@gianm thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these should be in the bitmap factories, since the thresholds are experimentally determined and specific to the Search Query algorithm. I doubt anything else would find them useful, so IMO, they belong in the Search Query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added SearchQueryDecisionHelper which provides such information.

}
if (retVal.size() >= limit) {
return makeReturnResult(limit, retVal);
switch (strategy) {
Copy link
Contributor

@fjy fjy Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having a switch statement here, can we instead have the strategy return the execution strategy?

return strategy.getExecutionPlan();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -49,6 +49,14 @@
private final List<DimensionSpec> dimensions;
private final SearchQuerySpec querySpec;
private final int limit;
private final Strategy strategy;

public enum Strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if SearchStrategy should be an interface that we extend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

throw new ISE(
"Null storage adapter found. Probably trying to issue a query against a segment being memory unmapped."
);
private static boolean isLowCardinality(final BitmapFactory bitmapFactory, final long totalCard) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should have a method on bitmapFactory that returns if its a low or high cardinality bitmap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to SearchQueryDecisionHelper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, I think this method works better in the SearchQuery since the meaning of "low" and "high" is very subjective and, in this case, determined experimentally for a specific query. It's possible that for a different query, different thresholds would be needed.

final Iterable<DimensionSpec> dimsToSearch;
if (dimensions == null || dimensions.isEmpty()) {
dimsToSearch = Iterables.transform(adapter.getAvailableDimensions(), Druids.DIMENSION_IDENTITY);
private static boolean isHighSelectivityFilter(final QueryableIndex index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, the bitmap should tell us this information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to SearchQueryDecisionHelper

@@ -254,6 +275,10 @@ public boolean equals(Object o)
return false;
}

if (!strategy.equals(that.strategy)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about hashcode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson
Copy link
Contributor Author

@fjy thank you for your review. I updated my patch.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review since I started before @jihoonson's recent refactoring. Will review the new version again.

@@ -71,6 +73,12 @@
public class SearchQueryRunner implements QueryRunner<Result<SearchResultValue>>
{
private static final EmittingLogger log = new EmittingLogger(SearchQueryRunner.class);

private static final double HIGH_FILTER_SELECTIVITY_THRESHOLD_FOR_CONCISE = 0.99;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these should be in the bitmap factories, since the thresholds are experimentally determined and specific to the Search Query algorithm. I doubt anything else would find them useful, so IMO, they belong in the Search Query.

@@ -94,6 +102,7 @@ public SearchQueryRunner(Segment segment)
final SearchQuerySpec searchQuerySpec = query.getQuery();
final int limit = query.getLimit();
final boolean descending = query.isDescending();
final Strategy strategy = query.getStrategy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we do tunings like this as "context" flags rather than explicitly making objects for them. So I'd do something similar here. The code would be like:

final Strategy strategy = Strategy.valueOf(query.getContextValue(CTX_KEY_STRATEGY, DEFAULT_STRATEGY).toUpperCase());

Where CTX_KEY_STRATEGY and DEFAULT_STRATEGY are string constants that have values like "searchQueryStrategy" and "auto". See the groupBy "Query context" docs for a similar example. http://druid.io/docs/latest/querying/groupbyquery.html#query-context

final int limit,
final BitmapFactory bitmapFactory)
{
log.info("Index only query execution strategy is selected.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't log anything at info level during a query, it will be way too many logs.

throw new ISE(
"Null storage adapter found. Probably trying to issue a query against a segment being memory unmapped."
);
private static boolean isLowCardinality(final BitmapFactory bitmapFactory, final long totalCard) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, I think this method works better in the SearchQuery since the meaning of "low" and "high" is very subjective and, in this case, determined experimentally for a specific query. It's possible that for a different query, different thresholds would be needed.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, left a few comments. thanks @jihoonson.

for (int i = startIndex; i <= endIndex; i++) {
timeBitmap.add(i);
}
final SearchStrategy strategy = query.getStrategy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we do tunings like this as "context" flags rather than explicitly making objects for them. So I'd do something similar here. The code would be like:

final SearchStrategy strategy = SearchStrategy.fromString(query.getContextValue(CTX_KEY_STRATEGY, DEFAULT_STRATEGY));

Where CTX_KEY_STRATEGY and DEFAULT_STRATEGY are string constants that have values like "searchQueryStrategy" and "auto", and SearchStrategy.fromString is something like QueryGranularity.fromString.

To me the main reason to prefer context parameters is that it allows users to add tunings from newer versions of Druid without their client needing to be aware of a new query field. Almost all Druid clients allow users to add arbitrary context parameters but not all of them allow arbitrary top-level query parameters.

See also GroupByStrategySelector and the groupBy "Query context" http://druid.io/docs/latest/querying/groupbyquery.html#query-context for how strategy selection & tuning is handled in GroupBy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll change.

if (filter == null ||
index.getDecisionHelper().hasLowCardinality(index, dimsToSearch) ||
index.getDecisionHelper().hasHighSelectivity(index, timeFilteredBitmap)) {
log.info("Index-only execution strategy is selected");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't log anything at info level during query execution, there will be too many logs. debug or trace is preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Metadata getMetadata();
Map<String, DimensionHandler> getDimensionHandlers();

SearchQueryDecisionHelper getDecisionHelper();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this decision-making should be self-contained within the Search query rather than bleeding into the QueryableIndex interface.

My rationale is that the thresholds embodied in this decision helper are experimentally determined and specific to the Search query algorithm, and ideally nothing else should know about them. Moving this to the Search query does mean that the Search query will need to know about different kinds of bitmaps, but I think that's better than the QueryableIndex knowing about the Search query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. How about making DecisionHelper based on index type in SearchStrategy? DecisionHelper can be used in only SearchStrategy as well as separating from index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me.

// Index-only strategy is selected when
// 1) there is no filter,
// 2) the total cardinality is very low, or
// 3) the filter has a very high selectivity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the filter is highly selective, doesn't that mean we want to use the cursor-based strategy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think we're using different definitions of "high selectivity". I think of it the same way as this guy: https://blogs.msdn.microsoft.com/bartd/2011/01/25/query-tuning-fundamentals-density-predicates-selectivity-and-cardinality/

Selectivity for a filter predicate against a base table can be calculated as “[# rows that pass the predicate]/[# rows in the table]”. If the predicate passes all rows in the table, its selectivity is 1.0. If it disqualifies all rows, its selectivity is 0. (This can be confusing. Note that 0.000001 reflects a high selectivity even though the number is small, while 1.0 is low selectivity even though the number is higher.)

So > .99 selectivity would be "very low" if the two of us are thinking about it the right way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I was confused. Thanks.

throw new IAE("Should only have one interval, got[%s]", intervals);
}
final Interval interval = intervals.get(0);
final ImmutableBitmap timeFilteredBitmap = IndexOnlyExecutor.makeTimeFilteredBitmap(index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of performance issues here:

  1. makeTimeFilteredBitmap isn't cheap, and if we use the index-based execution we will end up calling it twice.
  2. filter.getBitmapIndex isn't cheap either, and if we end up doing the cursor-based execution, it'll get called twice too (once by makeTimeFilteredBitmap and once by makeCursors).

Some possible strategies to help:

  • Don't generate the timeFilteredBitmap until we actually need to call hasHighSelectivity. i.e. if the search dims are low cardinality, just go straight to the index-only strategy. And if they're very high cardinality, similar to the number of rows in the segment, then possibly go straight to the cursor-only strategy (please do some benchmarks to verify this guess though). We should only need to check the filter selectivity if search dimension cardinality is medium.
  • Save the timeFilteredBitmap after generating it, and pass it to the index-based executor if we choose that one.

I think one thing we should not do is save the bitmap and pass it to makeCursors. That's probably too leaky and I would rather accept the performance hit there.

static Iterable<DimensionSpec> getDimsToSearch(Indexed<String> availableDimensions, List<DimensionSpec> dimensions)
{
if (dimensions == null || dimensions.isEmpty()) {
return Iterables.transform(availableDimensions, Druids.DIMENSION_IDENTITY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lazily computed, to save on allocations you can instead wrap it in ImmutableList.copyOf, like

return ImmutableList.copyOf(Iterables.transform(availableDimensions, Druids.DIMENSION_IDENTITY));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -171,6 +159,39 @@ private void setupQueries()
basicQueries.put("A", queryBuilderA);
}

{ // basic.B
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please attach results of this benchmark before and after the patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add.

@JsonSubTypes.Type(name = "indexOnly", value = IndexOnlyStrategy.class),
@JsonSubTypes.Type(name = "cursorBased", value = CursorBasedStrategy.class)
})
public abstract class SearchStrategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jackson annotations here are unnecessary if this is changed to be a context parameter.

@gianm
Copy link
Contributor

gianm commented Dec 21, 2016

@fjy could you chime in about a whether you agree with having the strategy be a context parameter (#3792 (comment)) and whether you agree that the search query should self-contain its own decision making with respect to the bitmap types (#3792 (comment))?

@fjy
Copy link
Contributor

fjy commented Dec 21, 2016

@gianm agree

@jihoonson
Copy link
Contributor Author

Addressed comments. I'll fix the conflicts and add benchmark results soon.

@jihoonson
Copy link
Contributor Author

Here is a result of simple benchmark.

  • 2 filters for dimUniform and dimHyperUnique dimensions.
    • Each filter has 0.1 selectivity.
    • The cardinality of both dimensions is 100,000.

With concise bitmap

cursorBased

Benchmark                                  (limit)  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score      Error  Units
SearchBenchmark.queryMultiQueryableIndex    750000              1            750000           basic.B  avgt   25  267250.791 ± 9636.399  us/op
SearchBenchmark.querySingleQueryableIndex   750000              1            750000           basic.B  avgt   25  242133.054 ± 7486.306  us/op


indexOnly (before patch)

Benchmark                                  (limit)  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt         Score        Error  Units
SearchBenchmark.queryMultiQueryableIndex    750000              1            750000           basic.B  avgt   25  14607220.224 ± 133086.094  us/op
SearchBenchmark.querySingleQueryableIndex   750000              1            750000           basic.B  avgt   25  14819905.575 ± 180125.594  us/op


auto (after patch)

Benchmark                                  (limit)  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score       Error  Units
SearchBenchmark.queryMultiQueryableIndex    750000              1            750000           basic.B  avgt   25  491054.876 ± 25474.403  us/op
SearchBenchmark.querySingleQueryableIndex   750000              1            750000           basic.B  avgt   25  468489.379 ± 11342.444  us/op

With roaring bitmap

cursorBased

Benchmark                                  (limit)  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score      Error  Units
SearchBenchmark.queryMultiQueryableIndex    750000              1            750000           basic.B  avgt   25  438806.602 ± 7831.121  us/op
SearchBenchmark.querySingleQueryableIndex   750000              1            750000           basic.B  avgt   25  435590.852 ± 8456.003  us/op


indexOnly (before patch)

Benchmark                                  (limit)  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score       Error  Units
SearchBenchmark.queryMultiQueryableIndex    750000              1            750000           basic.B  avgt   25  1438733.826 ± 27503.264  us/op
SearchBenchmark.querySingleQueryableIndex   750000              1            750000           basic.B  avgt   25  1494863.759 ± 59221.451  us/op


auto (after patch)

Benchmark                                  (limit)  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  Mode  Cnt       Score       Error  Units
SearchBenchmark.queryMultiQueryableIndex    750000              1            750000           basic.B  avgt   25  830984.652 ± 22974.701  us/op
SearchBenchmark.querySingleQueryableIndex   750000              1            750000           basic.B  avgt   25  824529.731 ± 18906.598  us/op

You can also see the comparison graph here.

Auto strategy is little bit slower than Cursor-based strategy. I suppose Auto strategy calls IndexOnlyStrategy.makeTimeFilteredBitmap() internally and this makes the difference.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jihoonson. Other than the line comments I've got these two general ones:


switch (strategyString) {
case AutoStrategy.NAME:
log.debug("Auto strategy is selected");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include the query ID in these log messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

if (filter == null ||
helper.hasLowCardinality(index, dimsToSearch) ||
helper.hasLowSelectivity(index, timeFilteredBitmap)) {
log.debug("Index-only execution strategy is selected");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include the query id in this log message too, from query.getId()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


public class IndexOnlyStrategy extends SearchStrategy
{
public static final String NAME = "indexOnly";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"indexOnly" isn't really the right name for this, since we might not always use the index. How about naming the three strategies "cursorOnly", "useIndexes", and "auto"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I changed names.

@JsonProperty
@Min(1)
private int maxSearchLimit = 1000;

@JsonProperty
private String searchStrategy = AutoStrategy.NAME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me wonder if auto is a good default. TLDR: I think default to auto is fine.

Longer answer: It looks to me like there will be no necessary performance degradation from the old code by defaulting to "auto". There could be some degradation for some queries if we choose cursor-based when index-based would actually be better, but I'm willing to take that risk in order to have #3775 fixed with the default config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's make index-only strategy default. We can change later if auto strategy becomes sufficiently smart and efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think that's fine, in that case I'll tag this with "release notes" so we can let people know in the notes that there's a new option they can turn on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After improving auto strategy by avoiding computing timeFilteredBitmap, I think we can use it as the default search strategy. It shows nearly optimal performance. Please refer to #3792 (comment).

@@ -0,0 +1,12 @@
package io.druid.query.search.search;

public class ConciseBitmapDecisionHelper extends SearchQueryDecisionHelper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this a singleton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import io.druid.collections.bitmap.BitmapFactory;
import io.druid.collections.bitmap.ConciseBitmapFactory;
import io.druid.collections.bitmap.ImmutableBitmap;
import io.druid.collections.bitmap.WrappedImmutableConciseBitmap;
import it.uniroma3.mat.extendedset.intset.ImmutableConciseSet;

import java.nio.ByteBuffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this file.

QueryRunnerFactory factory = new SearchQueryRunnerFactory(new SearchQueryQueryToolChest(
new SearchQueryConfig(),
final SearchQueryConfig config = new SearchQueryConfig();
QueryRunnerFactory factory = new SearchQueryRunnerFactory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This supplier can be Suppliers.ofInstance(config).

// Index-only strategy is selected when
// 1) there is no filter,
// 2) the total cardinality is very low, or
// 3) the filter has a very low selectivity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I wonder if we can do a better heuristic here. Instead of checking for "low cardinality" and "low selectivity" does it make sense to use cost functions like:

  • cursor-based is numRowsInSegment * filterSelectivity * costOfCheckingSingleValue
  • index-based with a filter is dimensionCardinality * (costOfCheckingSingleValue + searchQuerySelectivity * costOfFilterIntersection)
  • index-based without a filter is dimensionCardinality * costOfCheckingSingleValue

And choose the strategy with the lowest cost. Unfortunately we don't know what searchQuerySelectivity is in advance, so I think we'd have to just use some constant.

One behavior difference is that if dimensionCardinality is equal to numRowsInSegment, like a unique key, then cost functions like that would choose the cursor-based strategy regardless of filter selectivity (since dimensionCardinality >= numRowsInSegment * filterSelectivity for all values of filterSelectivity). The current code would still use the index-based approach if filter selectivity is lower than the threshold. I'd guess that using cursor-based is actually better, though.

We can also skip computing timeFilteredBitmap if cursor-based already beats index-based (or comes close) before applying the * filterSelectivity term. Based on your benchmarks, it looks like this can save substantial time.

@jihoonson what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to doing either the cost based approach or your current approach, so feel free to try to convince me either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the good suggestion. I agree that cost-based approach is cool. Here are some issues what I'm considering.

  • Disk access patterns of cursor-based and index-only strategies look different. The disk access cost can be ignored like you said above only when they are almost same, but I'm not sure that they are.
  • When using cursor-based strategy, some filters can be pushed down into the cursor (preFilters), thereby actually skipping reading non-result data. However, it is difficult to figure out which part of filters are preFilters outside QueryableIndexStorageAdapter.makeCursors(). To do so, some refactoring is needed. This also makes implementing cost-based approach difficult.

I think we can simply ignore these details at first, and improve later. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, ignoring this for now sounds good.

Although your comment makes me realize that SearchQueryRunner has a bug right now… it calls filter.getBitmapIndex without first calling filter.supportsBitmapIndex to see if that call is valid. The Filter interface doesn't say this (it probably should) but it is bad to call getBitmapIndex if supportsBitmapIndex is false. Callers should use makeMatcher instead.

We should fix that; do you prefer to do it as part of this patch of a follow up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, since we are ignoring this for now, could you please write up a github issue about how we could make this smarter in the future? That way, if someone ever complains about the smartness, we can point them to that issue and ask them to help :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is equivalent now, but only by luck. We are planning to add some filters in the future that don't ever support bitmaps (like filters that filter based on expressions). Those would return false from "supportsBitmapIndex" even if the underlying dimensions did have indexes, since they'd always want to operate on a row by row basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, filter.getBitmapIndex() is called for the query-level filter, which might be on a different column than the dimension being search, so even with the current set of filters, things will not work properly. For example try filtering on a long column and searching on a different, string column with a single search query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I missed this is possible. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm, I created #3832 for improving cost-based planner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for timeFilteredBitmap, making it is expensive because it requires a lot of unions of bitmaps for filters.
I have a plan to avoid unions by estimating filter selectivity. I already have some benchmark result which shows nearly optimal performance. However, this work needs some new codes and refactoring, so I think it would be better to proceed in another issue. I'll make a PR soon.

public class ConciseBitmapDecisionHelper extends SearchQueryDecisionHelper
{
private static final double LOW_FILTER_SELECTIVITY_THRESHOLD = 0.99;
private static final int LOW_CARDINALITY_THRESHOLD = 5000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please briefly describe, in a comment, where these numbers came from? (if you decide to keep using them rather than the cost based approach)

public class RoaringBitmapDecisionHelper extends SearchQueryDecisionHelper
{
private static final double LOW_FILTER_SELECTIVITY_THRESHOLD = 0.65;
private static final int LOW_CARDINALITY_THRESHOLD = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please briefly describe, in a comment, where these numbers came from? (if you decide to keep using them rather than the cost based approach)

@jihoonson
Copy link
Contributor Author

@gianm, thank you for the review. I updated the patch.

@@ -0,0 +1,60 @@
package io.druid.query.search.search;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing license header here and in other new files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you saw an old patch. I added license headers.

@fjy
Copy link
Contributor

fjy commented Jan 10, 2017

👍 from me

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson - looks good other than some minor comments about docs & tests.


#### Strategies

Search queries can be executed using two different strategies. The default strategy is determeined by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"determined" (spelling)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

only the rows which satisfy those filters, thereby saving I/O cost. However, it might be slow with filters of low selectivity.

- "auto" strategy uses a cost-based planner for choosing an optimal search strategy. It estimates the cost of index-only
and cursor-based execution plans, and chooses the optimal one. Currently, its performance is suboptimal due to the large
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "it is not enabled by default due to the overhead of cost estimation."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


#### Query context

The following runtime properties apply:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are "query context parameters" not "runtime properties".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

configs[0] = new SearchQueryConfig();
configs[0].setSearchStrategy(UseIndexesStrategy.NAME);
configs[1] = new SearchQueryConfig();
configs[1].setSearchStrategy(CursorOnlyStrategy.NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test for AutoStrategy too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

return new UseIndexesStrategy(query, false, timeFilteredBitmap);
}

private UseIndexesStrategy(SearchQuery query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this file have codestyle applied? Are you using Eclipse or IntelliJ? It's not the same as what my IntelliJ does when I apply the codestyle.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this problem. I'll apply the codestyle to all changed files.

index
);

// Index-only plan is used only when any filter is not specified or every filter supports bitmap indexes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "the filter" instead of "every filter" (there's just one filter). Maybe this is "every" since you were thinking of decomposing ANDs, but that's not happening right now, so it should be "the"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

@jihoonson
Copy link
Contributor Author

@gianm, thank you for your review. I updated the patch.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, 👍

@gianm gianm merged commit c099977 into apache:master Jan 11, 2017
dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017
…egy (apache#3792)

* Add an option to SearchQuery to choose a search query execution strategy.

Supported strategies are
1) Index-only query execution
2) Cursor-based scan
3) Auto: choose an efficient strategy for a given query

* Add SearchStrategy and SearchQueryExecutor

* Address comments

* Rename strategies and set UseIndexesStrategy as the default strategy

* Add a cost-based planner for auto strategy

* Add document

* Fix code style

* apply code style

* apply comments
@clambertus clambertus unassigned fjy and gianm Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants