Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard against exponential increase of filters during CNF conversion #12314

Merged
merged 2 commits into from
Mar 9, 2022

Conversation

rohangarg
Copy link
Member

Fixes #12311

Currently, the CNF conversion of a filter is unbounded, which means that it can create as many filters as possible thereby also leading to OOMs in historical heap. We should throw an error or disable CNF conversion if the filter count starts getting out of hand. There are ways to do CNF conversion with linear increase in filters as well but that has been left out of the scope of this change since those algorithms add new variables in the predicate - which can be contentious. Any improved algorithm for conversion can be taken up later upon discussion.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@@ -178,6 +179,11 @@ private static void generateAllCombinations(
a.add(child);
// Result must receive an actual OrFilter.
result.add(idempotentOr(Filters.or(a)));
if (result.size() >= 10_000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for this limit to be configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be configurable - didn't keep it as of now to avoid more configs.
but if it feels like disruptive (because it could lead to failure in some existing queries which create > 10k filters and still run), I can add a config for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the change where exceeding this limit for not fail queries for now. when the limit exceeds, the join flow will disable pushdown optimization and the normal filter flow will return the non-CNF filter. does that seem ok to not add a config to toggle the limit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think that is ok, the transformation shouldn't be changing results afaik, though per my other comment there should be some way to know that this has happened


import org.apache.druid.java.util.common.StringUtils;

public class CNFFilterExplosionException extends Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a BadQueryException? Since i think this is user controllable by setting a context flag or not, I think doing this would result in it being a user error when it eventually surfaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BadQueryException makes sense but one problem is that it is an un-checked exception. I was trying to make this a checked exception so that callers must decide how to handle it. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that makes sense, looking closer I see now that these exceptions are always eaten and don't make it to the user, i admittedly just did a drive by scan of this PR earlier and thought they might make it to the surface 😅

though thinking about it, maybe it is worth a log of some sort to know that this is happening so that there is some indicator of why the expected query isn't happening?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess a log might be kind of noisy if this was coming from the SQL planner, and would need to be collected and printed at the end instead. We have a mechanism for collecting errors this way, but not for informative messages; this could be done at a later time I think (if is actually useful to have a way to know why the planner did or didn't do a think that was asked for).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with that - it would make more sense to have something in a per query structure which is shown to the user.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return useCNF ? Filters.toCnf(filter) : filter;
}
catch (CNFFilterExplosionException cnfFilterExplosionException) {
return filter; // cannot convert to CNF, return the filter as is
Copy link
Contributor

@cryptoe cryptoe Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just put a log that we are not converting to cnf due to 'CNFFilterExplosionExceptions` so that people debugging the query know whats happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clusters where this happens regularly, that can generate a lot of log spew. You could argue that they should respond to the log spew by not doing the queries, but it would still be a sudden chunk of log spew. In general, CNF is a potential optimization done by the planner, we don't log other optimization choices that we make, we just make them. In this case, it's deeming that converting to CNF is more expensive than just leaving it as is, which is yet another optimization decision. I don't think it needs a log, or if it does, maybe at debug level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand where your coming from. In such a case, I would then tend towards not throwing an exception and refactoring the Filter.toCnf(filter) to return an appropriate state which helps us decide if the conversionWasSucessfull or not. Maybe use Optional.of()so that we avoid throwing exceptions due to performance overheads of exceptions.
We can take this up as future work. Maybe when we are making this limit configurable.

@abhishekagarwal87 abhishekagarwal87 merged commit 56fbd2a into apache:master Mar 9, 2022
@abhishekagarwal87
Copy link
Contributor

Merged since failures on jdk15 tests are unrelated.

TSFenwick pushed a commit to TSFenwick/druid that referenced this pull request Apr 11, 2022
…pache#12314)

Currently, the CNF conversion of a filter is unbounded, which means that it can create as many filters as possible thereby also leading to OOMs in historical heap. We should throw an error or disable CNF conversion if the filter count starts getting out of hand. There are ways to do CNF conversion with linear increase in filters as well but that has been left out of the scope of this change since those algorithms add new variables in the predicate - which can be contentious.
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Historical crashes with OOM on SQL join query
6 participants