Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

Closed
wants to merge 10 commits into from

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Sep 19, 2023

Druid's retention rules are evaluated in the order in which they are set. Unfortunately, a mistake in the order of rules by an operator can lead to unintentional data loss, and if it goes unnoticed and auto-kill is enabled, data could be permanently deleted.

To address this issue, a guard rail is added in the rules API. This guard rail aims to identify rules that fully contains the interval for any subsequent rules A rule is considered good as long as it will run at some point (and doesn't fully hide other rules in the chain) - it's ensured by checking at the interval boundaries as well.

Some examples of bad rules that will be disallowed (see unit tests for more examples):

  • [loadByPeriod(P6M), loadByPeriod(P3M)]
  • [loadByInterval(2020-01-01/2050-01-01), loadByInterval(2021-01-01/2023-01-01)]
  • [dropBeforeByPeriod(P3M), dropBeforeByPeriod(P6M)]
  • [loadForever, loadByPeriod(P1M)]
  • [loadForever, dropForever]

Examples of rules that will be allowed because they will be evaluated at some point:

  • [loadByPeriod(P5Y), loadByInterval(2021-01-01/2023-01-01)]; the loadByInterval rule will be evaluated at some point.
  • [loadByPeriod(P1Y), loadByPeriod(P2Y)]

For testing purposes, if a user wants to intentionally set a rule that fully contains other rules, they can do it temporarily and then fetch the correct set of rules from the rules audit history after their testing is done.

Release note

Datasource rules API will reject the rules payload if a rule fully contains subsequent rules' interval.


Key changed/added classes in this PR
  • Rule.java
  • Rules.java
  • RulesTest.java
  • RulesEligibleIntervalTest.java

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

This would prevent users from shooting themselves in the foot.
For example, if an interval in a rule is "large enough" and
covers an interval in subsequent rule(s), then those rules
will never run.

Implement guard rails for rules in the rule chain.
@kfaraz
Copy link
Contributor

kfaraz commented Sep 20, 2023

@abhishekrb19 , it seems like a nice idea to have these checks while updating retention rules.

If there is a cluster that already defines rules that would fail the check, would that cluster break after this change?
Edit: I suppose not since the check is enforced only when updating rules.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done a full review of the validation logic yet. Left some comments on the structure.

@abhishekrb19
Copy link
Contributor Author

Thanks for the review, @kfaraz. Re:

If there is a cluster that already defines rules that would fail the check, would that cluster break after this change?

No, this is an API-only validation change, so any existing rules in a cluster will remain as-is.

Comment on lines 65 to 81
final Rule currRule = rules.get(i);
final Rule nextRule = rules.get(i + 1);
final Interval currInterval = currRule.getEligibleInterval(now);
final Interval nextInterval = nextRule.getEligibleInterval(now);
if (currInterval.contains(nextInterval)) {
// If the current rule has an eternity interval, it covers everything following it.
// Or if the current rule still covers the next rule at the current interval boundaries, then the
// next rule will never fire at any time, so throw an exception.
if (Intervals.ETERNITY.equals(currInterval) ||
(currRule.getEligibleInterval(currInterval.getStart()).contains(nextRule.getEligibleInterval(currInterval.getStart()))
&& currRule.getEligibleInterval(currInterval.getEnd()).contains(nextRule.getEligibleInterval(currInterval.getEnd())))) {
throw InvalidInput.exception(
"Rule[%s] has an interval that contains interval for rule[%s]. The interval[%s] also covers interval[%s].",
currRule,
nextRule,
currInterval,
nextInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behaviour in the following situation:
Rule#1 - Covers the interval 2020-2021
Rule#2 - Covers the interval 2021-2022
Rule#3 - Covers the interval 2020-2022

In this case validateRules would not throw an exception, however, from my understanding, Rule#3 would never trigger.

If that's the case, we should handle these cases as well (probably by maintaining a set of intervals that the rules seen so far cover and checking if the next rule isn't completely overshadowed by the interval set seen so far)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a nice check to have as well. Let me think about it more and adjust as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LakshSingla, actually, Druid currently does a fully contains interval check for drop rules, so rule #3 is valid with dropByInterval. So we can't really guarantee the overlapping condition in a type agnostic way (which apply to load rules, so we would probably need a separate contract)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I didn't get why that is the case. Rule#1 spans from [2020-2021). rule #2 spans from [2021-2022). Therefore, collectively, rule#1 and rule#2 span from [2020-2022). Is there any instant in the timeline that would cause rule#3 to fire without triggering either rule#1 or rule#2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, rule #3 is valid for the time interval if it existed in isolation. However, when present along with other rules, the precedence of those takes place, causing rule#3 to be effectively useless, which is what this check exists to detect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Rule #3 would never be applied on any segment irrespective of the rule types.
While evaluating retention rules, as soon as a rule applies to a segment, we break and don't look at the other rules.

Also, in general, I am not entirely sure of the validation logic here.

  • For one, the set thing suggested by @LakshSingla makes sense. But we would have to ensure that the fully covered check is done correctly.
    • Additive logic (cleaner to implement): Say your set contains intervals A [4, 10) and B [15, 20) and a new interval C [8, 17) comes in. Adding this new interval C should merge all of A, B and C into one interval [4, 17).
    • Subtractive logic: To check if an interval has any part which is not fully covered by the preceding intervals, remove all parts of it that overlap with intervals already encountered. If you are left with an empty interval, the rule is not valid.
  • Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.
  • There could even be other types of rules in the future, but I guess we need not worry about them now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's consider a segment with an interval 2020/2022 and the following drop rules in order:

  1. dropByInterval('2020/2021')
  2. dropByInterval('2021/2022')
  3. dropByInterval('2020/2022')

For the segment in question, only rule 3 will match as the complete interval is contained in the drop interval.

On the other hand, if we consider loadByInterval rules for the same intervals, you're right that all the rules will apply for the segment as load rules will fire as long as there is at least an overlap.

Hope that clarifies why we cannot detect the overlapping case without baking in some type awareness, but let me know if I am missing something!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the incorrect assumption I made in my original example. This digression occurs from the semantics of how intervals are applied in load rules v/s in drop rules. In load rules, we only need for partial overlap for the rule to fire, while in the drop rule, we need a complete overlap for the rule to fire

However when we are doing a contains check in this logic, we are inherently assuming something about the structure of the rules.

Consider the following case:
The system has 2 types of segments, one ranging from 2020-2021 and the other from 2020-2022. If the user wants to drop the segments ranging from 2020-2021 while preserving the other segments, how does the user craft intuitive rules for the same without causing the newly added check to fail? I only came up with the following, however, that fails the check.

  1. Drop 2020-2021
  2. Load 2020-2022

I think this ties back to the common use case that we are trying to prevent the user from doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the use case above - I'd write the same rules as you've listed in the example. The first rule's interval 2020/2021 does not contain the second rule's interval 2020/2022. So, I'm not sure why the validation check would flag this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfaraz - please see my responses to your comments:

we would have to ensure that the fully covered check is done correctly.

To check if an interval fully contains another, the proposed algorithm has a nested loop that goes through the sequence of rules to find if a larger interval contains another subsequent interval down in the list; O(n^2) algorithmic runtime, but practical runtime should be trivial though.

I suppose there are alternate approaches like condensing intervals - the additive and subtractive logic you're describing. My first thought is that this could work well for the overlapping case (which isn't the intent of this patch).

Re this:

Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.

The boundary check primarily exists to allow the case where a rule will fire at some point. Unlike the forever and interval rules, period rules factor in the reference timestamp thereby yielding distinct intervals at the boundaries. Something like:

PeriodRuleR3.eligibleInterval(now) = IntervalR3
PeriodRuleR3.eligibleInterval(1yAgo) = IntervalR4

The boundaries check is essentially to verify that a legit rule will fire at some point and allow such a rule. For example, consider the rules [loadByPeriod(P10Y), loadByInterval('2020/2025')]:

  • With the boundary check (before and after 10 years from now), the first rule will not contain the second interval. So the validation check will not flag the rule set. The second rule will indeed fire at some point in the future.
  • Without the boundary check, the first rule will fully cover the second rule's interval, incorrectly flagging the second rule.

I realize I need to update the javadocs. :)

@abhishekrb19 abhishekrb19 changed the title Validate datasource retention rules: Disallow rules that overshadow other rules Validate datasource retention rules: Reject rules that fully contain the following rules' interval Sep 20, 2023
Co-authored-by: Laksh Singla <lakshsingla@gmail.com>
@abhishekrb19 abhishekrb19 changed the title Validate datasource retention rules: Reject rules that fully contain the following rules' interval Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval Sep 20, 2023
@Override
public Interval getEligibleInterval(DateTime referenceTimestamp)
{
return new Interval(referenceTimestamp.minus(period), referenceTimestamp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new Interval(referenceTimestamp.minus(period), referenceTimestamp);
return new Interval(period, referenceTimestamp);

Comment on lines 65 to 81
final Rule currRule = rules.get(i);
final Rule nextRule = rules.get(i + 1);
final Interval currInterval = currRule.getEligibleInterval(now);
final Interval nextInterval = nextRule.getEligibleInterval(now);
if (currInterval.contains(nextInterval)) {
// If the current rule has an eternity interval, it covers everything following it.
// Or if the current rule still covers the next rule at the current interval boundaries, then the
// next rule will never fire at any time, so throw an exception.
if (Intervals.ETERNITY.equals(currInterval) ||
(currRule.getEligibleInterval(currInterval.getStart()).contains(nextRule.getEligibleInterval(currInterval.getStart()))
&& currRule.getEligibleInterval(currInterval.getEnd()).contains(nextRule.getEligibleInterval(currInterval.getEnd())))) {
throw InvalidInput.exception(
"Rule[%s] has an interval that contains interval for rule[%s]. The interval[%s] also covers interval[%s].",
currRule,
nextRule,
currInterval,
nextInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Rule #3 would never be applied on any segment irrespective of the rule types.
While evaluating retention rules, as soon as a rule applies to a segment, we break and don't look at the other rules.

Also, in general, I am not entirely sure of the validation logic here.

  • For one, the set thing suggested by @LakshSingla makes sense. But we would have to ensure that the fully covered check is done correctly.
    • Additive logic (cleaner to implement): Say your set contains intervals A [4, 10) and B [15, 20) and a new interval C [8, 17) comes in. Adding this new interval C should merge all of A, B and C into one interval [4, 17).
    • Subtractive logic: To check if an interval has any part which is not fully covered by the preceding intervals, remove all parts of it that overlap with intervals already encountered. If you are left with an empty interval, the rule is not valid.
  • Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.
  • There could even be other types of rules in the future, but I guess we need not worry about them now.

@LakshSingla
Copy link
Contributor

To allow for the introduction of new rules and cases where the user does want to add these absurd rules that have been supported till now, WDYT of having a URL parameter ?forced which bypasses these validation checks. The new check would be the default, however, if someone wants to override it, they can set the flag and bypass them.

@abhishekrb19
Copy link
Contributor Author

@LakshSingla, re:

where the user does want to add these absurd rules that have been supported till now, WDYT of having a URL parameter ?forced which bypasses these validation checks.

hmm, a parameter to bypass API validation seems a bit hacky to me. If users want to "temporarily" set something for testing or so, they can set effective rules and then always retrieve the "actual" rules from the audit history from the console/API. I think that should suffice?

Copy link

github-actions bot commented Mar 5, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 5, 2024
Copy link

github-actions bot commented Apr 3, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants