Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

abhishekrb19 · 2023-09-19T21:58:47Z

Druid's retention rules are evaluated in the order in which they are set. Unfortunately, a mistake in the order of rules by an operator can lead to unintentional data loss, and if it goes unnoticed and auto-kill is enabled, data could be permanently deleted.

To address this issue, a guard rail is added in the rules API. This guard rail aims to identify rules that fully contains the interval for any subsequent rules A rule is considered good as long as it will run at some point (and doesn't fully hide other rules in the chain) - it's ensured by checking at the interval boundaries as well.

Some examples of bad rules that will be disallowed (see unit tests for more examples):

[loadByPeriod(P6M), loadByPeriod(P3M)]
[loadByInterval(2020-01-01/2050-01-01), loadByInterval(2021-01-01/2023-01-01)]
[dropBeforeByPeriod(P3M), dropBeforeByPeriod(P6M)]
[loadForever, loadByPeriod(P1M)]
[loadForever, dropForever]

Examples of rules that will be allowed because they will be evaluated at some point:

[loadByPeriod(P5Y), loadByInterval(2021-01-01/2023-01-01)]; the loadByInterval rule will be evaluated at some point.
[loadByPeriod(P1Y), loadByPeriod(P2Y)]

For testing purposes, if a user wants to intentionally set a rule that fully contains other rules, they can do it temporarily and then fetch the correct set of rules from the rules audit history after their testing is done.

Release note

Datasource rules API will reject the rules payload if a rule fully contains subsequent rules' interval.

Key changed/added classes in this PR

Rule.java
Rules.java
RulesTest.java
RulesEligibleIntervalTest.java

This PR has:

been self-reviewed.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

server/src/test/java/org/apache/druid/server/http/RulesResourceTest.java

This would prevent users from shooting themselves in the foot. For example, if an interval in a rule is "large enough" and covers an interval in subsequent rule(s), then those rules will never run. Implement guard rails for rules in the rule chain.

kfaraz · 2023-09-20T03:19:46Z

@abhishekrb19 , it seems like a nice idea to have these checks while updating retention rules.

If there is a cluster that already defines rules that would fail the check, would that cluster break after this change?
Edit: I suppose not since the check is enforced only when updating rules.

kfaraz

Haven't done a full review of the validation logic yet. Left some comments on the structure.

...rc/main/java/org/apache/druid/server/coordinator/rules/ForeverBroadcastDistributionRule.java

...src/main/java/org/apache/druid/server/coordinator/rules/PeriodBroadcastDistributionRule.java

server/src/main/java/org/apache/druid/server/coordinator/rules/Rule.java

server/src/main/java/org/apache/druid/server/coordinator/rules/Rules.java

server/src/main/java/org/apache/druid/server/http/RulesResource.java

abhishekrb19 · 2023-09-20T04:45:39Z

Thanks for the review, @kfaraz. Re:

If there is a cluster that already defines rules that would fail the check, would that cluster break after this change?

No, this is an API-only validation change, so any existing rules in a cluster will remain as-is.

server/src/main/java/org/apache/druid/server/coordinator/rules/Rule.java

LakshSingla · 2023-09-20T05:31:49Z

server/src/main/java/org/apache/druid/server/coordinator/rules/Rules.java

+      final Rule currRule = rules.get(i);
+      final Rule nextRule = rules.get(i + 1);
+      final Interval currInterval = currRule.getEligibleInterval(now);
+      final Interval nextInterval = nextRule.getEligibleInterval(now);
+      if (currInterval.contains(nextInterval)) {
+        // If the current rule has an eternity interval, it covers everything following it.
+        // Or if the current rule still covers the next rule at the current interval boundaries, then the
+        // next rule will never fire at any time, so throw an exception.
+        if (Intervals.ETERNITY.equals(currInterval) ||
+            (currRule.getEligibleInterval(currInterval.getStart()).contains(nextRule.getEligibleInterval(currInterval.getStart()))
+             && currRule.getEligibleInterval(currInterval.getEnd()).contains(nextRule.getEligibleInterval(currInterval.getEnd())))) {
+          throw InvalidInput.exception(
+              "Rule[%s] has an interval that contains interval for rule[%s]. The interval[%s] also covers interval[%s].",
+              currRule,
+              nextRule,
+              currInterval,
+              nextInterval


What's the behaviour in the following situation:
Rule#1 - Covers the interval 2020-2021
Rule#2 - Covers the interval 2021-2022
Rule#3 - Covers the interval 2020-2022

In this case validateRules would not throw an exception, however, from my understanding, Rule#3 would never trigger.

If that's the case, we should handle these cases as well (probably by maintaining a set of intervals that the rules seen so far cover and checking if the next rule isn't completely overshadowed by the interval set seen so far)

That sounds like a nice check to have as well. Let me think about it more and adjust as needed.

@LakshSingla, actually, Druid currently does a fully contains interval check for drop rules, so rule #3 is valid with dropByInterval. So we can't really guarantee the overlapping condition in a type agnostic way (which apply to load rules, so we would probably need a separate contract)

I am sorry, I didn't get why that is the case. Rule#1 spans from [2020-2021). rule #2 spans from [2021-2022). Therefore, collectively, rule#1 and rule#2 span from [2020-2022). Is there any instant in the timeline that would cause rule#3 to fire without triggering either rule#1 or rule#2.

From my understanding, rule #3 is valid for the time interval if it existed in isolation. However, when present along with other rules, the precedence of those takes place, causing rule#3 to be effectively useless, which is what this check exists to detect.

Yes, I agree. Rule #3 would never be applied on any segment irrespective of the rule types.
While evaluating retention rules, as soon as a rule applies to a segment, we break and don't look at the other rules.

Also, in general, I am not entirely sure of the validation logic here.

For one, the set thing suggested by @LakshSingla makes sense. But we would have to ensure that the fully covered check is done correctly.

Additive logic (cleaner to implement): Say your set contains intervals A [4, 10) and B [15, 20) and a new interval C [8, 17) comes in. Adding this new interval C should merge all of A, B and C into one interval [4, 17).

Subtractive logic: To check if an interval has any part which is not fully covered by the preceding intervals, remove all parts of it that overlap with intervals already encountered. If you are left with an empty interval, the rule is not valid.

Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.

There could even be other types of rules in the future, but I guess we need not worry about them now.

Yeah, let's consider a segment with an interval 2020/2022 and the following drop rules in order:

dropByInterval('2020/2021')

dropByInterval('2021/2022')

dropByInterval('2020/2022')

For the segment in question, only rule 3 will match as the complete interval is contained in the drop interval.

On the other hand, if we consider loadByInterval rules for the same intervals, you're right that all the rules will apply for the segment as load rules will fire as long as there is at least an overlap.

Hope that clarifies why we cannot detect the overlapping case without baking in some type awareness, but let me know if I am missing something!

I get the incorrect assumption I made in my original example. This digression occurs from the semantics of how intervals are applied in load rules v/s in drop rules. In load rules, we only need for partial overlap for the rule to fire, while in the drop rule, we need a complete overlap for the rule to fire

However when we are doing a contains check in this logic, we are inherently assuming something about the structure of the rules.

Consider the following case:
The system has 2 types of segments, one ranging from 2020-2021 and the other from 2020-2022. If the user wants to drop the segments ranging from 2020-2021 while preserving the other segments, how does the user craft intuitive rules for the same without causing the newly added check to fail? I only came up with the following, however, that fails the check.

Drop 2020-2021

Load 2020-2022

I think this ties back to the common use case that we are trying to prevent the user from doing?

Re the use case above - I'd write the same rules as you've listed in the example. The first rule's interval 2020/2021 does not contain the second rule's interval 2020/2022. So, I'm not sure why the validation check would flag this?

@kfaraz - please see my responses to your comments:

we would have to ensure that the fully covered check is done correctly.

To check if an interval fully contains another, the proposed algorithm has a nested loop that goes through the sequence of rules to find if a larger interval contains another subsequent interval down in the list; O(n^2) algorithmic runtime, but practical runtime should be trivial though.

I suppose there are alternate approaches like condensing intervals - the additive and subtractive logic you're describing. My first thought is that this could work well for the overlapping case (which isn't the intent of this patch).

Re this:

Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.

The boundary check primarily exists to allow the case where a rule will fire at some point. Unlike the forever and interval rules, period rules factor in the reference timestamp thereby yielding distinct intervals at the boundaries. Something like:

PeriodRuleR3.eligibleInterval(now) = IntervalR3
PeriodRuleR3.eligibleInterval(1yAgo) = IntervalR4

The boundaries check is essentially to verify that a legit rule will fire at some point and allow such a rule. For example, consider the rules [loadByPeriod(P10Y), loadByInterval('2020/2025')]:

With the boundary check (before and after 10 years from now), the first rule will not contain the second interval. So the validation check will not flag the rule set. The second rule will indeed fire at some point in the future.

Without the boundary check, the first rule will fully cover the second rule's interval, incorrectly flagging the second rule.

I realize I need to update the javadocs. :)

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

server/src/main/java/org/apache/druid/server/coordinator/rules/PeriodDropBeforeRule.java

kfaraz · 2023-09-20T04:48:44Z

server/src/main/java/org/apache/druid/server/coordinator/rules/PeriodDropRule.java

+  @Override
+  public Interval getEligibleInterval(DateTime referenceTimestamp)
+  {
+    return new Interval(referenceTimestamp.minus(period), referenceTimestamp);


Suggested change

return new Interval(referenceTimestamp.minus(period), referenceTimestamp);

return new Interval(period, referenceTimestamp);

server/src/main/java/org/apache/druid/server/coordinator/rules/PeriodLoadRule.java

server/src/main/java/org/apache/druid/server/coordinator/rules/Rule.java

server/src/main/java/org/apache/druid/server/coordinator/rules/Rules.java

server/src/test/java/org/apache/druid/server/http/RulesResourceTest.java

server/src/main/java/org/apache/druid/server/coordinator/rules/Rule.java

kfaraz · 2023-09-21T05:51:28Z

server/src/main/java/org/apache/druid/server/coordinator/rules/Rules.java

+      final Rule currRule = rules.get(i);
+      final Rule nextRule = rules.get(i + 1);
+      final Interval currInterval = currRule.getEligibleInterval(now);
+      final Interval nextInterval = nextRule.getEligibleInterval(now);
+      if (currInterval.contains(nextInterval)) {
+        // If the current rule has an eternity interval, it covers everything following it.
+        // Or if the current rule still covers the next rule at the current interval boundaries, then the
+        // next rule will never fire at any time, so throw an exception.
+        if (Intervals.ETERNITY.equals(currInterval) ||
+            (currRule.getEligibleInterval(currInterval.getStart()).contains(nextRule.getEligibleInterval(currInterval.getStart()))
+             && currRule.getEligibleInterval(currInterval.getEnd()).contains(nextRule.getEligibleInterval(currInterval.getEnd())))) {
+          throw InvalidInput.exception(
+              "Rule[%s] has an interval that contains interval for rule[%s]. The interval[%s] also covers interval[%s].",
+              currRule,
+              nextRule,
+              currInterval,
+              nextInterval


Yes, I agree. Rule #3 would never be applied on any segment irrespective of the rule types.
While evaluating retention rules, as soon as a rule applies to a segment, we break and don't look at the other rules.

Also, in general, I am not entirely sure of the validation logic here.

For one, the set thing suggested by @LakshSingla makes sense. But we would have to ensure that the fully covered check is done correctly.

Additive logic (cleaner to implement): Say your set contains intervals A [4, 10) and B [15, 20) and a new interval C [8, 17) comes in. Adding this new interval C should merge all of A, B and C into one interval [4, 17).

Subtractive logic: To check if an interval has any part which is not fully covered by the preceding intervals, remove all parts of it that overlap with intervals already encountered. If you are left with an empty interval, the rule is not valid.

Secondly, all of this is easy to follow when we are dealing with ForeverXyzRule or IntervalXyzRule (rule with fixed interval). For PeriodXyzRule, I am not entirely sure of the boundaries that would need to be checked. The proposed change does compute the eligibleInterval at both start and end of the currInterval but I need to give it some more thought.

There could even be other types of rules in the future, but I guess we need not worry about them now.

...src/main/java/org/apache/druid/server/coordinator/rules/PeriodBroadcastDistributionRule.java

server/src/main/java/org/apache/druid/server/coordinator/rules/PeriodDropRule.java

LakshSingla · 2023-09-21T07:03:44Z

To allow for the introduction of new rules and cases where the user does want to add these absurd rules that have been supported till now, WDYT of having a URL parameter ?forced which bypasses these validation checks. The new check would be the default, however, if someone wants to override it, they can set the flag and bypass them.

abhishekrb19 · 2023-09-27T05:22:17Z

@LakshSingla, re:

where the user does want to add these absurd rules that have been supported till now, WDYT of having a URL parameter ?forced which bypasses these validation checks.

hmm, a parameter to bypass API validation seems a bit hacky to me. If users want to "temporarily" set something for testing or so, they can set effective rules and then always retrieve the "actual" rules from the audit history from the console/API. I think that should suffice?

github-actions · 2024-03-05T00:16:22Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-04-03T00:16:07Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

abhishekrb19 added the Area - Operations label Sep 19, 2023

abhishekrb19 force-pushed the validate_rules_api branch 3 times, most recently from 2d15660 to 2f3b80b Compare September 19, 2023 23:02

github-advanced-security bot found potential problems Sep 19, 2023

View reviewed changes

server/src/test/java/org/apache/druid/server/http/RulesResourceTest.java Fixed Show fixed Hide fixed

abhishekrb19 force-pushed the validate_rules_api branch from 2f3b80b to f72eda0 Compare September 20, 2023 00:14

kfaraz requested changes Sep 20, 2023

View reviewed changes

Address review comments.

91d4797

abhishekrb19 requested a review from kfaraz September 20, 2023 04:41

LakshSingla reviewed Sep 20, 2023

View reviewed changes

abhishekrb19 changed the title ~~Validate datasource retention rules: Disallow rules that overshadow other rules~~ Validate datasource retention rules: Reject rules that fully contain the following rules' interval Sep 20, 2023

Apply suggestions from code review

2166d6c

Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

abhishekrb19 changed the title ~~Validate datasource retention rules: Reject rules that fully contain the following rules' interval~~ Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval Sep 20, 2023

abhishekrb19 added 4 commits September 19, 2023 23:23

includeFuture change and unit test.

c3edb12

Merge branch 'master' into validate_rules_api

eed59b4

Some test cleanup.

20133fc

fix and add more tests.

f8975e2

kfaraz reviewed Sep 21, 2023

View reviewed changes

abhishekrb19 added 2 commits September 26, 2023 22:06

Minor adjustments for consistency

320cd90

Add separate test classes. Do some code cleanup.

d729fdf

fixup intellij inspection and update comment.

966c902

abhishekrb19 force-pushed the validate_rules_api branch from ce89122 to 966c902 Compare September 27, 2023 06:14

github-actions bot added the stale label Mar 5, 2024

github-actions bot closed this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

abhishekrb19 commented Sep 19, 2023 •

edited

kfaraz commented Sep 20, 2023 •

edited

kfaraz left a comment •

edited

abhishekrb19 commented Sep 20, 2023

LakshSingla Sep 20, 2023

abhishekrb19 Sep 20, 2023

abhishekrb19 Sep 21, 2023

LakshSingla Sep 21, 2023 •

edited

LakshSingla Sep 21, 2023 •

edited

kfaraz Sep 21, 2023

abhishekrb19 Sep 21, 2023

LakshSingla Sep 21, 2023

abhishekrb19 Sep 21, 2023

abhishekrb19 Sep 21, 2023 •

edited

kfaraz Sep 20, 2023

kfaraz Sep 21, 2023

LakshSingla commented Sep 21, 2023

abhishekrb19 commented Sep 27, 2023

github-actions bot commented Mar 5, 2024

github-actions bot commented Apr 3, 2024

	return new Interval(referenceTimestamp.minus(period), referenceTimestamp);
	return new Interval(period, referenceTimestamp);

Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

Validate datasource retention rules: Reject rules that fully contain subsequent rules' interval #15015

Conversation

abhishekrb19 commented Sep 19, 2023 • edited

Release note

Key changed/added classes in this PR

kfaraz commented Sep 20, 2023 • edited

kfaraz left a comment • edited

Choose a reason for hiding this comment

abhishekrb19 commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Sep 21, 2023 • edited

Choose a reason for hiding this comment

LakshSingla Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekrb19 Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla commented Sep 21, 2023

abhishekrb19 commented Sep 27, 2023

github-actions bot commented Mar 5, 2024

github-actions bot commented Apr 3, 2024

abhishekrb19 commented Sep 19, 2023 •

edited

kfaraz commented Sep 20, 2023 •

edited

kfaraz left a comment •

edited

LakshSingla Sep 21, 2023 •

edited

LakshSingla Sep 21, 2023 •

edited

abhishekrb19 Sep 21, 2023 •

edited