You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
0.15.1. I believe it affects 0.16 and master but have not tested.
Description
Say you have load rules configured to only load data from 2019-02-01/P1M, and you have a single segment on the interval 2019-01-20/2019-02-10. Note that this segment is loaded by historicals (see #5595) because it overlaps with the load rule. Historicals and brokers will in fact even serve queries for the end of January, because load rules are (as far as I understand) only used to decide which segments to load onto historicals, and don't affect how queries work later.
Now you use batch ingestion (say, native batch ingestion with ingestSegment applying a filter) with output segment granularity DAY over the range 2019-01-01/2019-02-05. This will produce segments for each of the days in January plus the first four days in February.
The 4 February segments will be loaded onto historicals by the load rules, but the 31 January segments will not. This means that queries run against January from the 20th on will give results based on the "old" data before the re-ingestion, not the new data!
Moreover, if configured to automatically kill unloaded segments, the new data will be permanently deleted, which means that if you change the load rules later to include the intervals of the old data, then even intervals covered by load rules will start returning "old" data. And because they are killed, they won't get combined with the older segment when automatic compaction happens.
Potential fixes
I can think of a few classes of fixes:
Change load rule semantics
Change load rules to load any segments which overlap with loaded segments that they overshadow. This is relatively simple and only affects one part of the code, and solves both the "immediate queries give old data" and the "increasing load periods later actively loads old data" issues.
My main concern is that people might be confused by the fact that segments are loaded which don't themselves match load rules and think something is wrong with their configuration. (Maybe we could show an explanation in the new console UI when these situations happen?)
Apply load rules in more places
The coordinator could tell historicals when loading segments onto them that they should pretend the segment is smaller than it actually is when the segment overlaps with load rules.
This would involve changing more places than "change load rule semantics", and it would not fix the "unused segments can be killed and then if load rules are changed later, old data will be served" issue.
The text was updated successfully, but these errors were encountered:
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
Affected Version
0.15.1. I believe it affects 0.16 and master but have not tested.
Description
Say you have load rules configured to only load data from
2019-02-01/P1M
, and you have a single segment on the interval2019-01-20/2019-02-10
. Note that this segment is loaded by historicals (see #5595) because it overlaps with the load rule. Historicals and brokers will in fact even serve queries for the end of January, because load rules are (as far as I understand) only used to decide which segments to load onto historicals, and don't affect how queries work later.Now you use batch ingestion (say, native batch ingestion with
ingestSegment
applying a filter) with output segment granularityDAY
over the range2019-01-01/2019-02-05
. This will produce segments for each of the days in January plus the first four days in February.The 4 February segments will be loaded onto historicals by the load rules, but the 31 January segments will not. This means that queries run against January from the 20th on will give results based on the "old" data before the re-ingestion, not the new data!
Moreover, if configured to automatically kill unloaded segments, the new data will be permanently deleted, which means that if you change the load rules later to include the intervals of the old data, then even intervals covered by load rules will start returning "old" data. And because they are killed, they won't get combined with the older segment when automatic compaction happens.
Potential fixes
I can think of a few classes of fixes:
Change load rule semantics
Change load rules to load any segments which overlap with loaded segments that they overshadow. This is relatively simple and only affects one part of the code, and solves both the "immediate queries give old data" and the "increasing load periods later actively loads old data" issues.
My main concern is that people might be confused by the fact that segments are loaded which don't themselves match load rules and think something is wrong with their configuration. (Maybe we could show an explanation in the new console UI when these situations happen?)
Apply load rules in more places
The coordinator could tell historicals when loading segments onto them that they should pretend the segment is smaller than it actually is when the segment overlaps with load rules.
This would involve changing more places than "change load rule semantics", and it would not fix the "unused segments can be killed and then if load rules are changed later, old data will be served" issue.
The text was updated successfully, but these errors were encountered: