-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for compaction reservation expiration was not implemented. #4454
Comments
There is some background from the 2.1 implementation that would be helpful in understanding this. I plan to add a comment w/ that information. |
In 2.1 and forward user compactions reserve a set of files per tablet and then create one or more jobs to compact these reserved files. The reserved files are not available for system compactions. It is possible that the scenario happens.
The property referenced in this ticket deals with the above situation by allowing the system compaction to cancel the reservation if nothing has happened for a while. Then user compaction will eventually acquire a new reservation on the files after this happens. The way the reservation cancellation works it can only happen when zero jobs have run against a tablet for a user compaction. Once a single job has run, the reservation can not be canceled. This is done to avoid wasting work. Canceling the reservation when zero jobs have run waste no work. In elasticity there is currently nothing in place for system compactions to cancel a user compaction reservation. In elasticity there is a new selected files column that holds the per tablet reserved files. This selected files column is created by the Fate operation that drives user compactions. The tablet group watcher sees it and creates compaction jobs based on it. The coordinator modifies it using conditional mutations as compaction jobs run. We could possibly do something like the following in elasticity.
However I am not sure how to handle time in the above situation. The above is conceptually what 2.1 does, but its implemented ina completely different way using conditional mutations vs in memory data structs in a tablet server. Maybe we could drop the notion of a timeout and always let system compactions run. The reason the timeout was added was to prevent system compactions from starving user compactions in the case were a tablet always has new small files arriving. |
I can start taking a look at this |
Below are some pointers for the code.
We could probably drop the timeout in elasticity for now and replace it with a log messages indicating a system compaction removed the selected files of a user compaction. This would allow detection of the case where a system compaction is starving a user compaction. This logging would be done by the code at number 3 above. |
@keith-turner - I am diving into this more now and I just remembered I recently added a new column to prevent user compaction starvation by system compactions in #4254 . I'm wondering if we need to worry about this column? At first glance I don't think so because it should be cleared after files are selected so I think we just need to worry about the selected files column and clearing that to cancel a user compaction. |
Actually, it looks like the column is not cleared until after the compaction finishes. The original issue mentions clearing the column after selecting files, but currently it is not cleared until CleanUp runs on completion here So, I am wondering if we should be clearing that column instead when we mutate the tablet to insert selected files (not sure if we should still have clean up check as well) or if we keep the current behavior where the marker isn't removed until after it's finished and instead we also would need to clear that marker as well as selected files |
I have a prototype of this in a branch and I'm working on tweaking things and an some tests and an IT. This may be another feature that would be suited for being tested with #4415 when merged, but a regular IT is also necessary as well as there's more to test here than just the fate operations and is good to test end to end with everything working together. I should have at least a draft PR up this weekend. |
This change will allow system compactions to postpone user compactions that have had no jobs run yet. Before this, if a user compaction was in the queue and selected files that overlapped it would block system compactions from running. Now if there are selected files but the user compaction is not running and hasn't had any jobs completed, the coordinator will clear the selectedFiles column so that the system compaction can run. The fate operation will reset the column again while trying to make progress. This clsoes apache#4454
This change will allow system compactions to postpone user compactions that have had no jobs run yet. Before this, if a user compaction was in the queue and had selected files that overlapped it would block system compactions from running. Now if there are selected files, but the user compaction is not running and hasn't had any jobs completed, the coordinator will clear the selectedFiles column so that the system compaction can run. The fate operation will reset the column again while trying to make progress. This closes apache#4454
One possible way to handle time is to use existing steady time in the manager which is suitable for persisting and using across multiple manager processes. One potential drawback to this approach is that the TabletManagment iterator would need access to the current steady time to pass to the compaction job generator so it can make decisions. One way to handle this would be to add a steadytime parameter to this class that is used to pass information from the manager to the tablet mgmt iterator. Trying this could be explored in its own PR. If it were done, thinking a good prerequisite change would be to introduce a SteadyTime type to replace the long currently returned by the method. If SteadyTime is more widely used a more narrow type in the code would be good for correctness. Update: Steady time could be persisted int he selected files column at creation time. Then everything making decisions based on it could use that to compute an age. |
…#4480) This change will allow system compactions to postpone user compactions that have had no jobs run yet. Before this, if a user compaction was in the queue and had selected files that overlapped it would block system compactions from running. Now if there are selected files, but the user compaction is not running and hasn't had any jobs completed, the coordinator will clear the selectedFiles column so that the system compaction can run if the expiration time has passed. The fate operation will reset the column again while trying to make progress. This closes #4454 Co-authored-by: Keith Turner <kturner@apache.org>
When the compaction management code was moved from the tablet server to the manager support for the functionality behind this property was dropped. This property controlled functionality that would allows system compaction to cancel the reservation a user compaction had on files if the user compaction was queued for too long and never ran.
The text was updated successfully, but these errors were encountered: