Support for compaction reservation expiration was not implemented. #4454

keith-turner · 2024-04-12T14:58:25Z

When the compaction management code was moved from the tablet server to the manager support for the functionality behind this property was dropped. This property controlled functionality that would allows system compaction to cancel the reservation a user compaction had on files if the user compaction was queued for too long and never ran.

keith-turner · 2024-04-12T14:59:33Z

There is some background from the 2.1 implementation that would be helpful in understanding this. I plan to add a comment w/ that information.

keith-turner · 2024-04-12T15:29:49Z

In 2.1 and forward user compactions reserve a set of files per tablet and then create one or more jobs to compact these reserved files. The reserved files are not available for system compactions. It is possible that the scenario happens.

Tablet A has many files that are currently candidates for system compaction to reduce the tablets file count
A user compaction is started that includes Tablet A and its files are reserved.
Once the files are reserved for Tablet A a compaction job CJ1 is queued to process some subset of the files
CJ1 sits in a queue for hours and does not run.
System compaction are prevented from running during the time the files are reserved.

The property referenced in this ticket deals with the above situation by allowing the system compaction to cancel the reservation if nothing has happened for a while. Then user compaction will eventually acquire a new reservation on the files after this happens.

The way the reservation cancellation works it can only happen when zero jobs have run against a tablet for a user compaction. Once a single job has run, the reservation can not be canceled. This is done to avoid wasting work. Canceling the reservation when zero jobs have run waste no work.

In elasticity there is currently nothing in place for system compactions to cancel a user compaction reservation. In elasticity there is a new selected files column that holds the per tablet reserved files. This selected files column is created by the Fate operation that drives user compactions. The tablet group watcher sees it and creates compaction jobs based on it. The coordinator modifies it using conditional mutations as compaction jobs run.

We could possibly do something like the following in elasticity.

Add a new field to the selected files column with a count of the number of jobs that have completed. This would be updated by the coordinator when commits compactions using conditional mutations.
When the above count is zero the tablet group watcher could queue system and user compactions jobs. When the count is >0 only user compaction jobs could be scheduled.
When the selected files column exists, the compaction coordinator will only allow system compaction jobs to start when the count is zero. If a system compaction does start it will remove the selected files column, forcing the fate op to recreate it when the system compaction is done.

However I am not sure how to handle time in the above situation. The above is conceptually what 2.1 does, but its implemented ina completely different way using conditional mutations vs in memory data structs in a tablet server. Maybe we could drop the notion of a timeout and always let system compactions run. The reason the timeout was added was to prevent system compactions from starving user compactions in the case were a tablet always has new small files arriving.

cshannon · 2024-04-12T22:00:17Z

I can start taking a look at this

keith-turner · 2024-04-19T01:51:48Z

Below are some pointers for the code.

This code selects files for user compactions, creating the selected column.
This code creates compaction jobs that are queued. If we add a count to the selected files column, then this code could generate system jobs when that count is zero (and no user jobs are running) in addition to user jobs. Would not want to exclude the selected files as candidates for planning system compaction in this case as this code currently does.
This code creates a running compaction entry in the metadata table when a compaction starts. This code could allow system compactions to start if selected files exists and its count is zero (and no user jobs are running), for this case would need to remove the selected files column.
This code commits a compaction. It will modify the selected files column, removing the compacted files for user compactions. It will also update the tablets files and remove the running compaction entry. This code could increment the count in the selected files column.

We could probably drop the timeout in elasticity for now and replace it with a log messages indicating a system compaction removed the selected files of a user compaction. This would allow detection of the case where a system compaction is starving a user compaction. This logging would be done by the code at number 3 above.

cshannon · 2024-04-19T11:39:43Z

@keith-turner - I am diving into this more now and I just remembered I recently added a new column to prevent user compaction starvation by system compactions in #4254 .

I'm wondering if we need to worry about this column? At first glance I don't think so because it should be cleared after files are selected so I think we just need to worry about the selected files column and clearing that to cancel a user compaction.

cshannon · 2024-04-19T12:15:34Z

Actually, it looks like the column is not cleared until after the compaction finishes. The original issue mentions clearing the column after selecting files, but currently it is not cleared until CleanUp runs on completion here

So, I am wondering if we should be clearing that column instead when we mutate the tablet to insert selected files (not sure if we should still have clean up check as well) or if we keep the current behavior where the marker isn't removed until after it's finished and instead we also would need to clear that marker as well as selected files

cshannon · 2024-04-19T22:28:48Z

I have a prototype of this in a branch and I'm working on tweaking things and an some tests and an IT. This may be another feature that would be suited for being tested with #4415 when merged, but a regular IT is also necessary as well as there's more to test here than just the fate operations and is good to test end to end with everything working together. I should have at least a draft PR up this weekend.

This change will allow system compactions to postpone user compactions that have had no jobs run yet. Before this, if a user compaction was in the queue and selected files that overlapped it would block system compactions from running. Now if there are selected files but the user compaction is not running and hasn't had any jobs completed, the coordinator will clear the selectedFiles column so that the system compaction can run. The fate operation will reset the column again while trying to make progress. This clsoes apache#4454

This change will allow system compactions to postpone user compactions that have had no jobs run yet. Before this, if a user compaction was in the queue and had selected files that overlapped it would block system compactions from running. Now if there are selected files, but the user compaction is not running and hasn't had any jobs completed, the coordinator will clear the selectedFiles column so that the system compaction can run. The fate operation will reset the column again while trying to make progress. This closes apache#4454

keith-turner · 2024-04-20T19:46:46Z

One possible way to handle time is to use existing steady time in the manager which is suitable for persisting and using across multiple manager processes. One potential drawback to this approach is that the TabletManagment iterator would need access to the current steady time to pass to the compaction job generator so it can make decisions. One way to handle this would be to add a steadytime parameter to this class that is used to pass information from the manager to the tablet mgmt iterator. Trying this could be explored in its own PR. If it were done, thinking a good prerequisite change would be to introduce a SteadyTime type to replace the long currently returned by the method. If SteadyTime is more widely used a more narrow type in the code would be good for correctness.

Update: Steady time could be persisted int he selected files column at creation time. Then everything making decisions based on it could use that to compute an age.

…#4480) This change will allow system compactions to postpone user compactions that have had no jobs run yet. Before this, if a user compaction was in the queue and had selected files that overlapped it would block system compactions from running. Now if there are selected files, but the user compaction is not running and hasn't had any jobs completed, the coordinator will clear the selectedFiles column so that the system compaction can run if the expiration time has passed. The fate operation will reset the column again while trying to make progress. This closes #4454 Co-authored-by: Keith Turner <kturner@apache.org>

cshannon self-assigned this Apr 12, 2024

cshannon mentioned this issue Apr 20, 2024

Allow system compactions to run if zero user compaction jobs have run #4480

Merged

keith-turner mentioned this issue Apr 20, 2024

Create a SteadyTime type in the manager #4482

Closed

cshannon mentioned this issue Apr 28, 2024

Create a SteadyTime type in the Manager #4494

Merged

cshannon closed this as completed May 18, 2024

ctubbsii added this to the 4.0.0 milestone Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for compaction reservation expiration was not implemented. #4454

Support for compaction reservation expiration was not implemented. #4454

keith-turner commented Apr 12, 2024

keith-turner commented Apr 12, 2024

keith-turner commented Apr 12, 2024 •

edited

Loading

cshannon commented Apr 12, 2024

keith-turner commented Apr 19, 2024 •

edited

Loading

cshannon commented Apr 19, 2024

cshannon commented Apr 19, 2024 •

edited

Loading

cshannon commented Apr 19, 2024

keith-turner commented Apr 20, 2024 •

edited

Loading

Support for compaction reservation expiration was not implemented. #4454

Support for compaction reservation expiration was not implemented. #4454

Comments

keith-turner commented Apr 12, 2024

keith-turner commented Apr 12, 2024

keith-turner commented Apr 12, 2024 • edited Loading

cshannon commented Apr 12, 2024

keith-turner commented Apr 19, 2024 • edited Loading

cshannon commented Apr 19, 2024

cshannon commented Apr 19, 2024 • edited Loading

cshannon commented Apr 19, 2024

keith-turner commented Apr 20, 2024 • edited Loading

keith-turner commented Apr 12, 2024 •

edited

Loading

keith-turner commented Apr 19, 2024 •

edited

Loading

cshannon commented Apr 19, 2024 •

edited

Loading

keith-turner commented Apr 20, 2024 •

edited

Loading