Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for compaction reservation expiration was not implemented. #4454

Closed
keith-turner opened this issue Apr 12, 2024 · 8 comments
Closed
Assignees
Milestone

Comments

@keith-turner
Copy link
Contributor

When the compaction management code was moved from the tablet server to the manager support for the functionality behind this property was dropped. This property controlled functionality that would allows system compaction to cancel the reservation a user compaction had on files if the user compaction was queued for too long and never ran.

@keith-turner
Copy link
Contributor Author

There is some background from the 2.1 implementation that would be helpful in understanding this. I plan to add a comment w/ that information.

@keith-turner
Copy link
Contributor Author

keith-turner commented Apr 12, 2024

In 2.1 and forward user compactions reserve a set of files per tablet and then create one or more jobs to compact these reserved files. The reserved files are not available for system compactions. It is possible that the scenario happens.

  1. Tablet A has many files that are currently candidates for system compaction to reduce the tablets file count
  2. A user compaction is started that includes Tablet A and its files are reserved.
  3. Once the files are reserved for Tablet A a compaction job CJ1 is queued to process some subset of the files
  4. CJ1 sits in a queue for hours and does not run.
  5. System compaction are prevented from running during the time the files are reserved.

The property referenced in this ticket deals with the above situation by allowing the system compaction to cancel the reservation if nothing has happened for a while. Then user compaction will eventually acquire a new reservation on the files after this happens.

The way the reservation cancellation works it can only happen when zero jobs have run against a tablet for a user compaction. Once a single job has run, the reservation can not be canceled. This is done to avoid wasting work. Canceling the reservation when zero jobs have run waste no work.

In elasticity there is currently nothing in place for system compactions to cancel a user compaction reservation. In elasticity there is a new selected files column that holds the per tablet reserved files. This selected files column is created by the Fate operation that drives user compactions. The tablet group watcher sees it and creates compaction jobs based on it. The coordinator modifies it using conditional mutations as compaction jobs run.

We could possibly do something like the following in elasticity.

  1. Add a new field to the selected files column with a count of the number of jobs that have completed. This would be updated by the coordinator when commits compactions using conditional mutations.
  2. When the above count is zero the tablet group watcher could queue system and user compactions jobs. When the count is >0 only user compaction jobs could be scheduled.
  3. When the selected files column exists, the compaction coordinator will only allow system compaction jobs to start when the count is zero. If a system compaction does start it will remove the selected files column, forcing the fate op to recreate it when the system compaction is done.

However I am not sure how to handle time in the above situation. The above is conceptually what 2.1 does, but its implemented ina completely different way using conditional mutations vs in memory data structs in a tablet server. Maybe we could drop the notion of a timeout and always let system compactions run. The reason the timeout was added was to prevent system compactions from starving user compactions in the case were a tablet always has new small files arriving.

@cshannon cshannon self-assigned this Apr 12, 2024
@cshannon
Copy link
Contributor

I can start taking a look at this

@keith-turner
Copy link
Contributor Author

keith-turner commented Apr 19, 2024

Below are some pointers for the code.

  1. This code selects files for user compactions, creating the selected column.
  2. This code creates compaction jobs that are queued. If we add a count to the selected files column, then this code could generate system jobs when that count is zero (and no user jobs are running) in addition to user jobs. Would not want to exclude the selected files as candidates for planning system compaction in this case as this code currently does.
  3. This code creates a running compaction entry in the metadata table when a compaction starts. This code could allow system compactions to start if selected files exists and its count is zero (and no user jobs are running), for this case would need to remove the selected files column.
  4. This code commits a compaction. It will modify the selected files column, removing the compacted files for user compactions. It will also update the tablets files and remove the running compaction entry. This code could increment the count in the selected files column.

We could probably drop the timeout in elasticity for now and replace it with a log messages indicating a system compaction removed the selected files of a user compaction. This would allow detection of the case where a system compaction is starving a user compaction. This logging would be done by the code at number 3 above.

@cshannon
Copy link
Contributor

@keith-turner - I am diving into this more now and I just remembered I recently added a new column to prevent user compaction starvation by system compactions in #4254 .

I'm wondering if we need to worry about this column? At first glance I don't think so because it should be cleared after files are selected so I think we just need to worry about the selected files column and clearing that to cancel a user compaction.

@cshannon
Copy link
Contributor

cshannon commented Apr 19, 2024

Actually, it looks like the column is not cleared until after the compaction finishes. The original issue mentions clearing the column after selecting files, but currently it is not cleared until CleanUp runs on completion here

So, I am wondering if we should be clearing that column instead when we mutate the tablet to insert selected files (not sure if we should still have clean up check as well) or if we keep the current behavior where the marker isn't removed until after it's finished and instead we also would need to clear that marker as well as selected files

@cshannon
Copy link
Contributor

I have a prototype of this in a branch and I'm working on tweaking things and an some tests and an IT. This may be another feature that would be suited for being tested with #4415 when merged, but a regular IT is also necessary as well as there's more to test here than just the fate operations and is good to test end to end with everything working together. I should have at least a draft PR up this weekend.

cshannon added a commit to cshannon/accumulo that referenced this issue Apr 20, 2024
This change will allow system compactions to postpone user compactions
that have had no jobs run yet. Before this, if a user compaction was in
the queue and selected files that overlapped it would block system
compactions from running. Now if there are selected files but the user
compaction is not running and hasn't had any jobs completed, the
coordinator will clear the selectedFiles column so that the system
compaction can run. The fate operation will reset the column again while
trying to make progress.

This clsoes apache#4454
cshannon added a commit to cshannon/accumulo that referenced this issue Apr 20, 2024
This change will allow system compactions to postpone user compactions
that have had no jobs run yet. Before this, if a user compaction was in
the queue and had selected files that overlapped it would block system
compactions from running. Now if there are selected files, but the user
compaction is not running and hasn't had any jobs completed, the
coordinator will clear the selectedFiles column so that the system
compaction can run. The fate operation will reset the column again while
trying to make progress.

This closes apache#4454
@keith-turner
Copy link
Contributor Author

keith-turner commented Apr 20, 2024

One possible way to handle time is to use existing steady time in the manager which is suitable for persisting and using across multiple manager processes. One potential drawback to this approach is that the TabletManagment iterator would need access to the current steady time to pass to the compaction job generator so it can make decisions. One way to handle this would be to add a steadytime parameter to this class that is used to pass information from the manager to the tablet mgmt iterator. Trying this could be explored in its own PR. If it were done, thinking a good prerequisite change would be to introduce a SteadyTime type to replace the long currently returned by the method. If SteadyTime is more widely used a more narrow type in the code would be good for correctness.

Update: Steady time could be persisted int he selected files column at creation time. Then everything making decisions based on it could use that to compute an age.

cshannon added a commit that referenced this issue May 18, 2024
…#4480)

This change will allow system compactions to postpone user compactions
that have had no jobs run yet. Before this, if a user compaction was in
the queue and had selected files that overlapped it would block system
compactions from running. Now if there are selected files, but the user
compaction is not running and hasn't had any jobs completed, the
coordinator will clear the selectedFiles column so that the system
compaction can run if the expiration time has passed. The fate operation 
will reset the column again while trying to make progress.

This closes #4454

Co-authored-by: Keith Turner <kturner@apache.org>
@ctubbsii ctubbsii added this to the 4.0.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants