Tablet unload impacted by long-running compaction cancellation #4485

dtspence · 2024-04-22T16:34:18Z

Describe the bug
A tablet unload (i.e. due to migration request) may be delayed while a tablet is attempting to unload, but cannot due to pending compaction cancellations. We have observed that the tablet will wait for as long as 50+ minutes while compactions cancel.

Versions (OS, Maven, Java, and others, as appropriate):

Affected version(s) of this project: 2.1.2

To Reproduce
We are attempting to gather additional information to reproduce. Some preliminary information:

Tablets are being compacted by t-servers (i.e. not using external compactions).
Data being compacted is not expected to be filtered, however we are unsure if somehow the iterator may not be returning for an extended time.

Expected behavior
Migration request should complete within some shorter time.

Screenshots
N/A

Additional context
The manager logs:

2024-04-22T16:23:58,434 [balancer.HostRegexTableLoadBalancer] WARN: Not balancing tables due to 1 outstanding migrations

The text was updated successfully, but these errors were encountered:

dlmarion · 2024-04-23T11:59:06Z

@dtspence - is it the case that you manually cancelled the compactions? If so, did your command complete, or did it hang too?

dlmarion · 2024-04-23T12:12:10Z

The FileCompactor checks if the compaction is still enabled for every key that it writes. I'm curious if the compaction was making progress (you said filtering was not expected which could also be a cause). Is this happening often? If not, is bouncing the tserver an option?

dtspence · 2024-04-23T16:37:50Z

@dlmarion

is it the case that you manually cancelled the compactions? If so, did your command complete, or did it hang too?

No, the compaction eventually logs that it has canceled. We have not been taking manual intervention.

The FileCompactor checks if the compaction is still enabled for every key that it writes. I'm curious if the compaction was making progress (you said filtering was not expected which could also be a cause). Is this happening often? If not, is bouncing the tserver an option?

Yes, the issue is re-appearing. It does not appear to be localized to a single t-server. We have been wondering if something tablet related that correlates to the issue. At least one tablet we were looking at was a hot-spot and contained a lot of i-files from imports.

dlmarion · 2024-04-23T16:51:45Z

@dtspence - are you seeing a lot of these messages?

dtspence · 2024-04-23T18:00:16Z

@dlmarion

are you seeing a lot of these messages?

No we are not seeing the message above. Just for reference, it may have already been known - but we see:

2024-04-23T17:58:17,298 [tserver.UnloadTabletHandler] DEBUG: Failed to unload tablet <tablet-name> ... it was already closing or closed

dlmarion · 2024-04-23T18:07:15Z

I think that log message might be from the Manager continuing to tell the TabletServer to unload the tablet.

dlmarion · 2024-04-23T20:47:55Z

I'm still thinking that maybe the compaction is not making progress. I don't think there is good logging for this with compactions that run in the Tablet Server. IIRC, the way to tell if it's making progress is to check out the output file for the compaction in HDFS and see if its size is increasing. If nothing is getting written to the file for a long time, then either it's filtering out a lot of data, or it's waiting on input from HDFS.

ivakegg · 2024-04-29T18:10:11Z

I would love to have something in place to avoid compaction from holding up unloading of tablets. Is this something that is relative easy to do? This would save us from long shutdowns as well.

dlmarion · 2024-04-29T18:59:58Z

Is this something that is relative easy to do?

It's not a switch that exists today. We would need to develop and test a solution. If you can identify which compactions are causing a tablet not to close, then you could run them as External Compactions. The existing codepath does not wait for the External Compactions to complete. It only waits for them if they are in the process of committing their changes to the tablet metadata.

dtspence · 2024-05-30T21:10:51Z

We believe the compaction is not emitting a key/value to respond to the cancellation. We also observed that the issue appears to occur when a tablet splits (due to growth).

In all (or most) of the observed instances the following occurs:

T-server performs a sized-based split on a tablet into A and B.
T-server starts compaction for A.
Manager begins to migrate A and asks the t-server to close A.
T-server attempts to unload tablet and cancel the running compaction(s).
T-server eventually cancels compactions and tablet is unloaded.

dlmarion · 2024-05-31T11:59:49Z

If we introduced a delay in starting compactions for newly split tablets, then that might help mitigate this issue. We would have to make the delay time configurable most likely as timing the balancer can't really be done.

dlmarion · 2024-05-31T13:20:33Z

If we introduced a delay in starting compactions for newly split tablets, then that might help mitigate this issue. We would have to make the delay time configurable most likely as timing the balancer can't really be done.

In 2.1, Tablet.split() is the only caller of the TabletData constructor that sets Tablet.splitCreationTime. In a normal Tablet load scenario, splitCreationTime remains zero. We could use this variable to introduce a delay in starting compactions for newly split tablets considering there is a good chance that at least one of them will be migrated.

keith-turner · 2024-05-31T13:50:40Z

Delaying starting compactions after split could negatively impact scans. It may be hard to calibrate the delay to avoid scan and compaction cancel problems.

We could attempt to repeatedly interrupt the compaction thread. Currently an atomic boolean is used instead of interrupting the thread because in the past some external code would eat interrupts. The atomic boolean is only checked when a key values is returned. I experimented with pushing the atomic boolean check lower in the iterator stack in the past and it had an impact on performance.

The code could do both approaches for cancellation, add new code to repeatedly interrupt the compaction thread and leave the existing code that sets the atomic boolean.

dlmarion · 2024-05-31T13:57:21Z

Delaying starting compactions after split could negatively impact scans. It may be hard to calibrate the delay to avoid scan and compaction cancel problems.

If we made it configurable at the table-level, then the user could make the decision as to whether or not to enable this, and how long the delay should be. We could also potentially ignore the timeout if the tablet is over the scan max files.

keith-turner · 2024-05-31T14:18:47Z

If we made it configurable at the table-level, then the user could make the decision as to whether or not to enable this, and how long the delay should be. We could also potentially ignore the timeout if the tablet is over the scan max files.

The code would need to emit enough information through logging to know if the property value needs adjustment. The property could be dropped in elasticity as it would not longer be needed there.

The property only addresses one potential cause of stuck compactions causing problems. There could be other cases where a stuck compaction is causing problems that was started by other means. This could be a user initiated compaction that is not returning data and preventing tablet unload or a tablet that did not recently split but is stuck in a system compaction that is preventing unload.

This changes HeapIterator.next and MultiIterator.seek to throw an InterruptedIOException if the current Thread is interrupted, and modifies CompactableImpl.compact such that it creates a scheduled task to periodically check whether the majc should no longer be running and interrupts the thread. Fixes apache#4485

Added interrupt method on FileCompactor that is set to true in CompactableUtils when the majc env is no longer enabled. FileCompactor.interrupt sets an AtomicBoolean to true, and the reference to the AtomicBoolean is set on the RFiles. Only internal compactions will use this interrupt behavior as external compactions already interrupt the compaction thread when compaction cancellation is requested. Fixes apache#4485

dtspence added the bug This issue has been verified to be a bug. label Apr 22, 2024

dtspence changed the title ~~Tablet unload may be impacted by long-running compaction cancellation~~ Tablet unload impacted by long-running compaction cancellation Apr 22, 2024

EdColeman mentioned this issue Apr 23, 2024

sets compaction coordinator status update log to trace #4489

Merged

dlmarion mentioned this issue May 31, 2024

Interrupt internal compaction thread if no longer enabled #4623

Closed

dlmarion linked a pull request May 31, 2024 that will close this issue

Interrupt internal compaction thread if no longer enabled #4623

Closed

dlmarion mentioned this issue Jun 3, 2024

Modified FileCompactor to set interruptFlag on RFile #4628

Merged

dlmarion linked a pull request Jun 5, 2024 that will close this issue

Modified FileCompactor to set interruptFlag on RFile #4628

Merged

dlmarion self-assigned this Jun 5, 2024

dlmarion closed this as completed in b0643be Jun 5, 2024

ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tablet unload impacted by long-running compaction cancellation #4485

Tablet unload impacted by long-running compaction cancellation #4485

dtspence commented Apr 22, 2024

dlmarion commented Apr 23, 2024

dlmarion commented Apr 23, 2024

dtspence commented Apr 23, 2024

dlmarion commented Apr 23, 2024 •

edited

Loading

dtspence commented Apr 23, 2024

dlmarion commented Apr 23, 2024

dlmarion commented Apr 23, 2024

ivakegg commented Apr 29, 2024

dlmarion commented Apr 29, 2024

dtspence commented May 30, 2024

dlmarion commented May 31, 2024

dlmarion commented May 31, 2024

keith-turner commented May 31, 2024

dlmarion commented May 31, 2024

keith-turner commented May 31, 2024

Tablet unload impacted by long-running compaction cancellation #4485

Tablet unload impacted by long-running compaction cancellation #4485

Comments

dtspence commented Apr 22, 2024

dlmarion commented Apr 23, 2024

dlmarion commented Apr 23, 2024

dtspence commented Apr 23, 2024

dlmarion commented Apr 23, 2024 • edited Loading

dtspence commented Apr 23, 2024

dlmarion commented Apr 23, 2024

dlmarion commented Apr 23, 2024

ivakegg commented Apr 29, 2024

dlmarion commented Apr 29, 2024

dtspence commented May 30, 2024

dlmarion commented May 31, 2024

dlmarion commented May 31, 2024

keith-turner commented May 31, 2024

dlmarion commented May 31, 2024

keith-turner commented May 31, 2024

dlmarion commented Apr 23, 2024 •

edited

Loading