You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Saw a deadlock while running continuous ingest to test 1.9.2 RC1. I was looking into why a long hold time was happening. Luckily I got this stack trace before the agitator wacked the tserver.
What is going on is that two tablets both try close around the same time. The two tablets minor compact, with that tablet lock held (which only happens at close). At the end of the minor compaction, the tablets run a check to see which WALs are referenced. This check attempts to get each tablets lock. Locking needs to be avoided in this check.
I am not 100% sure, but this issue may occur in 1.9.1 with locking that was added in 84791ec for the removeInUseLogs() method. It may be that the only continuous ingest test that was done for 1.9.1 was with agitation. This is why its important to test with and without agitation, because agitation hides bugs like this unless someone is closely watching the test. I got lucky when I found this.
Java stack information for the threads listed above:
===================================================
"Minor compacting !0;~<":
at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
- waiting to lock <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
at org.apache.accumulo.tserver.tablet.Tablet.minorCompactNow(Tablet.java:1047)
at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2388)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:64)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:748)
"Minor compacting 2;42005;41804":
at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
- waiting to lock <0x0000000794a37d20> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428)
- locked <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:2291)
- locked <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2109)
at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2089)
at org.apache.accumulo.tserver.TabletServer.access$2300(TabletServer.java:271)
at org.apache.accumulo.tserver.TabletServer$SplitRunner.run(TabletServer.java:1978)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:748)
"Minor compacting 2;69c0cc;6980bb":
at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
- waiting to lock <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428)
- locked <0x0000000794a37d20> (a org.apache.accumulo.tserver.tablet.Tablet)
at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:1318)
at org.apache.accumulo.tserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2206)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:748)
Found 1 deadlock.
The text was updated successfully, but these errors were encountered:
Saw a deadlock while running continuous ingest to test 1.9.2 RC1. I was looking into why a long hold time was happening. Luckily I got this stack trace before the agitator wacked the tserver.
What is going on is that two tablets both try close around the same time. The two tablets minor compact, with that tablet lock held (which only happens at close). At the end of the minor compaction, the tablets run a check to see which WALs are referenced. This check attempts to get each tablets lock. Locking needs to be avoided in this check.
I am not 100% sure, but this issue may occur in 1.9.1 with locking that was added in 84791ec for the
removeInUseLogs()
method. It may be that the only continuous ingest test that was done for 1.9.1 was with agitation. This is why its important to test with and without agitation, because agitation hides bugs like this unless someone is closely watching the test. I got lucky when I found this.The text was updated successfully, but these errors were encountered: