Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saw deadlock when two tablets tried to close around the same time. #558

Closed
keith-turner opened this issue Jul 12, 2018 · 0 comments
Closed
Assignees
Labels
blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.
Milestone

Comments

@keith-turner
Copy link
Contributor

Saw a deadlock while running continuous ingest to test 1.9.2 RC1. I was looking into why a long hold time was happening. Luckily I got this stack trace before the agitator wacked the tserver.

What is going on is that two tablets both try close around the same time. The two tablets minor compact, with that tablet lock held (which only happens at close). At the end of the minor compaction, the tablets run a check to see which WALs are referenced. This check attempts to get each tablets lock. Locking needs to be avoided in this check.

I am not 100% sure, but this issue may occur in 1.9.1 with locking that was added in 84791ec for the removeInUseLogs() method. It may be that the only continuous ingest test that was done for 1.9.1 was with agitation. This is why its important to test with and without agitation, because agitation hides bugs like this unless someone is closely watching the test. I got lucky when I found this.

Java stack information for the threads listed above:
===================================================
"Minor compacting !0;~<":
        at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
        - waiting to lock <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
        at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
        at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
        at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
        at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
        at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
        at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
        at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
        at org.apache.accumulo.tserver.tablet.Tablet.minorCompactNow(Tablet.java:1047)
        at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2388)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:64)
        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:748)
"Minor compacting 2;42005;41804":
        at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
        - waiting to lock <0x0000000794a37d20> (a org.apache.accumulo.tserver.tablet.Tablet)
        at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
        at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
        at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
        at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
        at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
        at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
        at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
        at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428)
        - locked <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
        at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:2291)
        - locked <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
        at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2109)
        at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:2089)
        at org.apache.accumulo.tserver.TabletServer.access$2300(TabletServer.java:271)
        at org.apache.accumulo.tserver.TabletServer$SplitRunner.run(TabletServer.java:1978)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:748)
"Minor compacting 2;69c0cc;6980bb":
        at org.apache.accumulo.tserver.tablet.Tablet.removeInUseLogs(Tablet.java:2462)
        - waiting to lock <0x000000079ca85988> (a org.apache.accumulo.tserver.tablet.Tablet)
        at org.apache.accumulo.tserver.TabletServer$12.removeInUse(TabletServer.java:3413)
        at org.apache.accumulo.tserver.TabletServer.findOldestUnreferencedWals(TabletServer.java:3370)
        at org.apache.accumulo.tserver.TabletServer.markUnusedWALs(TabletServer.java:3421)
        at org.apache.accumulo.tserver.TabletServer.minorCompactionFinished(TabletServer.java:3245)
        at org.apache.accumulo.tserver.tablet.DatafileManager.bringMinorCompactionOnline(DatafileManager.java:459)
        at org.apache.accumulo.tserver.tablet.Tablet.minorCompact(Tablet.java:914)
        at org.apache.accumulo.tserver.tablet.MinorCompactionTask.run(MinorCompactionTask.java:90)
        at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1428)
        - locked <0x0000000794a37d20> (a org.apache.accumulo.tserver.tablet.Tablet)
        at org.apache.accumulo.tserver.tablet.Tablet.close(Tablet.java:1318)
        at org.apache.accumulo.tserver.TabletServer$UnloadTabletHandler.run(TabletServer.java:2206)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
@keith-turner keith-turner added v2.0.0 blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug. labels Jul 12, 2018
@keith-turner keith-turner self-assigned this Jul 12, 2018
keith-turner added a commit to keith-turner/accumulo that referenced this issue Jul 13, 2018
@ctubbsii ctubbsii added this to the 1.9.2 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.
Projects
None yet
Development

No branches or pull requests

2 participants