-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lastCompactID is inconsistent with metadata Exception #2667
Comments
@keith-turner Any thoughts on this consistency check? I am not sure if there was a problem or not. |
I'll take a look at it |
I think I have found one possible cause to this. Before doing the consistency check on close the following code will wait for there the be zero non-external compactions by looking at the accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java Lines 1504 to 1505 in b3ac225
The following code is called when a compaction completes and it removes itself from accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java Lines 1220 to 1233 in b3ac225
The call to accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableUtils.java Lines 439 to 444 in b3ac225
Looking at the error message
Which is consistent with the metadata table being updated first followed by the lastCompactID instance variable. I think this code with the race conditions only executes when an empty tablet is forced to compact, so all it does is write the compact id to the metadata table to signify its done w/o any files changes. There was a forced compaction in the manager logs a bit before this split that closed the tablet. I am going to look into this some more and see if I can figure out if there is a problem with reordering the code to avoid the race condition. |
@keith-turner Did you make any changes for this race condition bug? |
It seems like recent changes (possibly the refactor in #2574) have made this less frequent but I just saw this happen again. I was running 4 Bulk Rwalk jobs again. 2022-09-26T09:45:25,362 [util.FileUtil] DEBUG: Found first and last keys for 0 map files in 0.00 secs 2022-09-26T09:45:25,363 [tablet.Tablet] DEBUG: Tablet 2;r143fe< had no dir, creating hdfs://localhost:8020/accumulo/tables/2/t-00000rs 2022-09-26T09:45:25,369 [tablet.Tablet] ERROR: Closed tablet 2;r143fe< lastCompactID is inconsistent with metadata : 115 != 112 2022-09-26T09:45:25,370 [tablet.Tablet] ERROR: Failed to do close consistency check for tablet 2;r143fe< java.lang.RuntimeException: Closed tablet 2;r143fe< lastCompactID is inconsistent with metadata : 115 != 112 at org.apache.accumulo.tserver.tablet.Tablet.lambda$closeConsistencyCheck$4(Tablet.java:1090) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at java.util.OptionalLong.ifPresent(OptionalLong.java:163) ~[?:?] at org.apache.accumulo.tserver.tablet.Tablet.closeConsistencyCheck(Tablet.java:1085) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1003) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:1465) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:489) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.accumulo.tserver.TabletClientHandler.splitTablet(TabletClientHandler.java:1004) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at jdk.internal.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$0(TraceUtil.java:206) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at com.sun.proxy.$Proxy38.splitTablet(Unknown Source) ~[?:?] at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$splitTablet.getResult(TabletClientService.java:2648) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$splitTablet.getResult(TabletClientService.java:2627) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.16.0.jar:0.16.0] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT] at java.lang.Thread.run(Thread.java:829) ~[?:?] |
Any advice if I wanted to try to reproduce this? How many tservers were you running? You had 4 random walk processes each running the bulk graph? Was everything running on a single VM or multiple VMs? |
I was able to produce this on a single EC2 Ubuntu instance, running Uno. I installed Snaphot and set for (( i=1; i<5; i++)); do ~/workspace/accumulo-testing/bin/rwalk Bulk.xml > /tmp/rwalk"$i".log 2>&1 & done FYI I found a bug in the Bulk RW test so make sure you get the fix from accumulo-testing first. |
@keith-turner I also dropped the verify and END nodes from Bulk, just so the test would keep running. |
I started looking into this and ran bulk import RW test. I saw the following error, which is a different problem. I didn't keep the RW test logs so can't dig into it much. Going to keep running RW bulk test and keeping the test logs.
|
I wonder if this error occurred when a table was compacting and splitting at the same time and the split failed due to the bug I fixed in #2977. |
I will look into that. I have been trying to recreate your scenario with a slight modification to the RW test graph. Instead of not ever verifiying and therefore not ending. I changed my graph locally to start over after verify. So doing the following locally.
I have been bumping into a few little issues with the test when trying to verify so i have not have not had a super long run yet. But in the runs so far I don't think I saw any problems. I will take a look at the split thing you mentioned. I also found I needed to add nohup to your little script so that the processes would not terminate.
I think I am close to getting a test that could run for days on end going. I think this a good to test to run even if this issue may be fixed by the change you mentioned. @milleruntime when you did see the error on the tserver, did it cause the RW test to hang or error? |
@milleruntime when you did see this error how long did you run test before seeing it? How many times did you see it? |
I was able to reproduce this yesterday. It seems like the same situation that was speculated as a cause in an earlier comment, a compaction of an empty tablet followed by a close. Saw the following in the tserver logs.
Saw the following the manager logs
I added some logging to the tserver that will help verify the hypothesis and kicked the test off again. Didn't see the problem again, but did see #3006. Also noticed some of the bulk import random walk test are stuck this morning i have not had a chance to track that down. Going to keep trying to reproduce the problem with improved logging. |
Interesting. I am going to try to write a test to reproduce it. |
This commit fixes a bug where : * user compactions take multiple compaction steps (because the tablet has many files) * the intermediate steps produce no output When the above happened CompactableImpl and Tablet would disagree about what files there were. The Tablet code would ignore the empty file produced by an intermediate compaction. CompactableImpl would expect Tablet to know of this file. When this happened things would hang until a tserver was restarted. Ran into this bug while continually running Bulk random walk test to reproduce apache#2667
This commit fixes a bug where : * user compactions take multiple compaction steps (because the tablet has many files) * the intermediate steps produce no output When the above happened CompactableImpl and Tablet would disagree about what files there were. The Tablet code would ignore the empty file produced by an intermediate compaction. CompactableImpl would expect Tablet to know of this file. When this happened things would hang until a tserver was restarted. Ran into this bug while continually running Bulk random walk test to reproduce #2667
I am continuing to run the bulk import test with the fixes for #3013 and #3014 applied. I have not seen the error again with the inconsistent id, I suspect I know what the problem is but want to confirm it with the additional logging I added. On the run last night all of the tservers died with out of memory problems. I looked at the UNO config and the tserver memory was 768m. I upped this to 4G and restarted the test. I have never seen the out of memory problem when running the test, maybe fixing #3013 and #3014 allowed the test to run longer and that caused it. |
I opened #3028, I think it may fix this issue. I saw the problem happen multiple times while running the bulk RW test. With additional logging added I confirmed the problem happened under the following conditions all happening to the same tablet at around the same time.
|
Steps to reproduce:
~/workspace/accumulo-testing$ ./bin/rwalk Bulk.xml > /tmp/log4 2>&1 &
From the tserver log:
Manager log:
The text was updated successfully, but these errors were encountered: