lastCompactID is inconsistent with metadata Exception #2667

milleruntime · 2022-05-02T12:47:28Z

Steps to reproduce:

Start a Uno Cluster with 2 tservers (I used commit b3ac225)
Run RW Bulk test from Accumulo Testing. I ran the following command 4 times with different log files:
~/workspace/accumulo-testing$ ./bin/rwalk Bulk.xml > /tmp/log4 2>&1 &
Let tests finish running and maybe see the error

java.lang.RuntimeException: Closed tablet 5<;r13c3b lastCompactID is inconsistent with metadata : 25 != 23
	at org.apache.accumulo.tserver.tablet.Tablet.lambda$closeConsistencyCheck$4(Tablet.java:1090)
	at java.base/java.util.OptionalLong.ifPresent(OptionalLong.java:163)
	at org.apache.accumulo.tserver.tablet.Tablet.closeConsistencyCheck(Tablet.java:1085)
	at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1003)
	at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:1464)
	at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:491)
	at org.apache.accumulo.tserver.TabletClientHandler.splitTablet(TabletClientHandler.java:1006)
	at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$0(TraceUtil.java:206)
	at com.sun.proxy.$Proxy39.splitTablet(Unknown Source)
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$splitTablet.getResult(TabletClientService.java:2648)
	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$splitTablet.getResult(TabletClientService.java:2627)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
	at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138)
	at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54)
	at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524)
	at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129)
	at org.apache.thrift.server.Invocation.run(Invocation.java:18)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.base/java.lang.Thread.run(Thread.java:829)

From the tserver log:

2022-05-02T08:16:04,322 [tablet.Tablet] DEBUG: Tablet 5;r13c3b;r0f20c had no dir, creating hdfs://localhost:8020/accumulo/tables/5/t-00000fa
2022-05-02T08:16:04,333 [tablet.Tablet] DEBUG: Files for low split 5;r0d7fe;r0860c []
2022-05-02T08:16:04,333 [tablet.Tablet] DEBUG: Files for high split 5;r0f20c;r0d7fe []
2022-05-02T08:16:04,356 [tablet.Tablet] ERROR: Closed tablet 5<;r13c3b lastCompactID is inconsistent with metadata : 25 != 23
2022-05-02T08:16:04,357 [tablet.Tablet] ERROR: Failed to do close consistency check for tablet 5<;r13c3b

Manager log:

2022-05-02T08:15:59,338 [tableOps.Utils] INFO : table 5 (6043c9b3efe07636) locked for read operation: COMPACT
2022-05-02T08:15:59,887 [tableOps.Utils] INFO : table 5 (332cb19f408b595b) locked for write operation: MERGE
2022-05-02T08:16:00,991 [tableOps.Utils] INFO : table 5 (332cb19f408b595b) unlocked for write
2022-05-02T08:16:01,008 [tableOps.Utils] INFO : table 5 (5b528b34c36ceac6) locked for write operation: MERGE
2022-05-02T08:16:02,810 [tableOps.Utils] INFO : table 5 (5b528b34c36ceac6) unlocked for write
2022-05-02T08:16:02,927 [tableOps.Utils] INFO : table 5 (33fb4a18cefb5fa1) locked for read operation: COMPACT
2022-05-02T08:16:03,659 [tableOps.Utils] INFO : table 5 (56170bb68d50efd6) locked for write operation: MERGE
2022-05-02T08:16:03,909 [tableOps.Utils] INFO : table 5 (56170bb68d50efd6) unlocked for write
2022-05-02T08:16:03,910 [tableOps.Utils] INFO : table 5 (71a50c6c0d42b5b0) locked for read operation: COMPACT
2022-05-02T08:16:03,910 [tableOps.Utils] INFO : table 5 (112da7f68f32e75d) locked for read operation: COMPACT
2022-05-02T08:16:08,555 [tableOps.Utils] INFO : table 5 (4de2323e0a3772eb) locked for write operation: MERGE

The text was updated successfully, but these errors were encountered:

milleruntime · 2022-05-02T14:02:11Z

@keith-turner Any thoughts on this consistency check? I am not sure if there was a problem or not.

keith-turner · 2022-05-02T14:53:25Z

I'll take a look at it

keith-turner · 2022-05-02T17:16:01Z

I think I have found one possible cause to this. Before doing the consistency check on close the following code will wait for there the be zero non-external compactions by looking at the runningJobs set.

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java

Lines 1504 to 1505 in b3ac225

    
           while (runningJobs.stream() 
        
               .anyMatch(job -> !((CompactionExecutorIdImpl) job.getExecutor()).isExternalId())

The following code is called when a compaction completes and it removes itself from runningJobs and then calls selectFiles()

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableImpl.java

Lines 1220 to 1233 in b3ac225

    
           private void completeCompaction(CompactionJob job, Set<StoredTabletFile> jobFiles, 
        
               StoredTabletFile metaFile) { 
        
             synchronized (this) { 
        
               Preconditions.checkState(removeJob(job)); 
        
               fileMgr.completed(job, jobFiles, metaFile); 
        
               if (!compactionRunning) { 
        
                 notifyAll(); 
        
               } 
        
             } 
        
             checkifChopComplete(tablet.getDatafiles().keySet()); 
        
             selectFiles(); 
        
           }

The call to selectFiles() eventaully calls the the following code which updates the metadata table and then updates the tablets in memory last compact ID. The following code can run concurrently with the close consistency check because the thread has already removed the compaction from runningJobs()

accumulo/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactableUtils.java

Lines 439 to 444 in b3ac225

    
           if (selectedFiles.isEmpty()) { 
        
             MetadataTableUtil.updateTabletCompactID(tablet.getExtent(), compactionId, 
        
                 tablet.getTabletServer().getContext(), tablet.getTabletServer().getLock()); 
        
             tablet.setLastCompactionID(compactionId); 
        
           }

Looking at the error message Closed tablet 5<;r13c3b lastCompactID is inconsistent with metadata : 25 != 23

25 is from metadata table
23 is from tablet instance variable lastCompactID

Which is consistent with the metadata table being updated first followed by the lastCompactID instance variable.

I think this code with the race conditions only executes when an empty tablet is forced to compact, so all it does is write the compact id to the metadata table to signify its done w/o any files changes. There was a forced compaction in the manager logs a bit before this split that closed the tablet.

I am going to look into this some more and see if I can figure out if there is a problem with reordering the code to avoid the race condition.

milleruntime · 2022-05-24T16:45:57Z

@keith-turner Did you make any changes for this race condition bug?

milleruntime · 2022-09-26T14:51:59Z

It seems like recent changes (possibly the refactor in #2574) have made this less frequent but I just saw this happen again. I was running 4 Bulk Rwalk jobs again.
Here is the error I saw:

2022-09-26T09:45:25,362 [util.FileUtil] DEBUG: Found first and last keys for 0 map files in   0.00 secs
2022-09-26T09:45:25,363 [tablet.Tablet] DEBUG: Tablet 2;r143fe< had no dir, creating hdfs://localhost:8020/accumulo/tables/2/t-00000rs
2022-09-26T09:45:25,369 [tablet.Tablet] ERROR: Closed tablet 2;r143fe< lastCompactID is inconsistent with metadata : 115 != 112
2022-09-26T09:45:25,370 [tablet.Tablet] ERROR: Failed to do close consistency check for tablet 2;r143fe<
java.lang.RuntimeException: Closed tablet 2;r143fe< lastCompactID is inconsistent with metadata : 115 != 112
        at org.apache.accumulo.tserver.tablet.Tablet.lambda$closeConsistencyCheck$4(Tablet.java:1090) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.util.OptionalLong.ifPresent(OptionalLong.java:163) ~[?:?]
        at org.apache.accumulo.tserver.tablet.Tablet.closeConsistencyCheck(Tablet.java:1085) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.tablet.Tablet.completeClose(Tablet.java:1003) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.tablet.Tablet.split(Tablet.java:1465) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.TabletServer.splitTablet(TabletServer.java:489) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.TabletClientHandler.splitTablet(TabletClientHandler.java:1004) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$0(TraceUtil.java:206) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at com.sun.proxy.$Proxy38.splitTablet(Unknown Source) ~[?:?]
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$splitTablet.getResult(TabletClientService.java:2648) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$splitTablet.getResult(TabletClientService.java:2627) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.16.0.jar:0.16.0]
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]

keith-turner · 2022-09-26T20:34:06Z

I was running 4 Bulk Rwalk jobs again.

Any advice if I wanted to try to reproduce this? How many tservers were you running? You had 4 random walk processes each running the bulk graph? Was everything running on a single VM or multiple VMs?

milleruntime · 2022-09-27T12:07:32Z

I was running 4 Bulk Rwalk jobs again.

Any advice if I wanted to try to reproduce this? How many tservers were you running? You had 4 random walk processes each running the bulk graph? Was everything running on a single VM or multiple VMs?

I was able to produce this on a single EC2 Ubuntu instance, running Uno. I installed Snaphot and set export NUM_TSERVERS=2. Then once I had Uno running, I ran the following command to fire off 4 local RW processes:

for (( i=1; i<5; i++)); do ~/workspace/accumulo-testing/bin/rwalk Bulk.xml > /tmp/rwalk"$i".log 2>&1 & done

FYI I found a bug in the Bulk RW test so make sure you get the fix from accumulo-testing first.

milleruntime · 2022-09-27T20:09:55Z

@keith-turner I also dropped the verify and END nodes from Bulk, just so the test would keep running.

keith-turner · 2022-10-04T16:57:23Z

I started looking into this and ran bulk import RW test. I saw the following error, which is a different problem. I didn't keep the RW test logs so can't dig into it much. Going to keep running RW bulk test and keeping the test logs.

2022-10-04T13:26:12,850 [testing.randomwalk.Module] DEBUG: State information
2022-10-04T13:26:12,850 [testing.randomwalk.Module] DEBUG:   bulkImportSuccess: true
2022-10-04T13:26:12,855 [testing.randomwalk.Module] DEBUG:   fs: class org.apache.hadoop.hdfs.DistributedFileSystem - DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-924030891_1, ugi=ubuntu (auth:SIMPLE)]]
2022-10-04T13:26:12,855 [testing.randomwalk.Module] DEBUG:   pool: class org.apache.accumulo.core.util.threads.ThreadPools$2 - org.apache.accumulo.core.util.threads.ThreadPools$2@4248b963[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 823]
2022-10-04T13:26:12,855 [testing.randomwalk.Framework] ERROR: Error during random walk
java.lang.Exception: Error running node bulk.Verify
	at org.apache.accumulo.testing.randomwalk.Module.visit(Module.java:360) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Framework.run(Framework.java:54) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Framework.main(Framework.java:98) ~[accumulo-testing-shaded.jar:?]
Caused by: java.lang.Exception: Bad key at r004d9 cf:000 [] 1664889971863 false=-1
	at org.apache.accumulo.testing.randomwalk.bulk.Verify.visit(Verify.java:73) ~[accumulo-testing-shaded.jar:?]
	at org.apache.accumulo.testing.randomwalk.Module.lambda$visit$0(Module.java:295) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-testing-shaded.jar:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-10-04T13:26:12,864 [testing.randomwalk.Framework] INFO : Test finished

milleruntime · 2022-10-05T20:47:40Z

I wonder if this error occurred when a table was compacting and splitting at the same time and the split failed due to the bug I fixed in #2977.

keith-turner · 2022-10-06T12:00:19Z

I wonder if this error occurred when a table was compacting and splitting at the same time and the split failed due to the bug I fixed in #2977.

I will look into that. I have been trying to recreate your scenario with a slight modification to the RW test graph. Instead of not ever verifiying and therefore not ending. I changed my graph locally to start over after verify. So doing the following locally.

$ git diff
diff --git a/src/main/resources/randomwalk/modules/Bulk.xml b/src/main/resources/randomwalk/modules/Bulk.xml
index 26004b1..1880944 100644
--- a/src/main/resources/randomwalk/modules/Bulk.xml
+++ b/src/main/resources/randomwalk/modules/Bulk.xml
@@ -55,7 +55,7 @@
 </node>
 
 <node id="bulk.Verify">
-  <edge id="END" weight="1"/>
+  <edge id="bulk.Setup" weight="1"/>
 </node>
 
 </module>

I have been bumping into a few little issues with the test when trying to verify so i have not have not had a super long run yet. But in the runs so far I don't think I saw any problems. I will take a look at the split thing you mentioned. I also found I needed to add nohup to your little script so that the processes would not terminate.

for (( i=1; i<5; i++)); do nohup bin/rwalk Bulk.xml > rwalk"$i".log 2>&1 & done

I think I am close to getting a test that could run for days on end going. I think this a good to test to run even if this issue may be fixed by the change you mentioned.

@milleruntime when you did see the error on the tserver, did it cause the RW test to hang or error?

keith-turner · 2022-10-06T12:21:57Z

@milleruntime when you did see this error how long did you run test before seeing it? How many times did you see it?

keith-turner · 2022-10-07T08:54:33Z

I was able to reproduce this yesterday. It seems like the same situation that was speculated as a cause in an earlier comment, a compaction of an empty tablet followed by a close.

Saw the following in the tserver logs.

2022-10-06T19:41:09,701 [tablet.location] DEBUG: Loading 3s<< on localhost:9997[1000325bae30004]
2022-10-06T19:41:09,728 [tablet.location] DEBUG: Loaded 3s<< on localhost:9997[1000325bae30004]
2022-10-06T19:41:09,830 [tablet.Tablet] DEBUG: Tablet 3s<< had no dir, creating hdfs://localhost:8020/accumulo/tables/3s/default_tablet
2022-10-06T19:41:09,837 [tablet.Tablet] ERROR: Closed tablet 3s<< lastCompactID is inconsistent with metadata : 2 != -1
2022-10-06T19:41:09,839 [tablet.Tablet] ERROR: Failed to do close consistency check for tablet 3s<<
java.lang.RuntimeException: Closed tablet 3s<< lastCompactID is inconsistent with metadata : 2 != -1
java.lang.RuntimeException: Failed to do close consistency check for tablet 3s<<

Saw the following the manager logs

2022-10-06T19:41:09,520 [security.SecurityOperation] INFO : Granted table permission READ for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,523 [security.SecurityOperation] INFO : Granted table permission WRITE for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,526 [security.SecurityOperation] INFO : Granted table permission BULK_IMPORT for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,528 [security.SecurityOperation] INFO : Granted table permission ALTER_TABLE for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,530 [security.SecurityOperation] INFO : Granted table permission GRANT for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,533 [security.SecurityOperation] INFO : Granted table permission DROP_TABLE for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,537 [security.SecurityOperation] INFO : Granted table permission GET_SUMMARIES for user root on the table 3s at the request of user !SYSTEM
2022-10-06T19:41:09,550 [tableOps.Utils] INFO : table 3s FATE[3d5c0cbb1211e4a9] locked for write operation: CREATE
2022-10-06T19:41:09,550 [tables.TableManager] DEBUG: Creating ZooKeeper entries for new table bulk_dev_uk_2030907_1665085269455 (ID: 3s) in namespace (ID: +default)
2022-10-06T19:41:09,560 [manager.EventCoordinator] INFO : Table state in zookeeper changed for 3s to NEW
2022-10-06T19:41:09,581 [balancer.TableLoadBalancer] INFO : Loaded class org.apache.accumulo.core.spi.balancer.SimpleLoadBalancer for table 3s
2022-10-06T19:41:09,603 [tables.TableManager] DEBUG: Transitioning state for table 3s from NEW to ONLINE
2022-10-06T19:41:09,604 [manager.EventCoordinator] INFO : Table state in zookeeper changed for 3s to ONLINE
2022-10-06T19:41:09,613 [tableOps.Utils] INFO : table 3s FATE[3d5c0cbb1211e4a9] unlocked for write
2022-10-06T19:41:09,680 [accumulo.audit] INFO : operation: permitted; user: root; client: 127.0.0.1:53720; action: compactTable; targetTable: 3s; targetNamespace: +default;
2022-10-06T19:41:09,685 [accumulo.audit] INFO : operation: permitted; user: root; client: 127.0.0.1:53712; action: compactTable; targetTable: 3s; targetNamespace: +default;
2022-10-06T19:41:09,700 [tablet.location] DEBUG: Assigned 3s<< to localhost:9997[1000325bae30004]
2022-10-06T19:41:09,730 [manager.EventCoordinator] INFO : tablet 3s<< was loaded on localhost:9997
2022-10-06T19:41:09,731 [tableOps.Utils] INFO : table 3s FATE[68a7eab3013b5aaa] locked for read operation: COMPACT
2022-10-06T19:41:09,751 [tableOps.Utils] INFO : table 3s FATE[71a8f902cf66da2a] locked for read operation: COMPACT
2022-10-06T19:41:11,854 [schema.LinkingIterator] DEBUG: First tablet seen provides evidence of earlier tablet in range, retrying 3s;r0e773 [3s; : [] 9223372036854775807 false,3s<%00; : [] 9223372036854775807 false) 
2022-10-06T19:41:12,124 [manager.EventCoordinator] INFO : localhost:9997 reported split 3s;r0e773<, 3s<;r0e773
2022-10-06T19:41:13,897 [schema.LinkingIterator] DEBUG: First tablet seen provides evidence of earlier tablet in range, retrying 3s;r057b6 [3s; : [] 9223372036854775807 false,3s<%00; : [] 9223372036854775807 false) 
2022-10-06T19:41:14,123 [manager.EventCoordinator] INFO : localhost:9997 reported split 3s;r057b6<, 3s;r0e773;r057b6
2022-10-06T19:41:14,905 [manager.EventCoordinator] INFO : localhost:9997 reported split 3s;r11bf7;r0e773, 3s<;r11bf7
2022-10-06T19:41:16,423 [manager.EventCoordinator] INFO : localhost:9997 reported split 3s;r0af73;r057b6, 3s;r0e773;r0af73
2022-10-06T19:41:16,902 [schema.LinkingIterator] DEBUG: Tablets end row and prev end row not equals 3s;r11bf7;r0e773 3s<;r1391e 
2022-10-06T19:41:17,425 [manager.EventCoordinator] INFO : localhost:9997 reported split 3s;r1391e;r11bf7, 3s<;r1391e
2022-10-06T19:41:18,168 [accumulo.audit] INFO : operation: permitted; user: root; client: 127.0.0.1:53712; action: mergeTable; targetTable: 3s; targetNamespace: +default;
2022-10-06T19:41:18,168 [manager.Manager] DEBUG: Creating merge op: 3s from startRow: -inf to endRow: +inf
2022-10-06T19:41:18,237 [tableOps.Utils] INFO : table 3s FATE[57d4471744875c60] locked for write operation: MERGE
2022-10-06T19:41:18,242 [manager.EventCoordinator] INFO : Merge state of 3s<< set to STARTED
2022-10-06T19:41:18,382 [manager.Manager] DEBUG: mergeInfo overlaps: 3s;r057b6< true

I added some logging to the tserver that will help verify the hypothesis and kicked the test off again. Didn't see the problem again, but did see #3006. Also noticed some of the bulk import random walk test are stuck this morning i have not had a chance to track that down. Going to keep trying to reproduce the problem with improved logging.

milleruntime · 2022-10-07T19:37:22Z

I was able to reproduce this yesterday. It seems like the same situation that was speculated as a cause in an earlier comment, a compaction of an empty tablet followed by a close.

Interesting. I am going to try to write a test to reproduce it.

This commit fixes a bug where : * user compactions take multiple compaction steps (because the tablet has many files) * the intermediate steps produce no output When the above happened CompactableImpl and Tablet would disagree about what files there were. The Tablet code would ignore the empty file produced by an intermediate compaction. CompactableImpl would expect Tablet to know of this file. When this happened things would hang until a tserver was restarted. Ran into this bug while continually running Bulk random walk test to reproduce apache#2667

This commit fixes a bug where : * user compactions take multiple compaction steps (because the tablet has many files) * the intermediate steps produce no output When the above happened CompactableImpl and Tablet would disagree about what files there were. The Tablet code would ignore the empty file produced by an intermediate compaction. CompactableImpl would expect Tablet to know of this file. When this happened things would hang until a tserver was restarted. Ran into this bug while continually running Bulk random walk test to reproduce #2667

keith-turner · 2022-10-13T10:21:35Z

I am continuing to run the bulk import test with the fixes for #3013 and #3014 applied. I have not seen the error again with the inconsistent id, I suspect I know what the problem is but want to confirm it with the additional logging I added. On the run last night all of the tservers died with out of memory problems. I looked at the UNO config and the tserver memory was 768m. I upped this to 4G and restarted the test. I have never seen the out of memory problem when running the test, maybe fixing #3013 and #3014 allowed the test to run longer and that caused it.

keith-turner · 2022-10-18T09:45:00Z

I opened #3028, I think it may fix this issue. I saw the problem happen multiple times while running the bulk RW test. With additional logging added I confirmed the problem happened under the following conditions all happening to the same tablet at around the same time.

Tablet with no files
User compaction initiated
Tablet being closed

milleruntime added the bug This issue has been verified to be a bug. label May 2, 2022

milleruntime added this to To do in 2.1.0 via automation May 2, 2022

milleruntime added the blocker This issue blocks any release version labeled on it. label May 2, 2022

milleruntime changed the title ~~Exception during RW Bulk import~~ lastCompactID is inconsistent with metadata Exception during RW Bulk import May 2, 2022

milleruntime changed the title ~~lastCompactID is inconsistent with metadata Exception during RW Bulk import~~ lastCompactID is inconsistent with metadata Exception May 2, 2022

keith-turner self-assigned this May 2, 2022

milleruntime closed this as completed Sep 26, 2022

2.1.0 automation moved this from To do to Done Sep 26, 2022

milleruntime reopened this Sep 26, 2022

2.1.0 automation moved this from Done to In progress Sep 26, 2022

milleruntime mentioned this issue Sep 26, 2022

Bunch of threads throwing IllegalStateException #2961

Closed

keith-turner mentioned this issue Oct 5, 2022

fixes two bulk import RW bugs apache/accumulo-testing#235

Merged

keith-turner mentioned this issue Oct 6, 2022

fixes bug with bulk RW file partition point creation apache/accumulo-testing#236

Merged

keith-turner mentioned this issue Oct 6, 2022

Saw error related to table props right after table was created in manager logs. #3004

Open

milleruntime added a commit to milleruntime/accumulo that referenced this issue Oct 7, 2022

Test to reproduce apache#2667

df4f099

keith-turner mentioned this issue Oct 10, 2022

Make bulk import random walk verify step print missing markers apache/accumulo-testing#239

Closed

keith-turner mentioned this issue Oct 11, 2022

fixes user compaction stuck when producing no output #3013

Merged

keith-turner added a commit to keith-turner/accumulo that referenced this issue Oct 18, 2022

fixes apache#2667 wait for metadata write in tablet close

c018995

keith-turner closed this as completed in 8531c62 Oct 18, 2022

2.1.0 automation moved this from In progress to Done Oct 18, 2022

milleruntime mentioned this issue Oct 18, 2022

fixes #2667 wait for metadata write in tablet close #3028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lastCompactID is inconsistent with metadata Exception #2667

lastCompactID is inconsistent with metadata Exception #2667

milleruntime commented May 2, 2022

milleruntime commented May 2, 2022

keith-turner commented May 2, 2022

keith-turner commented May 2, 2022 •

edited

Loading

milleruntime commented May 24, 2022

milleruntime commented Sep 26, 2022

keith-turner commented Sep 26, 2022

milleruntime commented Sep 27, 2022

milleruntime commented Sep 27, 2022

keith-turner commented Oct 4, 2022

milleruntime commented Oct 5, 2022

keith-turner commented Oct 6, 2022

keith-turner commented Oct 6, 2022

keith-turner commented Oct 7, 2022 •

edited

Loading

milleruntime commented Oct 7, 2022

keith-turner commented Oct 13, 2022

keith-turner commented Oct 18, 2022

lastCompactID is inconsistent with metadata Exception #2667

lastCompactID is inconsistent with metadata Exception #2667

Comments

milleruntime commented May 2, 2022

milleruntime commented May 2, 2022

keith-turner commented May 2, 2022

keith-turner commented May 2, 2022 • edited Loading

milleruntime commented May 24, 2022

milleruntime commented Sep 26, 2022

keith-turner commented Sep 26, 2022

milleruntime commented Sep 27, 2022

milleruntime commented Sep 27, 2022

keith-turner commented Oct 4, 2022

milleruntime commented Oct 5, 2022

keith-turner commented Oct 6, 2022

keith-turner commented Oct 6, 2022

keith-turner commented Oct 7, 2022 • edited Loading

milleruntime commented Oct 7, 2022

keith-turner commented Oct 13, 2022

keith-turner commented Oct 18, 2022

keith-turner commented May 2, 2022 •

edited

Loading

keith-turner commented Oct 7, 2022 •

edited

Loading