Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bunch of threads throwing IllegalStateException #2961

Closed
milleruntime opened this issue Sep 26, 2022 · 6 comments · Fixed by #2969
Closed

Bunch of threads throwing IllegalStateException #2961

milleruntime opened this issue Sep 26, 2022 · 6 comments · Fixed by #2969
Labels
blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.
Projects

Comments

@milleruntime
Copy link
Contributor

I was running Bulk Rwalk test on a Uno cluster with 2 tservers. I killed one of the tservers (with kill -9) and then restarted it (with accumulo-cluster start-tservers). A little while later 50+ of these errors flooded the Monitor.

java.lang.RuntimeException: java.lang.IllegalStateException: Unexpected family chopped
	at org.apache.accumulo.tserver.AssignmentHandler.run(AssignmentHandler.java:141)
	at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:63)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalStateException: Unexpected family chopped
	at org.apache.accumulo.core.metadata.schema.TabletMetadata.convertRow(TabletMetadata.java:403)
	at org.apache.accumulo.core.metadata.schema.TabletsMetadata$Builder.lambda$buildNonRoot$8(TabletsMetadata.java:216)
	at com.google.common.collect.Iterators$6.transform(Iterators.java:829)
	at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52)
	at com.google.common.collect.Iterators$5.computeNext(Iterators.java:672)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:132)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at org.apache.accumulo.core.metadata.schema.AmpleImpl.readTablet(AmpleImpl.java:47)
	at org.apache.accumulo.core.metadata.schema.Ample.readTablet(Ample.java:144)
	at org.apache.accumulo.tserver.AssignmentHandler.run(AssignmentHandler.java:109)
	... 6 more

I think these errors are related to a Chop compaction that was taking place when the tserver was killed.

2022-09-26T10:36:12,167 [manager.Manager] INFO : Asking localhost:10000[100007556480005] to chop a;r0265c;r01179
tserver2_ip-10-113-14-231.log:2022-09-26T10:36:24,770 [tablet.files] DEBUG: Compacting a;r0265c;r01179 on i.default.small for CHOP from [I00002vg.rf, I00002it.rf, I00002h5.rf, C00001an.rf, I00002xr.rf, I0000303.rf, I00002zo.rf, I00002vh.rf, I00002hd.rf, I00002ob.rf] size 36 KB
tserver2_ip-10-113-14-231.log:2022-09-26T10:36:31,369 [tablet.files] DEBUG: Compacted a;r0265c;r01179 for CHOP created hdfs://localhost:8020/accumulo/tables/a/t-00002pq/C000033u.rf from [I00002oe.rf, I00002nv.rf, I00002nw.rf, I00002no.rf, I00002oh.rf, I00002nb.rf, I00002nj.rf, I00002oc.rf, I00002m2.rf]
...
2022-09-26T10:36:36,121 [manager.Manager] ERROR: unable to get tablet server status localhost:10000[100007556480005]
2022-09-26T10:37:03,458 [manager.Manager] DEBUG: 85 assigned to dead servers: ...  a;r0265c;r01179
2022-09-26T10:37:03,468 [tablet.location] DEBUG: Suspended a;r0265c;r01179 to localhost:10000 at 3696264 ms with 1 walogs
...
2022-09-26T10:37:37,537 [manager.Manager] INFO : New servers: [localhost:10000[10000755648001d]]
2022-09-26T10:37:37,539 [manager.EventCoordinator] INFO : There are now 2 tablet servers
2022-09-26T10:37:37,683 [tablet.location] DEBUG: Assigned a;r0265c;r01179 to localhost:10000[10000755648001d]
2022-09-26T10:37:41,266 [manager.Manager] ERROR: localhost:10000 reports assignment failed for tablet a;r0265c;r01179
@milleruntime milleruntime added the bug This issue has been verified to be a bug. label Sep 26, 2022
@milleruntime milleruntime added this to To do in 2.1.0 via automation Sep 26, 2022
@milleruntime milleruntime added the blocker This issue blocks any release version labeled on it. label Sep 26, 2022
@milleruntime
Copy link
Contributor Author

This is similar to problems we had with other Metadata checks in #2574 and #2667

@keith-turner
Copy link
Contributor

keith-turner commented Sep 26, 2022

Looking at the code that function is used by Ample to convert a metadata table row to POJO. I think the function just does not handle the chopped family and that needs to be added.

@keith-turner
Copy link
Contributor

@milleruntime I could submit a PR to fix this if you would like, just let me know.

@milleruntime
Copy link
Contributor Author

@milleruntime I could submit a PR to fix this if you would like, just let me know.

I am looking at it today. I see where the column family needs to be added to the switch case but I wasn't sure if there was anything else special that we need to do within TabletMetadata. Like, do we need to set [this chopped boolean] to true(

var tls = new TabletLocationState(extent, future, current, last, suspend, null, false);
) ? Or can i just create a boolean for chopped?

@keith-turner
Copy link
Contributor

keith-turner commented Sep 27, 2022

I am looking at it today. I see where the column family needs to be added to the switch case but I wasn't sure if there was anything else special that we need to do within TabletMetadata. Like, do we need to set [this chopped boolean] to true(

Looking at the code inside TabletLocationState and how it being used there, it does not matter if chopped is passed. So can probably leave that alone. I think its unrelated to fixing this issue, but that code that uses TabletLocationState should probably be refactored to avoid not passing in uneeded params to compute something.

@keith-turner
Copy link
Contributor

keith-turner commented Sep 27, 2022

I think its unrelated to fixing this issue, but that code that uses TabletLocationState should probably be refactored to avoid not passing in uneeded params to compute something.

Going to try restating above w/o double negatives. Thinking we should refactor the code in question to have a function that computes tablet state only passing in parameters that are needed by the computation. That refactor could possibly be a separate PR from fixing this.

milleruntime added a commit to milleruntime/accumulo that referenced this issue Sep 27, 2022
2.1.0 automation moved this from To do to Done Sep 27, 2022
milleruntime added a commit that referenced this issue Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.
Projects
2.1.0
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants