Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 使用协程,系统稳定运行8个小时后,kafka发生 invalid generation:[commit=265,broker=266(error=0)]异常,不再消费消息,同时Redis不时发生连接超时 #226

Closed
lykxqhh opened this issue Apr 14, 2021 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@lykxqhh
Copy link

lykxqhh commented Apr 14, 2021

Description
使用8.6.6版本,线上灰度使用协程,系统稳定运行8个小时后,kafka发生 invalid generation:[commit=265,broker=266(error=0)]异常,不再消费消息,同时Redis发生连接超时,现象仅发生在dragonwell灰度实例。

JVM参数:
JVM_GC=" -XX:+UseG1GC -XX:G1HeapRegionSize=4M -XX:InitiatingHeapOccupancyPercent=40 -XX:MaxGCPauseMillis=100 -XX:+TieredCompilation -XX:CICompilerCount=4 -XX:-UseBiasedLocking -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintStringTableStatistics -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCApplicationStoppedTime -XX:+PrintFlagsFinal -XX:-UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:+UnlockExperimentalVMOptions -XX:+UseWisp2 -Dio.netty.transport.noNative=true -Dio.netty.noUnsafe=true"

@lykxqhh lykxqhh changed the title [Bug] 线上灰度使用协程,系统稳定运行8个小时后,kafka发生 invalid generation:[commit=265,broker=266(error=0)]异常,不再消费消息,同时Redis发生连接超时 [Bug] 使用协程,系统稳定运行8个小时后,kafka发生 invalid generation:[commit=265,broker=266(error=0)]异常,不再消费消息,同时Redis不时发生连接超时 Apr 14, 2021
@joeyleeeeeee97
Copy link
Contributor

joeyleeeeeee97 commented Apr 14, 2021

您好, 能否提供下出现问题的jstack文件?

@joeyleeeeeee97 joeyleeeeeee97 added the bug Something isn't working label Apr 14, 2021
@joeyleeeeeee97 joeyleeeeeee97 self-assigned this Apr 14, 2021
@lykxqhh
Copy link
Author

lykxqhh commented Apr 14, 2021

Coroutine [0x7fd61021b2e0] "batch-consume-thread#zcm_settle_voucher_to_clear_stage#com.nt.server.fundCheck" #833 active=816000 steal=19416 steal_fail=319 preempt=0 park=0/-1 cg=0/0 ttr=0
at java.dyn.CoroutineSupport.unsafeSymmetricYieldTo(CoroutineSupport.java:140)
- parking to wait for <0x00000006f1e03f00> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at com.alibaba.wisp.engine.WispTask.switchTo(WispTask.java:335)
at com.alibaba.wisp.engine.WispCarrier.yieldTo(WispCarrier.java:427)
at com.alibaba.wisp.engine.WispCarrier.schedule(WispCarrier.java:265)
at com.alibaba.wisp.engine.WispTask.parkInternal(WispTask.java:426)
at com.alibaba.wisp.engine.WispTask.jdkPark(WispTask.java:479)
at com.alibaba.wisp.engine.WispEngine$5.park(WispEngine.java:267)
at sun.misc.Unsafe.park(Unsafe.java:1034)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:216)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2087)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:471)
at com.nt.kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:72)
at com.nt.kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:36)
at com.nt.kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at com.nt.kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at com.nt.mafka.client.consumer.DefaultConsumerProcessor$MafkaStreamWorkInBatch.getMafkaMessage(DefaultConsumerProcessor.java:2121)
at com.nt.mafka.client.consumer.DefaultConsumerProcessor$MafkaStreamWorkInBatch.run(DefaultConsumerProcessor.java:1924)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:853)
at com.alibaba.wisp.engine.WispTask.runOutsideWisp(WispTask.java:305)
at com.alibaba.wisp.engine.WispTask.runCommand(WispTask.java:280)
at com.alibaba.wisp.engine.WispTask.access$100(WispTask.java:55)
at com.alibaba.wisp.engine.WispTask$CacheableCoroutine.run(WispTask.java:250)
at java.dyn.CoroutineBase.startInternal(CoroutineBase.java:62)

  • Coroutine [0x7fd610242620] "ConsumerFetcherThread-conch.clearing.cost.detail_staging#com.nt.server.datacollect_set-zf-sdpt-settle-settle" #924 active=1059484 steal=33406 steal_fail=36 preempt=0 park=0/-1 cg=0/0 ttr=0
    at java.dyn.CoroutineSupport.unsafeSymmetricYieldTo(CoroutineSupport.java:140)
    at com.alibaba.wisp.engine.WispTask.switchTo(WispTask.java:335)
    at com.alibaba.wisp.engine.WispCarrier.yieldTo(WispCarrier.java:427)
    at com.alibaba.wisp.engine.WispCarrier.schedule(WispCarrier.java:265)
    at com.alibaba.wisp.engine.WispTask.parkInternal(WispTask.java:426)
    at com.alibaba.wisp.engine.WispTask.jdkPark(WispTask.java:479)
    at com.alibaba.wisp.engine.WispEngine$5.park(WispEngine.java:267)
    at com.alibaba.wisp.engine.WispEngine$5.poll(WispEngine.java:364)
    at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:992)
    - locked <0x00000006f1c18b28> (a java.lang.Object)
    at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:217)

@lykxqhh
Copy link
Author

lykxqhh commented Apr 14, 2021

无Blocked线程

@lykxqhh lykxqhh closed this as completed Apr 14, 2021
@yuleil
Copy link
Collaborator

yuleil commented Apr 14, 2021

可以提供完整的jstack文件吗

@lykxqhh
Copy link
Author

lykxqhh commented Apr 15, 2021

jstack.txt
今早上复现的jstack

@lykxqhh
Copy link
Author

lykxqhh commented Apr 15, 2021

image

@lykxqhh lykxqhh reopened this Apr 15, 2021
@joeyleeeeeee97
Copy link
Contributor

@lykxqhh 您好,由于使用了com.nent.rasp.rasp.net.UnixDomainSocket 导致协程无法切换,请尝试使用
-Dcom.alibaba.wisp.threadAsWisp.black=class:com.nent.rasp.thread.CommandTask
黑名单机制, 不将此线程转为协程


"Wisp-Root-Worker-2" #6 daemon prio=5 os_prio=0 tid=0x00007ff0c47d2cf0 nid=0x281 runnable [0x00007ff001940000]
   java.lang.Thread.State: RUNNABLE
	at com.nent.rasp.rasp.net.UnixDomainSocket.nativeRead(Native Method)
	at com.nent.rasp.rasp.net.UnixDomainSocket$UnixDomainSocketInputStream.read(UnixDomainSocket.java:165)
	at java.io.DataInputStream.readByte(DataInputStream.java:265)
	at com.nent.rasp.command.WireIO.read(WireIO.java:27)
	at com.nent.rasp.thread.CommandTask.run(CommandTask.java:60)
	at java.lang.Thread.run(Thread.java:853)
	at com.alibaba.wisp.engine.WispTask.runOutsideWisp(WispTask.java:305)
	at com.alibaba.wisp.engine.WispTask.runCommand(WispTask.java:280)
	at com.alibaba.wisp.engine.WispTask.access$100(WispTask.java:55)
	at com.alibaba.wisp.engine.WispTask$CacheableCoroutine.run(WispTask.java:250)
	at java.dyn.CoroutineBase.startInternal(CoroutineBase.java:62)

@yuleil
Copy link
Collaborator

yuleil commented Apr 15, 2021

更加完美的解决方案是类似go的hand off机制,可以参考这篇文章Synchronous System Calls 小节。

@lykxqhh
Copy link
Author

lykxqhh commented Apr 15, 2021

我会在今天验证一下-Dcom.alibaba.wisp.threadAsWisp.black=class:com.nent.rasp.thread.CommandTask

@lykxqhh
Copy link
Author

lykxqhh commented Apr 15, 2021

-Dcom.alibaba.wisp.threadAsWisp.black=class:com.nent.rasp.thread 我可以这样配置吗,这样即是排除 class 以“com.nent.rasp.thread” 开头的所有类吗

@lykxqhh
Copy link
Author

lykxqhh commented Apr 15, 2021

image

-Dcom.alibaba.wisp.threadAsWisp.black=class:com.nent.rasp.thread

这样算是成功排除掉了吗?感觉没什么变化....

@yuleil
Copy link
Collaborator

yuleil commented Apr 15, 2021

-Dcom.alibaba.wisp.threadAsWisp.black=class:com.nent.rasp.thread.CommandTask

意思是CommandTask这个类型的Runnable作为线程执行时不要转换成协程

参考这个代码 https://github.com/alibaba/dragonwell8_jdk/blob/master/src/linux/classes/com/alibaba/wisp/engine/ThreadAsWisp.java#L99

@lykxqhh
Copy link
Author

lykxqhh commented Apr 15, 2021

image

image

看代码是 equals对比,但是文档,startWith 歧义很大,且不支持 class=....

@yuleil
Copy link
Collaborator

yuleil commented Apr 15, 2021

看代码是 equals对比,但是文档,startWith 歧义很大,且不支持 class=....

文档里的意思是假如配置字符串以class:开头,则根据:后面的名字来匹配类型。
举例:
class:com.nent.rasp.thread.CommandTask
表示这是一个class的配置,匹配类型为com.nent.rasp.thread.CommandTask 的Runnable对象。

这块功能目前触及的用户比较少,因此文档不够完善。后续我们将继续完善文档。

@chulong
Copy link

chulong commented Oct 23, 2021

怎么判定线程和协程未正常切换?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants