Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH] Make sure that all processors are cancelled before releasing global resources in native engine and after yarn kill or driver shutdown command comes #5788

Open
taiyang-li opened this issue May 17, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@taiyang-li
Copy link
Contributor

taiyang-li commented May 17, 2024

Backend

CH (ClickHouse)

Bug description

企业微信截图_6fbcdb51-b2ff-4129-bb31-aad816b7586b

Spark version

3.3

Spark configurations

AQE enabled

System information

No response

Relevant logs

2024/05/17 03:41:06,432 INFO [Executor task launch worker for task 15.0 in stage 1.0 (TID 67)] Executor: Finished task 15.0 in stage 1.0 (TID 67). 5183 bytes result sent to driver
2024/05/17 03:41:11,329 INFO [dispatcher-Executor] YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
2024/05/17 03:41:11,376 INFO [CoarseGrainedExecutorBackend-stop-executor] MemoryStore: MemoryStore cleared
2024/05/17 03:41:11,376 INFO [CoarseGrainedExecutorBackend-stop-executor] BlockManager: BlockManager stopped
2024-05-17 03:41:11.381 <Information> jni: start nativeOnTerminate
2024/05/17 03:41:12,445 INFO [shutdown-hook-0] JniLibLoader: Start unload library path /data17/hadoop/yarn/local/filecache/1281177/libch_240516.so 
2024/05/17 03:41:12,446 INFO [shutdown-hook-0] JniLibLoader: verbosePath: /usr/hdp/3.1.0.0-78/hadoop/lib/native/libhadoop.so.1.0.0, libPath: /data17/hadoop/yarn/local/filecache/1281177/libch_240516.so
2024/05/17 03:41:12,446 INFO [shutdown-hook-0] JniLibLoader: Skip finalize because libhadoop.so.1.0.0 != libch_240516.so
2024/05/17 03:41:12,446 INFO [shutdown-hook-0] JniLibLoader: verbosePath: /data17/hadoop/yarn/local/filecache/1281177/libch_240516.so, libPath: /data17/hadoop/yarn/local/filecache/1281177/libch_240516.so
2024/05/17 03:41:12,446 INFO [shutdown-hook-0] JniLibLoader: Start finalize library file libch_240516.so
2024-05-17 03:41:12.446 <Information> jni: start JNI_OnUnload
2024/05/17 03:41:12,446 INFO [shutdown-hook-0] JniLibLoader: verbosePath: /data5/hadoop/yarn/local/usercache/xumingyong/appcache/application_1710462985812_773900/container_e57_1710462985812_773900_01_000004/tmp/liblz4-java-1735226946575805094.so, libPath: /data17/hadoop/yarn/local/filecache/1281177/libch_240516.so
2024/05/17 03:41:12,447 INFO [shutdown-hook-0] JniLibLoader: Skip finalize because liblz4-java-1735226946575805094.so != libch_240516.so
2024/05/17 03:41:12,456 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called
@taiyang-li taiyang-li added bug Something isn't working triage labels May 17, 2024
@taiyang-li
Copy link
Contributor Author

taiyang-li commented May 17, 2024

猜想(待证实):当driver发送shutdown command到executor时,executor执行ShutdownHookManager中注册过的hook函数,其中包括释放native engine中的全局资源。但是从上面的log看,此时native engine中,source算子还在执行,它会依赖已经被释放的全局资源(global thread pool), 从而造成crash.

TODO

  • LocalExecutor支持cancel功能,底层调用PullingPipelineExecutor::cancel。
  • 实现gluten自定义processor的onCancel接口,在其中释放processor自身的资源。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant