-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CH] Make shuffle writer exit gracefully when tasks in executors are killed #5823
Comments
问题原因:
// do push merged data
try {
if (!isPushTargetWorkerExcluded(batches.get(0).loc, wrappedCallback)) {
if (!testRetryRevive || remainReviveTimes < 1) {
assert dataClientFactory != null;
TransportClient client = dataClientFactory.createClient(host, port);
client.pushMergedData(mergedData, pushDataTimeout, wrappedCallback);
} else {
wrappedCallback.onFailure(
new CelebornIOException(
StatusCode.PUSH_DATA_FAIL_NON_CRITICAL_CAUSE_PRIMARY,
new RuntimeException("Mock push merge data failed.")));
}
}
} catch (Exception e) {
logger.error(
"Exception raised while pushing merged data for shuffle {} map {} attempt {} partition {} groupedBatch {} batch {} location {}.",
shuffleId,
mapId,
attemptId,
Arrays.toString(partitionIds),
groupedBatchId,
Arrays.toString(batchIds),
addressPair,
e);
wrappedCallback.onFailure(
new CelebornIOException(StatusCode.PUSH_DATA_CREATE_CONNECTION_FAIL_PRIMARY, e));
}
#define LOCAL_ENGINE_JNI_JMETHOD_END(env) \
if ((env)->ExceptionCheck()) \
{ \
LOG_ERROR(&Poco::Logger::get("local_engine"), "Enter java exception handle."); \
auto excp = (env)->ExceptionOccurred(); \
(env)->ExceptionDescribe(); \
(env)->ExceptionClear(); \
jclass cls = (env)->GetObjectClass(excp); \
jmethodID mid = env->GetMethodID(cls, "toString", "()Ljava/lang/String;"); \
jstring jmsg = static_cast<jstring>((env)->CallObjectMethod(excp, mid)); \
const char * nmsg = (env)->GetStringUTFChars(jmsg, NULL); \
std::string msg = std::string(nmsg); \
env->ReleaseStringUTFChars(jmsg, nmsg); \
throw DB::Exception::createRuntime(DB::ErrorCodes::LOGICAL_ERROR, msg); \
|
解决:避免析构函数中抛出异常,这也是c++开发的一个准则:修改ReservationListenerWrapper::free接口,如果jvm中返回了异常,清理异常状态并打印warning日志,但是不throw DB::Exception。 |
liuneng1994
pushed a commit
that referenced
this issue
Jun 6, 2024
… in executors are killed or interrupted (#5839) What changes were proposed in this pull request? Changes: Clean code: remove useless JNIs and classes under cpp-ch Support cancel for all gluten processors. It was triggered when task is killed or shut down. Make sure offheap memory free does not throw exception. Ref: https://zhuanlan.zhihu.com/p/65454580 (Fixes: #5787 #5823) How was this patch tested? Manual
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Backend
CH (ClickHouse)
Bug description
When spark speculation execution is enabled, some tasks maybe killed by driver because another attempt had finished successfully.
Those tasks failed because they are not exiting gracefully.
Spark version
Spark-3.3.x
Spark configurations
System information
No response
Relevant logs
The text was updated successfully, but these errors were encountered: