Skip to content

[VL] native code crash: __cxa_pure_virtual #6292

@RaoZhiRou-Z

Description

@RaoZhiRou-Z

Backend

VL (Velox)

Bug description

after running for a period of time,yarn executor exit with core error:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f2688bfd640 (LWP 2915))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007f2708818527 in __GI_abort () at abort.c:79
#2  0x00007f26ccaa1919 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007f26ccaacf3a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007f26ccaacfa5 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00007f26ccaadca3 in __cxxabiv1::__cxa_pure_virtual () at ../../../../libstdc++-v3/libsupc++/pure.cc:50
#6  0x00007f25beba86fe in ?? ()
#7  0x062eabc3a9a65d26 in ?? ()
#8  0x0000000000000000 in ?? ()

Spark version

Spark-3.3.x

Spark configurations

spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 20g
spark.plugins org.apache.gluten.GlutenPlugin
spark.gluten.loadLibFromJar true
spark.gluten.loadLibOS CentOS
spark.gluten.loadLibOSVersion 7
spark.gluten.sql.native.writer.enabled true

System information

No response

Relevant logs

Retriable: False
Context: Operator: ValueStream[0] 0
Function: runInternal
File: /home/work/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 611
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4  _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5  _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6  _ZN6gluten24WholeStageResultIterator4nextEv
# 7  Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8  0x00007f9a1a0b9a30

	at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
	at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
	at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
	at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
	at org.apache.gluten.execution.VeloxColumnarToRowExec$$anon$1.hasNext(VeloxColumnarToRowExec.scala:131)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:521)
	at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
	at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.nativeConvert(RowToVeloxColumnarExec.scala:179)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:226)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:137)
	at org.apache.gluten.utils.InvocationFlowProtection.next(Iterators.scala:154)
	at org.apache.gluten.utils.IteratorCompleter.next(Iterators.scala:77)
	at org.apache.gluten.utils.PayloadCloser.next(Iterators.scala:39)
	at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
	at org.apache.gluten.vectorized.GeneralInIterator.nextColumnarBatch(GeneralInIterator.java:38)
	at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:33)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
	at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
	at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
	at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
	at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:118)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:236)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:552)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:555)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.spark.TaskKilledException
	at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:219)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.nativeConvert(RowToVeloxColumnarExec.scala:179)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:226)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:137)
	at org.apache.gluten.utils.InvocationFlowProtection.next(Iterators.scala:154)
	at org.apache.gluten.utils.IteratorCompleter.next(Iterators.scala:77)
	at org.apache.gluten.utils.PayloadCloser.next(Iterators.scala:39)
	at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
	at org.apache.gluten.vectorized.GeneralInIterator.nextColumnarBatch(GeneralInIterator.java:38)
	at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:33)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
	at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
	at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
	at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
	at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
	at org.apache.gluten.execution.VeloxColumnarToRowExec$$anon$1.hasNext(VeloxColumnarToRowExec.scala:131)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:521)
	at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
	at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.nativeConvert(RowToVeloxColumnarExec.scala:179)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:226)
	at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:137)
	at org.apache.gluten.utils.InvocationFlowProtection.next(Iterators.scala:154)
	at org.apache.gluten.utils.IteratorCompleter.next(Iterators.scala:77)
	at org.apache.gluten.utils.PayloadCloser.next(Iterators.scala:39)
	at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
	at org.apache.gluten.vectorized.GeneralInIterator.nextColumnarBatch(GeneralInIterator.java:38)
	at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:33)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
	at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
	at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
	at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
	at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:118)
	at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:236)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:552)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:555)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Retriable: False
Context: Operator: ValueStream[0] 0
Function: runInternal
File: /home/work/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 611
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4  _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5  _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6  _ZN6gluten24WholeStageResultIterator4nextEv
# 7  Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8  0x00007f9a1a0b9a30

	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
	at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
	... 43 more

Retriable: False
Function: runInternal
File: /home/work/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 611
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4  _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5  _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6  _ZN6gluten24WholeStageResultIterator4nextEv
# 7  Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8  0x00007f9a1a0b9a30

	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
	at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
	at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
	... 18 more
24/06/29 04:20:02 INFO Executor: Executor interrupted and killed task 33739.0 in stage 2594.0 (TID 25417349), reason: another attempt succeeded

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions