Backend
VL (Velox)
Bug description
after running for a period of time,yarn executor exit with core error:
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f2688bfd640 (LWP 2915))]
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1 0x00007f2708818527 in __GI_abort () at abort.c:79
#2 0x00007f26ccaa1919 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007f26ccaacf3a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4 0x00007f26ccaacfa5 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#5 0x00007f26ccaadca3 in __cxxabiv1::__cxa_pure_virtual () at ../../../../libstdc++-v3/libsupc++/pure.cc:50
#6 0x00007f25beba86fe in ?? ()
#7 0x062eabc3a9a65d26 in ?? ()
#8 0x0000000000000000 in ?? ()
Spark version
Spark-3.3.x
Spark configurations
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 20g
spark.plugins org.apache.gluten.GlutenPlugin
spark.gluten.loadLibFromJar true
spark.gluten.loadLibOS CentOS
spark.gluten.loadLibOSVersion 7
spark.gluten.sql.native.writer.enabled true
System information
No response
Relevant logs
Retriable: False
Context: Operator: ValueStream[0] 0
Function: runInternal
File: /home/work/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 611
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6 _ZN6gluten24WholeStageResultIterator4nextEv
# 7 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8 0x00007f9a1a0b9a30
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
at org.apache.gluten.execution.VeloxColumnarToRowExec$$anon$1.hasNext(VeloxColumnarToRowExec.scala:131)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:521)
at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.nativeConvert(RowToVeloxColumnarExec.scala:179)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:226)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:137)
at org.apache.gluten.utils.InvocationFlowProtection.next(Iterators.scala:154)
at org.apache.gluten.utils.IteratorCompleter.next(Iterators.scala:77)
at org.apache.gluten.utils.PayloadCloser.next(Iterators.scala:39)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
at org.apache.gluten.vectorized.GeneralInIterator.nextColumnarBatch(GeneralInIterator.java:38)
at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:33)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:118)
at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:236)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:552)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:555)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:219)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:554)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.nativeConvert(RowToVeloxColumnarExec.scala:179)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:226)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:137)
at org.apache.gluten.utils.InvocationFlowProtection.next(Iterators.scala:154)
at org.apache.gluten.utils.IteratorCompleter.next(Iterators.scala:77)
at org.apache.gluten.utils.PayloadCloser.next(Iterators.scala:39)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
at org.apache.gluten.vectorized.GeneralInIterator.nextColumnarBatch(GeneralInIterator.java:38)
at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:33)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
at org.apache.gluten.execution.VeloxColumnarToRowExec$$anon$1.hasNext(VeloxColumnarToRowExec.scala:131)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:521)
at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.nativeConvert(RowToVeloxColumnarExec.scala:179)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:226)
at org.apache.gluten.execution.RowToVeloxColumnarExec$$anon$1.next(RowToVeloxColumnarExec.scala:137)
at org.apache.gluten.utils.InvocationFlowProtection.next(Iterators.scala:154)
at org.apache.gluten.utils.IteratorCompleter.next(Iterators.scala:77)
at org.apache.gluten.utils.PayloadCloser.next(Iterators.scala:39)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:33)
at org.apache.gluten.vectorized.GeneralInIterator.nextColumnarBatch(GeneralInIterator.java:38)
at org.apache.gluten.vectorized.ColumnarBatchInIterator.next(ColumnarBatchInIterator.java:33)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at org.apache.gluten.utils.InvocationFlowProtection.hasNext(Iterators.scala:135)
at org.apache.gluten.utils.IteratorCompleter.hasNext(Iterators.scala:69)
at org.apache.gluten.utils.PayloadCloser.hasNext(Iterators.scala:35)
at org.apache.gluten.utils.PipelineTimeAccumulator.hasNext(Iterators.scala:98)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:118)
at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:236)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:552)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1535)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:555)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Retriable: False
Context: Operator: ValueStream[0] 0
Function: runInternal
File: /home/work/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 611
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6 _ZN6gluten24WholeStageResultIterator4nextEv
# 7 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8 0x00007f9a1a0b9a30
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
... 43 more
Retriable: False
Function: runInternal
File: /home/work/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Driver.cpp
Line: 611
Stack trace:
# 0 _ZN8facebook5velox7process10StackTraceC1Ei
# 1 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKSsEEvRKNS1_18VeloxCheckFailArgsET0_
# 3 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE.cold
# 4 _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
# 5 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 6 _ZN6gluten24WholeStageResultIterator4nextEv
# 7 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 8 0x00007f9a1a0b9a30
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
at org.apache.gluten.vectorized.ColumnarBatchOutIterator.hasNextInternal(ColumnarBatchOutIterator.java:65)
at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
... 18 more
24/06/29 04:20:02 INFO Executor: Executor interrupted and killed task 33739.0 in stage 2594.0 (TID 25417349), reason: another attempt succeeded
Backend
VL (Velox)
Bug description
after running for a period of time,yarn executor exit with core error:
Spark version
Spark-3.3.x
Spark configurations
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 20g
spark.plugins org.apache.gluten.GlutenPlugin
spark.gluten.loadLibFromJar true
spark.gluten.loadLibOS CentOS
spark.gluten.loadLibOSVersion 7
spark.gluten.sql.native.writer.enabled true
System information
No response
Relevant logs