Skip to content

[Bug] 使用spark load过程中,在loading阶段,be读取sparkETL生成的parquet文件时,be直接宕机,且无法重启恢复,直到当前load的事务结束 #27427

@trunckman

Description

@trunckman

Search before asking

  • I had searched in the issues and found no similar issues.

Version

1.2.2-rc01

What's Wrong?

当spark load进行到loading阶段时,be直接宕机,并出现如下异常:

2023-11-22 19:12:35 I1122 19:12:35.430696 591 daemon.cpp:256] OS physical memory 5.79 GB. Process memory usage 411.33 MB, limit 4.63 GB, soft limit 4.17 GB. Sys available memory 4.64 GB, low water mark 593.28 MB, warning water mark 1.16 GB. Refresh interval memory growth 0 B
2023-11-22 19:12:37 I1122 19:12:37.808765 1226 tablet_manager.cpp:899] find expired transactions for 0 tablets
2023-11-22 19:12:37 I1122 19:12:37.809018 1226 tablet_manager.cpp:937] success to build all report tablets info. tablet_count=1
2023-11-22 19:12:37 I1122 19:12:37.809084 1225 data_dir.cpp:734] path: /opt/apache-doris/be/storage total capacity: 245107195904, available capacity: 20833218560
2023-11-22 19:12:37 I1122 19:12:37.809141 1225 storage_engine.cpp:367] get root path info cost: 0 ms. tablet counter: 1
2023-11-22 19:12:37 I1122 19:12:37.814081 1224 task_worker_pool.cpp:1519] successfully report TASK|host=10.193.235.56|port=9020
2023-11-22 19:12:37 I1122 19:12:37.816437 1226 task_worker_pool.cpp:1519] successfully report TABLET|host=10.193.235.56|port=9020
2023-11-22 19:12:37 I1122 19:12:37.819788 1225 task_worker_pool.cpp:1519] successfully report DISK|host=10.193.235.56|port=9020
2023-11-22 19:12:37 I1122 19:12:37.826227 1372 task_worker_pool.cpp:252] successfully submit task|type=REALTIME_PUSH|signature=10012|queue_size=1
2023-11-22 19:12:37 I1122 19:12:37.826923 1184 task_worker_pool.cpp:619] get push task. signature=10012, priority=NORMAL push_type=LOAD_V2
2023-11-22 19:12:37 I1122 19:12:37.827843 1184 engine_batch_load_task.cpp:253] begin to process push. transaction_id=10011 tablet_id=14007, version=-1
2023-11-22 19:12:37 I1122 19:12:37.827883 1184 push_handler.cpp:55] begin to realtime push. tablet=14007.1531981042.5742ffccb4a3bd8a-55e35e96369a1a99, transaction_id=10011
2023-11-22 19:12:37 I1122 19:12:37.829012 1184 push_handler.cpp:211] tablet=14007.1531981042.5742ffccb4a3bd8a-55e35e96369a1a99, file path=hdfs://10.78.2.133:8020/tmp/doris1/jobs/11001/spark_load_test86/24015/V1.spark_load_test86.14005.14004.14006.0.1531981042.parquet, file size=896
2023-11-22 19:12:37 *** Query id: 0-0 ***
2023-11-22 19:12:37 *** Aborted at 1700651557 (unix time) try "date -d @1700651557" if you are using GNU date ***
2023-11-22 19:12:37 *** Current BE git commitID: a3521b3 ***
2023-11-22 19:12:37 *** SIGSEGV address not mapped to object (@0x20) received by PID 584 (TID 0xfffccfe16b40) from PID 32; stack trace: ***
2023-11-22 19:12:38 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /root/doris/be/src/common/signal_handler.h:420
2023-11-22 19:12:38 1# 0x0000FFFF997367A0 in linux-vdso.so.1
2023-11-22 19:12:38 2# doris::PushHandler::_convert_v2(std::shared_ptrdoris::Tablet, std::shared_ptrdoris::Rowset, std::shared_ptrdoris::TabletSchema) at /root/doris/be/src/olap/push_handler.cpp:247
2023-11-22 19:12:38 3# doris::PushHandler::_do_streaming_ingestion(std::shared_ptrdoris::Tablet, doris::TPushReq const&, doris::PushType, std::vector<doris::TTabletInfo, std::allocatordoris::TTabletInfo >
) at /root/doris/be/src/olap/push_handler.cpp:147
2023-11-22 19:12:38 4# doris::PushHandler::process_streaming_ingestion(std::shared_ptrdoris::Tablet, doris::TPushReq const&, doris::PushType, std::vector<doris::TTabletInfo, std::allocatordoris::TTabletInfo >) at /root/doris/be/src/olap/push_handler.cpp:63
2023-11-22 19:12:38 5# doris::EngineBatchLoadTask::_push(doris::TPushReq const&, std::vector<doris::TTabletInfo, std::allocatordoris::TTabletInfo >
) at /root/doris/be/src/olap/task/engine_batch_load_task.cpp:281
2023-11-22 19:12:38 6# doris::EngineBatchLoadTask::_process() at /root/doris/be/src/olap/task/engine_batch_load_task.cpp:232
2023-11-22 19:12:38 7# doris::EngineBatchLoadTask::execute() at /root/doris/be/src/olap/task/engine_batch_load_task.cpp:66
2023-11-22 19:12:38 8# doris::StorageEngine::execute_task(doris::EngineTask*) at /root/doris/be/src/olap/storage_engine.cpp:1022
2023-11-22 19:12:38 9# doris::TaskWorkerPool::_push_worker_thread_callback() at /root/doris/be/src/agent/task_worker_pool.cpp:627
2023-11-22 19:12:38 10# doris::ThreadPool::dispatch_thread() at /root/doris/be/src/util/threadpool.cpp:542
2023-11-22 19:12:38 11# doris::Thread::supervise_thread(void*) at /root/doris/be/src/util/thread.cpp:455
2023-11-22 19:12:38 12# start_thread in /lib/aarch64-linux-gnu/libpthread.so.0
2023-11-22 19:12:38 13# 0x0000FFFF9958001C in /lib/aarch64-linux-gnu/libc.so.6
2023-11-22 19:12:38
2023-11-22 19:12:38 /opt/apache-doris/be/bin/start_be.sh: line 244: 584 Segmentation fault ${LIMIT:+${LIMIT}} "${DORIS_HOME}/lib/doris_be" "$@" 2>&1 < /dev/null
2023-11-22 19:12:38 finished

What You Expected?

我想知道为什么会出现这种情况,且如何解决

How to Reproduce?

1.建表语句
CREATE TABLE load_obs_file_test (
id int(11) NULL,
name varchar(50) NULL,
age tinyint(4) NULL
) ENGINE = OLAP UNIQUE KEY(id) COMMENT 'OLAP' DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false", "storage_format" = "V2",
"disable_auto_compaction" = "false"
);

2.load 语句
LOAD LABEL spark_load_test85
(
DATA INFILE("hdfs://xx.xx.xx.xx:8020/tmp/test/test.csv")
INTO TABLE load_obs_file_test
COLUMNS TERMINATED BY ","
FORMAT AS "csv"
(id,name,age)
set (
id=id,
name=name,
age=age
)
)
WITH RESOURCE 'spark2'
(
"spark.executor.memory" = "1g",
"spark.shuffle.compress" = "true"
)
PROPERTIES
(
"timeout" = "3600",
"max_filter_ratio"="1"
);

Anything Else?

1.spark load成功执行到loading阶段
2.成功生产parquet文件
3.parquet文件如下:
parquet-tools show V1.spark_load_test85.24165.24164.24166.0.544377767.parquet
+------+----------+-------+
| id | name | age |
|------+----------+-------|
| 1 | zhangsan | 14 |
| 2 | lisi | 19 |
+------+----------+-------+

parquet-tools inspect V1.spark_load_test85.24165.24164.24166.0.544377767.parquet

############ file meta data ############
created_by: parquet-mr version 1.12.0 (build db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
num_columns: 3
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 651

############ Columns ############
id
name
age

############ Column(id) ############
name: id
path: id
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

############ Column(name) ############
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -4%)

############ Column(age) ############
name: age
path: age
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
compression: SNAPPY (space_saved: -5%)

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions