Skip to content

[Fix](compaction) Fix nullptr in CloudStorageEngine due to concurrent access to compaction maps #50819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025

Conversation

Yukang-Lian
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #49882

Problem Summary:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1746727905 (unix time) try "date -d @1746727905" if you are using GNU date ***
*** Current BE git commitID: ace825a ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 3151893 (TID 3152363 OR 0x7f1186c00640) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F12D9FEE520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hashtable<long, std::pair<long const, std::shared_ptrdoris::CloudBaseCompaction >, std::allocator<std::pair<long const, std::shared_ptrdoris::CloudBaseCompaction > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_before_node(unsigned long, long const&, unsigned long) const at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1817
5# std::pair<std::__detail::_Node_iterator<std::pair<long const, std::shared_ptrdoris::CloudBaseCompaction >, false, false>, bool> std::_Hashtable<long, std::pair<long const, std::shared_ptrdoris::CloudBaseCompaction >, std::allocator<std::pair<long const, std::shared_ptrdoris::CloudBaseCompaction > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<long, decltype(nullptr)>(std::integral_constant<bool, true>, long&&, decltype(nullptr)&&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1947
6# doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptrdoris::CloudTablet const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
7# doris::CloudStorageEngine::submit_compaction_task(std::shared_ptrdoris::CloudTablet const&, doris::CompactionType) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:917
8# doris::CloudStorageEngine::_compaction_tasks_producer_callback() at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:494
9# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
10# start_thread at ./nptl/pthread_create.c:442
11# 0x00007F12DA0D2850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33908 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a9a1bdee79e7ccce7beb56d52721c2dcf7b7202c, data reload: false

------ Round 1 ----------------------------------
q1	25508	5079	4995	4995
q2	2056	300	198	198
q3	10362	1226	713	713
q4	10238	993	509	509
q5	7541	2347	2350	2347
q6	182	163	135	135
q7	920	740	615	615
q8	9308	1301	1081	1081
q9	6792	5112	5099	5099
q10	6870	2295	1898	1898
q11	500	294	274	274
q12	348	353	225	225
q13	17756	3825	3117	3117
q14	234	221	212	212
q15	526	464	496	464
q16	423	427	385	385
q17	636	858	386	386
q18	7498	7166	7065	7065
q19	1411	960	557	557
q20	343	324	217	217
q21	4406	3371	2489	2489
q22	1019	1017	927	927
Total cold run time: 114877 ms
Total hot run time: 33908 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5106	5063	5053	5053
q2	235	321	225	225
q3	2144	2672	2261	2261
q4	1326	1760	1367	1367
q5	4440	4378	4336	4336
q6	220	171	130	130
q7	1996	1898	1738	1738
q8	2570	2594	2458	2458
q9	7165	7185	6958	6958
q10	3070	3199	2761	2761
q11	578	519	497	497
q12	721	738	626	626
q13	3464	3801	3284	3284
q14	301	309	292	292
q15	522	480	482	480
q16	452	490	435	435
q17	1140	1498	1396	1396
q18	7694	7515	7296	7296
q19	809	806	866	806
q20	1983	2026	1842	1842
q21	5082	4735	4739	4735
q22	1037	1048	1044	1044
Total cold run time: 52055 ms
Total hot run time: 50020 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193080 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a9a1bdee79e7ccce7beb56d52721c2dcf7b7202c, data reload: false

query1	1379	1081	1068	1068
query2	6211	1824	1842	1824
query3	11014	4635	4543	4543
query4	54814	25487	23084	23084
query5	5020	502	446	446
query6	332	229	206	206
query7	4870	502	295	295
query8	307	255	243	243
query9	5362	2692	2665	2665
query10	434	329	257	257
query11	14947	15007	14875	14875
query12	157	114	106	106
query13	1028	521	408	408
query14	10227	6323	6313	6313
query15	205	208	187	187
query16	7067	659	503	503
query17	1121	733	615	615
query18	1548	406	320	320
query19	205	192	175	175
query20	133	126	121	121
query21	211	137	112	112
query22	4318	4575	4160	4160
query23	34301	33603	33731	33603
query24	6612	2460	2450	2450
query25	473	471	411	411
query26	707	273	157	157
query27	2328	514	353	353
query28	2855	2185	2168	2168
query29	617	604	441	441
query30	275	215	188	188
query31	833	838	763	763
query32	80	62	60	60
query33	446	357	321	321
query34	774	854	551	551
query35	812	837	760	760
query36	984	996	918	918
query37	111	102	79	79
query38	4310	4220	4359	4220
query39	1517	1457	1431	1431
query40	220	120	112	112
query41	56	54	50	50
query42	126	113	105	105
query43	504	511	481	481
query44	1383	849	840	840
query45	177	176	175	175
query46	891	1060	659	659
query47	1862	1902	1803	1803
query48	408	421	333	333
query49	680	512	431	431
query50	680	704	420	420
query51	4202	4230	4237	4230
query52	112	111	102	102
query53	239	273	196	196
query54	597	579	530	530
query55	88	85	89	85
query56	296	314	299	299
query57	1186	1172	1130	1130
query58	275	265	263	263
query59	2630	2752	2596	2596
query60	340	351	315	315
query61	132	149	123	123
query62	738	778	676	676
query63	238	188	198	188
query64	1877	1062	712	712
query65	4292	4242	4227	4227
query66	738	391	310	310
query67	15875	15600	15464	15464
query68	7583	891	515	515
query69	555	308	270	270
query70	1229	1137	1103	1103
query71	515	340	297	297
query72	5998	4868	4844	4844
query73	1527	697	361	361
query74	8860	9064	8749	8749
query75	3836	3228	2731	2731
query76	4180	1194	748	748
query77	621	373	286	286
query78	10082	10321	9349	9349
query79	2764	790	574	574
query80	665	494	439	439
query81	476	256	218	218
query82	440	129	135	129
query83	377	250	227	227
query84	295	96	86	86
query85	783	352	304	304
query86	389	321	289	289
query87	4409	4431	4329	4329
query88	3409	2288	2293	2288
query89	412	310	285	285
query90	1804	202	205	202
query91	138	150	111	111
query92	76	61	65	61
query93	2032	922	580	580
query94	666	407	312	312
query95	375	287	280	280
query96	492	573	280	280
query97	3167	3192	3093	3093
query98	232	213	213	213
query99	1443	1419	1272	1272
Total cold run time: 299617 ms
Total hot run time: 193080 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.07 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a9a1bdee79e7ccce7beb56d52721c2dcf7b7202c, data reload: false

query1	0.03	0.04	0.03
query2	0.12	0.10	0.11
query3	0.25	0.19	0.19
query4	1.59	0.19	0.19
query5	0.58	0.58	0.58
query6	1.20	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.58	0.52	0.51
query10	0.56	0.57	0.57
query11	0.16	0.11	0.11
query12	0.14	0.11	0.11
query13	0.61	0.60	0.59
query14	0.79	0.81	0.81
query15	0.88	0.84	0.86
query16	0.39	0.38	0.37
query17	1.01	1.08	1.03
query18	0.23	0.22	0.22
query19	1.93	1.84	1.78
query20	0.01	0.01	0.02
query21	15.40	0.91	0.55
query22	0.73	1.22	0.65
query23	14.97	1.40	0.61
query24	7.13	1.31	0.56
query25	0.53	0.07	0.13
query26	0.66	0.16	0.15
query27	0.05	0.05	0.06
query28	9.54	0.94	0.47
query29	12.55	3.91	3.31
query30	0.25	0.09	0.07
query31	2.82	0.60	0.38
query32	3.23	0.54	0.47
query33	3.04	3.11	3.14
query34	15.78	5.10	4.53
query35	4.48	4.57	4.48
query36	0.65	0.50	0.48
query37	0.08	0.07	0.06
query38	0.06	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.14	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.03
query43	0.03	0.03	0.04
Total cold run time: 103.41 s
Total hot run time: 29.07 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 55.76% (14894/26709)
Line Coverage 44.57% (131780/295647)
Region Coverage 43.63% (66266/151882)
Branch Coverage 38.23% (33956/88814)

@Yukang-Lian
Copy link
Collaborator Author

run cloud_p0

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.46% (20877/26275)
Line Coverage 72.69% (214873/295583)
Region Coverage 70.83% (126352/178382)
Branch Coverage 64.45% (65367/101422)

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 14, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring dataroaring merged commit 732d8c6 into apache:master May 14, 2025
27 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request May 14, 2025
… access to compaction maps (#50819)

Related PR: #49882 

Problem Summary:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1746727905 (unix time) try "date -d @1746727905" if you
are using GNU date ***
*** Current BE git commitID: ace825a ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 3151893
(TID 3152363 OR 0x7f1186c00640) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F12D9FEE520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true>
>::_M_find_before_node(unsigned long, long const&, unsigned long) const
at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1817
5# std::pair<std::__detail::_Node_iterator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >, false, false>, bool>
std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<long,
decltype(nullptr)>(std::integral_constant<bool, true>, long&&,
decltype(nullptr)&&) at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1947
6#
doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr<doris::CloudTablet>
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
7#
doris::CloudStorageEngine::submit_compaction_task(std::shared_ptr<doris::CloudTablet>
const&, doris::CompactionType) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:917
8# doris::CloudStorageEngine::_compaction_tasks_producer_callback() at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:494
9# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
10# start_thread at ./nptl/pthread_create.c:442
11# 0x00007F12DA0D2850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
dataroaring pushed a commit that referenced this pull request May 14, 2025
…o concurrent access to compaction maps #50819 (#50881)

Cherry-picked from #50819

Co-authored-by: abmdocrt <lianyukang@selectdb.com>
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
… access to compaction maps (apache#50819)

Related PR: apache#49882 

Problem Summary:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1746727905 (unix time) try "date -d @1746727905" if you
are using GNU date ***
*** Current BE git commitID: ace825a ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 3151893
(TID 3152363 OR 0x7f1186c00640) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F12D9FEE520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true>
>::_M_find_before_node(unsigned long, long const&, unsigned long) const
at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1817
5# std::pair<std::__detail::_Node_iterator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >, false, false>, bool>
std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<long,
decltype(nullptr)>(std::integral_constant<bool, true>, long&&,
decltype(nullptr)&&) at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1947
6#
doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr<doris::CloudTablet>
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
7#
doris::CloudStorageEngine::submit_compaction_task(std::shared_ptr<doris::CloudTablet>
const&, doris::CompactionType) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:917
8# doris::CloudStorageEngine::_compaction_tasks_producer_callback() at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:494
9# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
10# start_thread at ./nptl/pthread_create.c:442
11# 0x00007F12DA0D2850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.6-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants