Skip to content

[fix](be) Fix DCHECK in LocalExchangeSharedState::sub_total_mem_usage#63742

Open
jacktengg wants to merge 1 commit into
apache:masterfrom
jacktengg:wt-fix-qa-sign-unsign
Open

[fix](be) Fix DCHECK in LocalExchangeSharedState::sub_total_mem_usage#63742
jacktengg wants to merge 1 commit into
apache:masterfrom
jacktengg:wt-fix-qa-sign-unsign

Conversation

@jacktengg
Copy link
Copy Markdown
Contributor

Issue Number: close #xxx

Problem Summary: In LocalExchangeSharedState::sub_total_mem_usage(), mem_usage is std::atomic<int64_t> but delta is size_t. The existing debug check

DCHECK_GE(prev_usage - delta, 0);

was never effective: the usual arithmetic conversions promote prev_usage - delta to size_t, and an unsigned expression is trivially >= 0. So the guard against mem_usage underflow (subtracting more than was added) silently passed in all debug builds, leaving any over-subtraction undetected.

Fix: compare prev_usage (int64_t) against cast_set<int64_t>(delta) so the comparison is performed entirely in signed space, and a real underflow will actually trip the DCHECK with the original prev_usage and delta values in the failure message. The release-mode guard on the next line (cast_set<int64_t>(prev_usage - delta) throws on underflow because the wrapped size_t result exceeds INT64_MAX) is preserved as-is.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Issue Number: close #xxx

Problem Summary: In LocalExchangeSharedState::sub_total_mem_usage(),
`mem_usage` is `std::atomic<int64_t>` but `delta` is `size_t`. The
existing debug check

    DCHECK_GE(prev_usage - delta, 0);

was never effective: the usual arithmetic conversions promote
`prev_usage - delta` to `size_t`, and an unsigned expression is
trivially `>= 0`. So the guard against `mem_usage` underflow
(subtracting more than was added) silently passed in all debug builds,
leaving any over-subtraction undetected.

Fix: compare `prev_usage` (int64_t) against `cast_set<int64_t>(delta)`
so the comparison is performed entirely in signed space, and a real
underflow will actually trip the DCHECK with the original prev_usage
and delta values in the failure message. The release-mode guard on the
next line (`cast_set<int64_t>(prev_usage - delta)` throws on underflow
because the wrapped size_t result exceeds INT64_MAX) is preserved as-is.
@jacktengg
Copy link
Copy Markdown
Contributor Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal and proof: the PR fixes an ineffective debug underflow check in LocalExchangeSharedState::sub_total_mem_usage; the new signed comparison directly validates prev_usage >= delta before the existing post-subtraction threshold check. No new test is included, but this is a narrow DCHECK-only correction.
  • Scope and clarity: the modification is minimal and focused on the broken assertion.
  • Concurrency: the changed code remains on the existing atomic mem_usage.fetch_sub path; no new shared state, locks, dependency transitions, or lock ordering are introduced.
  • Lifecycle/static initialization: no new objects or lifecycle management are introduced.
  • Configuration/compatibility: no config, protocol, storage, or rolling-upgrade compatibility impact.
  • Parallel code paths: this is the local-exchange total memory accounting path; no separate changed path appears necessary for this assertion fix.
  • Error handling and invariants: using DCHECK for the invariant is appropriate, and the existing release-path cast_set behavior is preserved.
  • Memory safety/performance/observability: no allocations, ownership changes, hot-path extra work in release builds, or observability requirements are introduced.
  • Tests: no test result changes. Given the change only corrects a debug assertion expression, existing coverage is acceptable.

User focus: no additional user-provided review focus was specified.

@jacktengg
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31494 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a51b03bc32e2c8ec5e02a3690f0244453ddfb283, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17638	4029	4000	4000
q2	q3	10820	1353	818	818
q4	4682	482	338	338
q5	7603	2412	2136	2136
q6	247	178	135	135
q7	948	792	647	647
q8	9425	1705	1447	1447
q9	5288	5035	4960	4960
q10	6377	2187	1869	1869
q11	439	270	241	241
q12	635	425	294	294
q13	18130	3439	2784	2784
q14	263	258	239	239
q15	q16	823	780	713	713
q17	1018	953	1015	953
q18	7112	5754	5484	5484
q19	1304	1295	1041	1041
q20	654	459	296	296
q21	6308	2919	2746	2746
q22	464	436	353	353
Total cold run time: 100178 ms
Total hot run time: 31494 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4718	5090	4800	4800
q2	q3	4932	5285	4670	4670
q4	2092	2174	1371	1371
q5	5058	4712	4752	4712
q6	233	184	134	134
q7	1871	1733	1565	1565
q8	2401	2151	2110	2110
q9	7927	7511	7472	7472
q10	4777	4630	4246	4246
q11	536	390	354	354
q12	728	730	523	523
q13	3010	3402	2760	2760
q14	283	274	257	257
q15	q16	672	691	614	614
q17	1292	1260	1266	1260
q18	7263	6734	6913	6734
q19	1113	1097	1089	1089
q20	2214	2200	1943	1943
q21	5247	4586	4352	4352
q22	520	458	401	401
Total cold run time: 56887 ms
Total hot run time: 51367 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172108 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a51b03bc32e2c8ec5e02a3690f0244453ddfb283, data reload: false

query5	4309	654	529	529
query6	338	233	202	202
query7	4260	548	311	311
query8	334	242	221	221
query9	8800	4031	4083	4031
query10	466	350	300	300
query11	5803	2525	2235	2235
query12	183	129	125	125
query13	1296	626	439	439
query14	6151	5468	5173	5173
query14_1	4505	4485	4478	4478
query15	209	202	187	187
query16	999	459	446	446
query17	1112	715	595	595
query18	2442	494	352	352
query19	213	205	158	158
query20	136	131	132	131
query21	212	138	117	117
query22	13693	13606	13338	13338
query23	17443	16583	16190	16190
query23_1	16463	16461	16287	16287
query24	7407	1804	1295	1295
query24_1	1303	1328	1335	1328
query25	552	479	416	416
query26	1326	317	174	174
query27	2710	580	342	342
query28	4429	2008	2037	2008
query29	998	613	497	497
query30	293	241	201	201
query31	1144	1092	958	958
query32	96	80	73	73
query33	569	365	310	310
query34	1190	1163	676	676
query35	798	813	697	697
query36	1427	1391	1300	1300
query37	159	108	97	97
query38	3242	3190	3070	3070
query39	937	920	882	882
query39_1	868	874	879	874
query40	235	154	131	131
query41	72	69	67	67
query42	113	112	116	112
query43	335	331	292	292
query44	
query45	215	208	198	198
query46	1129	1166	781	781
query47	2343	2338	2229	2229
query48	436	445	304	304
query49	649	529	405	405
query50	986	359	256	256
query51	4382	4320	4320	4320
query52	109	111	99	99
query53	268	291	210	210
query54	329	288	276	276
query55	98	101	90	90
query56	325	333	317	317
query57	1440	1400	1322	1322
query58	312	286	276	276
query59	1597	1710	1434	1434
query60	339	342	322	322
query61	188	186	181	181
query62	705	658	590	590
query63	252	210	214	210
query64	2493	836	641	641
query65	
query66	1704	486	356	356
query67	29894	29785	29616	29616
query68	
query69	465	345	301	301
query70	1080	999	1010	999
query71	302	276	273	273
query72	3102	2683	2428	2428
query73	859	765	433	433
query74	5112	4921	4822	4822
query75	2708	2618	2285	2285
query76	2271	1145	777	777
query77	410	405	320	320
query78	12459	12577	11856	11856
query79	1469	1002	802	802
query80	654	561	462	462
query81	454	279	243	243
query82	1384	161	123	123
query83	361	284	259	259
query84	262	142	114	114
query85	884	538	470	470
query86	406	347	309	309
query87	3426	3384	3255	3255
query88	3615	2750	2717	2717
query89	457	393	353	353
query90	1897	186	182	182
query91	181	169	135	135
query92	80	79	73	73
query93	1556	1455	888	888
query94	556	364	325	325
query95	667	391	354	354
query96	1037	820	356	356
query97	2729	2721	2616	2616
query98	237	232	233	232
query99	1168	1144	1039	1039
Total cold run time: 254885 ms
Total hot run time: 172108 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 50.00% (1/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.87% (20915/38828)
Line Coverage 37.42% (198042/529215)
Region Coverage 33.71% (155165/460338)
Branch Coverage 34.71% (67553/194620)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 50.00% (1/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.83% (28075/38028)
Line Coverage 57.73% (304754/527859)
Region Coverage 54.89% (255093/464761)
Branch Coverage 56.46% (110287/195346)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants