Skip to content

[improvement](filecache) Adapt file cache queue consumption#63504

Open
freemandealer wants to merge 1 commit into
apache:masterfrom
freemandealer:task-master-pick-file-cache-adaptive-queue-consu
Open

[improvement](filecache) Adapt file cache queue consumption#63504
freemandealer wants to merge 1 commit into
apache:masterfrom
freemandealer:task-master-pick-file-cache-adaptive-queue-consu

Conversation

@freemandealer
Copy link
Copy Markdown
Member

Problem Summary: File cache background consumers used fixed intervals and batch sizes for LRU recorder log replay and _need_update_lru_blocks updates. When producers outpaced those consumers, backlog growth was hard to observe and could increase memory pressure. This change adds queue length metrics for LRU recorder log queues, exposes queue-size accessors, supports bounded LRU log replay, and makes both background consumers adapt their interval and batch size according to backlog watermarks. It also slices block LRU update work into smaller lock-hold batches and skips LRU log recording when tail-record retention is disabled.

None

  • Test: Unit Test
    • CCACHE_DISABLE=1 DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 EXTRA_CXX_FLAGS='-Wno-error=deprecated-literal-operator' sh run-be-ut.sh --run --filter=BlockFileCacheTest.test_lru_log_replay_bound_and_disable_record
    • build-support/check-format.sh
    • git diff --check
    • Tried build-support/run-clang-tidy.sh --base origin/master --build-dir be/ut_build_ASAN; it was blocked by pre-existing/file-level diagnostics and system header lookup errors before producing a clean result.
  • Behavior changed: Yes. File cache background queue consumers can increase consume frequency and batch size when backlog crosses configured watermarks.
  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Problem Summary: File cache background consumers used fixed intervals and batch sizes for LRU recorder log replay and _need_update_lru_blocks updates. When producers outpaced those consumers, backlog growth was hard to observe and could increase memory pressure. This change adds queue length metrics for LRU recorder log queues, exposes queue-size accessors, supports bounded LRU log replay, and makes both background consumers adapt their interval and batch size according to backlog watermarks. It also slices block LRU update work into smaller lock-hold batches and skips LRU log recording when tail-record retention is disabled.

None

- Test: Unit Test
    - `CCACHE_DISABLE=1 DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 EXTRA_CXX_FLAGS='-Wno-error=deprecated-literal-operator' sh run-be-ut.sh --run --filter=BlockFileCacheTest.test_lru_log_replay_bound_and_disable_record`
    - `build-support/check-format.sh`
    - `git diff --check`
    - Tried `build-support/run-clang-tidy.sh --base origin/master --build-dir be/ut_build_ASAN`; it was blocked by pre-existing/file-level diagnostics and system header lookup errors before producing a clean result.
- Behavior changed: Yes. File cache background queue consumers can increase consume frequency and batch size when backlog crosses configured watermarks.
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31083 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a3410c382f1b8f3a57c58bb9792961b0c76a58b4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17677	3867	3893	3867
q2	q3	10803	1407	802	802
q4	4682	472	349	349
q5	7621	2292	2087	2087
q6	383	175	141	141
q7	960	767	662	662
q8	9370	1672	1592	1592
q9	6971	4907	4922	4907
q10	6438	2089	1802	1802
q11	444	279	254	254
q12	636	427	290	290
q13	18156	3363	2764	2764
q14	264	253	237	237
q15	q16	825	779	721	721
q17	1008	921	981	921
q18	7028	5757	5577	5577
q19	1185	1289	1084	1084
q20	515	415	270	270
q21	5505	2556	2438	2438
q22	431	362	318	318
Total cold run time: 100902 ms
Total hot run time: 31083 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4306	4134	4176	4134
q2	q3	4543	4923	4344	4344
q4	2100	2222	1395	1395
q5	4395	4280	4315	4280
q6	232	185	212	185
q7	2160	1813	1641	1641
q8	2585	2188	2130	2130
q9	7886	7765	7678	7678
q10	4565	4487	4068	4068
q11	591	463	487	463
q12	731	738	526	526
q13	3385	3645	3068	3068
q14	299	310	311	310
q15	q16	721	752	647	647
q17	1363	1326	1290	1290
q18	8046	7346	7150	7150
q19	1127	1104	1115	1104
q20	2218	2240	1943	1943
q21	5312	4612	4488	4488
q22	547	467	422	422
Total cold run time: 57112 ms
Total hot run time: 51266 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170549 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a3410c382f1b8f3a57c58bb9792961b0c76a58b4, data reload: false

query5	4338	656	515	515
query6	337	230	209	209
query7	4272	552	320	320
query8	331	244	221	221
query9	8903	4052	4027	4027
query10	479	352	298	298
query11	5818	2394	2196	2196
query12	189	128	129	128
query13	1308	621	438	438
query14	5979	5347	5068	5068
query14_1	4367	4358	4363	4358
query15	210	203	184	184
query16	986	457	514	457
query17	1021	748	601	601
query18	2474	502	373	373
query19	221	205	172	172
query20	136	135	128	128
query21	217	144	121	121
query22	13568	13573	13279	13279
query23	17147	16416	16052	16052
query23_1	16095	16197	16191	16191
query24	7430	1729	1300	1300
query24_1	1326	1305	1305	1305
query25	597	506	441	441
query26	1305	330	180	180
query27	2674	548	358	358
query28	4481	1985	1971	1971
query29	991	644	522	522
query30	311	244	200	200
query31	1112	1070	971	971
query32	94	76	71	71
query33	535	349	286	286
query34	1165	1106	648	648
query35	756	776	677	677
query36	1342	1388	1184	1184
query37	153	109	101	101
query38	3192	3133	3085	3085
query39	926	919	902	902
query39_1	873	865	866	865
query40	224	143	122	122
query41	64	64	62	62
query42	108	105	107	105
query43	327	324	278	278
query44	
query45	207	201	198	198
query46	1056	1165	723	723
query47	2322	2339	2231	2231
query48	405	390	288	288
query49	640	477	370	370
query50	1055	342	259	259
query51	4335	4273	4220	4220
query52	105	106	91	91
query53	256	279	205	205
query54	303	264	250	250
query55	92	91	82	82
query56	294	303	303	303
query57	1409	1406	1346	1346
query58	300	260	257	257
query59	1529	1619	1408	1408
query60	321	323	297	297
query61	161	158	148	148
query62	666	629	564	564
query63	239	199	204	199
query64	2390	813	618	618
query65	
query66	1745	485	369	369
query67	29342	29969	29695	29695
query68	
query69	454	334	308	308
query70	1039	1004	974	974
query71	306	272	268	268
query72	2976	2663	2418	2418
query73	854	781	380	380
query74	5038	4889	4723	4723
query75	2638	2629	2246	2246
query76	2311	1135	765	765
query77	390	401	331	331
query78	12167	12099	11664	11664
query79	1448	995	729	729
query80	642	542	456	456
query81	456	276	239	239
query82	1379	156	127	127
query83	363	271	252	252
query84	297	143	107	107
query85	870	527	459	459
query86	403	341	338	338
query87	3393	3402	3211	3211
query88	3514	2677	2654	2654
query89	436	384	335	335
query90	2011	180	179	179
query91	178	167	137	137
query92	80	80	74	74
query93	1506	1404	847	847
query94	531	340	295	295
query95	687	373	348	348
query96	1012	774	336	336
query97	2669	2729	2568	2568
query98	234	228	235	228
query99	1133	1087	959	959
Total cold run time: 251885 ms
Total hot run time: 170549 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 86.30% (126/146) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.65% (20744/38667)
Line Coverage 37.23% (196431/527596)
Region Coverage 33.55% (153980/458920)
Branch Coverage 34.57% (67090/194082)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 86.30% (126/146) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.90% (27245/37892)
Line Coverage 55.25% (290897/526546)
Region Coverage 52.23% (242116/463520)
Branch Coverage 53.56% (104369/194861)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants