Skip to content

[fix](filecache) fix clear_file_cache right after reboot causing file cache size percent overflow#63410

Open
freemandealer wants to merge 1 commit into
apache:masterfrom
freemandealer:task-master-fix-file-cache-reset-range-size-accounting
Open

[fix](filecache) fix clear_file_cache right after reboot causing file cache size percent overflow#63410
freemandealer wants to merge 1 commit into
apache:masterfrom
freemandealer:task-master-fix-file-cache-reset-range-size-accounting

Conversation

@freemandealer
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: When file cache LRU restore creates a block from dump metadata and later lazy loading finds the same hash/offset with a smaller real file size, reset_range only updated the LRU queue size and _cur_cache_size. The FileBlock range still kept the old restored size, so a later async clear or eviction subtracted the old block size and could underflow _cur_cache_size, producing huge size_percent values in need-evict-cache-in-advance logs. This change makes reset_range update the FileBlock range as the single place that keeps the FileBlock, LRU queue, _cur_cache_size, and TTL size accounting consistent. FileBlock::finalize now delegates the range shrink to reset_range instead of changing the range before calling it.

Release note

None

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: When file cache LRU restore creates a block from dump metadata and later lazy loading finds the same hash/offset with a smaller real file size, reset_range only updated the LRU queue size and _cur_cache_size. The FileBlock range still kept the old restored size, so a later async clear or eviction subtracted the old block size and could underflow _cur_cache_size, producing huge size_percent values in need-evict-cache-in-advance logs. This change makes reset_range update the FileBlock range as the single place that keeps the FileBlock, LRU queue, _cur_cache_size, and TTL size accounting consistent. FileBlock::finalize now delegates the range shrink to reset_range instead of changing the range before calling it.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added BE UT BlockFileCacheTest.lru_restore_size_mismatch_does_not_underflow_on_clear
    - Ran build-support/clang-format.sh with clang-format 16
    - Ran build-support/check-format.sh with clang-format 16
    - Ran DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter=BlockFileCacheTest.lru_restore_size_mismatch_does_not_underflow_on_clear
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31750 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9a71b509ad7c33f9750c39cb3d49fc14c1ad0714, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17977	4096	4089	4089
q2	q3	10752	1430	821	821
q4	4684	496	350	350
q5	7708	2336	2156	2156
q6	337	186	151	151
q7	993	797	625	625
q8	9517	1784	1647	1647
q9	6813	4937	4910	4910
q10	6468	2127	1788	1788
q11	438	280	245	245
q12	686	439	317	317
q13	18157	3473	2813	2813
q14	276	263	243	243
q15	q16	832	777	713	713
q17	972	909	938	909
q18	7003	5780	5667	5667
q19	1265	1381	1051	1051
q20	543	424	272	272
q21	6172	2815	2661	2661
q22	472	390	322	322
Total cold run time: 102065 ms
Total hot run time: 31750 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4674	4606	4960	4606
q2	q3	4937	5298	4642	4642
q4	2343	2385	1630	1630
q5	4959	4817	4744	4744
q6	252	197	137	137
q7	2005	1813	1590	1590
q8	2578	2062	2073	2062
q9	7403	7354	7297	7297
q10	4610	4468	4011	4011
q11	593	433	389	389
q12	720	715	531	531
q13	3024	3363	2838	2838
q14	285	275	256	256
q15	q16	708	715	648	648
q17	1302	1271	1263	1263
q18	8023	7068	7151	7068
q19	1102	1125	1075	1075
q20	2269	2259	1966	1966
q21	5486	4898	4714	4714
q22	540	489	426	426
Total cold run time: 57813 ms
Total hot run time: 51893 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168581 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9a71b509ad7c33f9750c39cb3d49fc14c1ad0714, data reload: false

query5	4336	653	551	551
query6	323	218	204	204
query7	4224	568	310	310
query8	325	232	223	223
query9	8888	4012	3974	3974
query10	443	335	308	308
query11	5759	2410	2208	2208
query12	182	131	128	128
query13	1311	632	412	412
query14	6114	5378	5038	5038
query14_1	4349	4337	4337	4337
query15	208	202	190	190
query16	1022	469	461	461
query17	1161	749	604	604
query18	2720	480	358	358
query19	225	213	180	180
query20	144	139	135	135
query21	214	142	117	117
query22	13620	13835	13655	13655
query23	17301	16330	15939	15939
query23_1	16066	16096	16071	16071
query24	7448	1794	1287	1287
query24_1	1291	1310	1290	1290
query25	532	459	420	420
query26	1299	307	174	174
query27	2697	563	331	331
query28	4393	1998	1941	1941
query29	978	615	497	497
query30	308	235	196	196
query31	1110	1083	938	938
query32	86	73	71	71
query33	545	354	293	293
query34	1176	1128	652	652
query35	753	777	664	664
query36	1307	1306	1123	1123
query37	154	108	90	90
query38	3215	3129	3027	3027
query39	925	914	897	897
query39_1	877	874	862	862
query40	225	144	128	128
query41	63	62	63	62
query42	114	109	115	109
query43	321	322	280	280
query44	
query45	211	202	190	190
query46	1054	1191	749	749
query47	2305	2372	2200	2200
query48	394	440	300	300
query49	628	480	371	371
query50	967	357	261	261
query51	4314	4269	4245	4245
query52	112	103	92	92
query53	260	289	204	204
query54	313	270	262	262
query55	94	92	86	86
query56	298	303	303	303
query57	1398	1385	1269	1269
query58	291	258	257	257
query59	1527	1613	1400	1400
query60	345	314	296	296
query61	154	148	145	145
query62	663	636	559	559
query63	253	199	209	199
query64	2321	767	603	603
query65	
query66	1674	481	349	349
query67	30020	29894	29816	29816
query68	
query69	455	332	302	302
query70	980	975	948	948
query71	307	275	304	275
query72	2942	2678	2367	2367
query73	843	805	417	417
query74	5019	4872	4699	4699
query75	2675	2585	2268	2268
query76	2287	1153	781	781
query77	402	400	334	334
query78	12007	11886	11655	11655
query79	1435	1015	722	722
query80	650	553	462	462
query81	451	276	240	240
query82	1392	158	123	123
query83	351	272	256	256
query84	260	135	114	114
query85	901	535	444	444
query86	401	334	339	334
query87	3394	3351	3238	3238
query88	3612	2680	2611	2611
query89	434	387	359	359
query90	1871	185	185	185
query91	174	170	139	139
query92	80	77	75	75
query93	1485	1430	898	898
query94	537	368	307	307
query95	676	379	343	343
query96	1082	794	346	346
query97	2717	2686	2584	2584
query98	238	238	228	228
query99	1111	1077	970	970
Total cold run time: 252601 ms
Total hot run time: 168581 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 57.14% (4/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.52% (27874/37914)
Line Coverage 57.47% (302170/525751)
Region Coverage 54.64% (252795/462697)
Branch Coverage 56.16% (109183/194407)

@freemandealer
Copy link
Copy Markdown
Member Author

run external

@freemandealer
Copy link
Copy Markdown
Member Author

run cloud_p0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants