Skip to content

[fix](backup) reject upload snapshots on broken storage path#61251

Merged
gavinchou merged 2 commits intoapache:masterfrom
luwei16:fix/upload-broken-snapshot-path
Mar 17, 2026
Merged

[fix](backup) reject upload snapshots on broken storage path#61251
gavinchou merged 2 commits intoapache:masterfrom
luwei16:fix/upload-broken-snapshot-path

Conversation

@luwei16
Copy link
Contributor

@luwei16 luwei16 commented Mar 12, 2026

Backup upload reuses snapshot paths returned by MAKE_SNAPSHOT. When a data dir is later marked as broken, the stale snapshot directory can still remain on that disk and be picked up by upload. In that case the upload task may continue into file checksum and remote upload logic with a snapshot source that is no longer safe to read.

This change adds a broken-storage-path validation step to SnapshotLoader local source path checking for upload. The check canonicalizes the snapshot path, matches it to its DataDir, and rejects the source early when the owning DataDir is offline or the path is listed in broken_storage_path. That turns the broken-disk case into a normal task error instead of letting upload continue on an invalid local snapshot source.

The unit tests cover both the direct broken-path case and a canonicalized symlink path to ensure the validation cannot be bypassed by path indirection.

Backup upload reuses snapshot paths returned by MAKE_SNAPSHOT. When a
data dir is later marked as broken, the stale snapshot directory can
still remain on that disk and be picked up by upload. In that case the
upload task may continue into file checksum and remote upload logic with
a snapshot source that is no longer safe to read.

This change adds a broken-storage-path validation step to
SnapshotLoader local source path checking for upload. The check
canonicalizes the snapshot path, matches it to its DataDir, and rejects
the source early when the owning DataDir is offline or the path is
listed in broken_storage_path. That turns the broken-disk case into a
normal task error instead of letting upload continue on an invalid local
snapshot source.

The unit tests cover both the direct broken-path case and a
canonicalized symlink path to ensure the validation cannot be bypassed
by path indirection.
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@luwei16 luwei16 changed the title [fix] reject upload snapshots on broken storage path [fix](backup) reject upload snapshots on broken storage path Mar 12, 2026
@luwei16
Copy link
Contributor Author

luwei16 commented Mar 12, 2026

run buildall

@luwei16
Copy link
Contributor Author

luwei16 commented Mar 12, 2026

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 94.74% (36/38) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.62% (19681/37399)
Line Coverage 36.18% (183511/507261)
Region Coverage 32.31% (141724/438694)
Branch Coverage 33.47% (61874/184847)

@doris-robot
Copy link

TPC-H: Total hot run time: 27795 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 83a443615ae4d4d35fb1974bc783b72cae67ef41, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17642	4593	4296	4296
q2	q3	10647	821	518	518
q4	4675	368	261	261
q5	7561	1206	1022	1022
q6	175	175	152	152
q7	795	856	671	671
q8	9301	1476	1322	1322
q9	4928	4780	4757	4757
q10	6253	1913	1674	1674
q11	478	267	256	256
q12	696	575	473	473
q13	18069	2979	2219	2219
q14	234	230	216	216
q15	943	812	805	805
q16	784	736	710	710
q17	715	845	412	412
q18	6097	5422	5236	5236
q19	1139	987	616	616
q20	498	515	390	390
q21	4399	2261	1528	1528
q22	396	330	261	261
Total cold run time: 96425 ms
Total hot run time: 27795 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4668	4542	4503	4503
q2	q3	3872	4325	3805	3805
q4	902	1226	767	767
q5	4082	4376	4379	4376
q6	181	178	142	142
q7	1803	1655	1552	1552
q8	2484	2870	2606	2606
q9	7665	7465	7406	7406
q10	3757	3987	3639	3639
q11	520	447	437	437
q12	507	594	447	447
q13	2773	3179	2277	2277
q14	273	288	275	275
q15	857	803	862	803
q16	749	791	758	758
q17	1184	1466	1329	1329
q18	7198	6968	6791	6791
q19	914	886	902	886
q20	2068	2189	2037	2037
q21	4081	3533	3542	3533
q22	457	445	384	384
Total cold run time: 50995 ms
Total hot run time: 48753 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 153766 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 83a443615ae4d4d35fb1974bc783b72cae67ef41, data reload: false

query5	4327	641	530	530
query6	326	232	211	211
query7	4205	477	270	270
query8	329	248	231	231
query9	8706	2714	2759	2714
query10	531	389	346	346
query11	7543	5907	5652	5652
query12	191	129	128	128
query13	1258	468	341	341
query14	5694	3911	3658	3658
query14_1	2876	2830	2831	2830
query15	212	203	180	180
query16	1012	490	469	469
query17	1104	721	648	648
query18	2460	458	389	389
query19	208	210	180	180
query20	135	132	128	128
query21	219	141	118	118
query22	4767	4913	4736	4736
query23	16819	16036	15675	15675
query23_1	15817	15717	15860	15717
query24	7956	1690	1223	1223
query24_1	1279	1246	1246	1246
query25	574	475	420	420
query26	1245	273	153	153
query27	2769	470	303	303
query28	4498	1861	1831	1831
query29	891	592	503	503
query30	310	244	210	210
query31	1353	1293	1239	1239
query32	82	78	72	72
query33	513	328	293	293
query34	907	918	567	567
query35	643	686	592	592
query36	1058	1128	935	935
query37	130	97	95	95
query38	2985	2987	2922	2922
query39	888	873	839	839
query39_1	827	834	839	834
query40	247	155	135	135
query41	62	60	59	59
query42	308	306	297	297
query43	239	263	242	242
query44	
query45	197	198	186	186
query46	863	989	600	600
query47	2073	2122	2038	2038
query48	308	319	230	230
query49	623	459	386	386
query50	679	279	222	222
query51	4106	4076	4050	4050
query52	289	297	283	283
query53	301	346	279	279
query54	349	272	274	272
query55	93	93	84	84
query56	329	329	316	316
query57	1362	1321	1299	1299
query58	288	289	285	285
query59	1374	1459	1326	1326
query60	337	335	330	330
query61	145	141	143	141
query62	637	574	540	540
query63	318	280	280	280
query64	5093	1266	985	985
query65	
query66	1452	460	360	360
query67	16355	16354	16357	16354
query68	
query69	392	311	278	278
query70	1005	956	960	956
query71	343	315	306	306
query72	2862	2870	2601	2601
query73	561	550	328	328
query74	10027	9908	9856	9856
query75	2886	2762	2500	2500
query76	2296	1038	695	695
query77	388	396	324	324
query78	11241	11414	10705	10705
query79	1125	786	601	601
query80	1551	642	579	579
query81	567	292	269	269
query82	1192	156	128	128
query83	366	300	253	253
query84	250	114	97	97
query85	1211	480	435	435
query86	416	304	289	289
query87	3204	3156	3044	3044
query88	3544	2664	2650	2650
query89	443	381	366	366
query90	1883	183	184	183
query91	165	162	137	137
query92	79	76	77	76
query93	1034	853	517	517
query94	662	324	304	304
query95	586	399	320	320
query96	640	504	229	229
query97	2499	2538	2453	2453
query98	235	224	208	208
query99	1012	1007	917	917
Total cold run time: 235438 ms
Total hot run time: 153766 ms

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.61% (26233/36632)
Line Coverage 54.33% (275084/506358)
Region Coverage 51.40% (227869/443291)
Branch Coverage 52.90% (98191/185611)

@luwei16
Copy link
Contributor Author

luwei16 commented Mar 13, 2026

run p0

@luwei16
Copy link
Contributor Author

luwei16 commented Mar 13, 2026

run external

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.39% (26885/36632)
Line Coverage 56.67% (286931/506358)
Region Coverage 53.82% (238560/443291)
Branch Coverage 55.64% (103275/185611)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.39% (26884/36632)
Line Coverage 56.66% (286926/506358)
Region Coverage 53.82% (238571/443291)
Branch Coverage 55.64% (103276/185611)

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 17, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@gavinchou gavinchou merged commit 625b0c4 into apache:master Mar 17, 2026
30 of 32 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 17, 2026
Backup upload reuses snapshot paths returned by MAKE_SNAPSHOT. When a
data dir is later marked as broken, the stale snapshot directory can
still remain on that disk and be picked up by upload. In that case the
upload task may continue into file checksum and remote upload logic with
a snapshot source that is no longer safe to read.

This change adds a broken-storage-path validation step to SnapshotLoader
local source path checking for upload. The check canonicalizes the
snapshot path, matches it to its DataDir, and rejects the source early
when the owning DataDir is offline or the path is listed in
broken_storage_path. That turns the broken-disk case into a normal task
error instead of letting upload continue on an invalid local snapshot
source.

The unit tests cover both the direct broken-path case and a
canonicalized symlink path to ensure the validation cannot be bypassed
by path indirection.
github-actions bot pushed a commit that referenced this pull request Mar 17, 2026
Backup upload reuses snapshot paths returned by MAKE_SNAPSHOT. When a
data dir is later marked as broken, the stale snapshot directory can
still remain on that disk and be picked up by upload. In that case the
upload task may continue into file checksum and remote upload logic with
a snapshot source that is no longer safe to read.

This change adds a broken-storage-path validation step to SnapshotLoader
local source path checking for upload. The check canonicalizes the
snapshot path, matches it to its DataDir, and rejects the source early
when the owning DataDir is offline or the path is listed in
broken_storage_path. That turns the broken-disk case into a normal task
error instead of letting upload continue on an invalid local snapshot
source.

The unit tests cover both the direct broken-path case and a
canonicalized symlink path to ensure the validation cannot be bypassed
by path indirection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants