Skip to content

[Enhancement](be) Fix small_file_mgr to support HTTPS when FE runs in HTTPS-only mode#63918

Open
nsivarajan wants to merge 1 commit into
apache:masterfrom
nsivarajan:handle-small-file-mgr
Open

[Enhancement](be) Fix small_file_mgr to support HTTPS when FE runs in HTTPS-only mode#63918
nsivarajan wants to merge 1 commit into
apache:masterfrom
nsivarajan:handle-small-file-mgr

Conversation

@nsivarajan
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

When Doris FE is configured with enable_https=true and http_port=0 (HTTPS-only hardened deployment), the BE's SmallFileMgr fails to download small files (SSL certificates, UDF jars, Kerberos keytabs) from the FE master.

SmallFileMgr is the only BE→FE path that uses HTTP rather than Thrift/RPC. It downloads files via /api/get_small_file using a hardcoded http:// scheme. When the FE disables HTTP (http_port=0), this connection is refused and the
download fails — breaking features that depend on small files, such as Routine Load with Kafka SSL certificates.

What is changed and how does it work?

_download_file() in small_file_mgr.cpp now tries HTTP first (preserving zero-overhead behavior for existing HTTP deployments), then falls back to HTTPS if HTTP fails. The HTTPS attempt uses use_untrusted_ssl() which skips TLS certificate chain verification.

This is safe for two reasons:

  1. This is internal cluster traffic on a private network (FE master → BE).
  2. Every downloaded file is independently verified via MD5 checksum after download, making it computationally infeasible for a tampered file to pass undetected.

Note: A companion FE PR is needed for the complete fix. The FE HeartbeatMgr must send https_port (not http_port) to BEs when enable_https=true, so that master_fe_http_port contains the correct port for both the HTTP and HTTPS attempts. Without the FE change #60921 , this BE change is safe (no regression) . With both PRs merged, the full end-to-end fix is complete.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@nsivarajan nsivarajan force-pushed the handle-small-file-mgr branch from 130baac to 9d88913 Compare May 31, 2026 09:48
@nsivarajan
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/22) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.01% (21025/38928)
Line Coverage 37.55% (199185/530503)
Region Coverage 33.82% (155965/461109)
Branch Coverage 34.80% (67858/195000)

@nsivarajan nsivarajan force-pushed the handle-small-file-mgr branch from 9d88913 to 7acfeb8 Compare May 31, 2026 11:59
@nsivarajan
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31600 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7acfeb84a08b2c337451a0a7421cb17a4a12e0f0, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17610	3997	3964	3964
q2	q3	10852	1364	811	811
q4	4688	478	342	342
q5	7666	2220	2097	2097
q6	250	176	134	134
q7	956	778	635	635
q8	9366	1624	1737	1624
q9	6785	4963	4931	4931
q10	6438	2236	1858	1858
q11	436	269	240	240
q12	701	426	292	292
q13	18248	3389	2785	2785
q14	268	260	234	234
q15	q16	826	769	708	708
q17	1026	995	1066	995
q18	6714	5731	5541	5541
q19	1223	1255	1204	1204
q20	551	445	288	288
q21	5888	2768	2598	2598
q22	447	370	319	319
Total cold run time: 100939 ms
Total hot run time: 31600 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4804	4748	4974	4748
q2	q3	4926	5275	4615	4615
q4	2174	2210	1410	1410
q5	4851	4824	4679	4679
q6	247	185	135	135
q7	1853	1722	1560	1560
q8	2432	1983	1928	1928
q9	7437	7367	7385	7367
q10	4749	4691	4207	4207
q11	539	395	363	363
q12	735	749	535	535
q13	3044	3394	2800	2800
q14	268	280	253	253
q15	q16	685	705	619	619
q17	1282	1265	1247	1247
q18	7273	7019	6805	6805
q19	1102	1091	1107	1091
q20	2219	2225	1934	1934
q21	5265	4633	4429	4429
q22	520	455	395	395
Total cold run time: 56405 ms
Total hot run time: 51120 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171308 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7acfeb84a08b2c337451a0a7421cb17a4a12e0f0, data reload: false

query5	4348	665	512	512
query6	360	216	200	200
query7	4316	517	310	310
query8	323	235	225	225
query9	8791	4061	4086	4061
query10	471	347	314	314
query11	5796	2401	2217	2217
query12	186	134	126	126
query13	1293	623	440	440
query14	6103	5538	5166	5166
query14_1	4447	4507	4474	4474
query15	210	209	185	185
query16	1020	486	449	449
query17	1159	766	634	634
query18	2611	512	375	375
query19	230	216	183	183
query20	141	136	139	136
query21	222	139	119	119
query22	13629	13614	13353	13353
query23	17521	16582	16238	16238
query23_1	16281	16286	16393	16286
query24	7512	1768	1325	1325
query24_1	1296	1320	1307	1307
query25	592	504	446	446
query26	1308	321	181	181
query27	2681	562	349	349
query28	4442	2026	1969	1969
query29	997	672	523	523
query30	301	240	207	207
query31	1137	1096	959	959
query32	92	80	76	76
query33	583	375	308	308
query34	1178	1150	655	655
query35	773	826	699	699
query36	1392	1420	1232	1232
query37	151	101	96	96
query38	3214	3111	3075	3075
query39	941	931	879	879
query39_1	889	889	879	879
query40	252	147	128	128
query41	65	63	62	62
query42	107	108	110	108
query43	333	330	303	303
query44	
query45	216	202	192	192
query46	1106	1200	739	739
query47	2440	2404	2252	2252
query48	419	434	293	293
query49	652	511	397	397
query50	1061	376	262	262
query51	4361	4375	4292	4292
query52	113	108	98	98
query53	263	286	213	213
query54	311	295	269	269
query55	97	94	91	91
query56	312	325	309	309
query57	1447	1429	1341	1341
query58	289	268	270	268
query59	1611	1659	1448	1448
query60	318	316	307	307
query61	159	156	158	156
query62	696	652	577	577
query63	237	205	203	203
query64	2416	797	662	662
query65	
query66	1692	479	359	359
query67	29756	29133	29510	29133
query68	
query69	469	339	355	339
query70	1030	1020	993	993
query71	315	280	266	266
query72	3012	2675	2407	2407
query73	876	797	426	426
query74	5128	4989	4780	4780
query75	2697	2634	2310	2310
query76	2296	1142	793	793
query77	413	409	345	345
query78	12478	12354	11847	11847
query79	1471	1023	774	774
query80	811	569	475	475
query81	480	282	244	244
query82	1333	164	125	125
query83	363	274	251	251
query84	267	146	119	119
query85	916	533	473	473
query86	428	371	339	339
query87	3466	3394	3230	3230
query88	3617	2719	2707	2707
query89	460	396	343	343
query90	1874	185	182	182
query91	180	174	140	140
query92	85	77	76	76
query93	1529	1451	957	957
query94	609	357	279	279
query95	667	388	455	388
query96	1038	788	348	348
query97	2774	2722	2619	2619
query98	242	238	241	238
query99	1191	1156	1006	1006
Total cold run time: 254996 ms
Total hot run time: 171308 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/24) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.01% (21025/38928)
Line Coverage 37.55% (199185/530505)
Region Coverage 33.84% (156063/461123)
Branch Coverage 34.80% (67864/195002)

@nsivarajan
Copy link
Copy Markdown
Contributor Author

run external

1 similar comment
@nsivarajan
Copy link
Copy Markdown
Contributor Author

run external

@nsivarajan nsivarajan marked this pull request as ready for review May 31, 2026 16:49
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/24) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.85% (28149/38117)
Line Coverage 57.79% (305769/529127)
Region Coverage 54.98% (255965/465523)
Branch Coverage 56.42% (110422/195728)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/24) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.88% (28161/38117)
Line Coverage 57.83% (305970/529127)
Region Coverage 55.02% (256130/465523)
Branch Coverage 56.47% (110525/195728)

@morrySnow
Copy link
Copy Markdown
Contributor

/review

@morrySnow morrySnow requested a review from morningman June 1, 2026 07:01
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue: the BE fallback still uses the FE HTTP port value, so this PR by itself does not fix the stated HTTPS-only small-file download case.

Critical checkpoint conclusions:

  • Goal/test: The goal is to let BE SmallFileMgr download files when FE HTTPS is enabled or HTTP is disabled. The current code does not accomplish that on this branch because FE heartbeat still sends Config.http_port, and no test/manual result is provided to prove the end-to-end HTTPS-only case.
  • Scope/focus: The change is small and localized, but it depends on an FE protocol/heartbeat change that is not present in this PR.
  • Concurrency/lifecycle: No new shared-state concurrency, lock-order, static initialization, or lifecycle issue found beyond the existing SmallFileMgr lock-held download behavior.
  • Configuration/compatibility: The change relies on the semantics of master_fe_http_port changing to contain https_port when enable_https=true. That compatibility/protocol change is not included here.
  • Parallel paths: Other BE HTTP clients are not modified; for this specific small-file path, the missing FE port propagation is the blocking parallel-side change.
  • Error handling/memory/observability: Existing Status propagation is mostly preserved. The retry handles partial temp output by rewinding/truncating, though it cannot recover the intended scenario without the right port.
  • Tests: No regression/unit/manual evidence was included, and the primary HTTPS-only scenario remains unverified in this PR.
  • User focus: No additional user-provided review focus was present.

HttpClient https_client;
RETURN_IF_ERROR(https_client.init(url));
// Skip TLS cert verification: internal cluster traffic only; file integrity
// is guaranteed independently by MD5 checksum verification below.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback still builds the HTTPS URL from master_fe_http_port, but on this branch FE heartbeats populate that field from Config.http_port (HeartbeatMgr.setMaster() calls tMasterInfo.setHttpPort(Config.http_port)). In the stated HTTPS-only case (enable_https=true, http_port=0), the first request goes to http://host:0/... and this retry goes to https://host:0/..., so small-file download still fails. Please include the FE heartbeat change that sends https_port when HTTPS is enabled, or otherwise make BE use an actual HTTPS port before adding this fallback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a Preparatory BE change: adds HTTPS fallback to _download_file(). When HTTP fails, it retries with HTTPS using use_untrusted_ssl() (safe for internal cluster traffic; MD5 checksum independently verifies file integrity).

This PR causes no regression:

  • enable_https=false: HTTP succeeds on first try, no change.
  • enable_https=true, http_port=0 (without FE PR): both attempts fail as
    before — no change in outcome, ~1ms extra overhead on failure.

Activation

This fix takes effect automatically once the companion FE PR #60921 is merged. That PR changes HeartbeatMgr to send https_port to BEs when enable_https=true. From the next heartbeat
cycle, master_fe_http_port on every BE becomes 8050, the HTTP attempt fails fast, and the HTTPS fallback succeeds — no BE restart or config change needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference HeartbeatManager fix in #60921 , which sets HTTPS scheme and HTTPS port .

    public void setMaster(int clusterId, String token, long epoch) {
        TMasterInfo tMasterInfo = new TMasterInfo(
                new TNetworkAddress(FrontendOptions.getLocalHostAddress(), Config.rpc_port), clusterId, epoch);
        tMasterInfo.setToken(token);
        tMasterInfo.setHttpPort(Config.enable_https ? Config.https_port : Config.http_port);
        long flags = heartbeatFlags.getHeartbeatFlags();
        tMasterInfo.setHeartbeatFlags(flags);
        if (Config.isCloudMode()) {
            // Set the endpoint for the metadata service in cloud mode
            tMasterInfo.setMetaServiceEndpoint(Config.meta_service_endpoint);
        }
        masterInfo.set(tMasterInfo);
    }

@nsivarajan
Copy link
Copy Markdown
Contributor Author

@morrySnow @morningman - Thanks for the review, as mentioned this PR is a preparatory for #60921 . BE call FE in FE in HTTP in small_file_mgr only, making it fast fallback makes no harm. but when #60921 gets merged this auto takes "HTTPS" schema and "HTTPS port" - 8050/user defined any.

Please help me get this PR merged and help with next PR(Followup) : #60921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants