Skip to content

Conversation

@kaijchen
Copy link
Member

What problem does this PR solve?

Problem Summary:

For event-driven warmup jobs, ensure warmup only triggers if tablet exists on the target backend.
Because the source BE will cache the tablet location for a short time,
after rebalancing, the old target BE will still receive warm up requests during that time.
This PR prevents the old target BE to re-fetch the tablet cache when it no longer holds the tablet.
For the new target BE, it will take a different path (warm_up_cache_async) to do the warm up.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaijchen kaijchen force-pushed the fix-warmup-rowset-local-only-master branch from 729ba78 to 87b39be Compare August 6, 2025 08:41
@kaijchen kaijchen marked this pull request as ready for review August 6, 2025 08:41
@kaijchen
Copy link
Member Author

kaijchen commented Aug 6, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33746 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 87b39be41e60ed3a5c62dedf3b07644baa1b66dd, data reload: false

------ Round 1 ----------------------------------
q1	17563	5272	5329	5272
q2	1976	298	189	189
q3	10282	1392	694	694
q4	10223	1020	532	532
q5	7949	2366	2275	2275
q6	198	166	133	133
q7	914	753	598	598
q8	9294	1305	1060	1060
q9	6799	5188	5105	5105
q10	6888	2349	1956	1956
q11	472	285	270	270
q12	344	364	223	223
q13	17777	3456	2969	2969
q14	236	234	221	221
q15	540	451	460	451
q16	413	425	377	377
q17	561	819	351	351
q18	7348	7018	7023	7018
q19	3236	993	547	547
q20	327	309	209	209
q21	3438	3054	2300	2300
q22	1054	1074	996	996
Total cold run time: 107832 ms
Total hot run time: 33746 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5519	5332	5307	5307
q2	238	310	210	210
q3	2051	2547	2224	2224
q4	1339	1692	1313	1313
q5	4200	4477	4487	4477
q6	231	187	141	141
q7	2053	1928	1686	1686
q8	2601	2417	2615	2417
q9	7338	7297	7319	7297
q10	3239	3360	2834	2834
q11	536	540	497	497
q12	784	987	645	645
q13	3433	3893	3399	3399
q14	275	295	280	280
q15	497	461	451	451
q16	410	521	462	462
q17	1144	1357	1324	1324
q18	8279	7758	7959	7758
q19	10798	905	861	861
q20	3771	1906	1731	1731
q21	14086	4330	4179	4179
q22	1124	1050	977	977
Total cold run time: 73946 ms
Total hot run time: 50470 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 159513 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 87b39be41e60ed3a5c62dedf3b07644baa1b66dd, data reload: false

reason	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 22:43:56	2023-12-26 22:44:01	NULL	utf-8	NULL	NULL	
============================================
query1	981	367	431	367
query2	6533	1778	1684	1684
query3	6748	230	225	225
query4	26703	23724	22854	22854
query5	4377	594	503	503
query6	312	227	224	224
query7	4625	528	295	295
query8	274	256	237	237
query9	8593	2948	2927	2927
query10	489	339	284	284
query11	15366	15321	14749	14749
query12	183	135	133	133
query13	1670	535	418	418
query14	8512	5785	5728	5728
query15	205	188	179	179
query16	7129	664	505	505
query17	981	769	652	652
query18	1993	445	329	329
query19	222	213	185	185
query20	148	149	136	136
query21	217	125	113	113
query22	4162	3980	3873	3873
query23	34371	34156	34471	34156
query24	5130	2361	2422	2361
query25	474	510	448	448
query26	720	298	165	165
query27	2251	518	353	353
query28	3082	2314	2301	2301
query29	673	597	478	478
query30	287	233	192	192
query31	841	782	713	713
query32	90	76	78	76
query33	494	413	371	371
query34	790	862	515	515
query35	797	829	754	754
query36	999	1022	898	898
query37	133	108	90	90
query38	3904	3938	3882	3882
query39	1446	1360	1367	1360
query40	243	141	135	135
query41	62	63	54	54
query42	134	122	124	122
query43	511	512	465	465
query44	1384	850	870	850
query45	198	186	170	170
query46	943	1059	660	660
query47	1772	1877	1750	1750
query48	392	422	317	317
query49	667	503	441	441
query50	660	703	414	414
query51	4175	4162	4097	4097
query52	129	132	114	114
query53	255	291	210	210
query54	650	631	548	548
query55	88	85	92	85
query56	354	351	342	342
query57	1210	1223	1137	1137
query58	347	362	328	328
query59	2740	2703	2511	2511
query60	409	379	410	379
query61	129	123	125	123
query62	758	732	664	664
query63	248	224	212	212
query64	2383	1087	767	767
query65	4263	4123	4101	4101
query66	1086	442	334	334
query67	query68	16644	633	590	590
query69	1002	324	303	303
query70	1425	1088	1105	1088
query71	740	345	314	314
query72	9252	2274	2374	2274
query73	3538	627	354	354
query74	8978	8957	8657	8657
query75	7563	3106	2634	2634
query76	8827	1206	787	787
query77	1161	409	352	352
query78	9612	11469	query79	16205	594	578	578
query80	3167	595	476	476
query81	563	256	223	223
query82	608	150	119	119
query83	407	293	277	277
query84	304	96	159	96
query85	1267	368	333	333
query86	395	329	284	284
query87	4237	4192	4149	4149
query88	5555	2228	2224	2224
query89	541	346	310	310
query90	2492	222	227	222
query91	144	139	108	108
query92	91	70	65	65
query93	6673	959	660	660
query94	1226	392	276	276
query95	427	321	311	311
query96	499	610	275	275
query97	2677	2722	2627	2627
query98	252	227	225	225
query99	1476	1334	1279	1279
Total cold run time: 291431 ms
Total hot run time: 159513 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 87b39be41e60ed3a5c62dedf3b07644baa1b66dd, data reload: false

query1	0.04	0.04	0.04
query2	0.07	0.04	0.04
query3	0.25	0.07	0.07
query4	1.60	0.10	0.10
query5	0.42	0.40	0.41
query6	1.16	0.65	0.66
query7	0.02	0.02	0.02
query8	0.05	0.04	0.03
query9	0.55	0.48	0.47
query10	0.52	0.52	0.53
query11	0.16	0.10	0.11
query12	0.14	0.11	0.11
query13	0.64	0.64	0.63
query14	0.93	1.19	0.97
query15	0.94	0.87	0.89
query16	0.38	0.40	0.39
query17	1.04	1.06	1.07
query18	0.22	0.20	0.20
query19	1.96	1.80	1.90
query20	0.02	0.01	0.02
query21	15.40	0.84	0.57
query22	0.79	1.37	0.71
query23	14.76	1.22	0.62
query24	6.56	1.48	1.01
query25	0.48	0.30	0.23
query26	0.54	0.15	0.12
query27	0.05	0.06	0.05
query28	10.39	0.86	0.44
query29	12.59	3.76	3.30
query30	3.00	2.98	3.06
query31	2.82	0.57	0.39
query32	3.24	0.57	0.50
query33	3.02	3.18	3.21
query34	15.98	5.38	4.94
query35	4.88	4.91	4.95
query36	0.69	0.52	0.49
query37	0.12	0.07	0.07
query38	0.06	0.04	0.05
query39	0.03	0.02	0.02
query40	0.18	0.13	0.12
query41	0.09	0.03	0.03
query42	0.04	0.02	0.03
query43	0.04	0.04	0.03
Total cold run time: 106.86 s
Total hot run time: 33.25 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 58.30% (16382/28100)
Line Coverage 47.19% (148052/313717)
Region Coverage 36.15% (110792/306510)
Branch Coverage 39.03% (49192/126025)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/4) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 81.20% (22392/27577)
Line Coverage 73.92% (231868/313658)
Region Coverage 61.20% (192601/314713)
Branch Coverage 65.26% (83240/127553)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Aug 11, 2025
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@dataroaring dataroaring merged commit ad9138b into apache:master Aug 11, 2025
26 of 28 checks passed
dataroaring pushed a commit that referenced this pull request Aug 13, 2025
…backend (#53852) (#53850)

### What problem does this PR solve?

Cherry-pick: #53852

Problem Summary:

For event-driven warmup jobs, ensure warmup only triggers if tablet
exists on the target backend.
Because the source BE will cache the tablet location for a short time,
after rebalancing, the old target BE will still receive warm up requests
during that time.
This PR prevents the old target BE to re-fetch the tablet cache when it
no longer holds the tablet.
For the new target BE, it will take a different path
(`warm_up_cache_async`) to do the warm up.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
morrySnow pushed a commit that referenced this pull request Aug 14, 2025
…error #53852 (#54445)

### What problem does this PR solve?

Related PR: #53852

When warm up rowset on event, if the tablet is not found on the target
BE, it will now return an error to the source BE.
The source BE will try to refresh the tablet location cache, and retry
the warmup with `force` flag.
When the `force` flag is set, the target BE will do the warm up
regardless of the tablet existence.
kaijchen added a commit to kaijchen/doris that referenced this pull request Aug 14, 2025
…error apache#53852 (apache#54445)

Related PR: apache#53852

When warm up rowset on event, if the tablet is not found on the target
BE, it will now return an error to the source BE.
The source BE will try to refresh the tablet location cache, and retry
the warmup with `force` flag.
When the `force` flag is set, the target BE will do the warm up
regardless of the tablet existence.
dataroaring pushed a commit that referenced this pull request Aug 20, 2025
### What problem does this PR solve?

Related PR: #53852

When warm up rowset on event, if the tablet is not found on the target
BE, it will now return an error to the source BE.
The source BE will try to refresh the tablet location cache, and retry
the warmup with force flag.
When the force flag is set, the target BE will do the warm up regardless
of the tablet existence.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@gavinchou gavinchou mentioned this pull request Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants