Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](proxy) Fix getProxy frequently parse hostname to ip #28479

Closed
wants to merge 2 commits into from

Conversation

xinyiZzz
Copy link
Contributor

Proposed changes

Each RPC between Doris FE and BE will call getProxy. Each getProxy will parse hostname and check whether the IP has changed.

When the hostname resolution server is unstable, frequent parse hostname will often fail and cause performance problems.

Therefore, only check whether the backend IP changes during heartbeat, and if so, recreate the rpc client.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 43.84 seconds
stream load tsv: 584 seconds loaded 74807831229 Bytes, about 122 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17220810681 Bytes

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 2d7a4e5e5416a91fbab187242cba8b1f6f098d6d, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4701	4483	4456	4456
q2	363	152	159	152
q3	1450	1239	1239	1239
q4	1123	901	943	901
q5	3155	3201	3199	3199
q6	242	128	124	124
q7	995	496	479	479
q8	2184	2218	2196	2196
q9	6717	6699	6715	6699
q10	3245	3276	3294	3276
q11	331	207	206	206
q12	354	214	214	214
q13	4552	3806	3794	3794
q14	240	220	208	208
q15	569	523	534	523
q16	442	389	393	389
q17	1031	621	571	571
q18	7096	6913	6776	6776
q19	1542	1468	1414	1414
q20	534	307	282	282
q21	3076	2629	2628	2628
q22	347	284	284	284
Total cold run time: 44289 ms
Total hot run time: 40010 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4421	4394	4393	4393
q2	268	163	177	163
q3	3514	3518	3496	3496
q4	2381	2371	2371	2371
q5	5742	5744	5748	5744
q6	239	121	119	119
q7	2356	1829	1841	1829
q8	3533	3540	3515	3515
q9	9044	8931	8971	8931
q10	3927	3987	4021	3987
q11	506	386	383	383
q12	769	596	631	596
q13	4320	3567	3533	3533
q14	291	247	266	247
q15	584	531	527	527
q16	506	461	456	456
q17	1903	1852	1839	1839
q18	8613	8207	8182	8182
q19	1753	1756	1747	1747
q20	2255	1939	1944	1939
q21	6686	6114	6105	6105
q22	502	417	416	416
Total cold run time: 64113 ms
Total hot run time: 60518 ms

@xinyiZzz
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Tpch sf100 test result on commit 5448962f329479148b1c4541bae3f2d172377c68, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4744	4475	4451	4451
q2	361	154	154	154
q3	1459	1270	1199	1199
q4	1112	910	895	895
q5	3134	3151	3164	3151
q6	251	127	128	127
q7	985	477	494	477
q8	2184	2220	2211	2211
q9	6691	6679	6692	6679
q10	3204	3276	3271	3271
q11	327	196	205	196
q12	354	208	210	208
q13	4581	3842	3839	3839
q14	239	215	220	215
q15	568	524	524	524
q16	443	376	383	376
q17	1014	623	548	548
q18	7192	6994	6919	6919
q19	1538	1451	1392	1392
q20	518	324	285	285
q21	3128	2628	2627	2627
q22	357	282	285	282
Total cold run time: 44384 ms
Total hot run time: 40026 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4402	4420	4422	4420
q2	266	162	173	162
q3	3541	3524	3517	3517
q4	2388	2372	2367	2367
q5	5736	5744	5731	5731
q6	242	120	120	120
q7	2379	1897	1872	1872
q8	3521	3540	3530	3530
q9	9004	8983	9008	8983
q10	3945	3982	4011	3982
q11	504	390	380	380
q12	765	601	602	601
q13	4328	3567	3566	3566
q14	284	263	249	249
q15	569	523	527	523
q16	512	453	441	441
q17	1887	1881	1856	1856
q18	8641	8348	8304	8304
q19	1739	1736	1765	1736
q20	2253	1963	1926	1926
q21	6539	6185	6182	6182
q22	501	419	433	419
Total cold run time: 63946 ms
Total hot run time: 60867 ms

@xinyiZzz
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Tpch sf100 test result on commit 838929d7631480f74ad3b4ae0abce01c54654061, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4715	4473	4450	4450
q2	362	150	159	150
q3	1452	1217	1198	1198
q4	1128	920	919	919
q5	3148	3132	3158	3132
q6	244	126	128	126
q7	985	486	489	486
q8	2213	2241	2179	2179
q9	6705	6676	6669	6669
q10	3217	3277	3268	3268
q11	329	197	208	197
q12	349	218	213	213
q13	4594	3820	3796	3796
q14	237	212	214	212
q15	561	519	524	519
q16	433	388	382	382
q17	1028	605	497	497
q18	7238	6907	6891	6891
q19	1524	1457	1366	1366
q20	513	307	281	281
q21	3086	2605	2643	2605
q22	353	282	290	282
Total cold run time: 44414 ms
Total hot run time: 39818 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4429	4383	4379	4379
q2	267	165	181	165
q3	3539	3512	3502	3502
q4	2371	2371	2352	2352
q5	5705	5734	5726	5726
q6	243	123	122	122
q7	2390	1874	1882	1874
q8	3509	3522	3519	3519
q9	8994	8996	8955	8955
q10	3907	4003	4005	4003
q11	497	369	381	369
q12	775	604	590	590
q13	4291	3533	3564	3533
q14	293	259	239	239
q15	569	524	517	517
q16	511	444	446	444
q17	1908	1851	1854	1851
q18	8611	8336	8226	8226
q19	1737	1724	1733	1724
q20	2244	1950	1931	1931
q21	6509	6152	6108	6108
q22	502	442	420	420
Total cold run time: 63801 ms
Total hot run time: 60549 ms

@hello-stephen
Copy link
Contributor

run clickbench

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Tpch sf100 test result on commit 838929d7631480f74ad3b4ae0abce01c54654061, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4701	4480	4519	4480
q2	365	149	155	149
q3	1477	1290	1246	1246
q4	1115	921	919	919
q5	3183	3190	3187	3187
q6	246	126	130	126
q7	1021	482	488	482
q8	2213	2221	2212	2212
q9	6716	6725	6706	6706
q10	3214	3258	3266	3258
q11	327	205	205	205
q12	344	220	209	209
q13	4685	3798	3810	3798
q14	243	219	215	215
q15	568	534	530	530
q16	443	381	386	381
q17	1039	659	561	561
q18	7234	6923	6926	6923
q19	1551	1476	1360	1360
q20	514	317	311	311
q21	3111	2635	2706	2635
q22	353	282	285	282
Total cold run time: 44663 ms
Total hot run time: 40175 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4426	4461	4411	4411
q2	269	165	171	165
q3	3543	3543	3554	3543
q4	2392	2384	2377	2377
q5	5754	5734	5753	5734
q6	242	122	120	120
q7	2389	1883	1866	1866
q8	3529	3534	3520	3520
q9	9045	9053	9031	9031
q10	3909	4025	4024	4024
q11	501	397	395	395
q12	779	611	597	597
q13	4401	3622	3574	3574
q14	286	257	264	257
q15	586	521	528	521
q16	522	461	462	461
q17	1897	1879	1866	1866
q18	8675	8335	8274	8274
q19	1751	1767	1754	1754
q20	2260	1963	1921	1921
q21	6556	6191	6152	6152
q22	515	435	430	430
Total cold run time: 64227 ms
Total hot run time: 60993 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.15 seconds
stream load tsv: 600 seconds loaded 74807831229 Bytes, about 118 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 67 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.6 seconds inserted 10000000 Rows, about 349K ops/s
storage size: 17220663357 Bytes

@xiaokang xiaokang added usercase Important user case type label dev/2.0.4 labels Dec 18, 2023
@wm1581066 wm1581066 added need_more_review and removed usercase Important user case type label labels Jan 9, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@xinyiZzz xinyiZzz closed this Mar 27, 2024
morningman added a commit that referenced this pull request Mar 28, 2024
In previously, when enabling FQDN, Doris will call dns resolver to get IP from hostname
each time when 1) FE gets BE's grpc client. 2) BE gets other BE's brpc client.
So when in high concurrency case, the dns resolver be overloaded and failed to resolve hostname.

This PR mainly changes:

1. Add DNSCache for both FE and BE.
    The DNSCache will run on every FE and BE node. It has a cache, key is hostname and value is IP.
    Caller can get IP by hostname from this cache, and if hostname does not exist, it will try to resolve it
    and update the cache.
    In addition, DNSCache has a daemon thread to refresh the cache every 1 min, in case that the IP may
    be changed at anytime.

There are other implements of this dns cache:

1.  kaka11chen@36fed13
    This is for BE side, but it does not handle the IP change case.

3. #28479
    This is for FE side, but it can only work with Master FE. Other FE node will not be aware of the IP change.
    And there are a bunch of BackendServiceProxy, this PR only handle cache in one of them.
morningman added a commit to morningman/doris that referenced this pull request Mar 28, 2024
In previously, when enabling FQDN, Doris will call dns resolver to get IP from hostname
each time when 1) FE gets BE's grpc client. 2) BE gets other BE's brpc client.
So when in high concurrency case, the dns resolver be overloaded and failed to resolve hostname.

This PR mainly changes:

1. Add DNSCache for both FE and BE.
    The DNSCache will run on every FE and BE node. It has a cache, key is hostname and value is IP.
    Caller can get IP by hostname from this cache, and if hostname does not exist, it will try to resolve it
    and update the cache.
    In addition, DNSCache has a daemon thread to refresh the cache every 1 min, in case that the IP may
    be changed at anytime.

There are other implements of this dns cache:

1.  kaka11chen@36fed13
    This is for BE side, but it does not handle the IP change case.

3. apache#28479
    This is for FE side, but it can only work with Master FE. Other FE node will not be aware of the IP change.
    And there are a bunch of BackendServiceProxy, this PR only handle cache in one of them.
morningman added a commit to morningman/doris that referenced this pull request Apr 7, 2024
In previously, when enabling FQDN, Doris will call dns resolver to get IP from hostname
each time when 1) FE gets BE's grpc client. 2) BE gets other BE's brpc client.
So when in high concurrency case, the dns resolver be overloaded and failed to resolve hostname.

This PR mainly changes:

1. Add DNSCache for both FE and BE.
    The DNSCache will run on every FE and BE node. It has a cache, key is hostname and value is IP.
    Caller can get IP by hostname from this cache, and if hostname does not exist, it will try to resolve it
    and update the cache.
    In addition, DNSCache has a daemon thread to refresh the cache every 1 min, in case that the IP may
    be changed at anytime.

There are other implements of this dns cache:

1.  kaka11chen@36fed13
    This is for BE side, but it does not handle the IP change case.

3. apache#28479
    This is for FE side, but it can only work with Master FE. Other FE node will not be aware of the IP change.
    And there are a bunch of BackendServiceProxy, this PR only handle cache in one of them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants