fix multi_cluster_management check timeout #7377

hslightdb · 2023-12-12T05:48:33Z

DESCRIPTION: Fixes bug when hostname in pg_dist_node resolves to multiple IPs
when repeat localhost in /etc/hosts like following
/etc/hosts:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1   localhost

multi_cluster_management check will failed:


@@ -857,20 +857,21 @@
 ERROR:  group 14 already has a primary node
 -- check that you can add secondaries and unavailable nodes to a group
 SELECT groupid AS worker_2_group FROM pg_dist_node WHERE nodeport = :worker_2_port \gset
 SELECT 1 FROM master_add_node('localhost', 9998, groupid => :worker_1_group, noderole => 'secondary');
  ?column?
 ----------
         1
 (1 row)

 SELECT 1 FROM master_add_node('localhost', 9997, groupid => :worker_1_group, noderole => 'unavailable');
+WARNING:  could not establish connection after 5000 ms
  ?column?
 ----------
         1
 (1 row)

this pr is attempt to fix it

…etc/hosts

JelteF · 2023-12-22T09:44:31Z

The change seems reasonable in general, but could you explain a bit more how it fixes the issue you were seeing?

codecov · 2023-12-22T09:47:08Z

Codecov Report

Merging #7377 (6d3453e) into main (8e979f7) will decrease coverage by 9.34%.
The diff coverage is 66.66%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7377      +/-   ##
==========================================
- Coverage   89.51%   80.17%   -9.34%     
==========================================
  Files         280      280              
  Lines       60304    60299       -5     
  Branches     7505     7505              
==========================================
- Hits        53979    48343    -5636     
- Misses       4160     9204    +5044     
- Partials     2165     2752     +587

cstarc · 2023-12-29T09:06:09Z

The change seems reasonable in general, but could you explain a bit more how it fixes the issue you were seeing?

when have two localhost in /etc/hosts, socket will changed after MultiConnectionStatePoll( from first localhost to next locahost), but it will not reset wait events, then it will wait forever.

JelteF · 2024-01-04T12:27:14Z

Thank you for this fix. afaict it's actually not just a bug during testing then, it could also occur in normal usage when a hostname in pg_dist_node resolves to multiple IPs.

…7377) When there are multiple localhost entries in /etc/hosts like following /etc/hosts: ``` 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 127.0.0.1 localhost ``` multi_cluster_management check will failed: ``` @@ -857,20 +857,21 @@ ERROR: group 14 already has a primary node -- check that you can add secondaries and unavailable nodes to a group SELECT groupid AS worker_2_group FROM pg_dist_node WHERE nodeport = :worker_2_port \gset SELECT 1 FROM master_add_node('localhost', 9998, groupid => :worker_1_group, noderole => 'secondary'); ?column? ---------- 1 (1 row) SELECT 1 FROM master_add_node('localhost', 9997, groupid => :worker_1_group, noderole => 'unavailable'); +WARNING: could not establish connection after 5000 ms ?column? ---------- 1 (1 row) ``` This actually isn't just a problem in test environments, but could occur as well during actual usage when a hostname in pg_dist_node resolves to multiple IPs and one of those IPs is unreachable. Postgres will then automatically continue with the next IP, but Citus should listen for events on the new socket. Not on the old one. Co-authored-by: chuhx43211 <chuhx43211@hundsun.com> (cherry picked from commit 9a91136)

fix multi_cluster_management check timeout when repeat localhost in /…

89ae723

…etc/hosts

hslightdb force-pushed the fix_timeout branch from a1c72cb to 89ae723 Compare December 15, 2023 07:01

Merge branch 'main' into fix_timeout

20c3e05

JelteF enabled auto-merge (squash) January 4, 2024 12:24

Merge branch 'main' into fix_timeout

0a4c555

hslightdb and others added 2 commits January 9, 2024 14:46

Merge branch 'main' into fix_timeout

0d4e717

Merge branch 'main' into fix_timeout

6d3453e

JelteF approved these changes Jan 10, 2024

View reviewed changes

JelteF merged commit 9a91136 into citusdata:main Jan 10, 2024
123 of 126 checks passed

hslightdb deleted the fix_timeout branch January 11, 2024 03:20

onurctirtir mentioned this pull request Apr 18, 2024

Adds changelog for 12.1.3 #7587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix multi_cluster_management check timeout #7377

fix multi_cluster_management check timeout #7377

hslightdb commented Dec 12, 2023 •

edited by gurkanindibay

JelteF commented Dec 22, 2023

codecov bot commented Dec 22, 2023 •

edited

cstarc commented Dec 29, 2023

JelteF commented Jan 4, 2024

fix multi_cluster_management check timeout #7377

fix multi_cluster_management check timeout #7377

Conversation

hslightdb commented Dec 12, 2023 • edited by gurkanindibay

JelteF commented Dec 22, 2023

codecov bot commented Dec 22, 2023 • edited

Codecov Report

cstarc commented Dec 29, 2023

JelteF commented Jan 4, 2024

hslightdb commented Dec 12, 2023 •

edited by gurkanindibay

codecov bot commented Dec 22, 2023 •

edited