New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add port scan diagnostics to mria waiting for tables checks #11637
feat: add port scan diagnostics to mria waiting for tables checks #11637
Conversation
5ee673f
to
8a1fa8c
Compare
-define(PORT_PROBE_TIMEOUT, 10_000). | ||
open_ports_check() -> | ||
%% expected nodes according to mnesia schema | ||
OtherNodes = mnesia:system_info(db_nodes) -- [node()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
todo: fix for replicants
Replicants won't enter the waiting for tables loop: they'll probably be stuck waiting to connect to a core node (mria:find_upstream_node
).
Also, connectivity issues between core nodes are arguably more relevant for the diagnostics than replicants.
8a1fa8c
to
28b679c
Compare
Pull Request Test Coverage Report for Build 6252084448
馃挍 - Coveralls |
28b679c
to
6234da9
Compare
node_to_ips(Node) -> | ||
NodeBin0 = atom_to_binary(Node), | ||
HostOrIP = re:replace(NodeBin0, <<"^.+@">>, <<"">>, [{return, list}]), | ||
case inet:gethostbyname(HostOrIP, inet) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why specify inet
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to normalize the IPs to ipv4 addresses, to avoid ipv6 from being returned. Do you think it would be ok to keep it unspecified here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could make it an arg for the caller's choice when they run open_ports_check, or we just check both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is called by mria_mnesia:diagnosis/1
, which in turn is called when mria is stuck waiting for tables for more than 30 seconds. In this situation, there's not much interactivity to be able to change this configuration.
If necessary, perhaps we can make it configurable via a persistent term key that controls this behavior and can be set via remsh. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the info, thought it was a manual call by human.
Since it is for 'wait for table' that must be for Erlang dist port, we could get such info from the VM? inet or inet6?
6234da9
to
9456290
Compare
Fixes https://emqx.atlassian.net/browse/EMQX-10944 Also updates ekka -> 0.15.15, mria -> 0.6.4 How to test =========== 1. Start 2 or more EMQX nodes and merge them in a cluster. 2. Stop them in order. 3. Start only the first node that was stopped in the previous step. 4. Wait until the log is printed. Or, more easily: 1. Start 2 or more EMQX nodes and merge them in a cluster. 2. Stop all but one. 3. Run `mria_mnesia:diagnosis([]).` on that node. Example output ============== ``` Check check_open_ports should get ok but got #{msg => "some ports are unreachable", results => #{'emqx@172.100.239.4' => #{open_ports => #{4370 => false, 5370 => false}, ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 4}], status => bad_ports}, 'emqx@172.100.239.5' => #{open_ports => #{4370 => false, 5370 => false}, ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 5}], status => bad_ports}}} ``` After one node is back: ``` Check check_open_ports should get ok but got #{msg => "some ports are unreachable", results => #{'emqx@172.100.239.4' => #{ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 4}], status => ok}, 'emqx@172.100.239.5' => #{open_ports => #{4370 => false, 5370 => false}, ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 5}], status => bad_ports}}} ```
9456290
to
d6935b6
Compare
Follow up improvement after discussing with @qzhuyan : get the distribution port type ( |
Fixes https://emqx.atlassian.net/browse/EMQX-10944
Also updated libraries:
ekka
-> 0.15.15,mria
-> 0.6.4.How to test
Or, more easily:
mria_mnesia:diagnosis([]).
on that node.Example output
With two nodes down
After one node is back
Summary
馃 Generated by Copilot at 26156ed
Add a cluster TCP port availability check to
emqx_machine
. This check usesopen_ports_check/0
to test the connectivity ofekka
andgen_rpc
ports among cluster nodes, and reports any failures tomria
.PR Checklist
Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:
changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md
filesChecklist for CI (.github/workflows) changes
changes/
dir for user-facing artifacts update