feat: add port scan diagnostics to mria waiting for tables checks #11637

thalesmg · 2023-09-19T20:30:20Z

Fixes https://emqx.atlassian.net/browse/EMQX-10944

Also updated libraries: ekka -> 0.15.15, mria -> 0.6.4.

How to test

Start 2 or more EMQX nodes and merge them in a cluster.
Stop them in order.
Start only the first node that was stopped in the previous step.
Wait until the log is printed.

Or, more easily:

Start 2 or more EMQX nodes and merge them in a cluster.
Stop all but one.
Run mria_mnesia:diagnosis([]). on that node.

Example output

With two nodes down

   Check check_open_ports should get ok but got #{msg =>
                                                     "some ports are unreachable",
                                                 results =>
                                                     #{'emqx@172.100.239.4' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   4}],
                                                             status =>
                                                                 bad_ports},
                                                       'emqx@172.100.239.5' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   5}],
                                                             status =>
                                                                 bad_ports}}}

After one node is back

   Check check_open_ports should get ok but got #{msg =>
                                                     "some ports are unreachable",
                                                 results =>
                                                     #{'emqx@172.100.239.4' =>
                                                           #{ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   4}],
                                                             status => ok},
                                                       'emqx@172.100.239.5' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   5}],
                                                             status =>
                                                                 bad_ports}}}

Summary

`🤖 Generated by Copilot at 26156ed`

Add a cluster TCP port availability check to emqx_machine. This check uses open_ports_check/0 to test the connectivity of ekka and gen_rpc ports among cluster nodes, and reports any failures to mria.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

[na] Added tests for the changes
[na] Added property-based tests for code which performs user input validation
[na] Changed lines covered in coverage report
Change log has been added to changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
[na] Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
[na] Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

[na] If changed package build workflow, pass this action (manual trigger)
[na] Change log has been added to changes/ dir for user-facing artifacts update

thalesmg · 2023-09-20T16:13:11Z

apps/emqx_machine/src/emqx_machine.erl

+-define(PORT_PROBE_TIMEOUT, 10_000).
+open_ports_check() ->
+    %% expected nodes according to mnesia schema
+    OtherNodes = mnesia:system_info(db_nodes) -- [node()],


~~todo: fix for replicants~~

Replicants won't enter the waiting for tables loop: they'll probably be stuck waiting to connect to a core node (mria:find_upstream_node).

Also, connectivity issues between core nodes are arguably more relevant for the diagnostics than replicants.

coveralls · 2023-09-20T19:15:36Z

Pull Request Test Coverage Report for Build 6252084448

4 of 38 (10.53%) changed or added relevant lines in 1 file are covered.
25 unchanged lines in 8 files lost coverage.
Overall coverage decreased (-0.06%) to 81.641%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
apps/emqx_machine/src/emqx_machine.erl	4	38	10.53%

Files with Coverage Reduction	New Missed Lines	%
apps/emqx_durable_storage/src/emqx_ds_message_storage_bitmask.erl	1	93.42%
apps/emqx_management/src/emqx_mgmt_util.erl	1	13.79%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl	1	93.81%
apps/emqx/src/emqx_quic_stream.erl	1	75.0%
apps/emqx/src/emqx_router_helper.erl	1	84.91%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_frame.erl	2	64.04%
apps/emqx/src/emqx_alarm.erl	4	88.24%
apps/emqx/src/emqx_reason_codes.erl	14	87.5%

Totals
Change from base Build 6251999884:	-0.06%
Covered Lines:	33169
Relevant Lines:	40628

💛 - Coveralls

qzhuyan · 2023-09-21T07:30:22Z

apps/emqx_machine/src/emqx_machine.erl

+node_to_ips(Node) ->
+    NodeBin0 = atom_to_binary(Node),
+    HostOrIP = re:replace(NodeBin0, <<"^.+@">>, <<"">>, [{return, list}]),
+    case inet:gethostbyname(HostOrIP, inet) of


why specify inet here?

I was trying to normalize the IPs to ipv4 addresses, to avoid ipv6 from being returned. Do you think it would be ok to keep it unspecified here?

maybe we could make it an arg for the caller's choice when they run open_ports_check, or we just check both?

This check is called by mria_mnesia:diagnosis/1, which in turn is called when mria is stuck waiting for tables for more than 30 seconds. In this situation, there's not much interactivity to be able to change this configuration.

If necessary, perhaps we can make it configurable via a persistent term key that controls this behavior and can be set via remsh. What do you think?

thanks for the info, thought it was a manual call by human.

Since it is for 'wait for table' that must be for Erlang dist port, we could get such info from the VM? inet or inet6?

Fixes https://emqx.atlassian.net/browse/EMQX-10944 Also updates ekka -> 0.15.15, mria -> 0.6.4 How to test =========== 1. Start 2 or more EMQX nodes and merge them in a cluster. 2. Stop them in order. 3. Start only the first node that was stopped in the previous step. 4. Wait until the log is printed. Or, more easily: 1. Start 2 or more EMQX nodes and merge them in a cluster. 2. Stop all but one. 3. Run `mria_mnesia:diagnosis([]).` on that node. Example output ============== ``` Check check_open_ports should get ok but got #{msg => "some ports are unreachable", results => #{'emqx@172.100.239.4' => #{open_ports => #{4370 => false, 5370 => false}, ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 4}], status => bad_ports}, 'emqx@172.100.239.5' => #{open_ports => #{4370 => false, 5370 => false}, ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 5}], status => bad_ports}}} ``` After one node is back: ``` Check check_open_ports should get ok but got #{msg => "some ports are unreachable", results => #{'emqx@172.100.239.4' => #{ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 4}], status => ok}, 'emqx@172.100.239.5' => #{open_ports => #{4370 => false, 5370 => false}, ports_to_check => [4370,5370], resolved_ips => [{172,100,239, 5}], status => bad_ports}}} ```

thalesmg · 2023-09-22T16:06:15Z

Follow up improvement after discussing with @qzhuyan : get the distribution port type (inet / inet6) to define which address type to fetch.

Follow up to emqx#11637 (comment) Fixes https://emqx.atlassian.net/browse/EMQX-10944

thalesmg force-pushed the port-scan-mria-check-m-20230919 branch 2 times, most recently from 5ee673f to 8a1fa8c Compare September 20, 2023 15:47

thalesmg commented Sep 20, 2023

View reviewed changes

thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 8a1fa8c to 28b679c Compare September 20, 2023 17:30

thalesmg marked this pull request as ready for review September 20, 2023 19:15

thalesmg requested review from lafirest and a team as code owners September 20, 2023 19:15

thalesmg marked this pull request as draft September 20, 2023 19:58

thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 28b679c to 6234da9 Compare September 20, 2023 20:26

qzhuyan reviewed Sep 21, 2023

View reviewed changes

thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 6234da9 to 9456290 Compare September 21, 2023 12:12

thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 9456290 to d6935b6 Compare September 21, 2023 17:29

thalesmg marked this pull request as ready for review September 21, 2023 19:07

qzhuyan approved these changes Sep 22, 2023

View reviewed changes

thalesmg merged commit 5e40057 into emqx:master Sep 22, 2023
133 checks passed

thalesmg deleted the port-scan-mria-check-m-20230919 branch September 22, 2023 16:05

thalesmg mentioned this pull request Sep 22, 2023

chore: check ekka proto dist module type when resolving node address #11662

Merged

9 tasks

thalesmg added a commit to thalesmg/emqx that referenced this pull request Sep 25, 2023

chore: check ekka proto dist module type when resolving node address

806017e

Follow up to emqx#11637 (comment) Fixes https://emqx.atlassian.net/browse/EMQX-10944

chenrui333 mentioned this pull request Nov 15, 2023

emqx 5.3.1 Homebrew/homebrew-core#154443

Merged

BrewTestBot mentioned this pull request Nov 15, 2023

emqx 5.3.1 Homebrew/homebrew-core#154455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add port scan diagnostics to mria waiting for tables checks #11637

feat: add port scan diagnostics to mria waiting for tables checks #11637

thalesmg commented Sep 19, 2023 •

edited

thalesmg Sep 20, 2023 •

edited

coveralls commented Sep 20, 2023

qzhuyan Sep 21, 2023

thalesmg Sep 21, 2023

qzhuyan Sep 22, 2023

thalesmg Sep 22, 2023

qzhuyan Sep 22, 2023 •

edited

thalesmg commented Sep 22, 2023

feat: add port scan diagnostics to mria waiting for tables checks #11637

feat: add port scan diagnostics to mria waiting for tables checks #11637

Conversation

thalesmg commented Sep 19, 2023 • edited

How to test

Example output

Summary

🤖 Generated by Copilot at 26156ed

PR Checklist

Checklist for CI (.github/workflows) changes

thalesmg Sep 20, 2023 • edited

Choose a reason for hiding this comment

coveralls commented Sep 20, 2023

Pull Request Test Coverage Report for Build 6252084448

💛 - Coveralls

qzhuyan Sep 21, 2023

Choose a reason for hiding this comment

thalesmg Sep 21, 2023

Choose a reason for hiding this comment

qzhuyan Sep 22, 2023

Choose a reason for hiding this comment

thalesmg Sep 22, 2023

Choose a reason for hiding this comment

qzhuyan Sep 22, 2023 • edited

Choose a reason for hiding this comment

thalesmg commented Sep 22, 2023

thalesmg commented Sep 19, 2023 •

edited

`🤖 Generated by Copilot at 26156ed`

thalesmg Sep 20, 2023 •

edited

qzhuyan Sep 22, 2023 •

edited