Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add port scan diagnostics to mria waiting for tables checks #11637

Merged
merged 1 commit into from Sep 22, 2023

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented Sep 19, 2023

Fixes https://emqx.atlassian.net/browse/EMQX-10944

Also updated libraries: ekka -> 0.15.15, mria -> 0.6.4.

How to test

  1. Start 2 or more EMQX nodes and merge them in a cluster.
  2. Stop them in order.
  3. Start only the first node that was stopped in the previous step.
  4. Wait until the log is printed.

Or, more easily:

  1. Start 2 or more EMQX nodes and merge them in a cluster.
  2. Stop all but one.
  3. Run mria_mnesia:diagnosis([]). on that node.

Example output

With two nodes down
   Check check_open_ports should get ok but got #{msg =>
                                                     "some ports are unreachable",
                                                 results =>
                                                     #{'emqx@172.100.239.4' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   4}],
                                                             status =>
                                                                 bad_ports},
                                                       'emqx@172.100.239.5' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   5}],
                                                             status =>
                                                                 bad_ports}}}
After one node is back
   Check check_open_ports should get ok but got #{msg =>
                                                     "some ports are unreachable",
                                                 results =>
                                                     #{'emqx@172.100.239.4' =>
                                                           #{ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   4}],
                                                             status => ok},
                                                       'emqx@172.100.239.5' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   5}],
                                                             status =>
                                                                 bad_ports}}}

Summary

馃 Generated by Copilot at 26156ed

Add a cluster TCP port availability check to emqx_machine. This check uses open_ports_check/0 to test the connectivity of ekka and gen_rpc ports among cluster nodes, and reports any failures to mria.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • [na] Added tests for the changes
  • [na] Added property-based tests for code which performs user input validation
  • [na] Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • [na] Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • [na] Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • [na] If changed package build workflow, pass this action (manual trigger)
  • [na] Change log has been added to changes/ dir for user-facing artifacts update

@thalesmg thalesmg force-pushed the port-scan-mria-check-m-20230919 branch 2 times, most recently from 5ee673f to 8a1fa8c Compare September 20, 2023 15:47
-define(PORT_PROBE_TIMEOUT, 10_000).
open_ports_check() ->
%% expected nodes according to mnesia schema
OtherNodes = mnesia:system_info(db_nodes) -- [node()],
Copy link
Contributor Author

@thalesmg thalesmg Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • todo: fix for replicants

Replicants won't enter the waiting for tables loop: they'll probably be stuck waiting to connect to a core node (mria:find_upstream_node).

Also, connectivity issues between core nodes are arguably more relevant for the diagnostics than replicants.

@thalesmg thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 8a1fa8c to 28b679c Compare September 20, 2023 17:30
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6252084448

  • 4 of 38 (10.53%) changed or added relevant lines in 1 file are covered.
  • 25 unchanged lines in 8 files lost coverage.
  • Overall coverage decreased (-0.06%) to 81.641%

Changes Missing Coverage Covered Lines Changed/Added Lines %
apps/emqx_machine/src/emqx_machine.erl 4 38 10.53%
Files with Coverage Reduction New Missed Lines %
apps/emqx_durable_storage/src/emqx_ds_message_storage_bitmask.erl 1 93.42%
apps/emqx_management/src/emqx_mgmt_util.erl 1 13.79%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl 1 93.81%
apps/emqx/src/emqx_quic_stream.erl 1 75.0%
apps/emqx/src/emqx_router_helper.erl 1 84.91%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_frame.erl 2 64.04%
apps/emqx/src/emqx_alarm.erl 4 88.24%
apps/emqx/src/emqx_reason_codes.erl 14 87.5%
Totals Coverage Status
Change from base Build 6251999884: -0.06%
Covered Lines: 33169
Relevant Lines: 40628

馃挍 - Coveralls

@thalesmg thalesmg marked this pull request as ready for review September 20, 2023 19:15
@thalesmg thalesmg requested review from lafirest and a team as code owners September 20, 2023 19:15
@thalesmg thalesmg marked this pull request as draft September 20, 2023 19:58
@thalesmg thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 28b679c to 6234da9 Compare September 20, 2023 20:26
node_to_ips(Node) ->
NodeBin0 = atom_to_binary(Node),
HostOrIP = re:replace(NodeBin0, <<"^.+@">>, <<"">>, [{return, list}]),
case inet:gethostbyname(HostOrIP, inet) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why specify inet here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to normalize the IPs to ipv4 addresses, to avoid ipv6 from being returned. Do you think it would be ok to keep it unspecified here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could make it an arg for the caller's choice when they run open_ports_check, or we just check both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is called by mria_mnesia:diagnosis/1, which in turn is called when mria is stuck waiting for tables for more than 30 seconds. In this situation, there's not much interactivity to be able to change this configuration.

If necessary, perhaps we can make it configurable via a persistent term key that controls this behavior and can be set via remsh. What do you think?

Copy link
Contributor

@qzhuyan qzhuyan Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info, thought it was a manual call by human.

Since it is for 'wait for table' that must be for Erlang dist port, we could get such info from the VM? inet or inet6?

@thalesmg thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 6234da9 to 9456290 Compare September 21, 2023 12:12
Fixes https://emqx.atlassian.net/browse/EMQX-10944

Also updates ekka -> 0.15.15, mria -> 0.6.4

How to test
===========

1. Start 2 or more EMQX nodes and merge them in a cluster.
2. Stop them in order.
3. Start only the first node that was stopped in the previous step.
4. Wait until the log is printed.

Or, more easily:

1. Start 2 or more EMQX nodes and merge them in a cluster.
2. Stop all but one.
3. Run `mria_mnesia:diagnosis([]).` on that node.

Example output
==============

```
   Check check_open_ports should get ok but got #{msg =>
                                                     "some ports are unreachable",
                                                 results =>
                                                     #{'emqx@172.100.239.4' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   4}],
                                                             status =>
                                                                 bad_ports},
                                                       'emqx@172.100.239.5' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   5}],
                                                             status =>
                                                                 bad_ports}}}
```

After one node is back:

```
   Check check_open_ports should get ok but got #{msg =>
                                                     "some ports are unreachable",
                                                 results =>
                                                     #{'emqx@172.100.239.4' =>
                                                           #{ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   4}],
                                                             status => ok},
                                                       'emqx@172.100.239.5' =>
                                                           #{open_ports =>
                                                                 #{4370 => false,
                                                                   5370 =>
                                                                       false},
                                                             ports_to_check =>
                                                                 [4370,5370],
                                                             resolved_ips =>
                                                                 [{172,100,239,
                                                                   5}],
                                                             status =>
                                                                 bad_ports}}}
```
@thalesmg thalesmg force-pushed the port-scan-mria-check-m-20230919 branch from 9456290 to d6935b6 Compare September 21, 2023 17:29
@thalesmg thalesmg marked this pull request as ready for review September 21, 2023 19:07
@thalesmg thalesmg merged commit 5e40057 into emqx:master Sep 22, 2023
133 checks passed
@thalesmg thalesmg deleted the port-scan-mria-check-m-20230919 branch September 22, 2023 16:05
@thalesmg
Copy link
Contributor Author

Follow up improvement after discussing with @qzhuyan : get the distribution port type (inet / inet6) to define which address type to fetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants