Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security: only apply the outbound connection rate-limit to actual connections #2278

Merged
merged 8 commits into from
Jun 14, 2021

Conversation

teor2345
Copy link
Collaborator

@teor2345 teor2345 commented Jun 10, 2021

Motivation

Zebra advances the next peer timer, even when there are no eligible peers. (And the method is about to return None.)

This creates large peer wait times when there are no eligible peers. This can be a significant issue on testnet.

Solution

  • Only advance the timer when we're actually going to open a connection
  • Test that there is no delay when there are no candidate peers
  • Test that the maximum delay is bounded

Review

@jvff can review this PR.

Reviewer Checklist

  • Code implements Specs and Designs
  • Tests for Expected Behaviour
  • Tests for Errors

Previously, we were advancing the timer even when we returned `None`.
This created large wait times when there were no eligible peers.
@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code I-slow Problems with performance or responsiveness A-network Area: Network protocol updates or fixes labels Jun 10, 2021
@teor2345 teor2345 added this to the 2021 Sprint 11 - Zcon2 milestone Jun 10, 2021
@teor2345 teor2345 requested a review from jvff June 10, 2021 06:56
@teor2345 teor2345 force-pushed the only-rate-limit-actual-handshakes branch from e8e2f23 to 05a3673 Compare June 10, 2021 06:56
@teor2345 teor2345 changed the base branch from dynamic-local-listener-to-peers to main June 10, 2021 06:56
@teor2345

This comment has been minimized.

dconnolly
dconnolly previously approved these changes Jun 11, 2021
@dconnolly
Copy link
Contributor

@jvff would you mind adding a proptest to catch this issue?

If next() returns None, the sleep timer shouldn't be advanced at all.

Is that proptest a blocker to this merging?

@teor2345
Copy link
Collaborator Author

@jvff would you mind adding a proptest to catch this issue?
If next() returns None, the sleep timer shouldn't be advanced at all.

Is that proptest a blocker to this merging?

This is my third attempt at implementing/reviewing this rate-limit, so I need better tests for it.
(I've added two different proptests to this PR. So we should be good if this PR reliably passes CI.)

I have an idea for one more proptest, but it needs to done after PR #2275 merges to avoid conflicts. The existing proptests put pretty strict bounds on the wait time. So I don't think we need extra tests any time soon.

macOS VMs seem to need this extra time to pass their tests.
This change avoids test failures due to cumulative errors.

Also use a single call to `Instant::now` for each test round.
And print the times when the tests fail.
The candidate set proptests will fail if enough generated peers are
invalid for outbound connections.
Comment on lines +7 to +8
cc e27ea62986a1ace4e16dde4b99d8ae8557a6b1e92e3f9a7c6aba0d7b179eb817 # shrinks to peers = [MetaAddr { addr: 0.0.0.0:0, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:0, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:1151, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 129.65.108.101:9019, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 95.68.199.21:46887, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 167.206.253.131:43927, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 160.138.155.239:63430, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: [6a4e:a543:8c93:e074:8a2a:5cc4:96f8:ce95]:51071, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }], initial_candidates = 0, extra_candidates = 2
cc 788fe2c47a4af559e9a921764d1ce8fcfd0743361c19e6e7a06adc093bc19078 # shrinks to peers = [MetaAddr { addr: [::e9:8050:dbf7:5665]:27188, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 106.27.232.177:51742, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 127.0.0.1:30827, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:21948, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:38186, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: [4db2:96e6:38c5:fd62:e964:9338:4ce8:bfa0]:32686, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:22952, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 114.164.73.233:10203, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }], initial_candidates = 3, extra_candidates = 3
Copy link
Collaborator Author

@teor2345 teor2345 Jun 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addresses containing either:

  • IPv4: 0.0.0.0
  • IPv6: ::
  • Port: 0

Are invalid for outbound connections.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did about 5000 runs of this test overnight, seems like the generated test cases are (almost) all valid now.

@teor2345
Copy link
Collaborator Author

Coverage failed with:

warning: spurious network error (2 tries remaining): failed to get 200 response from https://crates.io/api/v1/crates/anyhow/1.0.40/download, got 502
warning: spurious network error (1 tries remaining): failed to get 200 response from https://crates.io/api/v1/crates/anyhow/1.0.40/download, got 502
error: failed to download from https://crates.io/api/v1/crates/anyhow/1.0.40/download

Caused by:
failed to get 200 response from https://crates.io/api/v1/crates/anyhow/1.0.40/download, got 502

This seems to be happening a bit lately. It would be good to wait between retries, but there are no retry or timeout options on the action:
https://github.com/actions-rs/toolchain#inputs

So this likely needs a cargo or actions-rs/toolchain code fix.

Comment on lines +24 to +28
pub fn alternate_strategy() -> BoxedStrategy<Self> {
(canonical_socket_addr(), any::<PeerServices>())
.prop_map(|(socket_addr, services)| MetaAddr::new_alternate(&socket_addr, &services))
.boxed()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the alternate strategy so it returns the full range of possible service bits. (Node and client.)

//
// TODO: create a "Zebra supported services" constant
let meta_addr = MetaAddr::new_alternate(&address, &PeerServices::NODE_NETWORK);
if meta_addr.is_valid_for_outbound() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a specific strategy for valid and ready outbound peers.

This was the missing check in the previous code:

                if meta_addr.is_valid_for_outbound() {

Comment on lines +315 to +318
// SECURITY: rate-limit new outbound peer connections
(&mut self.wait_next_handshake).await;
let mut sleep = sleep(constants::MIN_PEER_CONNECTION_INTERVAL);
mem::swap(&mut self.wait_next_handshake, &mut sleep);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two fixes:

  • only wait if we're returning Some
  • schedule the next sleep time after the previous sleep returns, avoiding issues with overlapping times

Comment on lines +121 to +126
// Allow enough time for the maximum number of candidates,
// plus some extra time for test machines with high CPU load.
// (The max delay asserts usually fail before hitting this timeout.)
let max_rate_limit_sleep = 3 * MAX_TEST_CANDIDATES * MIN_PEER_CONNECTION_INTERVAL;
let max_extra_delay = (2 * MAX_TEST_CANDIDATES + 1) * MAX_SLEEP_EXTRA_DELAY;
assert!(runtime.block_on(timeout(max_rate_limit_sleep + max_extra_delay, checks)).is_ok());
Copy link
Collaborator Author

@teor2345 teor2345 Jun 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increase the test timeout to match any increases in the candidates or times.
(Currently it's 10.2 seconds rather than 10 seconds!)

Also use named constants rather than integer literals.

Comment on lines -87 to +169
let mut minimum_reconnect_instant = Instant::now();
let mut now = Instant::now();
let mut minimum_reconnect_instant = now;
// Allow extra time for test machines with high CPU load
let mut maximum_reconnect_instant = now + MAX_SLEEP_EXTRA_DELAY;

for _ in 0..candidates {
assert!(candidate_set.next().await.is_some());
assert!(Instant::now() >= minimum_reconnect_instant);

minimum_reconnect_instant += MIN_PEER_CONNECTION_INTERVAL;
assert!(
candidate_set.next().await.is_some(),
"there are enough available candidates"
);

now = Instant::now();
assert!(
now >= minimum_reconnect_instant,
"all candidates should obey the minimum rate-limit: now: {:?} min: {:?}",
now,
minimum_reconnect_instant,
);
assert!(
now <= maximum_reconnect_instant,
"rate-limited candidates should not be delayed too long: now: {:?} max: {:?}. Hint: is the test machine overloaded?",
now,
maximum_reconnect_instant,
);

minimum_reconnect_instant = now + MIN_PEER_CONNECTION_INTERVAL;
maximum_reconnect_instant = now + MIN_PEER_CONNECTION_INTERVAL + MAX_SLEEP_EXTRA_DELAY;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two test changes:

  • add a maximum time that's a second after the minimum time (unfortunately macOS tends to fail on shorter times)
  • get the value of now once per loop iteration

/// TODO: after PR #2275 merges, add a similar proptest,
/// using a "not ready for attempt" peer generation strategy
#[test]
fn skipping_outbound_peer_connection_skips_rate_limit(next_peer_attempts in 0..TEST_ADDRESSES) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skipping_outbound_peer_connection_skips_rate_limit test checks the bug fix in the code.

(The other test is useful for making sure we're not delaying way too long when we actually have peers, but it can't be too strict due to OS scheduling delays.)

@teor2345 teor2345 requested a review from dconnolly June 14, 2021 01:08
@mpguerra mpguerra removed this from the 2021 Sprint 11 - Zcon2 milestone Jun 14, 2021
@teor2345 teor2345 changed the title Only apply the outbound connection rate-limit to actual connections Security: only apply the outbound connection rate-limit to actual connections Jun 14, 2021
@teor2345 teor2345 added I-usability Zebra is hard to understand or use I-hang A Zebra component stops responding to requests labels Jun 14, 2021
@teor2345 teor2345 merged commit 86f23f7 into main Jun 14, 2021
@teor2345 teor2345 deleted the only-rate-limit-actual-handshakes branch June 14, 2021 22:29
dconnolly pushed a commit that referenced this pull request Jun 15, 2021
…nections (#2278)

* Only advance the outbound connection timer when it returns an address

Previously, we were advancing the timer even when we returned `None`.
This created large wait times when there were no eligible peers.

* Refactor to avoid overlapping sleep timers

* Add a maximum next peer delay test

Also refactor peer numbers into constants.

* Make the number of proptests overridable by the standard env var

Also cleanup the test constants.

* Test that skipping peer connections also skips their rate limits

* Allow an extra second after each sleep on loaded machines

macOS VMs seem to need this extra time to pass their tests.

* Restart test time bounds from the current time

This change avoids test failures due to cumulative errors.

Also use a single call to `Instant::now` for each test round.
And print the times when the tests fail.

* Stop generating invalid outbound peers in proptests

The candidate set proptests will fail if enough generated peers are
invalid for outbound connections.
dconnolly pushed a commit that referenced this pull request Jun 16, 2021
…nections (#2278)

* Only advance the outbound connection timer when it returns an address

Previously, we were advancing the timer even when we returned `None`.
This created large wait times when there were no eligible peers.

* Refactor to avoid overlapping sleep timers

* Add a maximum next peer delay test

Also refactor peer numbers into constants.

* Make the number of proptests overridable by the standard env var

Also cleanup the test constants.

* Test that skipping peer connections also skips their rate limits

* Allow an extra second after each sleep on loaded machines

macOS VMs seem to need this extra time to pass their tests.

* Restart test time bounds from the current time

This change avoids test failures due to cumulative errors.

Also use a single call to `Instant::now` for each test round.
And print the times when the tests fail.

* Stop generating invalid outbound peers in proptests

The candidate set proptests will fail if enough generated peers are
invalid for outbound connections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes A-rust Area: Updates to Rust code C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants