Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.17: BankingStage Forwarding Filter (backport of #685) #696

Merged
merged 5 commits into from
Apr 11, 2024

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Apr 9, 2024

Problem

  • Forwarding all packets, when many will ultimately be duplicates

Summary of Changes

  • Only forward packets that were from staked connections (swqos)

Fixes #


This is an automatic backport of pull request #685 done by [Mergify](https://mergify.com).

@mergify mergify bot added the conflicts label Apr 9, 2024
Copy link
Author

mergify bot commented Apr 9, 2024

Cherry-pick of 1744e9e has failed:

On branch mergify/bp/v1.17/pr-685
Your branch is up to date with 'origin/v1.17'.

You are currently cherry-picking commit 1744e9efd7.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   bench-streamer/src/main.rs
	modified:   core/src/banking_stage/forwarder.rs
	modified:   core/src/fetch_stage.rs
	modified:   core/src/repair/ancestor_hashes_service.rs
	modified:   core/src/repair/serve_repair_service.rs
	modified:   core/src/shred_fetch_stage.rs
	modified:   gossip/src/gossip_service.rs
	modified:   local-cluster/tests/local_cluster.rs
	modified:   streamer/src/nonblocking/quic.rs
	modified:   streamer/src/streamer.rs

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   sdk/src/packet.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

tao-stones
tao-stones previously approved these changes Apr 10, 2024
Copy link

@tao-stones tao-stones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backport looks good, but the CI failed on the same spot of local-cluster test. I restarted it, see if it's just flaky.

@apfitzge
Copy link

@steviez I removed test_ledger_cleanup_service, we're unsure why that is affected by this change. The test was removed, by you, in 1.18+. Can you confirm that removing test here is reasonable choice?

@bw-solana
Copy link

@steviez I removed test_ledger_cleanup_service, we're unsure why that is affected by this change. The test was removed, by you, in 1.18+. Can you confirm that removing test here is reasonable choice?

Looks like the test is verifying we can clean up the ledger at the expected rate (i.e. purge slots). Expectation is that the max slots we have is whatever we set in the validator config + num validators. We're seeing more than 3x that during this test.

No idea why this change would impact this. Is this test consistently passing w/o these changes?

@@ -4273,49 +4295,6 @@ fn test_leader_failure_4() {
);
}

#[test]
#[serial]
fn test_ledger_cleanup_service() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed? it was failing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see comment here: #696 (comment)

@tao-stones
Copy link

No idea why this change would impact this. Is this test consistently passing w/o these changes?

This test passed rather consistently without change, on my local runs

sakridge
sakridge previously approved these changes Apr 10, 2024
@steviez
Copy link

steviez commented Apr 10, 2024

@steviez I removed test_ledger_cleanup_service, we're unsure why that is affected by this change. The test was removed, by you, in 1.18+. Can you confirm that removing test here is reasonable choice?

Looks like the test is verifying we can clean up the ledger at the expected rate (i.e. purge slots). Expectation is that the max slots we have is whatever we set in the validator config + num validators. We're seeing more than 3x that during this test.

No idea why this change would impact this. Is this test consistently passing w/o these changes?

Ok, so digging up git history which I'm sure y'all already did, I removed the test in this PR solana-labs#33947 with this justification:

This PR deletes two tests:

  • The one from local cluster exercises behavior that can be tested in isolation on a "single" node, and is already done so in blockstore_purge.rs

I think it is probably fine to remove, but let me recreate the failure locally and see if I can figure it out real quick just to be 100% sure

@steviez
Copy link

steviez commented Apr 10, 2024

Even with the tip of v1.17, it looks like test_ledger_cleanup_service is an already long running test (this from my 24-core/256 GB RAM machine):

test test_ledger_cleanup_service has been running for over 60 seconds
test test_ledger_cleanup_service ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 50 filtered out; finished in 259.65s

However, with this branch with the test deletion reverted, I've seen test_ledger_cleanup_service running for 12+ minutes without yet finishing

I still think this test isn't great in that it is testing something in an overly complex way (don't need a local cluster to test that blockstore functionality), but it did seem to reveal some underlying change of behavior that we should probably understand before pushing the PR

apfitzge and others added 5 commits April 10, 2024 17:32
* add PacketFlags::FROM_STAKED_NODE

* Only forward packets from staked node

* fix local-cluster test forwarding

* review comment

* tpu_votes get marked as from_staked_node

(cherry picked from commit 1744e9e)

# Conflicts:
#	sdk/src/packet.rs
@codecov-commenter
Copy link

codecov-commenter commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 63.63636% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (be83469) to head (ce0d46f).

Additional details and impacted files
@@           Coverage Diff           @@
##            v1.17     #696   +/-   ##
=======================================
  Coverage    81.6%    81.6%           
=======================================
  Files         806      806           
  Lines      219284   219295   +11     
=======================================
+ Hits       179014   179050   +36     
+ Misses      40270    40245   -25     

Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ ci. only difference from previous approvals is removing test hacks in favor of a more appropriate fix in #723

Copy link

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, glad we took the time to figure out the local cluster failures

@apfitzge apfitzge requested a review from sakridge April 11, 2024 13:46
@t-nelson t-nelson merged commit 6a46c01 into v1.17 Apr 11, 2024
34 checks passed
@t-nelson t-nelson deleted the mergify/bp/v1.17/pr-685 branch April 11, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants