Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

aws ec2 bandwidth between different regions #16

Open
shan-chen opened this issue Jul 27, 2021 · 15 comments
Open

aws ec2 bandwidth between different regions #16

shan-chen opened this issue Jul 27, 2021 · 15 comments

Comments

@shan-chen
Copy link

It seems that a peer connection is needed to establish between any two regions to achieve high bandwidth in aws ec2, while i didn't find in benchmark scripts.

@asonnino
Copy link
Contributor

Indeed we are not using it. The idea was to simulate (as much as possible) a distributed setting where each node is run by a different authority (although using only AWS machines undermines this argument). Also, I still have no evidence that bandwidth is the bottleneck (do you think otherwise?).

@shan-chen
Copy link
Author

I use iperf to test bandwidth between two instances in two different regions and the result shows the bandwidth is only about 60Mbps if without a peer connection. Is it ok or normal?

@asonnino
Copy link
Contributor

Which instance type are you using? Fig 6 of the paper (https://arxiv.org/pdf/2105.11827.pdf) seems to show that we get higher than that using m5.8xlarge instances.

@shan-chen
Copy link
Author

I use m5.8xlarge actually, one in ohio, on in sydney. I am also confused and what about your bandwidth in your environments?

@asonnino
Copy link
Contributor

What's the exact command you use? I will try to run the same

@shan-chen
Copy link
Author

iperf -s on one instance a and iperf -c a.ip on another instance b

@asonnino
Copy link
Contributor

I get similar results than you (around 60 Mbps), and I looked back at the benchmark results:

Benchmark with 4 nodes:

-----------------------------------------
 SUMMARY:
-----------------------------------------
 + CONFIG:
 Faults: 0 node(s)
 Committee size: 4 node(s)
 Worker(s) per node: 1 worker(s)
 Collocate primary and workers: False
 Input rate: 50,000 tx/s
 Transaction size: 512 B
 Execution time: 301 s

 Header size: 1,000 B
 Max header delay: 200 ms
 GC depth: 50 round(s)
 Sync retry delay: 10,000 ms
 Sync retry nodes: 3 node(s)
 batch size: 500,000 B
 Max batch delay: 200 ms

 + RESULTS:
 Consensus TPS: 45,819 tx/s
 Consensus BPS: 23,459,168 B/s
 Consensus latency: 2,418 ms

 End-to-end TPS: 45,667 tx/s
 End-to-end BPS: 23,381,658 B/s
 End-to-end latency: 3,354 ms
-----------------------------------------

The result seems to report a total throughput of about 23MB/s. Since there are 4 nodes (and the codebase uses one TCP connection per node), each connection ships about 23/4*8 = 46 Mbps.

Benchmark with 50 nodes:

-----------------------------------------
 SUMMARY:
-----------------------------------------
 + CONFIG:
 Faults: 0 node(s)
 Committee size: 50 node(s)
 Worker(s) per node: 1 worker(s)
 Collocate primary and workers: True
 Input rate: 180,000 tx/s
 Transaction size: 512 B
 Execution time: 301 s

 Header size: 1,000 B
 Max header delay: 200 ms
 GC depth: 50 round(s)
 Sync retry delay: 10,000 ms
 Sync retry nodes: 3 node(s)
 batch size: 500,000 B
 Max batch delay: 200 ms

 + RESULTS:
 Consensus TPS: 172,685 tx/s
 Consensus BPS: 88,414,625 B/s
 Consensus latency: 3,007 ms

 End-to-end TPS: 171,999 tx/s
 End-to-end BPS: 88,063,671 B/s
 End-to-end latency: 3,657 ms
-----------------------------------------

The result seems to report a total throughput of about 88MB/s. Since there are 50 nodes (and the codebase uses one TCP connection per node), each connection ships about 88/50*8 = 14 Mbps.

Now it is not clear how the above calculations are useful: (1) iperf seems to measure the BW of a single TCP connection (but it doesn't mean that we've got twice the BW if we open 2 TCP connections), and (2) Each time a batch is transmitted, the sender worker needs to broadcast it (at best) thus using O(n) BW (but through different TCP connections).

Any thoughts?

@asonnino
Copy link
Contributor

BTW, did you try to reproduce some of the results?

@shan-chen
Copy link
Author

I have reproduced in some other environment where BW per instance is limited but not aws. In my previous experiments, iperf can reach the max BW if there is only one TCP connection. I will further study iperf and test on aws again. I'm only confused by the description about 10Gbps of aws.
Thank you very much!

@asonnino
Copy link
Contributor

asonnino commented Jul 27, 2021

It is a good point, I only selected m5.8xlarge instances because they seemed standard for research papers (and reasonably priced for this kind of projects), and trusted the 10Gbps advertised BW. Please let me know how your experiments go.

@shan-chen
Copy link
Author

Ok, i will discuss with you if i have any result. Thank you!

@shan-chen
Copy link
Author

shan-chen commented Jul 29, 2021

I deployed 4 nodes in Sydney, Ohio, Frankfurt and Stockholm, all using ec2 t3.xlarge instances, which provides up to 5 Gbps BW. The command iperf can get the max BW normally while it seems that the BW per TCP connection is limited in aws. In the documentation of iperf, it says If the total aggregate bandwidth is more than what an individual stream gets, something is wrong. Either the TCP window size is too small, or the OS's TCP implementation has bugs, or the network itself has deficiencies. But in ec2, using iperf -P n almost gets n times BW.
Anyway, i used batch size as 1000 and 1000 bytes per transaction and one worker per node, getting tps around 37K. So the total BW is around 1K*37K*8=296Mbps. The max BW per connection is 60Mpbs. One worker needs to send 3 times transactions, and 180Mbps in max. The remaining 116Mbps is used for consensus? Is it reasonable?

@asonnino
Copy link
Contributor

The remaining 116Mbps is used for consensus? Is it reasonable?

No, the consensus layer uses 0 BW. The Tusk consensus protocol achieves total ordering by merely interpreting the dag (without sending any message).

Most of the BW should be used by the workers: they assemble client txs into batches and share the batches with other workers. Then the primary consistently broadcasts (Byzantine Consistent Broadcast) block headers containing hash references to the worker's batches. The first field of the node_params, called 'header_size', sets how many of these hashes are included into each block header, in bytes. That is, setting 'header_size = 1_000' means that each header includes about 1000/32 = 32 hashes. We typically try to keep block headers small (about 2KB for 30 nodes) so they shouldn't use a lot of BW.

Another source of BW consumption could be the synchronization protocol. If a node falls behind (that is, the rest of the nodes made progress without it), it will try to sync any missing data. The current sync protocol is quite naive to be fair (and could be bottleneck). The sync mechanism however uses separate TCP connections.

i used batch size as 1000 and 1000 bytes per transaction and one worker per node

Which bench_params and node_params did you set in your fabfile.py? A (worker) batch size of 1kB seems very small (I typically use 500KB):

node_params = {
    'header_size': 1_000,
    'max_header_delay': 100,
    'gc_depth': 50,
    'sync_retry_delay': 10_000,
    'sync_retry_nodes': 3,
    'batch_size': 500_000,
    'max_batch_delay': 100
}

Also it could be that the bottleneck is not BW (could it be storage/IO?). Another experiment could be to use many workers per node, all on the same machine (setting collocate=True), and see if the system can take advantage of the extra BW:

bench_params = {
    'nodes': [10, 20, 30],
    'workers: 2,
    'collocate': True,
    'rate': [20_000, 30_000, 40_000],
    'tx_size': 512,
    'faults': 0,
    'duration': 300,
    'runs': 2,
}

Btw, I just added a few benchmark results to the repo (in a folder called data).

@shan-chen
Copy link
Author

I used bench_params and node_params as:

bench_params = {
        'faults': 0,
        'nodes': [4],
        'workers': 1,
        'collocate': True,
        'rate': [60000],
        'tx_size': 1000,
        'duration': 30,
        'runs': 1,
}
node_params = {
        'header_size': 1_000,  
        'max_header_delay': 200,  
        'gc_depth': 50,  
        'sync_retry_delay': 10_000,  
        'sync_retry_nodes': 3,  
        'batch_size': 1000_000,
        'max_batch_delay': 200 
}

What i mean for consensus is that primaries need to broadcast hashes and certificates to construct the DAG, right? These parts indeed don't use a lot of BW. I will further test how much blocks proposed by each primary are committed.

I'm stilll confused by the result. As shown above, 296Mbps is the total throughput achieved but not the actual BW spent, and one worker used 180Mbps at most, is the proportion reasonable?

@asonnino
Copy link
Contributor

I will further test how much blocks proposed by each primary are committed.

That would be great actually. Let me know how it goes

I'm stilll confused by the result. As shown above, 296Mbps is the total throughput achieved but not the actual BW spent, and one worker used 180Mbps at most, is the proportion reasonable?

It is possible I guess. Each worker is collocated with a client (ie. the input load is load-balanced amongst the workers), there are thus 4 clients in your testbed (each submitting transactions at a rate of 60_000/4 tx/s). Now I suspect that the workers do not use all their available BW.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants