Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qa: rotating node test on v0.38.x #883

Merged
merged 8 commits into from
May 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 136 additions & 14 deletions docs/qa/CometBFT-QA-38.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,15 @@ range of ABCI++ functionality (ABCI 2.0), and other several improvements
described in the
[CHANGELOG](https://github.com/cometbft/cometbft/blob/v0.38.0-alpha.2/CHANGELOG.md).

## Testbed
## Issues discovered

* (critical, fixed) [\#539] and [\#546] - This bug causes the proposer to crash in
`PrepareProposal` because it does not have extensions while it should.
This happens mainly when the proposer was catching up.
* (critical, fixed) [\#562] - There were several bugs in the metrics-related
logic that were causing panics when the testnets were started.

## 200 Node Testnet

As in other iterations of our QA process, we have used a 200-node network as
testbed, plus nodes to introduce load and collect metrics.
Expand Down Expand Up @@ -75,7 +83,7 @@ saturation point is beyond the diagonal defined by `r=200,c=2` and `r=100,c=4`.
| r=100 | 8900 | 17800 | **35600** |
| r=200 | 17800 | **35600** | 38660 |

## Latencies
### Latencies

The following figure plots the latencies of the experiment carried out with the
configuration `c=1,r=400`.
Expand All @@ -93,12 +101,12 @@ from this small experiment, we can say that the latencies measured on the two
versions are equivalent, or at least that the version under test is not worse
than the baseline.

## Prometheus Metrics on the Chosen Experiment
### Prometheus Metrics on the Chosen Experiment

This section further examines key metrics for this experiment extracted from
Prometheus data regarding the chosen experiment with configuration `c=1,r=400`.

### Mempool Size
#### Mempool Size

The mempool size, a count of the number of transactions in the mempool, was
shown to be stable and homogeneous at all full nodes. It did not exhibit any
Expand All @@ -124,7 +132,7 @@ The behavior is similar to the observed in the baseline, presented next.
![mempool-avg-baseline](img37/200nodes_cmt037/avg_mempool_size.png)


### Peers
#### Peers

The number of peers was stable at all nodes. It was higher for the seed nodes
(around 140) than for the rest (between 20 and 70 for most nodes). The red
Expand All @@ -138,7 +146,7 @@ than 50 peers is due to [\#9548].
![peers](img37/200nodes_cmt037/peers.png)


### Consensus Rounds per Height
#### Consensus Rounds per Height

Most heights took just one round, that is, round 0, but some nodes needed to
advance to round 1.
Expand All @@ -150,7 +158,7 @@ The following specific run of the baseline required some nodes to reach round 1.
![rounds](img37/200nodes_cmt037/rounds.png)


### Blocks Produced per Minute, Transactions Processed per Minute
#### Blocks Produced per Minute, Transactions Processed per Minute

The following plot shows the rate in which blocks were created, from the point
of view of each node. That is, it shows when each node learned that a new block
Expand Down Expand Up @@ -180,10 +188,10 @@ similar to above.
![total-txs-baseline](img37/200nodes_cmt037/total_txs_rate.png)


### Memory Resident Set Size
#### Memory Resident Set Size

The following graph shows the Resident Set Size of all monitored processes, with
maximum memory usage of 1.6GB, slighlty lower than the baseline shown after.
maximum memory usage of 1.6GB, slightly lower than the baseline shown after.

![rss](img38/200nodes/memory.png)

Expand All @@ -196,9 +204,9 @@ The memory of all processes went down as the load is removed, showing no signs
of unconstrained growth.


### CPU utilization
#### CPU utilization

#### Comparison to baseline
##### Comparison to baseline

The best metric from Prometheus to gauge CPU utilization in a Unix machine is
`load1`, as it usually appears in the [output of
Expand All @@ -212,7 +220,7 @@ The baseline had a similar behavior.

![load1-baseline](img37/200nodes_cmt037/cpu.png)

#### Impact of vote extension signature verification
##### Impact of vote extension signature verification

It is important to notice that the baseline (`v0.37.x`) does not implement vote extensions,
whereas the version under test (`v0.38.0-alpha.2`) _does_ implement them, and they are
Expand All @@ -228,7 +236,7 @@ has no performance impact in the current version of the system: the bottlenecks
Thus, we should focus on optimizing other parts of the system: the ones that cause the current
bottlenecks (mempool gossip duplication, leaner proposal structure, optimized consensus gossip).

## Test Results
### Test Results

The comparison against the baseline results show that both scenarios had similar
numbers and are therefore equivalent.
Expand All @@ -238,7 +246,121 @@ commit versions used in the experiments.

| Scenario | Date | Version | Result |
| -------- | ---------- | ---------------------------------------------------------- | ------ |
| CometBFT | 2023-05-21 | v0.38.0-alpha.2 (1f524d12996204f8fd9d41aa5aca215f80f06f5e) | Pass |
| 200-node | 2023-05-21 | v0.38.0-alpha.2 (1f524d12996204f8fd9d41aa5aca215f80f06f5e) | Pass |


## Rotating Node Testnet

We use `c=1,r=400` as load, which can be considered a safe workload, as it was close to (but below)
the saturation point in the 200 node testnet. This testnet has less nodes (10 validators and 25 full nodes).

Importantly, the baseline considered in this section is `v0.37.0-alpha.2` (Tendermint Core),
which is **different** from the one used in the [previous section](#200-node-testbed).
The reason is that this testnet was not re-tested for `v0.37.0-alpha.3` (CometBFT),
since it was not deemed necessary.

Unlike in the baseline tests, the version of CometBFT used for these tests is _not_ affected by [\#9539],
which was fixed right after having run rotating testnet for `v0.37`.
lasarojc marked this conversation as resolved.
Show resolved Hide resolved
As a result, the load introduced in this iteration of the test is higher as transactions do not get rejected.

### Latencies

The plot of all latencies can be seen here.

![rotating-all-latencies](img38/rotating/rotating_latencies.png)

Which is similar to the baseline.

![rotating-all-latencies](img37/200nodes_tm037/v037_rotating_latencies.png)

The average increase of about 1 second with respect to the baseline is due to the higher
transaction load produced (remember the baseline was affected by [\#9539], whereby most transactions
produced were rejected by `CheckTx`).

### Prometheus Metrics

The set of metrics shown here roughly match those shown on the baseline (`v0.37`) for the same experiment.
We also show the baseline results for comparison.

#### Blocks and Transactions per minute

This following plot shows the blocks produced per minute.

![rotating-heights](img38/rotating/rotating_block_rate.png)

This is similar to the baseline, shown below.

![rotating-heights-bl](img37/rotating/rotating_block_rate.png)

The following plot shows only the heights reported by ephemeral nodes, both when they were blocksyncing
and when they were running consensus.
The second plot is the baseline plot for comparison. The baseline lacks the heights when the nodes were
blocksyncing as that metric was implemented afterwards.

![rotating-heights-ephe](img38/rotating/rotating_eph_heights.png)

![rotating-heights-ephe-bl](img37/rotating/rotating_eph_heights.png)

We seen that heights follow a similar pattern in both plots: they grow in length as the experiment advances.

The following plot shows the transactions processed per minute.

![rotating-total-txs](img38/rotating/rotating_txs_rate.png)

For comparison, this is the baseline plot.

![rotating-total-txs-bl](img37/rotating/rotating_txs_rate.png)

We can see the rate is much lower in the baseline plot.
The reason is that the baseline was affected by [\#9539], whereby `CheckTx` rejected most transactions
produced by the load runner.

#### Peers

The plot below shows the evolution of the number of peers throughout the experiment.

![rotating-peers](img38/rotating/rotating_peers.png)

This is the baseline plot, for comparison.

![rotating-peers-bl](img37/rotating/rotating_peers.png)

The plotted values and their evolution are comparable in both plots.

For further details on these plots, see the [this section](./TMCore-QA-34.md#peers-1).

#### Memory Resident Set Size

The average Resident Set Size (RSS) over all processes is notably bigger on `v0.38.0-alpha.2` than on the baseline.
The reason for this is, again, the fact that `CheckTx` was rejecting most transactions submitted on the baseline
and therefore the overall transaction load was lower on the baseline.
This is consistent with the difference seen in the transaction rate plots
in the [previous section](#blocks-and-transactions-per-minute).

![rotating-rss-avg](img38/rotating/rotating_avg_memory.png)

![rotating-rss-avg-bl](img37/rotating/rotating_avg_memory.png)

#### CPU utilization

The plots show metric `load1` for all nodes for `v0.38.0-alpha.2` and for the baseline.

![rotating-load1](img38/rotating/rotating_cpu.png)

![rotating-load1-bl](img37/rotating/rotating_cpu.png)

In both cases, it is contained under 5 most of the time, which is considered normal load.
The load seems to be more important on `v0.38.0-alpha.2` on average because of the bigger
number of transactions processed per minute as compared to the baseline.

### Test Result

| Scenario | Date | Version | Result |
| -------- | ---------- | ---------------------------------------------------------- | ------ |
| Rotating | 2023-05-23 | v0.38.0-alpha.2 (e9abb116e29beb830cf111b824c8e2174d538838) | Pass |

[\#9539]: https://github.com/tendermint/tendermint/issues/9539
[\#9548]: https://github.com/tendermint/tendermint/issues/9548
[\#539]: https://github.com/cometbft/cometbft/issues/539
[\#546]: https://github.com/cometbft/cometbft/issues/546
[\#562]: https://github.com/cometbft/cometbft/issues/562
Binary file added docs/qa/img37/rotating/rotating_avg_memory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img37/rotating/rotating_cpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img37/rotating/rotating_eph_heights.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img37/rotating/rotating_peers.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img37/rotating/rotating_txs_rate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img38/rotating/rotating_avg_memory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img38/rotating/rotating_cpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img38/rotating/rotating_latencies.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img38/rotating/rotating_peers.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/qa/img38/rotating/rotating_txs_rate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion scripts/qa/reporting/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,8 @@ python3 latency_plotter.py /path/to/csv/files/raw.csv
2. Tweak the script to your needs
1. Adjust the time window
2. Select the right fork
3. Tweak/add/remove metrics
3. Select the right test case
4. Tweak/add/remove metrics
3. Run the script as follows
```bash
# Do the following while ensuring that the virtual environment is activated (see
Expand Down
46 changes: 35 additions & 11 deletions scripts/qa/reporting/prometheus_plotter.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as md

import numpy as np
import pandas as pd
Expand All @@ -14,16 +15,21 @@

from prometheus_pandas import query

release = 'v0.37.x-alpha3'
#release = 'v0.37.0-alpha.2'
release = 'v0.38.0-alpha.2'
path = os.path.join('imgs')
prometheus = query.Prometheus('http://localhost:9090')

# Time window
#window_size = dict(seconds=150) #CMT 0.37.x-alpha3
#window_size = dict(seconds=126) #TM v0.37 (200 nodes) baseline
window_size = dict(seconds=115) #CMT v0.38 (200 nodes) baseline
#window_size = dict(hours=1, minutes=28, seconds=25) #TM v0.37.0-alpha.2 (rotating)
#window_size = dict(seconds=130) #homogeneous
#window_size = dict(seconds=127) #baseline
#window_size = dict(seconds=115) #CMT v0.38.0-alpha.2 (200 nodes)
window_size = dict(hours=1, minutes=46) #CMT v0.38.0-alpha.2 (rotating)


ext_window_size = dict(seconds=145)

# Use the time provided by latency_plotter for the selected experiment.
Expand All @@ -32,10 +38,12 @@
#left_end = '2023-02-14T15:18:00Z' #cmt1 tm1
#left_end = '2023-02-07T18:07:00Z' #homogeneous
#left_end = '2022-10-13T19:41:23Z' #baseline
#left_end = '2023-02-22T18:56:29Z' #CMT 0.37.x-alpha3
#left_end = '2023-02-22T18:56:29Z' #CMT v0.37.x-alpha3
#left_end = '2022-10-13T15:57:50Z' #TM v0.37 (200 nodes) baseline
#left_end = '2023-03-20T19:45:35Z' #feature/abci++vef merged with main (7d8c9d426)
left_end = '2023-05-22T09:39:20Z' #CMT 0.38.0-alpha.2
#left_end = '2023-05-22T09:39:20Z' #CMT v0.38.0-alpha.2 - 200 nodes
#left_end = '2022-10-10T15:47:15Z' #TM v0.37.0-alpha.2 - rotating
left_end = '2023-05-23T08:09:50Z' #CMT v0.38.0-alpha.2 - rotating

right_end = pd.to_datetime(left_end) + pd.Timedelta(**window_size)
time_window = (left_end, right_end.strftime('%Y-%m-%dT%H:%M:%SZ'))
Expand All @@ -48,8 +56,8 @@
fork='cometbft'
#fork='tendermint'

# Do prometheus queries
queries = [
# Do prometheus queries, depending on the test case
queries200Nodes = [
(( fork + '_mempool_size', time_window[0], time_window[1], '1s'), 'mempool_size', dict(ylabel='TXs', xlabel='time (s)', title='Mempool Size', legend=False, figsize=(10,6), grid=True, kind='area',stacked=True), False),
(( fork + '_p2p_peers', time_window[0], time_window[1], '1s'), 'peers', dict(ylabel='# Peers', xlabel='time (s)', title='Peers', legend=False, figsize=(10,6), grid=True), True),
(( 'avg(' + fork + '_mempool_size)', time_window[0], time_window[1], '1s'), 'avg_mempool_size', dict(ylabel='TXs', xlabel='time (s)', title='Average Mempool Size', legend=False, figsize=(10,6), grid=True), False),
Expand All @@ -71,23 +79,39 @@
(( 'rate(' + fork + '_consensus_total_txs[20s])*60', ext_time_window[0], ext_time_window[1], '1s'), 'total_txs_rate', dict(ylabel='TXs/min', xlabel='time (s)', title='Rate of transaction processing', legend=False, figsize=(10,6), grid=True), True),
]

for (query, file_name, pandas_params, plot_average) in queries:
queriesRotating = [
(( 'rate(' + fork + '_consensus_height[20s])*60', time_window[0], time_window[1], '1s'), 'rotating_block_rate', dict(ylabel='blocks/min', xlabel='time', title='Rate of Block Creation', legend=False, figsize=(10,6), grid=True), False),
(( 'rate(' + fork + '_consensus_total_txs[20s])*60', time_window[0], time_window[1], '1s'), 'rotating_txs_rate', dict(ylabel='TXs/min', xlabel='time', title='Rate of Transaction processing', legend=False, figsize=(10,6), grid=True), False),
(( fork + '_consensus_height{job=~"ephemeral.*"} or ' + fork + '_blocksync_latest_block_height{job=~"ephemeral.*"}',
time_window[0], time_window[1], '1s'), 'rotating_eph_heights', dict(ylabel='height', xlabel='time', title='Heights of Ephemeral Nodes', legend=False, figsize=(10,6), grid=True), False),
(( fork + '_p2p_peers', time_window[0], time_window[1], '1s'), 'rotating_peers', dict(ylabel='# peers', xlabel='time', title='Peers', legend=False, figsize=(10,6), grid=True), False),
(( 'avg(process_resident_memory_bytes)', time_window[0], time_window[1], '1s'), 'rotating_avg_memory', dict(ylabel='memory (bytes)', xlabel='time', title='Average Memory Usage', legend=False, figsize=(10,6), grid=True), False),
(( 'node_load1', time_window[0], time_window[1], '1s'), 'rotating_cpu', dict(ylabel='load', xlabel='time', title='Node Load', legend=False, figsize=(10,6), grid=True), False),
]

#queries = queries200Nodes
queries = queriesRotating


for (query, file_name, pandas_params, plot_average) in queries:
print(query)

data_frame = prometheus.query_range(*query)
#Tweak the x ticks
delta_index = pd.to_timedelta(data_frame.index.strftime('%H:%M:%S'))
data_frame = data_frame.set_index(delta_index)
data_frame = data_frame.set_index(md.date2num(data_frame.index))


data_frame.plot(**pandas_params)
pandas_params["title"] += " - " + release
ax = data_frame.plot(**pandas_params)
if plot_average:
average = data_frame.mean(axis=1)
data_frame['__average__'] = average
pandas_params['lw'] = 8
pandas_params['style'] = ['--']
pandas_params['color'] = ['red']
data_frame['__average__'].plot(**pandas_params)
ax = data_frame['__average__'].plot(**pandas_params)

ax.xaxis.set_major_formatter(md.DateFormatter('%H:%M:%S'))
plt.savefig(os.path.join(path, file_name + '.png'))
plt.plot()

Expand Down
Loading