Log messages rate limit #4536

matsuro-hadouken · 2024-02-13T08:35:30Z

Issue Summary:

On the 27th of September, 2023, I reported a bug to multiple core team members involving an excessive logging issue that resulted in a server outage. The detailed log message is as follows: ( message have WARN level, however is FATAL )

{"timestamp":"2023-09-27T00:13:01.143965Z","level":"WARN","fields":{"message":"consensus is active, not shutting down","fatal_ann":"fatal error [node/src/reactor/main_reactor/control.rs:199]: attempt to deactivate inactive era era 10894"},"target":"casper_node::reactor::main_reactor"}

This issue led to a rapid accumulation of log files, consuming all available disk space and subsequently causing the server to crash. Recently, I was informed by a team member, Joe, that this issue was addressed in release 1.5.6. While this is commendable, it appears that the broader implications of the problem were not fully addressed.

Please refer to the timestamp in the following screenshot for context:

The frequency of these log messages indicates an unsustainable rate, potentially leading to system failure by default.

Although the specific error causing the log spam on screenshot above has been patched, a recent discussion on the Casper TestNet Telegram channel (https://t.me/CasperTestNet/29313) revealed a similar issue, with 3.5G of log data being reported. The following screenshot was shared, indicating a persistent issue:

This suggests that there are still numerous unhandled error scenarios that could lead to rapid disk space consumption and potentially bring down nodes in a short timeframe.

Proposed Solution:

To mitigate this issue, I recommend implementing a rate limit for log messages and enhancing all error handling to ensure errors are logged more efficiently and LogRotate can handle them accordingly.

Immediate Workaround:

Given the critical nature of this issue, a temporary workaround could involve setting log rate limits at the journal level or node launchers unit level by using the following configuration:

LogRateLimitIntervalSec=
LogRateLimitBurst=

This measure could help prevent disk space exhaustion until a more comprehensive solution is implemented.

The text was updated successfully, but these errors were encountered:

sacherjj · 2024-02-16T16:03:47Z

We are looking at addressing the core cause of the messages and demote to debug if needed, rather than rate limit. I'll be using this issue to collect "problem" messages.

sacherjj · 2024-02-16T16:15:20Z

BlockAcquisition and BlockSyncronizer seem to be the most chatty.

We have a strand count down of validators that is not useful:

BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (100 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (95 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (90 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (85 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (80 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (75 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (70 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (65 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (60 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (55 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (50 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (45 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (40 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (35 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (30 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (25 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (20 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (15 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (10 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (5 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (80 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (79 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (78 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (77 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (76 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (75 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (74 validators) with 5 peers

sacherjj · 2024-02-16T16:16:53Z

Tons of BlockAcquisition: registering finality signature for: block hash b8c1..55d3 per block. Unneeded.

sacherjj · 2024-02-16T16:17:36Z

Is this logging once per peer of the 5 for each deploy?

BlockSynchronizer: need next for block hash 8e78..de1f: deploy deploy-id(deploy-hash(e66e..5567), approvals-hash(a023..bb2f)) with 5 peers
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f

sacherjj · 2024-02-16T16:25:11Z

Break down of a day of logs with top counts. Eliminating the first 2 would kill a million needless log lines. Possibly killing the third as well.

 783032 BlockAcquisition: registering finality signa
 239298 BlockSynchronizer: need next for block hash
  35328 BlockAcquisition: registering deploy for: bl
  10186 outgoing connection failed
   8249 new incoming connection established
   8247 regular connection closing
   7931 new outgoing connection established

marc-casperlabs · 2024-02-19T11:22:43Z

We will clean up the logs. The high-frequency log messages usually fall into three categories:

Verbose message that are bounded, but plentiful: These should usually be debug message. An example is logging every deploy received on a blog, while this is theoretically bounded by the number of blocks "finite" is not enough in this case. Our policy is to use discretion, but usually NOT leave them at info or higher.
Messages that can be triggered by external parties. An example is sending invalid data. These should always be handled quietly (i.e. debug or lower) for security reasons (i.e. don't make the log spammable), if they are left at a higher level, this is a bug.
The third category is the same, the previous, except the operator needs to see them, since a clean mitigation of spam is not possible (e.g. a resource is being exhausted due to it). Typically our choice then is to either create a metric to make these visible, or rate limit the logging.

For the third category, we are currently considering some rate limiting, in the past, we have always added metrics instead.

We will be looking the log messages mentioned in this thread in the coming days and will likely have a solution for the juliet release. In the meantime, specific components can have their log level set individually through RUST_LOG as a workaround.

The systemd rate limiting is something that could be added as an additional measure, however it would need frequent changes, or at least different settings based on whether a node is on testnet, mainnet or internal networks, all which have different logging requirements.

marc-casperlabs self-assigned this Feb 20, 2024

marc-casperlabs added the bug Something isn't working label Feb 20, 2024

This was referenced Mar 27, 2024

Log messages rate limit #4542

Closed

Reduce warnings frequency #4634

Merged

marc-casperlabs closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log messages rate limit #4536

Log messages rate limit #4536

matsuro-hadouken commented Feb 13, 2024 •

edited

Loading

sacherjj commented Feb 16, 2024 •

edited

Loading

sacherjj commented Feb 16, 2024

sacherjj commented Feb 16, 2024

sacherjj commented Feb 16, 2024

sacherjj commented Feb 16, 2024

marc-casperlabs commented Feb 19, 2024

Log messages rate limit #4536

Log messages rate limit #4536

Comments

matsuro-hadouken commented Feb 13, 2024 • edited Loading

sacherjj commented Feb 16, 2024 • edited Loading

sacherjj commented Feb 16, 2024

sacherjj commented Feb 16, 2024

sacherjj commented Feb 16, 2024

sacherjj commented Feb 16, 2024

marc-casperlabs commented Feb 19, 2024

matsuro-hadouken commented Feb 13, 2024 •

edited

Loading

sacherjj commented Feb 16, 2024 •

edited

Loading