Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log messages rate limit #4536

Closed
matsuro-hadouken opened this issue Feb 13, 2024 · 6 comments
Closed

Log messages rate limit #4536

matsuro-hadouken opened this issue Feb 13, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@matsuro-hadouken
Copy link

matsuro-hadouken commented Feb 13, 2024

Issue Summary:

On the 27th of September, 2023, I reported a bug to multiple core team members involving an excessive logging issue that resulted in a server outage. The detailed log message is as follows: ( message have WARN level, however is FATAL )

{"timestamp":"2023-09-27T00:13:01.143965Z","level":"WARN","fields":{"message":"consensus is active, not shutting down","fatal_ann":"fatal error [node/src/reactor/main_reactor/control.rs:199]: attempt to deactivate inactive era era 10894"},"target":"casper_node::reactor::main_reactor"}

This issue led to a rapid accumulation of log files, consuming all available disk space and subsequently causing the server to crash. Recently, I was informed by a team member, Joe, that this issue was addressed in release 1.5.6. While this is commendable, it appears that the broader implications of the problem were not fully addressed.

Please refer to the timestamp in the following screenshot for context:

image

The frequency of these log messages indicates an unsustainable rate, potentially leading to system failure by default.

Although the specific error causing the log spam on screenshot above has been patched, a recent discussion on the Casper TestNet Telegram channel (https://t.me/CasperTestNet/29313) revealed a similar issue, with 3.5G of log data being reported. The following screenshot was shared, indicating a persistent issue:

image

This suggests that there are still numerous unhandled error scenarios that could lead to rapid disk space consumption and potentially bring down nodes in a short timeframe.

Proposed Solution:

To mitigate this issue, I recommend implementing a rate limit for log messages and enhancing all error handling to ensure errors are logged more efficiently and LogRotate can handle them accordingly.

Immediate Workaround:

Given the critical nature of this issue, a temporary workaround could involve setting log rate limits at the journal level or node launchers unit level by using the following configuration:

LogRateLimitIntervalSec=
LogRateLimitBurst=

This measure could help prevent disk space exhaustion until a more comprehensive solution is implemented.

@sacherjj
Copy link
Contributor

sacherjj commented Feb 16, 2024

We are looking at addressing the core cause of the messages and demote to debug if needed, rather than rate limit. I'll be using this issue to collect "problem" messages.

@sacherjj
Copy link
Contributor

BlockAcquisition and BlockSyncronizer seem to be the most chatty.

We have a strand count down of validators that is not useful:

BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (100 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (95 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (90 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (85 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (80 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (75 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (70 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (65 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (60 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (55 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (50 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (45 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (40 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (35 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (30 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (25 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (20 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (15 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (10 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (5 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (80 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (79 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (78 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (77 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (76 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (75 validators) with 5 peers
BlockSynchronizer: need next for block hash d4f3..9fbf: finality signatures at era 12531 (74 validators) with 5 peers

@sacherjj
Copy link
Contributor

Tons of BlockAcquisition: registering finality signature for: block hash b8c1..55d3 per block. Unneeded.

@sacherjj
Copy link
Contributor

Is this logging once per peer of the 5 for each deploy?

BlockSynchronizer: need next for block hash 8e78..de1f: deploy deploy-id(deploy-hash(e66e..5567), approvals-hash(a023..bb2f)) with 5 peers
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f
BlockAcquisition: registering deploy for: block hash 8e78..de1f

@sacherjj
Copy link
Contributor

Break down of a day of logs with top counts. Eliminating the first 2 would kill a million needless log lines. Possibly killing the third as well.

 783032 BlockAcquisition: registering finality signa
 239298 BlockSynchronizer: need next for block hash
  35328 BlockAcquisition: registering deploy for: bl
  10186 outgoing connection failed
   8249 new incoming connection established
   8247 regular connection closing
   7931 new outgoing connection established

@marc-casperlabs
Copy link
Collaborator

We will clean up the logs. The high-frequency log messages usually fall into three categories:

  • Verbose message that are bounded, but plentiful: These should usually be debug message. An example is logging every deploy received on a blog, while this is theoretically bounded by the number of blocks "finite" is not enough in this case. Our policy is to use discretion, but usually NOT leave them at info or higher.
  • Messages that can be triggered by external parties. An example is sending invalid data. These should always be handled quietly (i.e. debug or lower) for security reasons (i.e. don't make the log spammable), if they are left at a higher level, this is a bug.
  • The third category is the same, the previous, except the operator needs to see them, since a clean mitigation of spam is not possible (e.g. a resource is being exhausted due to it). Typically our choice then is to either create a metric to make these visible, or rate limit the logging.

For the third category, we are currently considering some rate limiting, in the past, we have always added metrics instead.

We will be looking the log messages mentioned in this thread in the coming days and will likely have a solution for the juliet release. In the meantime, specific components can have their log level set individually through RUST_LOG as a workaround.

The systemd rate limiting is something that could be added as an additional measure, however it would need frequent changes, or at least different settings based on whether a node is on testnet, mainnet or internal networks, all which have different logging requirements.

@marc-casperlabs marc-casperlabs self-assigned this Feb 20, 2024
@marc-casperlabs marc-casperlabs added the bug Something isn't working label Feb 20, 2024
This was referenced Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants