-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log messages rate limit #4536
Comments
We are looking at addressing the core cause of the messages and demote to debug if needed, rather than rate limit. I'll be using this issue to collect "problem" messages. |
BlockAcquisition and BlockSyncronizer seem to be the most chatty. We have a strand count down of validators that is not useful:
|
Tons of |
Is this logging once per peer of the 5 for each deploy?
|
Break down of a day of logs with top counts. Eliminating the first 2 would kill a million needless log lines. Possibly killing the third as well.
|
We will clean up the logs. The high-frequency log messages usually fall into three categories:
For the third category, we are currently considering some rate limiting, in the past, we have always added metrics instead. We will be looking the log messages mentioned in this thread in the coming days and will likely have a solution for the juliet release. In the meantime, specific components can have their log level set individually through The systemd rate limiting is something that could be added as an additional measure, however it would need frequent changes, or at least different settings based on whether a node is on testnet, mainnet or internal networks, all which have different logging requirements. |
Issue Summary:
On the 27th of September, 2023, I reported a bug to multiple core team members involving an excessive logging issue that resulted in a server outage. The detailed log message is as follows: ( message have
WARN
level, however isFATAL
)This issue led to a rapid accumulation of log files, consuming all available disk space and subsequently causing the server to crash. Recently, I was informed by a team member, Joe, that this issue was addressed in release
1.5.6
. While this is commendable, it appears that the broader implications of the problem were not fully addressed.Please refer to the timestamp in the following screenshot for context:
The frequency of these log messages indicates an unsustainable rate, potentially leading to system failure by default.
Although the specific error causing the log spam on screenshot above has been patched, a recent discussion on the Casper TestNet Telegram channel (https://t.me/CasperTestNet/29313) revealed a similar issue, with
3.5G
of log data being reported. The following screenshot was shared, indicating a persistent issue:This suggests that there are still numerous unhandled error scenarios that could lead to rapid disk space consumption and potentially bring down nodes in a short timeframe.
Proposed Solution:
To mitigate this issue, I recommend implementing a rate limit for log messages and enhancing all error handling to ensure errors are logged more efficiently and LogRotate can handle them accordingly.
Immediate Workaround:
Given the critical nature of this issue, a temporary workaround could involve setting log rate limits at the journal level or node launchers unit level by using the following configuration:
This measure could help prevent disk space exhaustion until a more comprehensive solution is implemented.
The text was updated successfully, but these errors were encountered: