-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DotNext.Net.Cluster: System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'length') #244
Comments
Related #242. |
Hi @sakno I've attached a zip of the node.state-file of the problematic node202: node.zip The scenario is as follows: I restart the 6 Raspberry PIs in our cluster and our software including Raft is automatically started on startup. They all start from a clean slate. Here's what made it into their respective log-files (the callstack for the exceptions in node202 has been modified by my re-throwing in order to get various values):
Is this the data you were asking for? |
|
Do you mean these? |
Yep, thanks! You've posted WAL in a legacy binary format. But you're trying to open it as a new format which is not binary compatible with the previous one. See Release Notes. There a few ways to fix that:
|
That surprises me since we store the node.state and partition files in a RAM disk folder, and I know the cluster was both power cycled and rebooted multiple times yesterday. |
Yep, the easiest way is to look at first 512 bytes. In a new format, all of them are zeroes (except first byte, it can be 1 or 0). Old format doesn't have such a header so its first 40 bytes are mostly non-zero. |
I've published release 5.7.1 with a patch for new binary format that improves reliability of WAL in case of power loss or process crash. |
Hi @sakno |
Definitely not. You can check it by yourself. Instantiate your WAL class inside of test and open those files with it. For instance, for a log entry with index
Offset for the first log entry matches to the size of legacy format header (50 entries per partition X 40 bytes metadata = 2000 bytes). Length is meaningful, as well as timestamp (which is |
There are two places in the code inside of |
You mentioned ramfs. Is there any chance that the file content shared between deletions of the file? Could you reset Linux page cache or reboot the device? |
I'm rebooting all the devices between each test run. |
maxLogEntrySize is always 0 |
Another approach to catch mistake: go to |
I did this and |
New format uses a footer of the same size as header in the old format. Probably, due some bug, footer placed at position |
OK 👍 let me know if you want me to try out something. |
I've added some assertions to catch this situation. You can try |
Yours assertions are not hit 😕 |
|
Does it make sense to add the same assertion in Table.WriteThroughAsync (
Btw, are you counting on the footer memory buffer being initialized with all 0s (in the Table constructor)? |
Please check dependency version
Line 590 points to a different location in dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs Lines 590 to 597 in 2edc792
|
I've changed the way how the partition restores metadata table. You can try |
Thank you 👍 I'll try it out tomorrow |
It still happens with the latest changes on the I see this happening for all nodes receiving a snapshot.
|
Finally I've reproduced it. You can execute |
This is how it works on production environment. Let's assume that we have three nodes A, B, C in the cluster. A is the leader.
It's not a bug of a new binary format, it's unexpected behavior of compaction on the follower side that leads to failed assertion. With legacy format it was not a problem, because previous log entry is not needed to calculate the position of a new log entry. |
Done, please check |
Yeah, I forgot to fix it in one more place, inside of |
Not sure if you were done with the fix, but I tried your changes to |
I see elections timeouts locally with |
Found more issues with partition sealing. Fixed and pushed to |
Release 5.7.3 has been published. |
Reopened for confirmation. |
Hi, thanks, it seems to be gone (tested in the same test system). I think it makes sense to close it now. We will do some longer tests, roll it out to more systems and will let you know if we see it again. |
Discussed in #243
Originally posted by LarsWithCA June 25, 2024
Hi @sakno,
Once in a while we get a series of this exception during startup (possibly after restart/power-cycle) hindering the cluster from getting fully up and running:
Our setup:
Here are the values of various arguments/variables/fields inside
PersistentState.Table.WriteThroughAsync
when the exception happens:Are there any other values I should try and capture that might help you investigate/solve this?
In the previous version we were running (5.5.0), apart from this exception we also saw "System.ArgumentOutOfRangeException: Specified file length was too large for the file system." in the same area of the code - we might not be seeing that in the newest version.
The text was updated successfully, but these errors were encountered: