Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/uv_truncate.c:58: Assertion `i < n_segments' failed #591

Open
jerrytesting opened this issue Jul 28, 2022 · 12 comments
Open

src/uv_truncate.c:58: Assertion `i < n_segments' failed #591

jerrytesting opened this issue Jul 28, 2022 · 12 comments
Assignees
Labels
Bug Confirmed to be a bug

Comments

@jerrytesting
Copy link

jerrytesting commented Jul 28, 2022

In the latest commit version: 05a49e0, one assertion failed: src/uv_truncate.c:58: void uvTruncateWorkCb(uv_work_t *): Assertion `i < n_segments'.
Attached is the log file n_segments-assertion-failure.log. Hope it could help you debug.

@MathieuBordere
Copy link
Contributor

Thanks for testing, always valuable! I'll have a look.

@cole-miller
Copy link
Contributor

cole-miller commented Aug 24, 2022

It seems what happens here is that our node believes that an open segment open-1 exists, and contains the entries at indices from 2 to 55. It decides to truncate its log at index 51 due to receiving messages from a more current leader, and the barrier for the truncate operation kicks off finalization of open-1 -- but that fails, because open-1 doesn't exist on disk for some reason. The truncation code then only sees one closed segment, which has index 1 only, and one empty open segment open-4, so it can't find the index it's looking for and trips the assertion.

Unfortunately, we're missing the earlier part of the log, which might tell us how open-1 ended up not existing. @jerrytesting, do you happen to have the complete log handy? What kind of setup/conditions were you using for this run?

@MathieuBordere MathieuBordere added the Bug Confirmed to be a bug label Jan 18, 2023
@jerrytesting
Copy link
Author

Hi, I uploaded the full logs with our Jepsen++ tool: https://github.com/jerrytesting/Canonical-raft-bug-logs/. You could find Assertion i < n_segments'` failed in the logs of node n1, n3, and n5.

@jerrytesting
Copy link
Author

Besides this bug, this log contains another assertion failure:
In the logs of node n1, there is another assertion failed: Assertion r->state == RAFT_FOLLOWER || r->state == RAFT_CANDIDATE' failed`.

@cole-miller
Copy link
Contributor

@jerrytesting We've got a separate issue tracking that one: canonical/raft#386.

@MathieuBordere
Copy link
Contributor

@cole-miller
Copy link
Contributor

cole-miller commented Aug 17, 2023

Mathieu's logs tell a sort of similar story, in the sense that a bunch of files go missing. On n3, at line 30245 we have

LIBRAFT   1692019474892072099 src/uv_list.c:92 segment 0000000000000769-0000000000000896
LIBRAFT   1692019474892078499 src/uv_list.c:92 segment 0000000000000897-0000000000001024
LIBRAFT   1692019474892081599 src/uv_list.c:92 segment 0000000000001025-0000000000001042
LIBRAFT   1692019474892084799 src/uv_list.c:92 segment 0000000000001043-0000000000001153
LIBRAFT   1692019474892087799 src/uv_list.c:92 segment 0000000000001154-0000000000001461
LIBRAFT   1692019474892090899 src/uv_list.c:92 segment 0000000000001462-0000000000001463
LIBRAFT   1692019474892094000 src/uv_list.c:92 segment 0000000000001464-0000000000001590
LIBRAFT   1692019474892097000 src/uv_list.c:92 segment 0000000000001591-0000000000001719
LIBRAFT   1692019474892099900 src/uv_list.c:92 segment 0000000000001720-0000000000001846
LIBRAFT   1692019474892103100 src/uv_list.c:92 segment 0000000000001847-0000000000001974
LIBRAFT   1692019474892106400 src/uv_list.c:97 ignore cluster.yaml
LIBRAFT   1692019474892109300 src/uv_list.c:97 ignore info.yaml
LIBRAFT   1692019474892111900 src/uv_list.c:71 ignore metadata1
LIBRAFT   1692019474892114400 src/uv_list.c:71 ignore metadata2
LIBRAFT   1692019474892117600 src/uv_list.c:92 segment open-17
LIBRAFT   1692019474892120700 src/uv_list.c:92 segment open-18
LIBRAFT   1692019474892124100 src/uv_list.c:97 ignore snapshot-3-1718-480167
LIBRAFT   1692019474892132400 src/uv_list.c:82 snapshot snapshot-3-1718-480167.meta
LIBRAFT   1692019474892135500 src/uv_list.c:97 ignore snapshot-3-1846-483196
LIBRAFT   1692019474892141500 src/uv_list.c:82 snapshot snapshot-3-1846-483196.meta
LIBRAFT   1692019474892144700 src/uv_list.c:97 ignore snapshot-4-1974-487159
LIBRAFT   1692019474892150600 src/uv_list.c:82 snapshot snapshot-4-1974-487159.meta

And then at line 33331, just before the offending truncate operation:

LIBRAFT   1692019482640606483 src/uv_list.c:71 ignore metadata2
LIBRAFT   1692019482640614283 src/uv_list.c:92 segment open-20
LIBRAFT   1692019482640617383 src/uv_list.c:97 ignore removed

That second list of files is pretty weird:

  • there's only one metadata file and no snapshots
  • I don't know where the file named removed comes from

@cole-miller
Copy link
Contributor

cole-miller commented Aug 17, 2023

This bit in between is also interesting:

LIBRAFT   1692019482619984787 src/replication.c:1055 log mismatch -> truncate (2011)
LIBRAFT   1692019482619987887 src/uv_truncate.c:163 uv truncate 2011
LIBRAFT   1692019482619991088 src/uv_append.c:839 UvBarrier uv->append_next_index:2011
LIBRAFT   1692019482620014288 src/uv_append.c:622 enqueue 1 entries
LIBRAFT   1692019482620018788 src/uv_finalize.c:39 finalize open-17 into 0000000000001975-0000000000002011
LIBRAFT   1692019482620044388 src/uv_finalize.c:68 truncate segment open-17: open: no such file or directory
LIBRAFT   1692019482620020688 src/uv_prepare.c:166 create open segment open-20

Seems like the catastrophe happens before that point

@cole-miller
Copy link
Contributor

In jerrytesting's logs I see corrupt segments in the lists, and it's possible something different is going on.

@cole-miller
Copy link
Contributor

cole-miller commented Aug 18, 2023

Okay, so in Mathieu's logs, the issue is that Jepsen tries to remove n3 from the cluster (shrink!), which involves wiping the data directory, but the dqlite/raft process doesn't stop running and eventually notices that important files are missing. This might still be a raft issue, but a more niche one that's not related to consistency.

This doesn't apply to jerrytesting's logs, which as mentioned above seem to be running into a different issue.

@cole-miller
Copy link
Contributor

cole-miller commented Aug 24, 2023

Happened again:

@cole-miller
Copy link
Contributor

cole-miller commented Aug 24, 2023

I think this and canonical/jepsen.dqlite#125 are two manifestations of the same problem, that fatal signals are not reliably causing the jepsen.dqlite app to shut down reliably.

@cole-miller cole-miller transferred this issue from canonical/raft Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

3 participants