New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated mailbox metadata corruption with cyrus-imapd 3.0.4 on FreeBSD 11.1-STABLE: "Fatal error: Internal error: assertion failed: imap/index.c: 228: record->uid == im->uid" #2208

Open
rwatson opened this Issue Dec 5, 2017 · 14 comments

Comments

Projects
None yet
5 participants
@rwatson

rwatson commented Dec 5, 2017

I recently migrated from a (much) older FreeBSD/Cyrus server configuration to Cyrus 3.0.4 and FreeBSD 11.1-STABLE (August 2017) running ZFS; cyrus is installed using precompiled binaries downloaded via the FreeBSD package system:

root@cyrus:/ # pkg info | grep cyrus
cyrus-imapd30-3.0.4            The cyrus mail server, supporting POP3 and IMAP4 protocols
cyrus-sasl-2.1.26_12           RFC 2222 SASL (Simple Authentication and Security Layer)
cyrus-sasl-gssapi-2.1.26_7     SASL GSSAPI authentication plugin
cyrus-sasl-saslauthd-2.1.26_3  SASL authentication server for cyrus-sasl2

Since that time, I've been seeing regular assertion failures when accessing frequently written mailboxes, leading to errors reported over IMAP and via server logs. These are resolved by running reconstruct, but occur again pretty quickly soon thereafter (e.g., within a few hours or days) for the same mailboxes.

Sample output from alpine's debug logs:

IMAP 09:18:03 12/4 mm_notify bye: {cyrus.watson.org:993/imap/notls/ssl/user="weiler"}lists.net.nanog: Fatal error: Internal error: assertion failed: imap/index.c: 228: record->uid == im->uid
IMAP 09:18:03 12/4 mm_notify bye: {cyrus.watson.org:993/imap/notls/ssl/user="weiler"}lists.net.nanog: [CLOSED] IMAP connection broken (server response)
IMAP 09:18:03 12/4 mm_log error: [CLOSED] IMAP connection broken (server response)
IMAP 09:18:03 12/4 mm_notify bye: {cyrus.watson.org:993/imap/notls/ssl/user="weiler"}lists.net.nanog: [CLOSED] IMAP connection lost
IMAP 09:18:03 12/4 mm_log error: [CLOSED] IMAP connection lost

Running reconstruct while the server is shut down shows the following:

root@cyrus:/ # /usr/local/cyrus/sbin/reconstruct -rG lists.net
...
lists.net.dnsop: update uniqueid from header (null) => 7b745c5e37fe0569
lists.net.dnsop uid 19810 rediscovered - appending
lists.net.dnsop uid 19853 rediscovered - appending
lists.net.dnsop uid 20117 rediscovered - appending
lists.net.dnsop updating quota_mailbox_used: 138872177 => 138855932
lists.net.dnsop: updating exists 20182 => 20179
lists.net.dnsop: updating sync_crc 3824694062 => 3692858271
lists.net.dnsop
...
lists.net.nanog: update uniqueid from header (null) => 2cf7541035865f76
lists.net.nanog uid 176606 rediscovered - appending
lists.net.nanog uid 177318 rediscovered - appending
lists.net.nanog uid 177610 rediscovered - appending
lists.net.nanog updating quota_mailbox_used: 783360092 => 783337968
lists.net.nanog: updating exists 177671 => 177668
lists.net.nanog: updating sync_crc 1069219518 => 3346397954
lists.net.nanog
...

Similar problems occur with other mailboxes with moderate frequency, and running reconstruct per the above will usually turn up multiple mailboxes with rediscovered messages (as it did above -- I've only included excerpts for two mailboxes).

Default settings (e.g., twoskip) are used for the various indices and metadata except for seen state where I've inherited existing files and do not wish to convert them yet:

root@cyrus:/var/spool/imap/lists/net/nanog # file *cyrus*
cyrus.annotations: Cyrus twoskip DB
cyrus.cache:       data
cyrus.header:      data
cyrus.index:       data
cyrus.seen:        ASCII text, with very long lines
@@ -284,6 +288,7 @@
 # Allowed values: flat, skiplist, twoskip
 #
 #seenstate_db: twoskip
+seenstate_db: flat
 
 # The cyrusdb backend to use for the imap status cache.
 # Allowed values: skiplist, sql, twoskip

Enabling server debug logging for sessions, I see that these assertions typically fail as follows:

07000008 OK Completed (0.000 sec)
<1512397083<07000009 NOOP
>1512397083>07000009 OK Completed
<1512397083<0700000a FETCH 177608 (BODYSTRUCTURE FLAGS)
>1512397083>* BYE Fatal error: Internal error: assertion failed: imap/index.c: 228: record->uid == im->uid

This appears to be reproducible in the sense that problems recur pretty quickly after a regenerate, although I do not know what is causing the problem to arise. Please let me know if there are further things I can do to help debug this problem. I can pull FreeBSD forwarded to the latest 11-STABLE if desired but am currently sticking with this snapshot rather than introduce new variables into the equation.

@elliefm

This comment has been minimized.

Contributor

elliefm commented Dec 6, 2017

Do you know where we could find a list of patches (if any) and build settings used by the FreeBSD precompiled binaries? I think all the active Cyrus devs are using Linux at the moment

@elliefm elliefm added the 3.0 label Dec 6, 2017

@rwatson

This comment has been minimized.

rwatson commented Dec 6, 2017

The metadata for the FreeBSD cyrus-imapd30 port can be found here:

https://svnweb.freebsd.org/ports/head/mail/cyrus-imapd30/

Makefile contains a fair number of configuration parameters, dependency information, etc. files contains patches applied during the build process.

Similarly directories (a level or two up) exist for SASL, etc.

Thanks!

@pxnky

This comment has been minimized.

pxnky commented Dec 31, 2017

I am also seeing similar problems running Cyrus 3.0.4 on FreeBSD 11.1-STABLE-HBSD (HardenedBSD, which is FreeBSD with some kernel patches), mine look like this when deleting messages:

Dec 31 15:29:23 xx imap[50534]: Fatal error: Internal error: assertion failed: imap/index.c: 228: record->uid == im->uid
Dec 31 15:29:23 xx master[11797]: process type:SERVICE name:imap path:/usr/local/cyrus/libexec/imapd age:63.628s pid:50534 exited, status 75

I did migrate from a much older install that was running on linux with twoskip by just rsyncing the directories and then running reconstruct, it is possible I did something wrong there. I am going to try another reconstruct.

@elliefm

This comment has been minimized.

Contributor

elliefm commented Jan 2, 2018

What level of optimisations are the FreeBSD ports built with? I don't see a setting for this in the metadata that @rwatson has linked to, but I'd guess it's probably set at a higher level than per-package.

Are you able to rebuild Cyrus without optimisations, and report whether the problem persists?

@karagian

This comment has been minimized.

karagian commented Mar 12, 2018

I can confirm that this problem exists when upgrading from older versions of cyrus. We experienced the same issue after upgrading one of our servers to version 3.0.5. The messages we are seeing after update are of these two forms, mostly the latter:

Fatal error: Internal error: assertion failed: imap/index.c: 228: record->uid == im->uid
Fatal error: Internal error: assertion failed: imap/message.c: 4286: !message_need(m, M_RECORD)

Further details I can provide regarding our installation is that the system was upgraded from version 2.5.11, however the issue seems to be a leftover from a much older version:
After updating cyrus to 3.0.5 a reconstruct -V max was performed, which was failing on some mailboxes with the same message:

fatal error: Internal error: assertion failed: imap/message.c: 4286: !message_need(m, M_RECORD)

The reconstruct would be interrupted when this error occured, but it would continue the next time it was run. Eventually the process finished, however this did not fix the problem.

We tried running reconstruct on all mailboxes, which reported several messages of this form:

user.username.mailboxname failed to read at record xxxxxx (Mailbox format corruption detected), wiping
user.username.mailboxname uid xxxxxx rediscovered - appending
user.username.mailboxname uid xxxxxx rediscovered - appending

After this procedure was completed, the messages that were rediscovered, all appeared as unread to the clients. As it turns out, the messages that were rediscovered, they were all older than a very specific date, which is the date when an older cyrus upgrade was performed. At that date, cyrus was upgraded from version 2.2.12 to version 2.4.16.

I must note that a reconstruct -V max was never run until now. The last time the index files were upgraded was after that upgrade to v 2.4.16, when it was done automatically the first time users logged in. 2.5.x release notes mention that this would work without a problem with the exception of some newer features not working (https://www.cyrusimap.org/2.5/imap/release-notes/2.5/x/2.5.0.html), so no index upgrade to v13 was performed during update to 2.5.x. And there really was no problem with 2.5.x running with version 12 index files.

Hope this helps investigating the problem. I could probably provide further info if you want and maybe help to further pin down the problem.

@samuelweiler

This comment has been minimized.

samuelweiler commented Mar 18, 2018

I usually see this problem when the second or later message is read in a shared mailbox - possibly it relates to the seen state? I've never seen this before the first message is read, nor have I seen it on other-than-shared mailboxes.

@pxnky

This comment has been minimized.

pxnky commented Mar 21, 2018

I see this all the time but not sure what the trigger is.. My mailbox is not 'shared' but I do have mulitple imap clients connecting to it, from my laptop and phone .. Sometimes I click messages in my IMAP client (MacOS Mail.app) and they'll just disappear, and I'll check the logs and will have gotten an assertion failure and imapd process exit with return code 75 right then. Sorry about the late reply @elliefm I am just now getting around to trying to recompile without optimizations. I was running the freebsd (hardenedbsd actually) port of 3.0.4, then I switched to the precompiled package version of 3.0.5, now I'm compiling my own source tree after applying a couple of the minor patches needed from the port (just enough to get it to compile) and with CFLAGS=-O0, so lets see how that does. Also I have recently tried to go back to skiplist instead of twoskip, and run a reconstruct, but still saw the issue.

@karagian

This comment has been minimized.

karagian commented Mar 21, 2018

It seems that running a
reconstruct -G -V max [mailboxname]
fixes the index file that 3.0.x considers corrupt and cause the issue.
Note that without the mailboxname parameter reconstruct will be performed on all mailboxes and it may take a long time to complete.
Can anyone confirm this as a workaround for this issue?

@samuelweiler

This comment has been minimized.

samuelweiler commented Mar 21, 2018

@karagian: What @rwatson and I have seen is that a reconstruct will find the messages that have caused but problems, but new errors keep happening in the same mailboxes going forward.

@karagian

This comment has been minimized.

karagian commented Mar 21, 2018

@samuelweiler I know that a plain reconstruct will not fix the problem, but with "-G" and "-V max" I think the issue is gone. The index files get reparsed and upgraded to v13. Can you perform it on the mailbox that has this issue and see if it makes any difference?
Was that mailbox created from a much older version of cyrus (2.2.x)?

@pxnky

This comment has been minimized.

pxnky commented Mar 21, 2018

I will try this now! Thanks! I'll switch back to twoskip before I run it

@rwatson

This comment has been minimized.

rwatson commented Mar 22, 2018

This was from a much older version of Cyrus (2.2.x? 2.1.x?). I believe I used -G and -V max at some point after discovering this recurring issue, but my notes are unclear on this. I've rerun that reconstruct command on a few frequently affected mailboxes, and we'll now watch them to see if the problem recurs.

@pxnky

This comment has been minimized.

pxnky commented Mar 22, 2018

I am pretty sure this worked for me, nothing in the logs since I ran it! Thanks everyone! I am sure I missed RTFMing.. but haven't actually looked at the docs yet.

@elliefm

This comment has been minimized.

Contributor

elliefm commented Aug 10, 2018

I'm not certain, but I think this commit from Bron might fix the corruption/crashes: c665af5. This will be included in 3.0.8, due for release very soon, hopefully today :)

You should still reconstruct mailboxes with "-V max" to get the full benefit of upgrading Cyrus, and if you have mailboxes that pre-date GUIDs, you will also need the "-G" argument to force GUIDs to be recalculated (otherwise their GUID will be defaulted to 0, which will make replication and things not work for them). I think GUIDs turned up in 2.4 or 2.5 but I'm not sure precisely when, offhand.

I believe you can check if a mailboxes is missing GUIDs by looking for "GUID:0" in the output from cyrdump [mailbox]. If any messages have a zero GUID, you will need to reconstruct this mailbox with -G. Depending on the size of your installation it may be quicker/easier to just reconstruct -G -Vmax everything.

Note that cyrdump is currently broken in 3.0.0-3.0.7 (#2466), but it is fixed in git and will be fixed in 3.0.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment