Skip to content

feat: automatic oldest-first message removal from mailboxes to always stay under max_mailbox_size#929

Open
hpk42 wants to merge 4 commits intomainfrom
auto-expire-oldest
Open

feat: automatic oldest-first message removal from mailboxes to always stay under max_mailbox_size#929
hpk42 wants to merge 4 commits intomainfrom
auto-expire-oldest

Conversation

@hpk42
Copy link
Copy Markdown
Contributor

@hpk42 hpk42 commented Apr 20, 2026

Revamped #927 , trying to address all suggestions from there.

Both dovecot-quota-threshold triggers and the daily expiry routine will expunge oldest messages from mailboxes automatically when the mailbox reaches 75% of max_mailbox_size.

Delta Chat users should not see any warnings (at 80/95 percent) or bounce messages, and existing over-quota mailboxes should start receiving mails again.

Copy link
Copy Markdown
Contributor

@j4n j4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, tested overfilling a mailbox with

for i in $(seq 1 400); do   dd if=/dev/urandom bs=1M count=1 2>/dev/null | base64 |     (echo "From: test@example.org"; echo "To: test@do.main"; echo "Subject: fill $i"; echo; cat)     | doveadm save -u test@do.main; done

Running usr/local/lib/chatmaild/venv/bin/chatmail-quota-expire 70 **/home/vmail/mail/... worked, but when looking at the logs, it did not fire automatically:

doveadm(test@do.main): Error: program unix:/run/dovecot/quota-warning: net_connect_unix(/run/dovecot/quota-warning) failed: Permission denied (euid=999(vmail) egid=996(vmail) missing +r perm: /run/dovecot/quota-warning, dir owned by 0:0 mode=0755)
doveadm(test@do.ain): Error: Saving failed: Quota exceeded (mailbox for user is full)
...

cmdeploy/src/cmdeploy/dovecot/dovecot.conf.j2, the unix_listener quota-warning socket needs explicit permissions for vmail to connect:

   unix_listener quota-warning {
+    user = vmail
+    mode = 0600
   }

Then it works:

Apr 20 13:07:14 host dovecot[3834579]: quota-warning: Error: quota-expire: removed 6 message(s) from test@do.main

Comment thread chatmaild/src/chatmaild/expire.py
Copy link
Copy Markdown
Contributor

@missytake missytake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very promising!

Maybe it would make sense to first expire larger messages by date, and only then the small messages; missing an attachment is much more harmless than missing membership changes.

Comment thread chatmaild/src/chatmaild/expire.py
Comment thread chatmaild/src/chatmaild/expire.py
@hpk42
Copy link
Copy Markdown
Contributor Author

hpk42 commented Apr 21, 2026

In a prior version of the quota-expire effort, i had file-size play into the sorting, so larger messages would be deleted first if they are not much younger than small messages. @link2xt was skeptical of that and eventually we arrived at saying: "let's just do oldest first for now" also because it's easier to understand it and reason about it.

However, i am still wondering about it. Maybe we can keep the last 7 days of small messages unconditionally, so a few big messages can not push all the young small messages out of the queue?

sidenote: with this PR merged, addresses conceptually become handles for message queues rather than static mail storage.

@hpk42 hpk42 requested a review from j4n April 21, 2026 07:10
@hpk42 hpk42 force-pushed the auto-expire-oldest branch from 1b6c4d1 to f132881 Compare April 21, 2026 07:10
@hpk42 hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 21, 2026 07:10 — with GitHub Actions Inactive
@link2xt
Copy link
Copy Markdown
Contributor

link2xt commented Apr 21, 2026

Maybe it would make sense to first expire larger messages by date, and only then the small messages; missing an attachment is much more harmless than missing membership changes.

In #926 messages were scored by age multiplied by size. Could work, but then if you have really not read messages for some time, you will get some history but no "member added" messages and randomly dropped messages at the time when gossip happens instead of the history starting at roughly some timestamp. Can drop messages with Chat-Is-Post-Message header first or just very large messages first (likely videos) followed by all messages, but just dropping by date is likely good enough already.

Maybe we can keep the last 7 days of small messages unconditionally, so a few big messages can not push all the young small messages out of the queue?

Something simple with two thresholds is probably fine, first delete huge messages (videos), then if it is not sufficient the same as the current logic. But then still not clear if anyone will actually notice the improvement, but you lose the "queue" property and it might be difficult to reimplement this logic if we get to replacing Maildir with something more efficient (chatmail basically needs a log storage, just write new messages into mbox and rotate "pages").

… stay under max_mailbox_size

Both dovecot-quota-threshold triggers and the daily expiry routine
will now expunge oldest messages from mailboxes automatically
when the mailbox reaches 75% of max_mailbox_size.
Delta Chat users should not see any warnings (at 80/95 percent) or bounce messages,
and existing over-quota mailboxes should start receiving mails again.
@hpk42 hpk42 force-pushed the auto-expire-oldest branch from f132881 to df3c460 Compare April 23, 2026 11:07
@hpk42 hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 23, 2026 11:07 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@j4n j4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread chatmaild/src/chatmaild/tests/test_expire.py Outdated
Comment thread chatmaild/src/chatmaild/expire.py Outdated
Comment thread chatmaild/src/chatmaild/expire.py
Comment thread chatmaild/src/chatmaild/expire.py Outdated
total_size = sum(m.quota_size for m in messages)
removed = 0
for entry in sorted(messages):
if total_size <= target_bytes:
Copy link
Copy Markdown
Contributor

@link2xt link2xt Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion (from Ellie) is to unconditionally not delete messages that are new to make complete flushing of the mailbox noticeable:

there could also be a mixed mode, for example only delete old messages if they have been around at least X day

Maybe stop here not only if we reached quota, but also if we reached the message that is e.g. 2 days old. Then if you get a flood of messages, you start using the remaining 30% and user will get 80% quota warning. And if it happens persistently, then you eventually get over quota and this does not resolve, but this actually means something has gone wrong and should be investigated.

tl;dr the suggestion is to add or entry.mtime > time.time() - 2 * 86400 here and break out of the loop if we reached very new messages. If we eat through 400 MB of storage in two days, something has likely went really wrong and warning the user sounds reasonable.

Copy link
Copy Markdown
Contributor Author

@hpk42 hpk42 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, if an attacker burst-floods a mailbox with small messages, expire_to_target will delete them. when the burst is over, the address can receive messages again.

with a 2-day floor, a single burst of small messages fills the mailbox and none of them can be removed for 2 days. Dovecot rejects all incoming mail during that window. the attacker just repeats the burst every 2 day, minimal effort, permanent denial of service.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would be a visible denial of service, rather than a silent one, where you flush the messages out without a limit. I think for small groups that's better. For larger groups, the errors could be hidden, for example.

I think this is simply a needs of the large room vs needs of the small room scenario. In small rooms, just about anything seems preferrable to me to silent message loss. And whatever path is taken, that some messages won't reach a user if an attacker does something nasty, is unavoidable, so you can only control how visible that is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not about visible versus invisible denial of service. When you get flooded with messages, you will notice that in either case. Only in the remove-oldest algo you start receiving messages when the DoS stops, whereas with 2-day floor you stay incommunicado for two days after the first burst.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoS attack is pretty much hypothetical anyway, you can do more annoying stuff with less effort if you really want. More likely is some accidental flooding e.g. bot flooding the bot mailbox where nobody looks or users flooding their mailbox with invisible webxdc updates or subscribing to a channel that sends them videos every day. If we do this change it will either be pretty much never triggered or might uncover some problem if mailbox gets over quota and user actually investigates and reports it, but most likely it will just never get triggered because eating through 400 mb in two days accidentally is unlikely.

One last suggestion from me is adding or entry.mtime > time.time() - 3600 or so, to not rotate through the mailbox continuously in case something is really-really broken and mailbox gets a new 10MB message every minute, but if not then just mark it as resolved.

Copy link
Copy Markdown
Contributor Author

@hpk42 hpk42 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, i added the 1h-guard as suggested. 80ccf4d

addendum: only now saw ell1e's 6-hour ago comment for some reason, this message here was originally next to link2xt's last comment in the thread.

Copy link
Copy Markdown
Contributor Author

@hpk42 hpk42 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sender visibility of a sending failure in my opinion is even more important than the receiver noticing that something is up.

When a sender transmits a message to 3 addresses (a multi-relay chat contact), and one address fails because of quota overrun, it is the receiving device which should notice quota-full event and do something about it, like clearing the storage or switching to a different relay. For users nothing in particular needs to happen. No message needs to be flagged, no warning needs showing. Everything works, and auto-repairs to continue to work. <------ this is at least the rough UX direction of current discussions about "automated relay management" designed to work for mass-users who might not know any details of the relay network, but just onboard on it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addendum: only now saw ell1e's 6-hour ago comment for some reason

GitHub broke the caching logic for messages now, it was already broken for issue counters and pull request counters recently. I also get emails about your messages, then go to the page and have to refresh it manually to see the message. Also when I reload the page, i don't see any comments at first without any spinner, it is indistinguishable from the PR having no comments so i always think for a moment i opened the wrong PR or something, and the comments load only later, not necessary in the latest state, without the new comment just received by email sometimes.

Copy link
Copy Markdown

@ell1e ell1e Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think multi relay is relevant for this problem, since an attacker could just spam all relays. (Edit: or am I missing something? I'm happy to be wrong! ❤️ )

The only way to know that my message wasn't eaten by some spam a few hours after, while the user wasn't online yet, seems to be if there is a time window of at least 48 hours or so where it's guaranteed to be stored. That way it would actually arrive for most people. Otherwise you're allowing silent failures even though everything looked fine.

I honestly don't know if it's just me, but I abandoned multiple messengers just because of silent delivery failures. (That being XMPP OMEMO, as well as Wire messenger.) A single message not arriving without anybody being aware in a tense situation can be socially nuclear.

Copy link
Copy Markdown

@ell1e ell1e Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh another thing I just realized is relevant, I've had multiple people be offline for a week or two at some point. If you don't have a minimum retention time of two weeks, you would be able to write them really detailed messages in private while they might never arrive and nobody involved might ever know. While with the old configuration, you absolutely would know once their inbox ran full that both the new messages didn't arrive, and that the old messages would typically have been retained until read.

I realize this is an extreme case and it might not make sense to tweak this for the extreme cases. However, I also think the assumption every user is offline at most for 10 minutes is the opposite extreme. People might e.g. be in a building without reception for half the day, that's pretty common. Or their battery simply ran out and they were too busy to deal with it for a day. It happens to the best of us...

@hpk42 hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 23, 2026 21:00 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@link2xt link2xt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a seemingly unrelated CI failure, from IP-based relays.

@hpk42 hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 24, 2026 08:11 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants