feat: automatic oldest-first message removal from mailboxes to always stay under max_mailbox_size by hpk42 · Pull Request #929 · chatmail/relay

hpk42 · 2026-04-20T09:38:14Z

Revamped #927 , trying to address all suggestions from there.

Both dovecot-quota-threshold triggers and the daily expiry routine will expunge oldest messages from mailboxes automatically when the mailbox reaches 75% of max_mailbox_size.

Delta Chat users should not see any warnings (at 80/95 percent) or bounce messages, and existing over-quota mailboxes should start receiving mails again.

j4n

looks good, tested overfilling a mailbox with

for i in $(seq 1 400); do   dd if=/dev/urandom bs=1M count=1 2>/dev/null | base64 |     (echo "From: test@example.org"; echo "To: test@do.main"; echo "Subject: fill $i"; echo; cat)     | doveadm save -u test@do.main; done

Running usr/local/lib/chatmaild/venv/bin/chatmail-quota-expire 70 **/home/vmail/mail/... worked, but when looking at the logs, it did not fire automatically:

doveadm(test@do.main): Error: program unix:/run/dovecot/quota-warning: net_connect_unix(/run/dovecot/quota-warning) failed: Permission denied (euid=999(vmail) egid=996(vmail) missing +r perm: /run/dovecot/quota-warning, dir owned by 0:0 mode=0755)
doveadm(test@do.ain): Error: Saving failed: Quota exceeded (mailbox for user is full)
...

cmdeploy/src/cmdeploy/dovecot/dovecot.conf.j2, the unix_listener quota-warning socket needs explicit permissions for vmail to connect:

   unix_listener quota-warning {
+    user = vmail
+    mode = 0600
   }

Then it works:

Apr 20 13:07:14 host dovecot[3834579]: quota-warning: Error: quota-expire: removed 6 message(s) from test@do.main

missytake

Sounds very promising!

Maybe it would make sense to first expire larger messages by date, and only then the small messages; missing an attachment is much more harmless than missing membership changes.

hpk42 · 2026-04-21T07:01:13Z

In a prior version of the quota-expire effort, i had file-size play into the sorting, so larger messages would be deleted first if they are not much younger than small messages. @link2xt was skeptical of that and eventually we arrived at saying: "let's just do oldest first for now" also because it's easier to understand it and reason about it.

However, i am still wondering about it. Maybe we can keep the last 7 days of small messages unconditionally, so a few big messages can not push all the young small messages out of the queue?

sidenote: with this PR merged, addresses conceptually become handles for message queues rather than static mail storage.

link2xt · 2026-04-21T19:21:03Z

Maybe it would make sense to first expire larger messages by date, and only then the small messages; missing an attachment is much more harmless than missing membership changes.

In #926 messages were scored by age multiplied by size. Could work, but then if you have really not read messages for some time, you will get some history but no "member added" messages and randomly dropped messages at the time when gossip happens instead of the history starting at roughly some timestamp. Can drop messages with Chat-Is-Post-Message header first or just very large messages first (likely videos) followed by all messages, but just dropping by date is likely good enough already.

Maybe we can keep the last 7 days of small messages unconditionally, so a few big messages can not push all the young small messages out of the queue?

Something simple with two thresholds is probably fine, first delete huge messages (videos), then if it is not sufficient the same as the current logic. But then still not clear if anyone will actually notice the improvement, but you lose the "queue" property and it might be difficult to reimplement this logic if we get to replacing Maildir with something more efficient (chatmail basically needs a log storage, just write new messages into mbox and rotate "pages").

… stay under max_mailbox_size Both dovecot-quota-threshold triggers and the daily expiry routine will now expunge oldest messages from mailboxes automatically when the mailbox reaches 75% of max_mailbox_size. Delta Chat users should not see any warnings (at 80/95 percent) or bounce messages, and existing over-quota mailboxes should start receiving mails again.

j4n

lgtm

link2xt · 2026-04-23T20:02:38Z

+    total_size = sum(m.quota_size for m in messages)
+    removed = 0
+    for entry in sorted(messages):
+        if total_size <= target_bytes:


One suggestion (from Ellie) is to unconditionally not delete messages that are new to make complete flushing of the mailbox noticeable:

there could also be a mixed mode, for example only delete old messages if they have been around at least X day

Maybe stop here not only if we reached quota, but also if we reached the message that is e.g. 2 days old. Then if you get a flood of messages, you start using the remaining 30% and user will get 80% quota warning. And if it happens persistently, then you eventually get over quota and this does not resolve, but this actually means something has gone wrong and should be investigated.

tl;dr the suggestion is to add or entry.mtime > time.time() - 2 * 86400 here and break out of the loop if we reached very new messages. If we eat through 400 MB of storage in two days, something has likely went really wrong and warning the user sounds reasonable.

currently, if an attacker burst-floods a mailbox with small messages, expire_to_target will delete them. when the burst is over, the address can receive messages again.

with a 2-day floor, a single burst of small messages fills the mailbox and none of them can be removed for 2 days. Dovecot rejects all incoming mail during that window. the attacker just repeats the burst every 2 day, minimal effort, permanent denial of service.

But it would be a visible denial of service, rather than a silent one, where you flush the messages out without a limit. I think for small groups that's better. For larger groups, the errors could be hidden, for example.

I think this is simply a needs of the large room vs needs of the small room scenario. In small rooms, just about anything seems preferrable to me to silent message loss. And whatever path is taken, that some messages won't reach a user if an attacker does something nasty, is unavoidable, so you can only control how visible that is.

it's not about visible versus invisible denial of service. When you get flooded with messages, you will notice that in either case. Only in the remove-oldest algo you start receiving messages when the DoS stops, whereas with 2-day floor you stay incommunicado for two days after the first burst.

DoS attack is pretty much hypothetical anyway, you can do more annoying stuff with less effort if you really want. More likely is some accidental flooding e.g. bot flooding the bot mailbox where nobody looks or users flooding their mailbox with invisible webxdc updates or subscribing to a channel that sends them videos every day. If we do this change it will either be pretty much never triggered or might uncover some problem if mailbox gets over quota and user actually investigates and reports it, but most likely it will just never get triggered because eating through 400 mb in two days accidentally is unlikely.

One last suggestion from me is adding or entry.mtime > time.time() - 3600 or so, to not rotate through the mailbox continuously in case something is really-really broken and mailbox gets a new 10MB message every minute, but if not then just mark it as resolved.

k, i added the 1h-guard as suggested. 80ccf4d

addendum: only now saw ell1e's 6-hour ago comment for some reason, this message here was originally next to link2xt's last comment in the thread.

Sender visibility of a sending failure in my opinion is even more important than the receiver noticing that something is up.

When a sender transmits a message to 3 addresses (a multi-relay chat contact), and one address fails because of quota overrun, it is the receiving device which should notice quota-full event and do something about it, like clearing the storage or switching to a different relay. For users nothing in particular needs to happen. No message needs to be flagged, no warning needs showing. Everything works, and auto-repairs to continue to work. <------ this is at least the rough UX direction of current discussions about "automated relay management" designed to work for mass-users who might not know any details of the relay network, but just onboard on it.

addendum: only now saw ell1e's 6-hour ago comment for some reason

GitHub broke the caching logic for messages now, it was already broken for issue counters and pull request counters recently. I also get emails about your messages, then go to the page and have to refresh it manually to see the message. Also when I reload the page, i don't see any comments at first without any spinner, it is indistinguishable from the PR having no comments so i always think for a moment i opened the wrong PR or something, and the comments load only later, not necessary in the latest state, without the new comment just received by email sometimes.

I don't think multi relay is relevant for this problem, since an attacker could just spam all relays. (Edit: or am I missing something? I'm happy to be wrong! ❤️ )

The only way to know that my message wasn't eaten by some spam a few hours after, while the user wasn't online yet, seems to be if there is a time window of at least 48 hours or so where it's guaranteed to be stored. That way it would actually arrive for most people. Otherwise you're allowing silent failures even though everything looked fine.

I honestly don't know if it's just me, but I abandoned multiple messengers just because of silent delivery failures. (That being XMPP OMEMO, as well as Wire messenger.) A single message not arriving without anybody being aware in a tense situation can be socially nuclear.

Oh another thing I just realized is relevant, I've had multiple people be offline for a week or two at some point. If you don't have a minimum retention time of two weeks, you would be able to write them really detailed messages in private while they might never arrive and nobody involved might ever know. While with the old configuration, you absolutely would know once their inbox ran full that both the new messages didn't arrive, and that the old messages would typically have been retained until read.

I realize this is an extreme case and it might not make sense to tweak this for the extreme cases. However, I also think the assumption every user is offline at most for 10 minutes is the opposite extreme. People might e.g. be in a building without reception for half the day, that's pretty common. Or their battery simply ran out and they were too busy to deal with it for a day. It happens to the best of us...

link2xt

There is a seemingly unrelated CI failure, from IP-based relays.

hpk42 had a problem deploying to staging.chatmail.at/doc/relay/ April 20, 2026 09:38 — with GitHub Actions Failure

hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 20, 2026 12:51 — with GitHub Actions Inactive

j4n requested changes Apr 20, 2026

View reviewed changes

Comment thread chatmaild/src/chatmaild/expire.py

missytake approved these changes Apr 20, 2026

View reviewed changes

Comment thread chatmaild/src/chatmaild/expire.py

Comment thread chatmaild/src/chatmaild/expire.py

missytake mentioned this pull request Apr 20, 2026

feat: add quota-triggered per-user mailbox cleanup so get an always below-quota relay experience #927

Closed

hpk42 force-pushed the auto-expire-oldest branch from fcbf626 to 1b6c4d1 Compare April 20, 2026 20:41

hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 20, 2026 20:41 — with GitHub Actions Inactive

hpk42 requested a review from j4n April 21, 2026 07:10

hpk42 force-pushed the auto-expire-oldest branch from 1b6c4d1 to f132881 Compare April 21, 2026 07:10

hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 21, 2026 07:10 — with GitHub Actions Inactive

hpk42 force-pushed the auto-expire-oldest branch from f132881 to df3c460 Compare April 23, 2026 11:07

hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 23, 2026 11:07 — with GitHub Actions Inactive

j4n approved these changes Apr 23, 2026

View reviewed changes

link2xt reviewed Apr 23, 2026

View reviewed changes

Comment thread chatmaild/src/chatmaild/tests/test_expire.py Outdated

Comment thread chatmaild/src/chatmaild/expire.py Outdated

Comment thread chatmaild/src/chatmaild/expire.py

Comment thread chatmaild/src/chatmaild/expire.py Outdated

link2xt reviewed Apr 23, 2026

View reviewed changes

address link2xt review comments

7f58c8b

hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 23, 2026 21:00 — with GitHub Actions Inactive

link2xt approved these changes Apr 23, 2026

View reviewed changes

add 1h protection

80ccf4d

hpk42 temporarily deployed to staging.chatmail.at/doc/relay/ April 24, 2026 08:11 — with GitHub Actions Inactive

fixate madmail to v0.42.2 which according to my local cmlxc run works.

98184ce

hpk42 deployed to staging.chatmail.at/doc/relay/ April 24, 2026 12:11 — with GitHub Actions View deployment

Uh oh!

Conversation

hpk42 commented Apr 20, 2026

Uh oh!

j4n left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

missytake left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hpk42 commented Apr 21, 2026

Uh oh!

link2xt commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j4n left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

link2xt Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpk42 Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ell1e Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

hpk42 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

link2xt Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

hpk42 Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpk42 Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

link2xt Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

ell1e Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ell1e Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

link2xt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

link2xt commented Apr 21, 2026 •

edited

Loading

link2xt Apr 23, 2026 •

edited

Loading

hpk42 Apr 23, 2026 •

edited

Loading

hpk42 Apr 24, 2026 •

edited

Loading

hpk42 Apr 24, 2026 •

edited

Loading

ell1e Apr 24, 2026 •

edited

Loading

ell1e Apr 24, 2026 •

edited

Loading