feat: automatic oldest-first message removal from mailboxes to always stay under max_mailbox_size#929
feat: automatic oldest-first message removal from mailboxes to always stay under max_mailbox_size#929
Conversation
j4n
left a comment
There was a problem hiding this comment.
looks good, tested overfilling a mailbox with
for i in $(seq 1 400); do dd if=/dev/urandom bs=1M count=1 2>/dev/null | base64 | (echo "From: test@example.org"; echo "To: test@do.main"; echo "Subject: fill $i"; echo; cat) | doveadm save -u test@do.main; done
Running usr/local/lib/chatmaild/venv/bin/chatmail-quota-expire 70 **/home/vmail/mail/... worked, but when looking at the logs, it did not fire automatically:
doveadm(test@do.main): Error: program unix:/run/dovecot/quota-warning: net_connect_unix(/run/dovecot/quota-warning) failed: Permission denied (euid=999(vmail) egid=996(vmail) missing +r perm: /run/dovecot/quota-warning, dir owned by 0:0 mode=0755)
doveadm(test@do.ain): Error: Saving failed: Quota exceeded (mailbox for user is full)
...
cmdeploy/src/cmdeploy/dovecot/dovecot.conf.j2, the unix_listener quota-warning socket needs explicit permissions for vmail to connect:
unix_listener quota-warning {
+ user = vmail
+ mode = 0600
}Then it works:
Apr 20 13:07:14 host dovecot[3834579]: quota-warning: Error: quota-expire: removed 6 message(s) from test@do.main
missytake
left a comment
There was a problem hiding this comment.
Sounds very promising!
Maybe it would make sense to first expire larger messages by date, and only then the small messages; missing an attachment is much more harmless than missing membership changes.
fcbf626 to
1b6c4d1
Compare
|
In a prior version of the quota-expire effort, i had file-size play into the sorting, so larger messages would be deleted first if they are not much younger than small messages. @link2xt was skeptical of that and eventually we arrived at saying: "let's just do oldest first for now" also because it's easier to understand it and reason about it. However, i am still wondering about it. Maybe we can keep the last 7 days of small messages unconditionally, so a few big messages can not push all the young small messages out of the queue? sidenote: with this PR merged, addresses conceptually become handles for message queues rather than static mail storage. |
1b6c4d1 to
f132881
Compare
In #926 messages were scored by age multiplied by size. Could work, but then if you have really not read messages for some time, you will get some history but no "member added" messages and randomly dropped messages at the time when gossip happens instead of the history starting at roughly some timestamp. Can drop messages with
Something simple with two thresholds is probably fine, first delete huge messages (videos), then if it is not sufficient the same as the current logic. But then still not clear if anyone will actually notice the improvement, but you lose the "queue" property and it might be difficult to reimplement this logic if we get to replacing Maildir with something more efficient (chatmail basically needs a log storage, just write new messages into mbox and rotate "pages"). |
… stay under max_mailbox_size Both dovecot-quota-threshold triggers and the daily expiry routine will now expunge oldest messages from mailboxes automatically when the mailbox reaches 75% of max_mailbox_size. Delta Chat users should not see any warnings (at 80/95 percent) or bounce messages, and existing over-quota mailboxes should start receiving mails again.
f132881 to
df3c460
Compare
| total_size = sum(m.quota_size for m in messages) | ||
| removed = 0 | ||
| for entry in sorted(messages): | ||
| if total_size <= target_bytes: |
There was a problem hiding this comment.
One suggestion (from Ellie) is to unconditionally not delete messages that are new to make complete flushing of the mailbox noticeable:
there could also be a mixed mode, for example only delete old messages if they have been around at least X day
Maybe stop here not only if we reached quota, but also if we reached the message that is e.g. 2 days old. Then if you get a flood of messages, you start using the remaining 30% and user will get 80% quota warning. And if it happens persistently, then you eventually get over quota and this does not resolve, but this actually means something has gone wrong and should be investigated.
tl;dr the suggestion is to add or entry.mtime > time.time() - 2 * 86400 here and break out of the loop if we reached very new messages. If we eat through 400 MB of storage in two days, something has likely went really wrong and warning the user sounds reasonable.
There was a problem hiding this comment.
currently, if an attacker burst-floods a mailbox with small messages, expire_to_target will delete them. when the burst is over, the address can receive messages again.
with a 2-day floor, a single burst of small messages fills the mailbox and none of them can be removed for 2 days. Dovecot rejects all incoming mail during that window. the attacker just repeats the burst every 2 day, minimal effort, permanent denial of service.
There was a problem hiding this comment.
But it would be a visible denial of service, rather than a silent one, where you flush the messages out without a limit. I think for small groups that's better. For larger groups, the errors could be hidden, for example.
I think this is simply a needs of the large room vs needs of the small room scenario. In small rooms, just about anything seems preferrable to me to silent message loss. And whatever path is taken, that some messages won't reach a user if an attacker does something nasty, is unavoidable, so you can only control how visible that is.
There was a problem hiding this comment.
it's not about visible versus invisible denial of service. When you get flooded with messages, you will notice that in either case. Only in the remove-oldest algo you start receiving messages when the DoS stops, whereas with 2-day floor you stay incommunicado for two days after the first burst.
There was a problem hiding this comment.
DoS attack is pretty much hypothetical anyway, you can do more annoying stuff with less effort if you really want. More likely is some accidental flooding e.g. bot flooding the bot mailbox where nobody looks or users flooding their mailbox with invisible webxdc updates or subscribing to a channel that sends them videos every day. If we do this change it will either be pretty much never triggered or might uncover some problem if mailbox gets over quota and user actually investigates and reports it, but most likely it will just never get triggered because eating through 400 mb in two days accidentally is unlikely.
One last suggestion from me is adding or entry.mtime > time.time() - 3600 or so, to not rotate through the mailbox continuously in case something is really-really broken and mailbox gets a new 10MB message every minute, but if not then just mark it as resolved.
There was a problem hiding this comment.
k, i added the 1h-guard as suggested. 80ccf4d
addendum: only now saw ell1e's 6-hour ago comment for some reason, this message here was originally next to link2xt's last comment in the thread.
There was a problem hiding this comment.
Sender visibility of a sending failure in my opinion is even more important than the receiver noticing that something is up.
When a sender transmits a message to 3 addresses (a multi-relay chat contact), and one address fails because of quota overrun, it is the receiving device which should notice quota-full event and do something about it, like clearing the storage or switching to a different relay. For users nothing in particular needs to happen. No message needs to be flagged, no warning needs showing. Everything works, and auto-repairs to continue to work. <------ this is at least the rough UX direction of current discussions about "automated relay management" designed to work for mass-users who might not know any details of the relay network, but just onboard on it.
There was a problem hiding this comment.
addendum: only now saw ell1e's 6-hour ago comment for some reason
GitHub broke the caching logic for messages now, it was already broken for issue counters and pull request counters recently. I also get emails about your messages, then go to the page and have to refresh it manually to see the message. Also when I reload the page, i don't see any comments at first without any spinner, it is indistinguishable from the PR having no comments so i always think for a moment i opened the wrong PR or something, and the comments load only later, not necessary in the latest state, without the new comment just received by email sometimes.
There was a problem hiding this comment.
I don't think multi relay is relevant for this problem, since an attacker could just spam all relays. (Edit: or am I missing something? I'm happy to be wrong! ❤️ )
The only way to know that my message wasn't eaten by some spam a few hours after, while the user wasn't online yet, seems to be if there is a time window of at least 48 hours or so where it's guaranteed to be stored. That way it would actually arrive for most people. Otherwise you're allowing silent failures even though everything looked fine.
I honestly don't know if it's just me, but I abandoned multiple messengers just because of silent delivery failures. (That being XMPP OMEMO, as well as Wire messenger.) A single message not arriving without anybody being aware in a tense situation can be socially nuclear.
There was a problem hiding this comment.
Oh another thing I just realized is relevant, I've had multiple people be offline for a week or two at some point. If you don't have a minimum retention time of two weeks, you would be able to write them really detailed messages in private while they might never arrive and nobody involved might ever know. While with the old configuration, you absolutely would know once their inbox ran full that both the new messages didn't arrive, and that the old messages would typically have been retained until read.
I realize this is an extreme case and it might not make sense to tweak this for the extreme cases. However, I also think the assumption every user is offline at most for 10 minutes is the opposite extreme. People might e.g. be in a building without reception for half the day, that's pretty common. Or their battery simply ran out and they were too busy to deal with it for a day. It happens to the best of us...
link2xt
left a comment
There was a problem hiding this comment.
There is a seemingly unrelated CI failure, from IP-based relays.
Revamped #927 , trying to address all suggestions from there.
Both dovecot-quota-threshold triggers and the daily expiry routine will expunge oldest messages from mailboxes automatically when the mailbox reaches 75% of max_mailbox_size.
Delta Chat users should not see any warnings (at 80/95 percent) or bounce messages, and existing over-quota mailboxes should start receiving mails again.