Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: addmailuser - Remove delaying completion until /var/mail is ready #2729

Merged

Conversation

polarathene
Copy link
Member

@polarathene polarathene commented Aug 19, 2022

Description

Quick Summary

Resolves a TODO task with addmailuser.

Overview

The main change is adding three new methods in common.bash, which replace the completion delay in addmailuser / setup email add command.

Other than that:

  • I swapped sh -c 'addmailuser ...' to setup email add ....
  • Improved three tests in setup-cli.bats for setup email add|update|del (logic remains effectively the same still).
  • Rewrote the TODO comment for setup-cli.bats test on setup email del to better clarify the concern, but the test itself was no longer affected due to changes prior to this PR, so I enabled the commented out assertion.
  • Removed unnecessary waits. The two skip tests in test/tests.bats could be enabled again after this PR.

Commit messages provide additional details if helpful.

Details

Presently when adding a user to postfix-accounts.cf with a running container, the addmailuser (aka setup email add) command will delay completion until the account has a directory in /var/mail (which is created by the changedetector service during creation of the Dovecot user).

AFAIK the delay was added primarily for our test suite and serves no other purpose. This PR migrates that delay logic into common.bash test helper methods, resolving the TODO.

Future work

A future improvement might instead skip the changedetector service and add the Dovecot account directly.

Presently that is not an option as that functionality is wrapped in a loop and assumes a clean slate.

The upcoming Dovecot mailcrypt plugin feature does require addmailuser delaying until /var/mail account folder is available however. This PR will not help users of that feature until Dovecot accounts can be created directly without relying on changedetector.


TODO addressed by this PR

Removed comment inline-docs TODO for reference (and having Github link the PR to those issues if traceability is needed in future)

TODO: Remove this method or at least it's usage in addmailuser. If tests are failing, correct the tests.

This method was added to delay command completion until a change detection event had processed the newly added user,
confirmed once maildir was created.
It was a workaround to accommodate the test suite apparently, but otherwise
prevents batch adding users (each one would have to go through their own change detection event).

Originally introduced in PR 1980 (afterwards two futher PRs deleted, and then reverted that deletion):
Not much details/discussion were in the PR, these are the specific commits:

Original reasoning for this method (sounds like a network storage I/O issue):
Tests fail if the creation of /var/mail/${DOMAIN}/${USER} doesn't happen fast enough after addmailuser executes (check-for-changes.sh race-condition)
Prevent infinite loop in tests like: "checking accounts: user3 should have been added to /tmp/docker-mailserver/postfix-accounts.cf even when that file does not exist"


Fixes: #2096 (comment)

Type of change

  • Improvement (non-breaking change that does improve existing functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes

This was apparently only for supporting tests that need to wait on account creation being ready to test against.

As per the removed inline docs, it should be fine to remove once tests are updated to work correctly without it.
`wait_until_account_maildir_exists()` provides the same logic `addmailuser` command was carrying, to wait upon the account dir creation in `/var/mail`.

As this was specifically to support tests, it makes more sense as a test method.

`add_mail_account_then_wait_until_ready()` was added to handle the common pattern of creating account and waiting on it. An internal assert will ensure the account was successfully created first during the test before attempting to wait.
…essed

The current helper is more complicated for no real benefit, it only detects when a change is made that would trigger a change event in the `changedetector` service. Our usage of this in tests however is only interested in waiting out the completion of the change event.

Remove unnecessary change event waits. These waits should not be necessary if handled correctly.
The method is unaltered, just repositioned.
`wait_until_account_maildir_exists` was added to ensure the account is ready in two tests.

Some blank lines were added/removed for consistency.
This helper method is used where appropriate.
- A password is not relevant (optional).
- We need to wait on the creation on the account (Dovecot and `/var/mail` directory).
The delete test was failing as the `/var/mail` directory did not yet exist.

There is now a proper delay imposed in the `add` test now shares the same account for both `update` and `del` tests resolving that failure.

Additionally tests use better asserts where appropriate and the wait + sleep logic in `add` has been improved (now takes 10 seconds to complete, approx half the time than before).

The `del` test TODO while not technically addressed is no longer relevant due to the tests being switched to `-c` option (there is a separate `no container` test file, but it doesn't provide a `del` test).
@polarathene

This comment was marked as resolved.

This test helper relies on a `debug` log output by `check-for-changes.sh`.

Additionally `setup-cli.bats` was not using it properly as the container name wasn't passed as an arg.. and no failure was raised.
@polarathene polarathene marked this pull request as draft August 19, 2022 20:26
`/etc/dms-settings` quote wraps the value, but the assert wasn't helpful either with the failure (no match).

Chose to fetch the ENV from container environment with `env` as we don't rely on `start-mailserver.sh` processing for this ENV anyway. Also switched to `assert_regex` for better failure output.

The `changedetector` state conditions were returning the wrong status code (0 is true/ok in bash). An earlier misunderstanding to resolve a failure had me add explicit status codes to return for true/false, as it didn't seem to work correctly with just the `[[ expression ]]` but was due to something else?
@polarathene polarathene marked this pull request as ready for review August 20, 2022 13:30
@georglauterbach

This comment was marked as outdated.

@polarathene

This comment was marked as outdated.

Copy link
Member

@georglauterbach georglauterbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look again, got a nice overview with your comments, and this looks good - LGTM 👍🏼

casperklein
casperklein previously approved these changes Aug 21, 2022
@polarathene

This comment was marked as resolved.

There is not much reason to check before waiting on ClamAV.

It is more helpful to debug failures from `nc` mail send commands if we know that nothing went wrong inbetween the ClamAV wait time.

Additionally added an assertion which should provide more information if this part of the test setup fails again.
@polarathene
Copy link
Member Author

polarathene commented Aug 22, 2022

Healthcheck test failure investigation (fixed)

Well it didn't fail at the same spot, this time it was the healthcheck test:

not ok 312 checking container healthcheck in 87ms
# (from function `assert_output' in file test/test_helper/bats-assert/src/assert_output.bash, line 194,
#  in test file test/tests.bats, line 969)
#   `assert_output "healthy"' failed
#
# -- output differs --
# expected : healthy
# actual   : unhealthy
# --
#

@test "checking container healthcheck" {
run bash -c "docker inspect mail | jq -r '.[].State.Health.Status'"
assert_output "healthy"
assert_success
}

--health-cmd "ss --listening --tcp | grep -P 'LISTEN.+:smtp' || exit 1" \

That's fairly far down in the test, not sure how it ended up unhealthy?

Defaults:

--interval=DURATION (default: 30s)
--timeout=DURATION (default: 30s)
--start-period=DURATION (default: 0s)
--retries=N (default: 3)

I think retries has no delay between attempts, and all retries are part of the same check in the timeout bounds? So likely a false-positive here due to the check being affected at a time postfix was being restarted due to change detection event.. I'll shift this test up to the start then.

This test is a bit fragile. It relies on defaults for the healthcheck with intervals of 30 seconds.

If the check occurs while Postfix is down due a change event from earlier tests and the healthcheck kicks in at that point, then if there is not enough time to refresh the health status from `unhealthy`, the test will fail with a false-positive as Postfix is actually working and up again..
@georglauterbach
Copy link
Member

I hope I got this right: as the new addmailuser does not wait for the changedetector to pick up the changes and create the Dovecot account, there is a slight delay after the addmailuser command succeeds, right? Our test suite has always been fragile when it comes to timing in certain areas, your logic and code seem hardened to me however (nice work BTW). I do not see why tests would fail at the moment. I had experienced the healthcheck failing in the past too, so moving it up seems like a good idea, but I would add a comment if there really is a inter-test-dependency.

@polarathene
Copy link
Member Author

polarathene commented Aug 22, 2022

I hope I got this right: as the new addmailuser does not wait for the changedetector to pick up the changes and create the Dovecot account, there is a slight delay after the addmailuser command succeeds, right?

Before / After (verbose)

Before:

  • addmailuser make an account to postfix-accounts.cf, don't exit until /var/mail has created an associated folder /var/mail/<domain>/<user>.
  • changedetector notice update in postfix-accounts.cf:
    • Add account to UserDB and PassDB for Dovecot.
    • Create the folder (addmailuser condition met).
    • Apply check for user/group ownership (mkdir called by root in script, so condition triggers chown) - potential race condition (encountered in PR for Dovecot encrypted storage, I have a fix for that to wait on ownership instead of only checking /var/mail account directory exists).
    • Change detection restarts Postfix and Dovecot.
    • Change detection complete.
    • Update checksum file.
    • Dovecot and Postfix may still be coming back up for a moment.

Now:

  • addmailuser make an account to postfix-accounts.cf, exit.
  • Tests can use wait_until_account_maildir_exists():
    • Wait until directory exists, just like addmailuser did before.
  • Tests can use wait_until_change_detection_event_completes():
    • Wait on "Change detection complete" (after first waiting on a "Change detection start" equivalent event is present in logs).
    • Only confirms a change detection has occurred (may be sufficient, account now exist in Dovecot UserDB and PassDB, doveadm utilities can work without running Dovecot instance I think)
    • Otherwise use additional wait on service needed, or worse sleep for some time. No different to handling addmailuser before.

So mostly, addmailuser will finish much more quickly now, and if you actually need the delay to wait until maildir creation, you use wait_until_account_maildir_exists(), it should be have the same. If you want to instead wait a little bit longer, but avoid some potential race conditions, wait_until_change_detection_event_completes() is more useful.

Neither guarantees Dovecot or Postfix is up, as changedetector only reports when it's done processing, not on async activity of services restarting which can still occur after changedetector is finished it's part. wait_on_service() is available, but it just queries supervisord if the service in RUNNING state, which should be the case when the service is not yet ready during it's restart.

v12 can change from restart to reload of the services which should help minimize that issue assuming it causes no regressions.


New race condition failure - `email del -y` (resolved)

New transient failure (this PR is just the gift that keeps on giving with random failing tests..):

not ok 186 checking setup.sh: setup.sh email del in 429ms
# (from function `assert_failure' in file test/test_helper/bats-assert/src/assert_failure.bash, line 66,
#  in test file test/setup-cli.bats, line 116)
#   `assert_failure' failed
#
# -- command succeeded, but it was expected to fail --
# output :
# --
#

This time it's a test this PR actually touches, and is enabling a check that was previously disabled:

  # Mail storage for account was actually removed by `-y`:
  run docker exec "${TEST_NAME}" ls /var/mail/example.com/user
  assert_failure

Potentially a race condition with the previous setup email update test. Change detection is likely running, and perhaps starts to create all accounts from scratch while setup email del deletes the directory, and that is recreated by the changedetector? Doesn't look like it'd have much of a window to do that within :\

This is perhaps one of those scenarios where proper locking with flock() would avoid the issue as the user management commands and the change detector would be actually aware of each other (present file locking was changed around about 10.2, maybe earlier but mistakenly places a lock on the active script running, not the relevant config).

For now I'll ensure the change detector is done processing.. Opted for repeating the check with a timeout instead which should work just as well and minimize the wait time.

Workaround that tries not to introduce heavier delays by waiting on a full change event to complete in the previous `email update` if possible.

There is a chance that the account has the folder deleted, but restored from an active change event (for password update, then the account delete).

---

Additionally apply review feedback to add inline maintainer note about the healthcheck test to avoid refactors making the same mistake.
@polarathene polarathene merged commit 75a75bf into docker-mailserver:master Aug 22, 2022
@georglauterbach georglauterbach mentioned this pull request Oct 6, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests kind/improvement Improve an existing feature, configuration file or the documentation service/dovecot service/postfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

addmailuser is slow when number of mailboxes increases (especially with NFS)
3 participants