New approach for IMAP Idle connection handling #2208

csb0730 · 2021-02-08T22:06:06Z

These very interesting informations IMHO could be important for further development. So I think it's the best to post them here as an issue. I think it's worth and maybe necessary to discuss my findings. So let me start:

1. Motivation

While bad/flaky network conditions following issues has been come up and lead to bad user experience:

High battery drain by Delta Chat 1.2.1
Sometimes receipt of incoming messages not reliable
Sometimes no delivery of messages which needs to be send (red X; stop sending)

Testing environment: Android 4.1.2

2. Goals

Save battery drain as much as possible => do only really necessary actions!
IMAP Idle timeout length only limited by technical conditions, max 29min.
=> This should be handeled properly by core only (when possible).
Do job handling only if network is available, all other actions will fail anyway and are waste!
Reliable operation under all conditions
(For example: DC not opened, no manual intervention for a long time, device screen off)
** This is a must **

3. Background, Findings and Issues

While examining the approach of connection handling, job handling and use of Android system functions, I found that not a single issue is responsible for unfavorable behaviour of DC, rather there are more factors responsible in conjunction!

In detail:

a) Periodic Work Request (PWR) (Android; interval 15 min)

With the default interval of 15 min, it is not possible to use desired longer idle periods up to 29 min.
At latest after 15 min, all idles are interrupted.
More terrible, the Periodic Work Request is not syncronized with IMAP idle timeout start, so
very often a much shorter idle duration is the result.
Trying a longer interval for PWR (for example 30 min) shows, that this is not accepted by Android
system and only 15 min interval is working and possible.

Maybe this is an Android 4 limitation, but it has been the case.

Finally for all tests with longer timeouts than 15 min, a) Periodic Work Request has been disabled completely or b) triggered actions by that has been skipped by core!

b) One long timeout for IMAP idle connection lets core sleep unpredictable and doesn't show network errors while waiting.

For IMAP idle connection a timeout duration of 23 min is set (23 * 60s). Then core is waiting
for an external interrupt or timeout to expire.
The problem seen is, that this long connection timeout leads to unreliable core behaviour.
Expiration of this timeout is never reliable. Most of time expiration is much much longer than 23 min
or even endless! DC sleeps completely until user wakes it up by a manual trigger!
This leads to the situation that mail server ends connection and core doesn't get aware of that!
Thus, broken connection and/or external connection problems are not detected.
Often there has been minutes or even hours where DC didn't receive any message.

c) Max idle connection timeout depends from network type wifi/mobile

When dealing with really longer timeout periods, I found that maximum connection timeout not only depends from mail server, it depends from type of network too!
At home (wifi) I detected a max length of approx 13 min,
at mobile network I found a stabile operation until up to full 29 min which is possible from mail server.

d) Network status delivered by device not correct/not reliable sometimes

Network down events are not foreseen by FFI interface, so every network event triggers start of job actions - while flaky network conditions very often and unnecessary.
In flaky network conditions many events were fired. These quick events are not ignored/suppressed by core in a reasonable way.
Sometimes no event is fired, even if network is not available temporarily and comes back.

e) Unnecessary job handling, retransmissions and tries, even and especially when network not available.

This is caused by d) and the fact that core doesn't know if network goes up or down by an event.
Interface between UI and core doesn't provide this up/down information (!)

f) Interleaved parallel job handling (old core version)

Job handling not locked properly. When many network events are fired within seconds some (the same) actions are started again in a new thread while first action is still in progress.
=> Maybe this is solved meanwhile by newer core version.

g) Permanent notification (Android) - necessary to keep DC reliable, even with Android 4 (!)

Regardless what actions are chosen, sooner or later it has been come to the situation, that DC doesn't received any messages.
The ONLY possibility to keep DC working reliably is to introduce (force) "Permanent background notification", even for Android 4.

4. Debugging

At the beginning of all these examinations it was very hard to understand what core is really doing.
Preexisting logging was not sufficient to show all necessary information.
=> logging has been extended and reworked (text messages, trigger points and format) to reveal issues and root cause of core issues.

5. Basics of "new approach"

Use Android's "Periodic Work Request" only to check if core is working properly.
Do interrupt only when timing problem is detected.
=> store next necessary timeout duration end for IMAP idle in a variable and check that in a shorter loop!
Handle IMAP idle connection with many short loop timeouts (5s) instead of one long timeout of 23min.
- timeout controlled by timer, not by connection timeout!
- every connection error interuppts idle now.
- log every loop duration which is longer than 1min.
=> This approach guarantees a maximum in-operational duration of 2 min for core!
Dynamic IMAP idle duration, controlled by connection failures, 11-23mins
- start with 11 min, when no failure increment by 1 min (max duration 23 min).
- every connection failure reduces duration by 5 min.
Extend FFI interface to get on/off status of a network event to core
New internal core connection status flag
=> controlled by device's network events AND connection behavior (error, success).
No job handling when being offline (connection status flag).
Increment job retries only when connection status flag shows "online".
Change retry timer calculation to a predifined list of durations and reduce number of retries.
No Interleaved parallel job handling.
Suppress quick repetetion and overlapping job starts due to fast network events.
Permanent background notification forced (This is a must, even at Android 4!)

6. Experiences and Outlook with new approach

Experiences with flaky network conditions and overall operation:
- Very stable message reception
- Low network traffic
- Very low battery drain (always!)
- No job handling while being in "Flight Mode" or offline
No unsent messages any more.
There is a good potential to optimize even current DC connection handling
(DC 1.14.5, core 1.50.0) I checked current sources and found basic approach unchanged regarding to DC 1.2.1 (core 1.27).

This is a big summary now, but as I mentioned at start: There is not a single issue responsible for an unfavorable behaviour of DC. Meanwhile I'm running the "new approach" for some months with great success.
I would say, it meets the goal :-)

csb0730 · 2021-02-08T22:38:48Z

https://github.com/csb0730/deltachat-core-rust/tree/lesser-battery-drain/src/imap

Link to the sources

gerryfrancis · 2021-02-09T08:41:54Z

Related: deltachat/deltachat-android#1573

link2xt · 2021-02-19T01:09:24Z

Link to commits, easier to review than the sources:
https://github.com/csb0730/deltachat-core-rust/commits/lesser-battery-drain

csb0730 · 2021-04-10T21:41:11Z

@link2xt great to point to commit history. This is much more clear 👍

link2xt · 2021-05-09T21:19:22Z

b) One long timeout for IMAP idle connection lets core sleep unpredictable and doesn't show network errors while waiting.

I have looked into the implementation of wait_with_timeout from async-imap. It uses async_std::future::timeout, which in turn uses Timer from async-io crate. This crate uses std::time::Instant. We had a problem with std::time::Instant not advancing while the app is sleeping on android already: #1706. This is why IDLE timer may be effectively infinite if you are not using the app. I think we need to upstream this problem at least to https://github.com/smol-rs/async-io and then probably to Rust standard library as explained in https://users.rust-lang.org/t/std-now-with-android/41774

Filed an issue: smol-rs/async-io#63

csb0730 · 2021-06-07T21:41:14Z

I found too, that the app is handled different from OS when using a shorter connection timeout value in a loop.

I can describe it like this:

no loop, connection timeout is set to desired max timeout => timeout is unpredictable, very often endless
loop, connection timeout is set to 2min => OS is throttling loop call very quick (1st loop is ok, then rises quick to 2 ...10 mins and more!)
loop, connection timeout is set to 5s => OS is throttling loop call much slower ( 5s, 10s, 30s, ... max 2 min!) and every broken connection is signaled.

At the end loop duration is still not very accurate, but so reliable, that a secure operation is possible. To reach the goal I used a max timeout duration of 23 min instead of 29min which is mail servers limit.

It seems that when an app intends to run and not waiting for a semaphore or similar it is called more often and more reliable by OS.

=> These observations has been done with Android 4.1 but maybe valid for other OS's too.

r10s · 2022-03-08T20:33:30Z

imap connectivity has improved since then, it is more smtp that makes some problems meanwhile.

leaving this as reference in resurrection, however. if needed, we can split of smaller, actionable tasks (cmp march2022 cleanup)

csb0730 mentioned this issue Feb 24, 2021

Compile version for Android 4.1 deltachat/deltachat-android#1212

Closed

r10s mentioned this issue Sep 8, 2021

Detect active airplane mode deltachat/deltachat-android#1573

Closed

csb0730 mentioned this issue Oct 7, 2021

sometimes unacceptable battery usage deltachat/deltachat-android#1858

Closed

r10s added the discussion label Mar 8, 2022

r10s closed this as completed Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New approach for IMAP Idle connection handling #2208

New approach for IMAP Idle connection handling #2208

csb0730 commented Feb 8, 2021 •

edited

Loading

csb0730 commented Feb 8, 2021

gerryfrancis commented Feb 9, 2021

link2xt commented Feb 19, 2021

csb0730 commented Apr 10, 2021

link2xt commented May 9, 2021 •

edited

Loading

csb0730 commented Jun 7, 2021

r10s commented Mar 8, 2022 •

edited

Loading

New approach for IMAP Idle connection handling #2208

New approach for IMAP Idle connection handling #2208

Comments

csb0730 commented Feb 8, 2021 • edited Loading

1. Motivation

2. Goals

3. Background, Findings and Issues

a) Periodic Work Request (PWR) (Android; interval 15 min)

b) One long timeout for IMAP idle connection lets core sleep unpredictable and doesn't show network errors while waiting.

c) Max idle connection timeout depends from network type wifi/mobile

d) Network status delivered by device not correct/not reliable sometimes

e) Unnecessary job handling, retransmissions and tries, even and especially when network not available.

f) Interleaved parallel job handling (old core version)

g) Permanent notification (Android) - necessary to keep DC reliable, even with Android 4 (!)

4. Debugging

5. Basics of "new approach"

6. Experiences and Outlook with new approach

csb0730 commented Feb 8, 2021

gerryfrancis commented Feb 9, 2021

link2xt commented Feb 19, 2021

csb0730 commented Apr 10, 2021

link2xt commented May 9, 2021 • edited Loading

csb0730 commented Jun 7, 2021

r10s commented Mar 8, 2022 • edited Loading

csb0730 commented Feb 8, 2021 •

edited

Loading

link2xt commented May 9, 2021 •

edited

Loading

r10s commented Mar 8, 2022 •

edited

Loading