Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New approach for IMAP Idle connection handling #2208

Closed
csb0730 opened this issue Feb 8, 2021 · 7 comments
Closed

New approach for IMAP Idle connection handling #2208

csb0730 opened this issue Feb 8, 2021 · 7 comments

Comments

@csb0730
Copy link

csb0730 commented Feb 8, 2021

These very interesting informations IMHO could be important for further development. So I think it's the best to post them here as an issue. I think it's worth and maybe necessary to discuss my findings. So let me start:

1. Motivation

While bad/flaky network conditions following issues has been come up and lead to bad user experience:

  • High battery drain by Delta Chat 1.2.1
  • Sometimes receipt of incoming messages not reliable
  • Sometimes no delivery of messages which needs to be send (red X; stop sending)

Testing environment: Android 4.1.2

2. Goals

  • Save battery drain as much as possible => do only really necessary actions!

  • IMAP Idle timeout length only limited by technical conditions, max 29min.
    => This should be handeled properly by core only (when possible).

  • Do job handling only if network is available, all other actions will fail anyway and are waste!

  • Reliable operation under all conditions
    (For example: DC not opened, no manual intervention for a long time, device screen off)
    ** This is a must **

3. Background, Findings and Issues

While examining the approach of connection handling, job handling and use of Android system functions, I found that not a single issue is responsible for unfavorable behaviour of DC, rather there are more factors responsible in conjunction!

In detail:

a) Periodic Work Request (PWR) (Android; interval 15 min)

With the default interval of 15 min, it is not possible to use desired longer idle periods up to 29 min.
At latest after 15 min, all idles are interrupted.
More terrible, the Periodic Work Request is not syncronized with IMAP idle timeout start, so
very often a much shorter idle duration is the result.
Trying a longer interval for PWR (for example 30 min) shows, that this is not accepted by Android
system and only 15 min interval is working and possible.

Maybe this is an Android 4 limitation, but it has been the case.

Finally for all tests with longer timeouts than 15 min, a) Periodic Work Request has been disabled completely or b) triggered actions by that has been skipped by core!

b) One long timeout for IMAP idle connection lets core sleep unpredictable and doesn't show network errors while waiting.

For IMAP idle connection a timeout duration of 23 min is set (23 * 60s). Then core is waiting
for an external interrupt or timeout to expire.
The problem seen is, that this long connection timeout leads to unreliable core behaviour.
Expiration of this timeout is never reliable. Most of time expiration is much much longer than 23 min
or even endless! DC sleeps completely until user wakes it up by a manual trigger!
This leads to the situation that mail server ends connection and core doesn't get aware of that!
Thus, broken connection and/or external connection problems are not detected.
Often there has been minutes or even hours where DC didn't receive any message.

c) Max idle connection timeout depends from network type wifi/mobile

When dealing with really longer timeout periods, I found that maximum connection timeout not only depends from mail server, it depends from type of network too!
At home (wifi) I detected a max length of approx 13 min,
at mobile network I found a stabile operation until up to full 29 min which is possible from mail server.

d) Network status delivered by device not correct/not reliable sometimes

  • Network down events are not foreseen by FFI interface, so every network event triggers start of job actions - while flaky network conditions very often and unnecessary.
  • In flaky network conditions many events were fired. These quick events are not ignored/suppressed by core in a reasonable way.
  • Sometimes no event is fired, even if network is not available temporarily and comes back.

e) Unnecessary job handling, retransmissions and tries, even and especially when network not available.

This is caused by d) and the fact that core doesn't know if network goes up or down by an event.
Interface between UI and core doesn't provide this up/down information (!)

f) Interleaved parallel job handling (old core version)

Job handling not locked properly. When many network events are fired within seconds some (the same) actions are started again in a new thread while first action is still in progress.
=> Maybe this is solved meanwhile by newer core version.

g) Permanent notification (Android) - necessary to keep DC reliable, even with Android 4 (!)

Regardless what actions are chosen, sooner or later it has been come to the situation, that DC doesn't received any messages.
The ONLY possibility to keep DC working reliably is to introduce (force) "Permanent background notification", even for Android 4.

4. Debugging

At the beginning of all these examinations it was very hard to understand what core is really doing.
Preexisting logging was not sufficient to show all necessary information.
=> logging has been extended and reworked (text messages, trigger points and format) to reveal issues and root cause of core issues.

5. Basics of "new approach"

  • Use Android's "Periodic Work Request" only to check if core is working properly.
    Do interrupt only when timing problem is detected.
    => store next necessary timeout duration end for IMAP idle in a variable and check that in a shorter loop!

  • Handle IMAP idle connection with many short loop timeouts (5s) instead of one long timeout of 23min.

    • timeout controlled by timer, not by connection timeout!
    • every connection error interuppts idle now.
    • log every loop duration which is longer than 1min.

    => This approach guarantees a maximum in-operational duration of 2 min for core!

  • Dynamic IMAP idle duration, controlled by connection failures, 11-23mins

    • start with 11 min, when no failure increment by 1 min (max duration 23 min).
    • every connection failure reduces duration by 5 min.
  • Extend FFI interface to get on/off status of a network event to core

  • New internal core connection status flag
    => controlled by device's network events AND connection behavior (error, success).

  • No job handling when being offline (connection status flag).

  • Increment job retries only when connection status flag shows "online".

  • Change retry timer calculation to a predifined list of durations and reduce number of retries.

  • No Interleaved parallel job handling.
    Suppress quick repetetion and overlapping job starts due to fast network events.

  • Permanent background notification forced (This is a must, even at Android 4!)

6. Experiences and Outlook with new approach

  • Experiences with flaky network conditions and overall operation:
    - Very stable message reception
    - Low network traffic
    - Very low battery drain (always!)
    - No job handling while being in "Flight Mode" or offline

  • No unsent messages any more.

  • There is a good potential to optimize even current DC connection handling
    (DC 1.14.5, core 1.50.0) I checked current sources and found basic approach unchanged regarding to DC 1.2.1 (core 1.27).

This is a big summary now, but as I mentioned at start: There is not a single issue responsible for an unfavorable behaviour of DC. Meanwhile I'm running the "new approach" for some months with great success.
I would say, it meets the goal :-)

@csb0730
Copy link
Author

csb0730 commented Feb 8, 2021

@gerryfrancis
Copy link
Contributor

Related: deltachat/deltachat-android#1573

@link2xt
Copy link
Collaborator

link2xt commented Feb 19, 2021

Link to commits, easier to review than the sources:
https://github.com/csb0730/deltachat-core-rust/commits/lesser-battery-drain

@csb0730
Copy link
Author

csb0730 commented Apr 10, 2021

@link2xt great to point to commit history. This is much more clear 👍

@link2xt
Copy link
Collaborator

link2xt commented May 9, 2021

b) One long timeout for IMAP idle connection lets core sleep unpredictable and doesn't show network errors while waiting.

I have looked into the implementation of wait_with_timeout from async-imap. It uses async_std::future::timeout, which in turn uses Timer from async-io crate. This crate uses std::time::Instant. We had a problem with std::time::Instant not advancing while the app is sleeping on android already: #1706. This is why IDLE timer may be effectively infinite if you are not using the app. I think we need to upstream this problem at least to https://github.com/smol-rs/async-io and then probably to Rust standard library as explained in https://users.rust-lang.org/t/std-now-with-android/41774

Filed an issue: smol-rs/async-io#63

@csb0730
Copy link
Author

csb0730 commented Jun 7, 2021

I found too, that the app is handled different from OS when using a shorter connection timeout value in a loop.

I can describe it like this:

  1. no loop, connection timeout is set to desired max timeout => timeout is unpredictable, very often endless
  2. loop, connection timeout is set to 2min => OS is throttling loop call very quick (1st loop is ok, then rises quick to 2 ...10 mins and more!)
  3. loop, connection timeout is set to 5s => OS is throttling loop call much slower ( 5s, 10s, 30s, ... max 2 min!) and every broken connection is signaled.

At the end loop duration is still not very accurate, but so reliable, that a secure operation is possible. To reach the goal I used a max timeout duration of 23 min instead of 29min which is mail servers limit.

It seems that when an app intends to run and not waiting for a semaphore or similar it is called more often and more reliable by OS.

=> These observations has been done with Android 4.1 but maybe valid for other OS's too.

@r10s
Copy link
Member

r10s commented Mar 8, 2022

imap connectivity has improved since then, it is more smtp that makes some problems meanwhile.

leaving this as reference in resurrection, however. if needed, we can split of smaller, actionable tasks (cmp march2022 cleanup)

@r10s r10s closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants