Adaptive compression often wrongly locks at high compression levels #2200

mgorny · 2020-06-12T07:18:07Z

Describe the bug
In my experience, adaptive compression sometimes increases compression level throughout low entropy data and afterwards fails to decrease it even though CPU becomes the bottleneck. I've aborted the compression after it lasted for a minute or two, so I don't think if it'd eventually manage to reduce it.

To Reproduce
Most recently, I've reproduced while making a backup to a slower HDD, so...

Steps to reproduce the behavior:

tar -c / | zstd -v --adapt -T12 > /mnt/backup/foo.tar.zstd

Expected behavior
I expected the compression level to be adapted to achieve best throughput with CPU utilization close to 100%.

Screenshots and charts

This is a very rough benchmark of input rate vs time. p2.txt is constant compression level that keeps CPU utilization low, so it's entirely I/O bound. p2-adapt.txt is adaptive compression. There's a visible peak due to I/O caching but still, it's quite clearly visible that adaptive compression is doing very badly here.

Desktop (please complete the following information):

OS: Gentoo Linux
Version: (not versioned)
Compiler: gcc 9.3.0
Flags: -march=znver2 --param l1-cache-size=32 --param l1-cache-line-size=64 -O2 -pipe
Other relevant hardware specs: Ryzen 5 3600 (6 cores, 12 threads)
Build system: Makefile

Additional context
Maybe it'd be better to make adaptation work based on perceived CPU utilization rather than I/O conditions? I think that'd yield more stable results and avoid instability caused by buffering. I don't know what other use cases are for adaptive compression but in my case, the primary goal would be to achieve realtime compression, and only secondary to achieve strongest compression possible while at it.

The text was updated successfully, but these errors were encountered:

mgorny · 2020-06-12T09:09:19Z

I've made a quick proof-of-concept that uses (Δgettime(CLOCK_PROCESS_CPUTIME_ID) / Δgettime(CLOCK_MONOTONIC) / nthreads) to measure CPU use. This will probably work only on Linux. I've also had to add some flattening to avoid compression level jumping like crazy ;-).

The poor man's plot shows that it's performing a better than the current algorithm but still worse than static level. I suppose this could still be improved by someone who knows what he's doing.

The proof-of-concept diff:

diff --git a/programs/fileio.c b/programs/fileio.c
index e0f7fde..aef44b5 100644
--- a/programs/fileio.c
+++ b/programs/fileio.c
@@ -20,6 +20,8 @@
 #  define _POSIX_SOURCE 1          /* disable %llu warnings with MinGW on Windows */
 #endif
 
+#define EXPERIMENTAL_ADAPTIVE_CPUTIME 1
+
 /*-*************************************
 *  Includes
 ***************************************/
@@ -33,6 +35,9 @@
 #include <limits.h>     /* INT_MAX */
 #include <signal.h>
 #include "timefn.h"     /* UTIL_getTime, UTIL_clockSpanMicro */
+#if EXPERIMENTAL_ADAPTIVE_CPUTIME
+#include <time.h>
+#endif
 
 #if defined (_MSC_VER)
 #  include <sys/stat.h>
@@ -1172,6 +1177,15 @@ FIO_compressLz4Frame(cRess_t* ress,
 #endif
 
 
+#if EXPERIMENTAL_ADAPTIVE_CPUTIME
+double timespec_subtract(struct timespec t1, struct timespec t2)
+{
+    double nsec = t1.tv_nsec - t2.tv_nsec;
+    return t1.tv_sec - t2.tv_sec + nsec/1000000000.;
+}
+#endif
+
+
 static unsigned long long
 FIO_compressZstdFrame(FIO_prefs_t* const prefs,
                       const cRess_t* ressPtr,
@@ -1193,6 +1207,11 @@ FIO_compressZstdFrame(FIO_prefs_t* const prefs,
     unsigned inputPresented = 0;
     unsigned inputBlocked = 0;
     unsigned lastJobID = 0;
+#if EXPERIMENTAL_ADAPTIVE_CPUTIME
+    struct timespec prevWallTime, prevCpuTime;
+    speedChange_e prevSpeedChange = noChange;
+    int speedChangeCount = 0;
+#endif
 
     DISPLAYLEVEL(6, "compression using zstd format \n");
 
@@ -1274,6 +1293,38 @@ FIO_compressZstdFrame(FIO_prefs_t* const prefs,
                         assert(zfp.produced >= previous_zfp_update.produced);
                         assert(prefs->nbWorkers >= 1);
 
+#if EXPERIMENTAL_ADAPTIVE_CPUTIME
+                        struct timespec wallTime, cpuTime;
+                        double cpuUse;
+
+                        clock_gettime(CLOCK_MONOTONIC, &wallTime);
+                        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cpuTime);
+                        cpuUse =
+                            timespec_subtract(cpuTime, prevCpuTime) /
+                            timespec_subtract(wallTime, prevWallTime) /
+                            prefs->nbWorkers;
+
+                        if (cpuUse > 0.95)
+                            speedChange = faster;
+                        else if (cpuUse < 0.8)
+                            speedChange = slower;
+
+                        /* avoid very rapid speed changes */
+                        if (speedChange == prevSpeedChange) {
+                            if (++speedChangeCount < 10)
+                                speedChange = noChange;
+                            else
+                                speedChangeCount = 0;
+                        } else {
+                            prevSpeedChange = speedChange;
+                            speedChange = noChange;
+                            speedChangeCount = 0;
+                        }
+
+                        prevWallTime = wallTime;
+                        prevCpuTime = cpuTime;
+#else
+
                         /* test if compression is blocked
                          * either because output is slow and all buffers are full
                          * or because input is slow and no job can start while waiting for at least one buffer to be filled.
@@ -1328,6 +1379,7 @@ FIO_compressZstdFrame(FIO_prefs_t* const prefs,
                             inputBlocked = 0;
                             inputPresented = 0;
                         }
+#endif
 
                         if (speedChange == slower) {
                             DISPLAYLEVEL(6, "slower speed , higher compression \n")

mgorny · 2020-06-12T09:52:30Z

In case it isn't clear, the main question is whether you'd be interested in doing this or having me work on it? If the latter, I'd use some tips on how to do it. In particular, it'd be platform-specific, so whether it should be implemented as a new option specific to some platforms.

The clock_gettime method was convenient for PoC but I think the final implementation would have to grab CPU usage from /proc to account for other processes using the CPU. The current code would keep getting low CPU use if other processes were using it, and therefore it would keep increasing compression level wrongly. It also doesn't account for -T being higher than CPU thread count.

Cyan4973 · 2020-06-15T16:06:07Z

Thanks @mgorny ,

the issue described is correct : --adapt has troubles coming back to higher speeds after plunging towards high compression levels. That's primarily because each job becomes both larger and slower, thus seriously reducing the granularity at which decisions are taken (compression level can only be changed between jobs).

The logic has even more difficulties to keep up when using a lot of threads, because a lot of work is already piped in worker threads, and by the time it can realize that it has targeted a too slow speed, there is a huge latency to come back from.
And this is made further more difficult by more subtle I/O limitations, where input and output are partially sync'ed just by virtue of being done in the same I/O thread.

Anyway, I totally agree that the current logic is perfectible, and could be improved.

Thanks for your great measurements. It's helpful, and also shows that your proposal can offer sensible improvements for this use case. It's also nicely done, with logic concentrated in a few places.

Using time instead of buffer fill rate seems like a good idea, but it also introduces portability issues.
In order to reduce portability concerns, zstd has a module named timefn, which tries to offer a common interface to various timer implementations(C90, C11, posix, windows, etc.). Unfortunately, this module isn't able to distinguish CLOCK_MONOTONIC from CLOCK_PROCESS_CPUTIME_ID, and it's unclear if such a capability is portable.
Also, one of the long-term objectives is to bring this capability inside libzstd, where a dependency on time functions is not allowed. This is a long-term objective though, in the meantime it's fine for CLI to accept dependencies on time, provided it's portable. Adding another dependency on /proc though is more dangerous, making the mechanism even more posix-limited.

I suspect your suggested approach is to make your solution platform-specific, reverting back to existing logic for other unsupported platforms. It's both interesting, making it possible to directly compare the 2 logics and measure the differences, and a bit scary from a maintenance perspective.
I presume that one key is to ensure that the new logic is well centralized, so that relevant lines of code are grouped together, ensuring that it's easy to enable / disable, and easier to follow.

To be more complete, my own preference would be to improve the fill-rate buffer methodology so that it handles these scenarios better, due to its inherent portability advantages. But given that my availability is very limited, I don't plan to work on it soon, making this preference a moot point.

stati64 · 2021-01-23T12:57:41Z

Adaptive compression is a way of investing more compute cycles as profitable.
The real problem is detecting idle (available for this use) compute cycles at fine resolution.
Too much CPU and CPU is on the critical path, too little and IO etc. is on the critical path.

Perhaps it's only stable when modulated by something like an idle time goal.
Because there are large differences in cost between Pareto compression levels,
it needs to Pulse Width Modulate between adjacent levels so as not to leave resources
idle by oscillating between levels.

YellowOnion · 2023-03-06T04:20:09Z

Are there any known workarounds? perhaps limiting the bounds to 0.5x and 1.5x the link bandwidth will prevent wild swings for example --adapt=12,16 for a 20MB/s internet connection.

Cyan4973 self-assigned this Jun 15, 2020

Cyan4973 added the long-term valid topics that are expected take a long time to make progress label Jun 15, 2020

bagbag mentioned this issue Sep 20, 2021

zstdmt --adapt is MUCH SLOWER (for me) than pzstd psy0rz/zfs_autobackup#97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive compression often wrongly locks at high compression levels #2200

Adaptive compression often wrongly locks at high compression levels #2200

mgorny commented Jun 12, 2020 •

edited

mgorny commented Jun 12, 2020

mgorny commented Jun 12, 2020

Cyan4973 commented Jun 15, 2020

stati64 commented Jan 23, 2021

YellowOnion commented Mar 6, 2023

Navigation Menu

Adaptive compression often wrongly locks at high compression levels #2200

Adaptive compression often wrongly locks at high compression levels #2200

Comments

mgorny commented Jun 12, 2020 • edited

mgorny commented Jun 12, 2020

mgorny commented Jun 12, 2020

Cyan4973 commented Jun 15, 2020

stati64 commented Jan 23, 2021

YellowOnion commented Mar 6, 2023

mgorny commented Jun 12, 2020 •

edited