Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect count of logical cores instead of physical #2071

Closed
mako2580 opened this issue Apr 5, 2020 · 12 comments
Closed

Detect count of logical cores instead of physical #2071

mako2580 opened this issue Apr 5, 2020 · 12 comments
Assignees
Labels
feature request long-term valid topics that are expected take a long time to make progress optimization

Comments

@mako2580
Copy link

mako2580 commented Apr 5, 2020

Describe the bug
Parameter -T0 is detecting physical cores, but I think it should be detecting logical cores instead, to take advantage of hyperthreading.

To Reproduce
Steps to reproduce the behavior:

  1. Download data from http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia and convert zip to tar
  2. Run zstd with parameter -T0
  3. Rerun zstd with parameter -T#ofLogicalCores

Expected behavior
zstd should use all logical cores to speed up compression.

Screenshots and charts
As you can see from data below, parameter -T0 is the same as -T2, which is number of physical cores, which is expectable from source code. Only on default compression level 3 is using only number of physical cores faster than number of logical cores, but on any other compression level situation change. And with higher compression level the speed up is more visible.

Compression level 2:

$ time zstd -2 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar          : 32.88%   (211957760 => 69684920 bytes, silesia.tar.zst) 

real    0m0.761s
user    0m1.444s
sys     0m0.134s
$ time zstd -2 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar          : 32.88%   (211957760 => 69684920 bytes, silesia.tar.zst) 

real    0m0.754s
user    0m1.480s
sys     0m0.113s
$ time zstd -2 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar          : 32.88%   (211957760 => 69684920 bytes, silesia.tar.zst) 

real    0m0.732s
user    0m2.393s
sys     0m0.148s

Compression level 3:

$ time zstd -3 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar          : 31.47%   (211957760 => 66713509 bytes, silesia.tar.zst) 

real    0m1.273s
user    0m2.437s
sys     0m0.110s
$ time zstd -3 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar          : 31.47%   (211957760 => 66713509 bytes, silesia.tar.zst) 

real    0m1.282s
user    0m2.423s
sys     0m0.159s
$ time zstd -3 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar          : 31.47%   (211957760 => 66713509 bytes, silesia.tar.zst) 

real    0m1.389s
user    0m4.698s
sys     0m0.156s

Compression level 4:

$ time zstd -4 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar          : 30.94%   (211957760 => 65572880 bytes, silesia.tar.zst) 

real    0m1.882s
user    0m3.659s
sys     0m0.139s
$ time zstd -4 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar          : 30.94%   (211957760 => 65572880 bytes, silesia.tar.zst) 

real    0m1.906s
user    0m3.638s
sys     0m0.168s
$ time zstd -4 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar          : 30.94%   (211957760 => 65572880 bytes, silesia.tar.zst) 

real    0m1.810s
user    0m6.132s
sys     0m0.162s

Compression level 10:

$ time zstd -10 -f -T0 silesia.tar -o silesia.tar.zst
silesia.tar          : 28.11%   (211957760 => 59582278 bytes, silesia.tar.zst) 

real    0m10.516s
user    0m19.922s
sys     0m0.179s
$ time zstd -10 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar          : 28.11%   (211957760 => 59582278 bytes, silesia.tar.zst) 

real    0m10.571s
user    0m20.508s
sys     0m0.186s
$ time zstd -10 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar          : 28.11%   (211957760 => 59582278 bytes, silesia.tar.zst) 

real    0m8.313s
user    0m28.980s
sys     0m0.205s

Desktop:

  • OS: Arch Linux
  • Version Rolling release
  • Compiler gcc
  • Flags: not sure - used default package from distribution and not relevant to bug
  • Other relevant hardware specs Intel i5 4200U - 2 physical cores and 4 logical cores
  • Build system Makefile
@bimbashrestha
Copy link
Contributor

Hi @mako2580,

Thanks for posting the issue.

I managed to reproduce your results on my machine

lscpu
...
Thread(s) per core:  2
Core(s) per socket:  14
Socket(s):           2
...

perf stat -r 10 zstd -10 -T0 silesia.tar -o /dev/null
1.79603 +- 0.00911 seconds time elapsed  ( +-  0.51% )

perf stat -r 10 zstd -10 -T28 silesia.tar -o /dev/null
1.7640 +- 0.0132 seconds time elapsed  ( +-  0.75% )
# as expected, the same results as -T0

perf stat -r 10 zstd -10 -T56 silesia.tar -o /dev/null
1.6867 +- 0.0220 seconds time elapsed  ( +-  1.31% )
# a little better than when using physical cores

Unfortunately, the same improvement isn't there for higher levels that use more cpu.

perf stat -r 5 zstd -19 -T0 silesia.tar -o /dev/null
22.875 +- 0.592 seconds time elapsed  ( +-  2.59% )

perf stat -r 5 zstd -19 -T56 silesia.tar -o /dev/null
22.463 +- 0.577 seconds time elapsed  ( +-  2.57% )
# effectively the same

I'm not an expert on this topic but my understanding is that hyper-threading will only help when the running program doesn't fully saturate your cores (so when it's bound by something other than cpu). This should be the case when compressing large files (which should be I/O bound). Hence the speed improvement for level 10 but not for level 19 isn't terribly surprising to me.

I wasn't around when the decision to use the number of physical cores was made so maybe @Cyan4973, @terrelln or @felixhandte can provide some context here? Trying different options (not just -T) for specific host machines/data is something zstd users should do to extract the most utility out of the library. As for selecting a good default, I suppose if we can show that T(#logical_cores) is never worse than -T(#physical_cores) but sometimes quite a bit better, that would be a compelling enough reason to switch?

@mako2580
Copy link
Author

mako2580 commented Apr 6, 2020

Retested for compression level 19, and I'm getting expected results. -T#logical_cores is about 15 % faster than -T#physical_cores. What type of storage are you using for test? I was doing test in ramdisk, to avoid I/O bound as much as possible, that might be reason why you are getting almost the same results even with high compression level.

$ time zstd -19 -f -T2 silesia.tar -o silesia.tar.zst
silesia.tar          : 25.12%   (211957760 => 53251489 bytes, silesia.tar.zst) 

real    1m13.561s
user    2m23.948s
sys     0m0.266s
$ time zstd -19 -f -T4 silesia.tar -o silesia.tar.zst
silesia.tar          : 25.12%   (211957760 => 53251489 bytes, silesia.tar.zst) 

real    1m3.259s
user    3m17.664s
sys     0m0.517s

@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 6, 2020

This is a difficult topic.

Basically, the idea that one can use hyper-threading to improve performance has some truth, but it depends on multiple factors, and it's less true for zstd than for most applications.

The main idea is that modern OoO cores have multiple execution units in each core (and multiple cores per cpus, don't confuse those 2 levels). These execution units are able to issue instructions in parallel, as long as they are unrelated. At its peak, an Intel CPU can issue up to 4 micro-instructions per cycle. Of course, no-one reaches this theoretical limit, and any code featuring > 3 micro-instructions per cycle is considered very well optimized.

In practice, most codes are not that optimized, and it's common for many applications to linger in the ~0.5 instructions per clock territory. That's because the primary performance limitation nowadays is the memory wall. Fetching some random data in memory costs hundreds of cycles, and the core just stalls, starving from data, before being able to proceed on further calculation.

Modern cpus can re-order instructions to take advantage of memory fetch stalls, but there is a limit, both to cpu and program capabilities. At some point, the program cannot proceed forward, because there is absolutely nothing else to do, except waiting for the one critical data that every future action depend on.
In which case, hyper-threading is helpful, because it forces the core to process 2 threads in parallel, therefore when one thread is stalled, the other one can make use of the idle execution units. This improves occupancy rate.

The cost for this capability is not null : higher processing requirements for the cpu, on top of sharing essential core resources, such as L1 cache, virtual registers, or micro-op cache. So it'd better be useful.

Now let's talk about zstd.
Remember the topic of nb of micro-instructions per cycle ? zstd is highly optimized. Critical components of zstd, such as FSE, or Huff0, or xxHash, are able to extract more than 3 micro-instructions per cycle. It does so by organizing data dependencies so that multiple instructions can be processed in parallel without depending on each other. This happens within a single thread. That means there is basically no space left for a second thread to tap into the "idle" execution units.

That being said, this is actually an "ideal" situation. In practice, one can't completely avoid fetching data out of cache as compression level become higher. Because zstd uses larger tables, and has more history to look into. This necessarily brings it out of cache, and therefore stalls from waiting for memory happen.

So, as the average nb of micro-instructions per cycle becomes lower, there is more room for other threads to tap into unused resources. So at higher compression levels, I'm not surprised to see that hyper-threading is actually beneficial.

Now, that would be one reason to use logical cores. But there are other considerations to factor in.

Primarily, the "system" within which zstd works also has some other tasks to perform, even if only for keeping the "system" alive (bookkeeping, logging, interrupts, etc.). Therefore, a "default" setting bringing the whole system to a halt looks dangerous.
Also, more threads means more memory. At level 19, the cost is approximately ~100 MB per thread. Doubling the nb of threads double the memory budget. While some systems may be able to cope with that, others may feel that it's just too much, and trigger an OOM failure, or worse an OOM-Killer, with random impact on the system.

Therefore, due to these additional constraints, it felt "safer", or more conservative, to select a default which preserves system resources for other tasks, especially system-critical ones.

Now, it's not that using more threads is forbidden : at any rate, if someone wants more threads, he can requests so. It's more about "what's a good default", and I'm sure this can be debated at length.

In the future, if this is an important topic, we can probably pour more resources on the topic, to provide a more complete and satisfactory answer, with maybe more performance and more control. But there is only so much resource we can spend at any moment in time, and these days, we have settled for a "good enough" middle ground, by selecting this "nb of cores" policy. Maybe something that can still be improved upon in some future.

@bimbashrestha bimbashrestha added long-term valid topics that are expected take a long time to make progress optimization labels Apr 6, 2020
@mako2580
Copy link
Author

mako2580 commented Apr 6, 2020

Thanks for detailed answer. You agree, that using number of logical cores can have performance improvement. Not the big one, but it's something, The only problem you have, is that there will be no room for OS (background threads and free memory). That's good reason, but I think it is not that critical, as it might seen.

First one - Free memory - you will still run out of memory if you have CPU with many physical cores and not enough RAM. You are not solving problem with RAM consumption, you are just moving that line further. And if you do run out of memory, so for the next time you are going to run zstd you just specify -T#number_of_threads_you_have_ram_for, but if you have enough of RAM you get max performance even on default.

And the second one - OS background tasks - I think that might only happen if you set high process priority, but i think that's not the default case. Even if I wanted to be highly paranoid, I would rather set some constant that would reserve unused threads for background task - for example i will use (#logical_threads - 1) for -T0. Because in systems with higher number of cores you let half of cores unused, which is for background tasks too much IMHO.

@magiblot
Copy link

Primarily, the "system" within which zstd works also has some other tasks to perform, even if only for keeping the "system" alive (bookkeeping, logging, interrupts, etc.). Therefore, a "default" setting bringing the whole system to a halt looks dangerous.

But that's the job of the CPU scheduler. My linux system does not become unresponsive or sluggish when zstd is using all logical processors; instead, the overall CPU consumption of zstd is lowered when other programs also need to consume CPU.

Obviously, I don't mean to say that all systems will handle 100% CPU consumption gracefully. What I mean is that if the user knows it works well for them, there's nothing bad in allowing them to enable this behaviour systematically.

In my case, I was trying to tweak makepkg's compression algorithm. Since /etc/makepkg.conf works like a bash script, I can set COMPRESSZST=(zstd -c -T$(nproc) -18 -) and avoid hardcoding the number of logical cores of my current system. But not all use cases may have the chance to automate this easily.

What you say makes a lot of sense, and I do not believe the default behaviour of -T0 should be changed. So I'd suggest adding a separate option (or a new suffix for -T) to take the count of logical processors instead of physical cores.

@mako2580
Copy link
Author

I think that adding separate option is acceptable, but in that case it should be written more explicitly in --help, which option is using logical cores and which physical ones. Nevertheless I think that default behavior user wants, is using logical cores for maximizing performance. And I think that lot of software out there has already hardcoded -T0, so changing behavior of -T0 is better idea to improve performance of third-party software as well. Just my opinion.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 15, 2020

I'm not comfortable with changing the default for the time being.
In doubt, I prefer to lean towards more conservative (i.e. use less cpu usage).

Note that anyone can take advantage of hyper-threading, but it must be requested explicitly, it's just not the "default".

@magiblot
Copy link

Yann, I support your decision, but can we have an argument that explicitly requests to use the count of logical cores without the caller having to known the exact number of cores?

As I said before, currently in Bash I can do zstd -c -T$(nproc), but I'm sure doing this in e.g. cmd.exe is not so straight-forward.

@Cyan4973
Copy link
Contributor

Which command name would make sense ?
Since it can't be -T#, it would have to use a different convention.

@magiblot
Copy link

Can't -T# be used with a letter instead of a number? For example: -TL or -Tl.

@Cyan4973
Copy link
Contributor

Not with current code. At #, it expects an ascii number.

@magiblot
Copy link

I guess there are two ways to make this:

  • --logical-cores: when combined with -T0, the number of logical CPU cores is used instead of physical CPU cores.
  • --auto-threads={physical,logical} (default: physical): determines whether the number of working threads with -T0 is the number of physical CPU cores or logical CPU cores. (This interface is more consistent and easier to extend).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request long-term valid topics that are expected take a long time to make progress optimization
Projects
None yet
Development

No branches or pull requests

4 participants