Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

Performance improvement unexpectedly small #239

Closed
akleber opened this issue Oct 31, 2016 · 25 comments
Closed

Performance improvement unexpectedly small #239

akleber opened this issue Oct 31, 2016 · 25 comments

Comments

@akleber
Copy link
Contributor

akleber commented Oct 31, 2016

After resolving my last issues I got back to do some performance tests on my main project and I see a surprisingly small performance improvement: I am building with ninja on a 24 core machine on a SSD. A build with empty cache takes about 48 min. The rebuild with a full cache takes about 30min. Here are the cache statistics for a cold and warm build.

clcache statistics:
  current cache dir         : D:\j2/clcache
  cache size                : 4,354,861,756 bytes
  maximum cache size        : 10,073,741,824 bytes
  cache entries             : 6476
  cache hits                : 6394
  cache misses
    total                      : 6476
    evicted                    : 0
    header changed             : 40
    source changed             : 6436
  passed to real compiler
    called w/ invalid argument : 0
    called for preprocessing   : 762
    called for linking         : 8
    called for external debug  : 22
    called w/o source          : 0
    called w/ multiple sources : 0
    called w/ PCH              : 0

I profiled both runs: prof4-cold.txt and prof4-warm.txt
If I read the 'warm' one correctly, most of the time is spend in computing file hashes, over 2 million times. Are these all the headers from all files? If this is correct, what I do not understand then, why the 'warm' case has a small amount of more calls to getFileHash() and why it takes more than double the time.
Any ideas or input how I might improve clcaches performance improvements?

FYI: My still unfinished lockfile mode does not change these numbers significantly. Also using xcopy instead of shutil.copyfile() or using readinto() instead of read() in getFileHash() does not change the situation.

@frerich
Copy link
Owner

frerich commented Nov 1, 2016

Thanks for your thoughtful post! I agree with your initial analysis of the 'warm' profile: most of the time is spent in computing the hashes of about 2.1 million include files. The numbers you got match those you mentioned in an earlier comment of yours -- after working around what seems to be an inefficient usage of threads in the subprocess module, a large proportion of the time is spent in the getFileHash function.

Scrolling down in the profile report, one can see that almost all of the calls to getFileHash are triggered by a list comprehension in clcache.py:242 -- that is in the definition of the ManifestRepository.getIncludesContentHashForFiles method which is given a list of paths and then returns a list of file hashes. This method is called once for each entry in the computed manifest to see whether all include files used by a cache entry still have the last seen hash sum, i.e. to check whether we have a cache hit.

The difference in the number of calls to getFileHash between the warm and the cold profile is also visible in the 'callers' list further down in the profile report. This is the distribution of the calls per caller in the cold report:

clcache.py:747(getFileHash)                             <-    6435    0.555   11.960  clcache.py:224(getManifestHash)
                                                           2080352  145.234 13788.261  clcache.py:1388(<dictcomp>)

...and this is in the warm report:

clcache.py:747(getFileHash)                             <-    6435    0.627   18.588  clcache.py:224(getManifestHash)
                                                           2077999  149.706 29645.308  clcache.py:242(<listcomp>)
                                                             36998    2.241  317.594  clcache.py:1388(<dictcomp>)

That dictionary comprehension in clcache.py:1388 has been removed in the meanwhile, it's in createManifestEntry which is a function relevant when handling cache misses. I.e. even in the case of the 'warm' run, there are still cache misses -- and I think the list of include paths which are hashed is not cleansed from duplicates.

As for how to improve the situation, clearly there are two approaches:

  1. Reduce the number of calls to getFileHash. Seeing that ManifestRepository.getIncludesContentHashForFiles is called in a loop (once for each entry in a manifest), there is clearly some chance that multiple entries reference the same includes, or that a single entry (unexpectedly) references to the same include twice. An easy way to improve this might be to use the functools.lru_cache decorator with getFileHash. I just opened Introduce memoization to getFileHash #240 to test this theory -- it would be much appreciated if you could give that PR a try!

    A more aggressive approach which would even avoid re-hashing files across multiple clcache invocations might be to launch a little server script which speaks e.g. JSON over HTTP or something convenient and which allows getting the hash sum of a file. The server could monitor the file system and mark cached hash sums as outdated in case the file on disk was modified.

  2. Make getFileHash faster. IIRC, I noticed that exchanging the hash algorithm helped. In particular, sha1 or so was quite a bit faster (30%) than MD5. That aside, it might also be possible to optimize file reading a bit, but I don't know how -- reading the file contents into a data structure and then feeding that to the hashing module seems inefficient in comparison to reading the file contents in small blocks and feeding those to the hashing, but maybe it's not so bad after all.

@akleber
Copy link
Contributor Author

akleber commented Nov 1, 2016

Thanks for your input. I tried #240 but unfortunately it does not improve my timings. Which is a pitty as I like the idea of the lru_cache decorator very much. I added some lru_cache usage statistics output
cache_info_warm.txt. After each clcache run this statistic is printed, so we have about 6800 outputs (for every invocation). It is the output of a warm cache run and one can see that we have only very few clcache runs where we have lru_cache hits. In the cold case there is not one hit.
You were also right that I was using 3.3.0 before, because I missed the 3.3.1 release. This run here I did with the PR which is based on 3.3.1. As the hashes changed, I did a cold & warm run. They had the same timings as before.

@frerich
Copy link
Owner

frerich commented Nov 1, 2016

Thanks for trying #240 - I suppose that if it doesn't help, it means that on every individual clcache run, (almost) no include was hashed twice -- which is good. However, seeing how over two million files get hashed when compiling about eight thousand files, I suspect that many source files simply end up using the same includes (e.g. windows.h).

I'm currently experimenting with different approaches to solve this; one thing which @vchigrin implemented about five years ago was a 'daemon' mode in which clcache (among other things) talks to servers via named pipes in order to cache the include hashes - it seems he did this with great success. The code is still available at https://github.com/vchigrin/clcache/tree/cacodaemons it seems.

I'm currently studying @vchigrin 's implementation as well as other local-host IPC mechanisms (TCP sockets appear to be way too slow for our purposes, unless I badly screwed up my usage of the socket API) to see whether there's anything nice.

@inorton
Copy link
Contributor

inorton commented Nov 1, 2016

I wonder if pypy would help hashing performance?

When I started cclash I briefly trialed python (which I love and use constantly for other stuff) but back then c# was somthing like 1000x faster than python at hashing files with md5 or sha1 (I even tried md4 in the quest for more speed).

My only real solution was to memoize the hash results and use a file monitor to re-hash on file changes by adopting a client/server model where the server was long lived and could therefore avoid repeatedly hashing the same windows headers or libs again and again.

@frerich
Copy link
Owner

frerich commented Nov 2, 2016

@inorton pypy is an interesting idea, I'll give that a try - just for fun.

I looked closer at how the getFileHash function spends its time and had it hash 214 header files (those which are indirectly pulled in by just including windows.h...) a hundred times. This is the profiling output:

    21400    0.451    0.000    3.374    0.000 clcache.py:760(getFileHash)
    21400    0.010    0.000    0.010    0.000 {built-in method _hashlib.openssl_md5}
        1    0.000    0.000    3.461    3.461 {built-in method builtins.exec}
    21400    0.733    0.000    0.733    0.000 {built-in method io.open}
    21400    0.007    0.000    0.007    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    21400    0.030    0.000    0.030    0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
    21400    0.677    0.000    0.677    0.000 {method 'read' of '_io.BufferedReader' objects}
    21400    1.473    0.000    1.473    0.000 {method 'update' of '_hashlib.HASH' objects}

What's interesting here is that 21% of the time is spent in io.open and 20% is spent in _io.BufferedReader.read, i.e. about 41% is spent in I/O alone. The calls to the update method for updating the hash sum account for another 44% of the runtime. So the majority (85%) of the time is spent half in I/O, and half in computing the hash sums.

I suppose getting a faster hash sum is more likely to happen (in initial experiments, sha1 was faster than md5 for me...) but I wonder how to address the I/O issue. That might be something one indeed only can address by avoiding I/O in the first place and rather caching hash sums -- is that server you wrote which memoizes the hash results available somewhere? I suppose it could be a fairly generic piece of software which just sits somewhere, read from e.g. a named pipe and then replies with a hash sum.

I'll write a little C program for comparison to see how fast it could go if I talk to the Windows API directly.

@frerich
Copy link
Owner

frerich commented Nov 2, 2016

Alas, the PyPy Download Page has no Windows binaries for PyPy for Python 3, and the page also warns that

this is an alpha release supporting the Python 3.3 language. It's also known to be (sometimes much) slower than PyPy 2

...so I think I'll not pursue that idea further for the time being.

@frerich
Copy link
Owner

frerich commented Nov 2, 2016

For the record, two C programs I write which perform the same job (generating MD5 hash sums of 216 header files a hundred times) actually did not perform better than Python either: one uses the MD5 support features in the Windows standard libraries (I basically took the code from a Microsoft example) and the other uses the popular Aladdin MD5 code.

The Windows API version was slightly slower, but neither of the two versions was faster than the Python version -- the one based on Aladdin MD5 was actually exactly as fast as the Python version.

Based on this, I think Python itself is not to blame here - there's very little code happening in Python, the hash code calculation is in C (it seems it uses the Aladdin MD5 code) and the rest is I/O waiting for the operating system.

Maybe a viable alternative to a server process would be a database (a single, binary, file) which stores the modification times as well as the hashes of headers. If it was stored efficiently, it might be very fast to open the file, test for membership of a given path in the database, verify that the mtime is unchanged and then (if it is indeed unchanged) reuse the hash.

@akleber
Copy link
Contributor Author

akleber commented Nov 2, 2016

I took a look at the files getFileHash() hashes in my case. Of the 2 mio files 0.9 mio are system headers and 1.1 mio are files in my project. As the system headers are rarely changing maybe some kind of shortcut could be implemented there?
Might it be possible for example to specify to clcache a list of directories with files that are ignored when computing the resulting hash? Of course the user then whould have to make sure that the whole clcache-cache gets invalidated when the system headers change.

@akleber
Copy link
Contributor Author

akleber commented Nov 2, 2016

Intuitively I was surprised that generating the hashes would take this much time. To get a better picture I printed all 2 mio filenames passed to getFileHash() into a file. Then I wrote a small python script which reads in the 2 mio lines of filenames (takes less than 2 sec) and passes it to getFileHash(). All in one simple loop, so single threaded! On the same fast machine as before this takes 1000 sec. The profile of the warm cache run says something about 29274 sec for getFileHash() if I read it correctly. Any idea why this might be so much slower in the context of clcache? Maybe generating the hashes from 24 process lead to some kind of congestion? Am I missing something in the profile report?

@akleber
Copy link
Contributor Author

akleber commented Nov 2, 2016

Just analyzed how multiprocessing influences the numbers for my simple hashing only script. Single threaded: 1000s, 12 Processes: 1400s, 24 Processes: 1500s. I am using multiprocessing.Pool. So I think we see some congestion but nowhere near the numbers from the clcache profile reports.

@frerich
Copy link
Owner

frerich commented Nov 2, 2016

In your multiprocessing experiment, you're dividing the list of 2M entries by th enumber of processes, right? I.e. a single process needed 1000s, 12 processes hashing ~166k files each take 1400s in total etc.? Two ideas came to my mind

  1. There may be an overhead due to seek time, but seeing that you're using an SSD, that's probably not the case.
  2. Maybe getFileHash accidentally opens the file for exclusive access, so there was some blocking when having more than one process hash a file, but that does not seem to be the case either.

Right now, I'm thinking about whether this is maybe related to the operating system or the file system. The individual Python processes are perfectly decoupled as far as I can see, but I can imagine that in order to check (or update?) ACLs (this might also be something which an Anti Virus scanner performs as part of it's on-the-fly-scanning) that there is some enforced synchronisation.

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

In #243 , moving clcache to xxhash was proposed and I already read good things about xxhash in the past. It might be interesting to see how much this can improve the run time for builds with warm caches in your use case.

@TiloW
Copy link
Contributor

TiloW commented Dec 2, 2016

I am wondering, is getFileHash() called for every header of every source file that we want to compile?

It should be easy to store the hashes in a database along with the last modification time thus eliminating the need to read the same header file twice during the same build.

Maybe this is already implemented but I just didn't find it?

@frerich
Copy link
Owner

frerich commented Dec 2, 2016

@TiloW That's correct, getFileHash is called for every header of every source file. For warm caches (i.e. when you get a lot of cache hits), this is indeed a performance bottleneck.

A separate PR (#248) proposes a separate 'daemon' process which caches the hashes. It opens a named pipe and waits for paths to be given for which it then computes the hash sums. It also caches the hash sums and uses some file system monitoring to decide when to invalidate the cached hashes. The PR works in principle but is still work in progress. The main issue seems to be that it doesn't actually help as much as I would have hoped -- maybe it's because I merely traded runtimes (getFileHash resp. getFileHashes is faster now, but writing/reading to the named pipe is slow).

I hope that this is just a matter of me not being terribly familiar with named pipes. In #248 I already mentioned that avoiding any buffering may improve (i.e. reduce) latency.

@TiloW
Copy link
Contributor

TiloW commented Dec 3, 2016

I've tested our build on the machine that it's usually run on (CI, 4 cores @ 2 GHz) and also on a faster one (Fast, 4 cores @ 3.3 GHz, everything on SSD). I've also tested whether a full cache directory slows build times down (it doesn't) and the changes from #248. For what it's worth, I've also implemented a very rudimentary caching mechanism for header files using CouchDB (also tested PouchDB with in-memory storage). However, this doesn't seem to improve things. Assuming that windows build-in file caching picks up most of the header files, the only thing that we can speed up with this is the actual hash-computation.

machine clcache hash caching warm cache build time
CI 0 - - 4:49
CI 1 0 0 6:22
CI 1 0 1 2:06
CI 1 1 0 6:30
CI 1 1 1 2:04
Fast 0 - - 2:21
Fast 1 0 0 3:40
Fast 1 0 1 1:10

This is contrary to my last observation of 23 minutes for cold caches on the VM. I could be that our VM simply couldn't deliver the required performance at that time. Decreasing the aggressiveness of our virus scanner reduced build time by ~3 minutes (only when using clcache).

I feel like the increase in runtime for cold caches compared to compiling without clcache is quite large.

@frerich
Copy link
Owner

frerich commented Dec 3, 2016

@TiloW Thanks for those timings! It appears that for your case, there are three insights:

In case you're curious - you can set an environment variable CLCACHE_PROFILE=1 to have clcache generate profiling information for each invocation. After building a project, you can then use the showprofilereport.py script to aggregate all generated profile information and print a nice report which depicts where the time is spent. That might yield some interesting insights.

@TiloW
Copy link
Contributor

TiloW commented Dec 4, 2016

generate profiling information for each invocation

Here's the first few lines of the profiling report for my faster machine:

         18139399 function calls (17733649 primitive calls) in 1342.530 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      519    0.014    0.000 1342.533    2.587 {built-in method builtins.exec}
      519    0.004    0.000 1342.519    2.587 <string>:1(<module>)
      519    0.013    0.000 1342.513    2.587 clcache.py:1442(main)
      519    0.020    0.000 1342.227    2.586 clcache.py:1523(processCompileRequest)
      896    0.022    0.000 1297.121    1.448 subprocess.py:1264(wait)
     3391 1297.084    0.383 1297.084    0.383 {built-in method _winapi.WaitForSingleObject}
      516    0.035    0.000 1061.014    2.056 clcache.py:1579(processDirect)
      516    0.030    0.000 1025.521    1.987 clcache.py:1164(invokeRealCompiler)
        3    0.743    0.248  278.702   92.901 clcache.py:1270(reinvokePerSourceFile)
        3    0.015    0.005  277.943   92.648 clcache.py:1246(runJobs)
      515    0.003    0.000  275.647    0.535 clcache.py:1206(waitForAnyProcess)
      516    0.018    0.000   17.304    0.034 clcache.py:1652(ensureArtifactsExist)
      516    0.018    0.000   13.231    0.026 clcache.py:1419(createManifestEntry)
      516    0.321    0.001   12.742    0.025 clcache.py:1421(<listcomp>)
    88068    2.019    0.000   12.548    0.000 clcache.py:773(getFileHash)

When looking through the code I stumbled across this section and tested whether polling instead of waiting for the first job would improve runtime:

	while True:
		for p in procs:
			if p.poll() is not None:
				procs.remove(p)
				return p
			
		time.sleep(0.01)

But this yields only a few seconds for my build at best.

Enabling CLCACHE_HARDLINK also improves runtime by a few (<10) seconds.
Furthermore, I switched from hashlib.md5 to xxHash which also gives a minor improvement.

All of this does not explain the decrease in runtime when enabling clcache. So I removed the /MP-flag from my builds to have msbuild invoke the compiler with excatly one source file. As expected, build time increases independently of using clcache. However there is a significant difference:

clcache warm cache build time
0 - 6:04
1 0 5:32
1 1 2:02

So building with clcache actually improves runtime even with cold caches.
I'm not entirely sure why this occurs and to rule out any effects caused by the same object file being build multiple times for different build targets, I performed a reduced build where only one library is generated.

clcache warm cache build time
0 - 4:41
1 0 4:18
1 1 1:37

So here's my proposal: When clcache receives a list of source files, check which source files need to be compiled (as it's done currently) but call the real compiler only once with the entire list of source files that need to be compiled. Maybe the compiler does some funny optimizations when handling multiple files that we are currently missing out on. This would also negate the need for joining multiple processes.

@frerich
Copy link
Owner

frerich commented Dec 4, 2016

Thanks for the report! For what it's worth, the fact that clcache invokes itself for each source file (and the 'outer' clcache instance spends most of its time waiting for the inner processes to finishes) means that it's trick to interpret reports generated by invoking clcache with multiple source files. Such reports usually show that calls to _winapi.WaitForSingleObject dominate the runtime -- in this case, 1297 out of 1342 seconds (~97%!) are spent in that function. Scrolling further down, one can see that all this time is caused by calls from subprocess.py:1264(wait). This function is called by clcache.py:1164(invokeRealCompiler) (1021 seconds) and clcache.py:1206(waitForAnyProcess) (275 seconds). This means that 275 of the 1297 seconds spent in _winapi.WaitForSingleObject can be deducted from the time since they are executed concurrently to other ``_winapi.WaitForSingleObject` calls.

One can also see that clcache was invoked 519 times, and 516 times it called processDirect (i.e. 516 of 519 invocations are eligble for caching and are not passed through right away), and this resulted in 516 calls to invokeRealCompiler. So all calls were forwarded to the real compiler - i.e. this report stems from a clcache run with a cold cache. With that in mind, it's not surprising that most of the time is spent in _winapi.WaitForSingleObject, waiting for the real compiler to do its work.

With that being said...

So building with clcache actually improves runtime even with cold caches.

This cannot possibly be true, clcache always adds some overhead. In fact, in #239 (comment) you measured a 50% overhead for building with cold caches. So I can only assume some other side effect influencing the timings.

Enabling CLCACHE_HARDLINK also improves runtime by a few (<10) seconds.

That would be a < 1% improvement. This switch stems from times when mechanical drives were still a lot more common. It seems that with SSDs, it's not really useful anymore.

Furthermore, I switched from hashlib.md5 to xxHash which also gives a minor improvement.

Do you remember how much? Of course, for cold caches, it won't be much in percentages since the build time is dominated by invoking the real compiler. You might see a more noticeable improvement with warm caches, when clcache is mostly busy hashing and restoring data.

@TiloW
Copy link
Contributor

TiloW commented Dec 4, 2016

In fact, in #239 (comment) you measured a 50% overhead for building with cold caches.

Yes, but that was using /MP. That's the crucial result here: Heavily increased runtime when using clcache with cold caches seems to be tied to builds where the compiler is invoked with multiple sources.

So building with clcache actually improves runtime even with cold caches.

This cannot possibly be true

I have no idea yet why this occurs. But it's certainly reproducible. There are no cache hits registered by clcache so it is indeed a cold cache.

Do you remember how much?

cumtime for getFileHash() went from 12.548 to 9.525. So really just a very minor improvement.

Of course, for cold caches, it won't be much in percentages

Shouldn't be a problem as long as we compare absolute build times, right?

@TiloW
Copy link
Contributor

TiloW commented Dec 4, 2016

I believe runJobs() spawns up to j+1 subprocesses (which is contrary to the docs). This could lead to more concurrent compiler calls than expected thus decreasing runtime when using clcache even on cold builds. This could however only be the reason if reinvokePerSourceFile() is actually called. Surprisingly, this seems to be the case even without /MP (as my local profiling report suggests).

@TiloW
Copy link
Contributor

TiloW commented Dec 4, 2016

Surprisingly, this seems to be the case even without /MP (as my profiling report suggests).

Okay, so I wrongly assumed that the compiler would be invoked with multiple source files only when these source files should be compiled in parallel. That's not true and msbuild might invoke the compiler with all source files of a target independently of /MP. This also explains (in combination with the j+1=2 concurrent compiler invokations) why enabling clcache accellerates the build even on cold caches when the regular compiler would handle all files sequentially.

Edit: Changing the aforementioned line to while len(running) >= j: indeed increases runtime for cold builds to 9:36. This is compliant with the times observed when utilizing all cores.

@frerich
Copy link
Owner

frerich commented Dec 5, 2016

Changing the aforementioned line to while len(running) >= j: indeed increases runtime for cold builds to 9:36. This is compliant with the times observed when utilizing all cores.

Indeed, good catch! There's an off-by-one.

frerich pushed a commit that referenced this issue Dec 5, 2016
In the course of discussing GitHub issue #239, @TiloW noticed that the
number of clcache instances launched by the runJobs() function was
always one larger than the number of intended concurrent instances. A
simple off-by-one. This patch fixes it.

Alas, I couldn't think of a nice way to create a test for this since
controlling concurrency is quite difficult and I'd rather have no test
than a flaky test.
@TiloW
Copy link
Contributor

TiloW commented Dec 5, 2016

When clcache receives a list of source files, check which source files need to be compiled (as it's done currently) but call the real compiler only once with the entire list of source files that need to be compiled.

@frerich Are you planing to implement something like this in the foreseeable future? I could give it a go even though you would probably be much quicker.

@frerich
Copy link
Owner

frerich commented Dec 6, 2016

@frerich Are you planing to implement something like this in the foreseeable future? I could give it a go even though you would probably be much quicker.

@TiloW I'd love to give it a shot, but I cannot really promise anything. If you would like to give it a try, feel free to go ahead - please don't hesitate to shout in case you're wondering how things are supposed to work.

For what it's worth, you would probably have to start looking at processCompileRequest which is the function getting called once clcache decides that it's invoked to compile one or more source files (as opposed to e.g. linking object files). This function calls CommandLineAnalyzer.analyze to parse the given command line and determine the source file(s) to compile. If there's more than one source file, clcache reinvokes itself once for each source file by calling reinvokePerSourceFile. Otherwise, it proceeds to check whether the invocation has a cached result by calling processDirect or processNoDirect (depending on whether CLCACHE_DIRECT is set).

@frerich
Copy link
Owner

frerich commented Dec 15, 2016

I think with #248 and #255 merged, and various other things like different hash algorithms (e.g. #243) considered, this issue can be closed for now.

If there are still areas which are noticeably slower than expected, new issues should be created.

@frerich frerich closed this as completed Dec 15, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants