-
Notifications
You must be signed in to change notification settings - Fork 84
Performance improvement unexpectedly small #239
Comments
Thanks for your thoughtful post! I agree with your initial analysis of the 'warm' profile: most of the time is spent in computing the hashes of about 2.1 million include files. The numbers you got match those you mentioned in an earlier comment of yours -- after working around what seems to be an inefficient usage of threads in the Scrolling down in the profile report, one can see that almost all of the calls to The difference in the number of calls to
...and this is in the warm report:
That dictionary comprehension in clcache.py:1388 has been removed in the meanwhile, it's in As for how to improve the situation, clearly there are two approaches:
|
Thanks for your input. I tried #240 but unfortunately it does not improve my timings. Which is a pitty as I like the idea of the lru_cache decorator very much. I added some lru_cache usage statistics output |
Thanks for trying #240 - I suppose that if it doesn't help, it means that on every individual clcache run, (almost) no include was hashed twice -- which is good. However, seeing how over two million files get hashed when compiling about eight thousand files, I suspect that many source files simply end up using the same includes (e.g. I'm currently experimenting with different approaches to solve this; one thing which @vchigrin implemented about five years ago was a 'daemon' mode in which clcache (among other things) talks to servers via named pipes in order to cache the include hashes - it seems he did this with great success. The code is still available at https://github.com/vchigrin/clcache/tree/cacodaemons it seems. I'm currently studying @vchigrin 's implementation as well as other local-host IPC mechanisms (TCP sockets appear to be way too slow for our purposes, unless I badly screwed up my usage of the socket API) to see whether there's anything nice. |
I wonder if pypy would help hashing performance? When I started cclash I briefly trialed python (which I love and use constantly for other stuff) but back then c# was somthing like 1000x faster than python at hashing files with md5 or sha1 (I even tried md4 in the quest for more speed). My only real solution was to memoize the hash results and use a file monitor to re-hash on file changes by adopting a client/server model where the server was long lived and could therefore avoid repeatedly hashing the same windows headers or libs again and again. |
@inorton pypy is an interesting idea, I'll give that a try - just for fun. I looked closer at how the
What's interesting here is that 21% of the time is spent in I suppose getting a faster hash sum is more likely to happen (in initial experiments, I'll write a little C program for comparison to see how fast it could go if I talk to the Windows API directly. |
Alas, the PyPy Download Page has no Windows binaries for PyPy for Python 3, and the page also warns that
...so I think I'll not pursue that idea further for the time being. |
For the record, two C programs I write which perform the same job (generating MD5 hash sums of 216 header files a hundred times) actually did not perform better than Python either: one uses the MD5 support features in the Windows standard libraries (I basically took the code from a Microsoft example) and the other uses the popular Aladdin MD5 code. The Windows API version was slightly slower, but neither of the two versions was faster than the Python version -- the one based on Aladdin MD5 was actually exactly as fast as the Python version. Based on this, I think Python itself is not to blame here - there's very little code happening in Python, the hash code calculation is in C (it seems it uses the Aladdin MD5 code) and the rest is I/O waiting for the operating system. Maybe a viable alternative to a server process would be a database (a single, binary, file) which stores the modification times as well as the hashes of headers. If it was stored efficiently, it might be very fast to open the file, test for membership of a given path in the database, verify that the mtime is unchanged and then (if it is indeed unchanged) reuse the hash. |
I took a look at the files getFileHash() hashes in my case. Of the 2 mio files 0.9 mio are system headers and 1.1 mio are files in my project. As the system headers are rarely changing maybe some kind of shortcut could be implemented there? |
Intuitively I was surprised that generating the hashes would take this much time. To get a better picture I printed all 2 mio filenames passed to getFileHash() into a file. Then I wrote a small python script which reads in the 2 mio lines of filenames (takes less than 2 sec) and passes it to getFileHash(). All in one simple loop, so single threaded! On the same fast machine as before this takes 1000 sec. The profile of the warm cache run says something about 29274 sec for getFileHash() if I read it correctly. Any idea why this might be so much slower in the context of clcache? Maybe generating the hashes from 24 process lead to some kind of congestion? Am I missing something in the profile report? |
Just analyzed how multiprocessing influences the numbers for my simple hashing only script. Single threaded: 1000s, 12 Processes: 1400s, 24 Processes: 1500s. I am using multiprocessing.Pool. So I think we see some congestion but nowhere near the numbers from the clcache profile reports. |
In your multiprocessing experiment, you're dividing the list of 2M entries by th enumber of processes, right? I.e. a single process needed 1000s, 12 processes hashing ~166k files each take 1400s in total etc.? Two ideas came to my mind
Right now, I'm thinking about whether this is maybe related to the operating system or the file system. The individual Python processes are perfectly decoupled as far as I can see, but I can imagine that in order to check (or update?) ACLs (this might also be something which an Anti Virus scanner performs as part of it's on-the-fly-scanning) that there is some enforced synchronisation. |
In #243 , moving clcache to |
I am wondering, is It should be easy to store the hashes in a database along with the last modification time thus eliminating the need to read the same header file twice during the same build. Maybe this is already implemented but I just didn't find it? |
@TiloW That's correct, A separate PR (#248) proposes a separate 'daemon' process which caches the hashes. It opens a named pipe and waits for paths to be given for which it then computes the hash sums. It also caches the hash sums and uses some file system monitoring to decide when to invalidate the cached hashes. The PR works in principle but is still work in progress. The main issue seems to be that it doesn't actually help as much as I would have hoped -- maybe it's because I merely traded runtimes ( I hope that this is just a matter of me not being terribly familiar with named pipes. In #248 I already mentioned that avoiding any buffering may improve (i.e. reduce) latency. |
I've tested our build on the machine that it's usually run on (CI, 4 cores @ 2 GHz) and also on a faster one (Fast, 4 cores @ 3.3 GHz, everything on SSD). I've also tested whether a full cache directory slows build times down (it doesn't) and the changes from #248. For what it's worth, I've also implemented a very rudimentary caching mechanism for header files using CouchDB (also tested PouchDB with in-memory storage). However, this doesn't seem to improve things. Assuming that windows build-in file caching picks up most of the header files, the only thing that we can speed up with this is the actual hash-computation.
This is contrary to my last observation of 23 minutes for cold caches on the VM. I could be that our VM simply couldn't deliver the required performance at that time. Decreasing the aggressiveness of our virus scanner reduced build time by ~3 minutes (only when using I feel like the increase in runtime for cold caches compared to compiling without |
@TiloW Thanks for those timings! It appears that for your case, there are three insights:
In case you're curious - you can set an environment variable |
Here's the first few lines of the profiling report for my faster machine:
When looking through the code I stumbled across this section and tested whether polling instead of waiting for the first job would improve runtime:
But this yields only a few seconds for my build at best. Enabling All of this does not explain the decrease in runtime when enabling
So building with
So here's my proposal: When |
Thanks for the report! For what it's worth, the fact that clcache invokes itself for each source file (and the 'outer' clcache instance spends most of its time waiting for the inner processes to finishes) means that it's trick to interpret reports generated by invoking clcache with multiple source files. Such reports usually show that calls to One can also see that clcache was invoked 519 times, and 516 times it called With that being said...
This cannot possibly be true, clcache always adds some overhead. In fact, in #239 (comment) you measured a 50% overhead for building with cold caches. So I can only assume some other side effect influencing the timings.
That would be a < 1% improvement. This switch stems from times when mechanical drives were still a lot more common. It seems that with SSDs, it's not really useful anymore.
Do you remember how much? Of course, for cold caches, it won't be much in percentages since the build time is dominated by invoking the real compiler. You might see a more noticeable improvement with warm caches, when clcache is mostly busy hashing and restoring data. |
Yes, but that was using
I have no idea yet why this occurs. But it's certainly reproducible. There are no cache hits registered by
Shouldn't be a problem as long as we compare absolute build times, right? |
I believe |
Okay, so I wrongly assumed that the compiler would be invoked with multiple source files only when these source files should be compiled in parallel. That's not true and Edit: Changing the aforementioned line to |
Indeed, good catch! There's an off-by-one. |
In the course of discussing GitHub issue #239, @TiloW noticed that the number of clcache instances launched by the runJobs() function was always one larger than the number of intended concurrent instances. A simple off-by-one. This patch fixes it. Alas, I couldn't think of a nice way to create a test for this since controlling concurrency is quite difficult and I'd rather have no test than a flaky test.
@frerich Are you planing to implement something like this in the foreseeable future? I could give it a go even though you would probably be much quicker. |
@TiloW I'd love to give it a shot, but I cannot really promise anything. If you would like to give it a try, feel free to go ahead - please don't hesitate to shout in case you're wondering how things are supposed to work. For what it's worth, you would probably have to start looking at |
After resolving my last issues I got back to do some performance tests on my main project and I see a surprisingly small performance improvement: I am building with ninja on a 24 core machine on a SSD. A build with empty cache takes about 48 min. The rebuild with a full cache takes about 30min. Here are the cache statistics for a cold and warm build.
I profiled both runs: prof4-cold.txt and prof4-warm.txt
If I read the 'warm' one correctly, most of the time is spend in computing file hashes, over 2 million times. Are these all the headers from all files? If this is correct, what I do not understand then, why the 'warm' case has a small amount of more calls to getFileHash() and why it takes more than double the time.
Any ideas or input how I might improve clcaches performance improvements?
FYI: My still unfinished lockfile mode does not change these numbers significantly. Also using xcopy instead of shutil.copyfile() or using readinto() instead of read() in getFileHash() does not change the situation.
The text was updated successfully, but these errors were encountered: