Switch file hashing to xxhash #243

hubx · 2016-11-12T17:52:51Z

xxhash was mentioned in #217 and hashing is currently under discussion in #241. I wanted to show how trivial switching to xxhash is and throw it into the discussion again.

For me (ssd, massive parallel compilation) it decreased the compilation time only by 2.4% but others might see more signicifant improvements.

// cc @heremaps/cci @Jimilian @sschuberth

sschuberth · 2016-11-13T08:55:16Z

@hubx AppVeyor CI fails with

Traceback (most recent call last):
  File "clcache.py", line 23, in <module>
    import xxhash
ImportError: No module named 'xxhash'

Would you mind fixing this by adding pip install xxhash to appveyor.yml?

webmaster128 · 2016-11-13T19:14:00Z

I don't think the speed minor improvement justifies the requirement of an external dependency. I'd leave this as an opt-in speed improvement for large scale projects. For simple programs that compile 5 or 10 minutes, it is more important to get the cache up and running quickly.

frerich · 2016-11-14T07:48:29Z

it decreased the compilation time only by 2.4% but others might see more signicifant improvements.

To be fair, others might see less significant improvements, too. ;-)

Joking aside, I suspect that xxhash might be an improvement -- I recall having come across it, too, when checking for ways to make cache hits faster (the runtime for cache hits seems to be dominated by reading files and hashing data). However, I'd rather like to see some benchmarks first before making a call. It's good to see my assumption confirmed that changing the hash algorithm is easy, though!

frerich · 2016-11-14T09:16:04Z

Reopening this, didn't actually mean to close the PR just yet.

Jimilian · 2016-11-14T09:34:48Z

@frerich Would microbenchmark on "real" data (random input sequence with real length) satisfy your requirements? Of course, it should be same test for Python2.7/Python3.3/Python3.4/Python3.5.

frerich · 2016-11-14T09:35:54Z

For the record, it seems that xxhash is indeed a lot faster than hashlib.md5 (about twice as fast in my experiments) but it's also a non-trivial dependency: it's not only not part of the standard library, but also requires a very specific Visual Studio installation (matching the Python version) since the module includes C code.

Right now, clcache supports Python 3.3, 3.4 and 3.5. The WindowsCompilers page explains that for 3.3 and 3.4, Visual Studio 2010 needs to be installed, for 3.5, Visual Studio 2015 needs to be installed. I have neither though (I use Visual Studio 2008 and Visual Studio 2013 for my daily work), so I needed to go hunting a bit to find the right downloads. Since I use Python 3.5, there seems to be a download of the Visual C++ Build Tools available which gets you just the required command line tools needed for pip install xxhash to work.

The bottom line is: I think documenting this alternative hashing algorithm probably makes sense, or making it optional at runtime -- but it does not seem to have enough impact to actually warrant the complication of the installation.

frerich · 2016-11-14T09:43:39Z

@Jimilian I think the performance test included in performancetests.py would be a good start. It actually exercises the entire roundtrip. It would be nice to see how it changes with different hash algorithms.

For what it's worth, I did a quick test run in a Windows VM, and here's the output I get:

with hashlib.md5:

Compiling 30 source files sequentially, cold cache: 12.269236257898246 seconds
Compiling 30 source files sequentially, hot cache: 3.1958666312493627 seconds
Compiling 30 source files concurrently via /MP2, hot cache: 3.219691360815654 seconds

with xxhash.xxh64:

Compiling 30 source files sequentially, cold cache: 11.387133487436277 seconds
Compiling 30 source files sequentially, hot cache: 3.1238149097423946 seconds
Compiling 30 source files concurrently via /MP2, hot cache: 3.1892145836394654 seconds

In this particular scenario, the improvement is certainly measurable but I'm not sure whether it's big enough to warrant changing the default hash algorithm and imposing the more complicated installation. However, my test setup suffers from a somewhat slow I/O though (it's in a Windows VM) so with faster hard disks, maybe you see much bigger improvements?

I suspect that a much larger effect can be achieved by not hashing as many files (in other reports, e.g. #239 , it was found that since clcache processes don't communicate, a large number of files is hashed repeatedly) in the first place.

Change-Id: I7104c68d9f3023e55ae9ce5137742e31e6356f1a

hubx · 2016-11-14T19:25:13Z

@sschuberth done. Let's see if this works with the explained difficulties of building native extensions for Python on Windows with the corresponding Visual Studio version.

@frerich I forgot about the build dependencies there, since I set up scandir for Python 3.4 recently so the installation of xxhash was transparent for me :/.

But I would refrain from making it configurable. Rather check what is available and the prefer the faster hashing algorithm.

frerich · 2016-11-14T23:19:21Z

But I would refrain from making it configurable. Rather check what is available and the prefer the faster hashing algorithm.

That sounds like a plan. If you can adjust this PR such that it optionally uses xxhash, I think there's no real reason not to merge it.

sasobadovinac · 2016-11-21T19:27:42Z

I was not able to get any useful speedup in my tests when switching to xxhash https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.360

frerich · 2016-12-14T07:51:34Z

Closing this; xxhash is indeed faster, but not so much faster that it would justify the more difficult installation it seems. Furthermore, it was shown that a much better optimisation is to not compute hash sums in the first place but rather hash them.

Thanks everyone for the constructive discussion!

sasobadovinac · 2016-12-17T20:10:42Z

Would it make sense to retest this with all the new updates in 4.0.0 release?

frerich closed this Nov 14, 2016

frerich mentioned this pull request Nov 14, 2016

Performance improvement unexpectedly small #239

Closed

frerich reopened this Nov 14, 2016

Switch file hashing to xxhash

f05e95c

Change-Id: I7104c68d9f3023e55ae9ce5137742e31e6356f1a

hubx force-pushed the xxhash branch from 3d3ddd4 to f05e95c Compare November 14, 2016 19:18

frerich closed this Dec 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch file hashing to xxhash #243

Switch file hashing to xxhash #243

hubx commented Nov 12, 2016

sschuberth commented Nov 13, 2016

webmaster128 commented Nov 13, 2016

frerich commented Nov 14, 2016

frerich commented Nov 14, 2016

Jimilian commented Nov 14, 2016

frerich commented Nov 14, 2016

frerich commented Nov 14, 2016

hubx commented Nov 14, 2016 •

edited

Loading

frerich commented Nov 14, 2016

sasobadovinac commented Nov 21, 2016

frerich commented Dec 14, 2016

sasobadovinac commented Dec 17, 2016

Switch file hashing to xxhash #243

Switch file hashing to xxhash #243

Conversation

hubx commented Nov 12, 2016

sschuberth commented Nov 13, 2016

webmaster128 commented Nov 13, 2016

frerich commented Nov 14, 2016

frerich commented Nov 14, 2016

Jimilian commented Nov 14, 2016

frerich commented Nov 14, 2016

frerich commented Nov 14, 2016

hubx commented Nov 14, 2016 • edited Loading

frerich commented Nov 14, 2016

sasobadovinac commented Nov 21, 2016

frerich commented Dec 14, 2016

sasobadovinac commented Dec 17, 2016

hubx commented Nov 14, 2016 •

edited

Loading