Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

Switch file hashing to xxhash #243

Closed
wants to merge 1 commit into from
Closed

Conversation

hubx
Copy link
Contributor

@hubx hubx commented Nov 12, 2016

xxhash was mentioned in #217 and hashing is currently under discussion in #241. I wanted to show how trivial switching to xxhash is and throw it into the discussion again.

For me (ssd, massive parallel compilation) it decreased the compilation time only by 2.4% but others might see more signicifant improvements.

// cc @heremaps/cci @Jimilian @sschuberth

@sschuberth
Copy link

@hubx AppVeyor CI fails with

Traceback (most recent call last):
  File "clcache.py", line 23, in <module>
    import xxhash
ImportError: No module named 'xxhash'

Would you mind fixing this by adding pip install xxhash to appveyor.yml?

@webmaster128
Copy link
Contributor

I don't think the speed minor improvement justifies the requirement of an external dependency. I'd leave this as an opt-in speed improvement for large scale projects. For simple programs that compile 5 or 10 minutes, it is more important to get the cache up and running quickly.

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

it decreased the compilation time only by 2.4% but others might see more signicifant improvements.

To be fair, others might see less significant improvements, too. ;-)

Joking aside, I suspect that xxhash might be an improvement -- I recall having come across it, too, when checking for ways to make cache hits faster (the runtime for cache hits seems to be dominated by reading files and hashing data). However, I'd rather like to see some benchmarks first before making a call. It's good to see my assumption confirmed that changing the hash algorithm is easy, though!

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

Reopening this, didn't actually mean to close the PR just yet.

@Jimilian
Copy link
Contributor

@frerich Would microbenchmark on "real" data (random input sequence with real length) satisfy your requirements? Of course, it should be same test for Python2.7/Python3.3/Python3.4/Python3.5.

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

For the record, it seems that xxhash is indeed a lot faster than hashlib.md5 (about twice as fast in my experiments) but it's also a non-trivial dependency: it's not only not part of the standard library, but also requires a very specific Visual Studio installation (matching the Python version) since the module includes C code.

Right now, clcache supports Python 3.3, 3.4 and 3.5. The WindowsCompilers page explains that for 3.3 and 3.4, Visual Studio 2010 needs to be installed, for 3.5, Visual Studio 2015 needs to be installed. I have neither though (I use Visual Studio 2008 and Visual Studio 2013 for my daily work), so I needed to go hunting a bit to find the right downloads. Since I use Python 3.5, there seems to be a download of the Visual C++ Build Tools available which gets you just the required command line tools needed for pip install xxhash to work.

The bottom line is: I think documenting this alternative hashing algorithm probably makes sense, or making it optional at runtime -- but it does not seem to have enough impact to actually warrant the complication of the installation.

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

@Jimilian I think the performance test included in performancetests.py would be a good start. It actually exercises the entire roundtrip. It would be nice to see how it changes with different hash algorithms.

For what it's worth, I did a quick test run in a Windows VM, and here's the output I get:

  • with hashlib.md5:

Compiling 30 source files sequentially, cold cache: 12.269236257898246 seconds
Compiling 30 source files sequentially, hot cache: 3.1958666312493627 seconds
Compiling 30 source files concurrently via /MP2, hot cache: 3.219691360815654 seconds

  • with xxhash.xxh64:

Compiling 30 source files sequentially, cold cache: 11.387133487436277 seconds
Compiling 30 source files sequentially, hot cache: 3.1238149097423946 seconds
Compiling 30 source files concurrently via /MP2, hot cache: 3.1892145836394654 seconds

In this particular scenario, the improvement is certainly measurable but I'm not sure whether it's big enough to warrant changing the default hash algorithm and imposing the more complicated installation. However, my test setup suffers from a somewhat slow I/O though (it's in a Windows VM) so with faster hard disks, maybe you see much bigger improvements?

I suspect that a much larger effect can be achieved by not hashing as many files (in other reports, e.g. #239 , it was found that since clcache processes don't communicate, a large number of files is hashed repeatedly) in the first place.

Change-Id: I7104c68d9f3023e55ae9ce5137742e31e6356f1a
@hubx
Copy link
Contributor Author

hubx commented Nov 14, 2016

@sschuberth done. Let's see if this works with the explained difficulties of building native extensions for Python on Windows with the corresponding Visual Studio version.

@frerich I forgot about the build dependencies there, since I set up scandir for Python 3.4 recently so the installation of xxhash was transparent for me :/.

But I would refrain from making it configurable. Rather check what is available and the prefer the faster hashing algorithm.

@frerich
Copy link
Owner

frerich commented Nov 14, 2016

But I would refrain from making it configurable. Rather check what is available and the prefer the faster hashing algorithm.

That sounds like a plan. If you can adjust this PR such that it optionally uses xxhash, I think there's no real reason not to merge it.

@sasobadovinac
Copy link

I was not able to get any useful speedup in my tests when switching to xxhash https://ci.appveyor.com/project/sasobadovinac/freecad/build/1.0.360

@frerich
Copy link
Owner

frerich commented Dec 14, 2016

Closing this; xxhash is indeed faster, but not so much faster that it would justify the more difficult installation it seems. Furthermore, it was shown that a much better optimisation is to not compute hash sums in the first place but rather hash them.

Thanks everyone for the constructive discussion!

@frerich frerich closed this Dec 14, 2016
@sasobadovinac
Copy link

Would it make sense to retest this with all the new updates in 4.0.0 release?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants