Skip to content
euedge edited this page Aug 22, 2011 · 1 revision

UA Compared to other tools

This comparison is not scientific or exhaustive in any sense but should give some ideas how ua fares compared to similar tools (for popular demand). Let us know your favorite tool and we include it in the comparison.

UA on Synthetic Data

  • 830 files
  • There are 110 duplicate clusters with exactly 2 elements, with sizes ranging from 8,510 - 3,103,200 bytes
  • There is a duplicate cluster of five elements, each having 10,000,000 bytes
  • There are four duplicate clusters with sizes 55, 67, 72 and 206 elements where each file is 100,000 bytes
  • 315 files are unique

Results

'''Tool''' ''Correct'' ''Invocation'' '''Performance [sec]''' Ratio
ua yes time find . -type f | ua -2m256 -b2048 - > /dev/null 0.411 seconds 1.000
[http://duff.sourceforge.net/ duff] yes time duff -qr . > /dev/null 1.689 seconds 4.109
[http://iredale.dyndns.org/Perl/find-duplicated-files.html fdf] yes time find . -type f | fdf > /dev/null 2.175 seconds 5.292
[http://premium.caribe.net/~adrian2/fdupes.html fdupes] yes time fdupes -qr1 . > /dev/null 21.274 seconds 51.762

UA on Organic Data

This is an actual development tree of a large web shop and third-party tools.

  • there are 28,456 of various sizes, some images, some compiled, many source files (Java, JSP), some jars ... '''4,932''' clusters with duplicates, ranging from 2 - 6 files in each (but predominantly 2)

Results

'''Tool''' ''Correct'' ''Invocation'' '''Performance [sec]''' Ratio
ua yes time find . -type f | ua -2m256 -b2048 - > /dev/null 1.521 seconds 1.000
[http://iredale.dyndns.org/Perl/find-duplicated-files.html fdf] yes time find . -type f | fdf > /dev/null 5.966 seconds 3.922
[http://premium.caribe.net/~adrian2/fdupes.html fdupes] yes time fdupes -qr1 . > /dev/null 16.207 seconds 10.655
[http://duff.sourceforge.net/ duff] yes time duff -qr . > /dev/null 18.052 seconds 11.869

Performance Discussion

What are those seconds? To be fair, we ran all the tools three times in a row and took the "best" real seconds (see man time) of these. This is to account for caching ... we are not trying to measure how fast we can open and close files. Except for the very first "un-warmed-up" run, the times did not differ much for the runs of the same tool. And I have quite a beast, 4GB dual core Dell.

All programs are correct in the sense that return the same result. UA seems to be significantly faster than the other programs. Interestingly fdupes and duff behave quite differently on organic and synthetic data. The reason is that fdupes double checks duplicates by comparing them by byte, which hurts when there are large duplicates and duff compares the candidates in pairs (thus compares more often than needed and this hurts when there are a large number of duplicates). fdf is a Perl program, and is about 4 times slower than ua in both cases. So why is ua faster? It uses some tricks. I shall note that speed was not the reason to develop ua (see later), but maybe these tricks that make ua faster can be generally useful. All these programs use the same obvious trick to ask the FS for file size and get rid off unique ones. In addition ua uses the following tricks:

1. optionally, perform a two stage hash: that is, hash on a small prefix first. This really speeds up processing when comparing files that are large and have the same size, but more importantly it really really speeds up white space ignoring comparison (which is the real reason we needed something like ua) 1. when there are only two files in a candidate cluster (e.g. two files that have the same size), compare them byte-by-byte (of course buffered!), since the MD5 will be more costly 1. MD5 is a good hash. It is extremely (truly next to impossible) that two files with the same MD5 hash, the same prefix hash and the same size by accident collide. But MD5 is also good to derive primitive type hash (E.g. size_t) which is folded from the 128 hash. In fact, it is so good, that there were 0 collisions in the experiments. This smaller hash is calculated when the MD5 value is derived and used in the data structures both as the hash and as the means of ordering. This way, only the cases when the files are identical we needed to compare the 128 bits.

Why did we need UA?

We needed ua because we were hunting for source duplicates (specifically Java Server Pages). We really did not care for white spaces and moreover our team uses Linux while the client mostly uses Windows (and we all well know this "\r\n" business). We needed to be able to find duplicates even if they only differ in white space. The two stage hashing really improves this, too! So this is what makes it different from the other tools. On top of that we also have "ignore case" which was just thrown in because it was cheap to add, but we did not really need it.

The other design principle was to make it line oriented. Unlike fdupes, ua was specifically designed to work with find, ls, grep ... because we had more complicated things to do with duplicates than removing them. To appreciate this, I describe what we did:

  • If the duplicate was not referenced from files under its own web root, delete it
  • If the duplicate was referenced from several web roots, update the references by sed (yes, the Unix tool), move one copy to a shared location and then delete the duplicates
  • Plus a number of other complications; which we won't bore you with

So ua was designed to be embedded in shell, perl, etc. scripts: it takes lines of file names and produces lines as output which can be further passed through the shell pipes.

I hope this helps to understand why we have ua. Viktor Szathmary shall be credited with ua being open source. I just put it together fast in C++ and used it internally. He had the idea that other people could make use of it. To be fair I shall add a final note: white space ignoring means traditional white space and not Unicode white space. I so much *$#! unicode, that I am not planning to improve on this.

Enjoy.