Skip to content
euedge edited this page Aug 22, 2011 · 1 revision

UA MAN Page

NAME

ua - find identical sets of files (comes from the Hungarian word ''ugyanaz'' - meaning "the same")

SYNOPSIS

ua [__OPTION__]... [__FILE__]...

DESCRIPTION

Given a list of files, ua finds sets comprised of identical ones. ua was designed to take input from find or ls and produce output that is trivial to process by line oriented tools, such as sed, xargs, awk, wc, grep, etc. For example, counting the number of sets of duplicates, simply:

   
    $ find ~ -type f | ua - | wc -l 

or to find the largest such set:

    $ find ~ -type f | ua -s\; - | \
      awk -F\; '{if (NF>M) { M=NF;S=$0;}} END {print(S);}' 

OPTIONS

-i ignore letter caseBR -w ignore white spacesBR -n do not ask the file system for file sizeBR -v verbose output (prints stuff to stderr), verbose helpBR -m __max__ consider only the first max bytes in the hash BR -2 perform two stage hashing, first hash on the prefix of size set with -m and throw away candidates with unique prefix hashes BR -s __sep__ separator (default SPACE)BR -p also print the hash valueBR -b __size__ set internal buffer size (default 1024)BR -h this help (-vh more verbose help)BR - read file names from stdin (this must be the last option in the list)

OUTPUT

Each line of the output represents one set of identical files. The columns are the path names separated by sep (-s__sep__). When -p set, the first column will be the hash value. Remember that if -i or -w are set, the hash value will likely be different from what md5sum would give.

ALGORITHM

Calculation proceeds in three steps:

1. Ask the FS for file size and throw away files with unique byte counts. 1. If so requested (-2), calculate a fast hash on a fixed-size prefix (given by -m) of the files with the same byte count and throw away the ones with unique prefix hash values 1. The still matching files will go through a full MD5 hash; the files with the same hash will be deemed identical

-w implies -n, since the byte count is irrelevant information in this case. The two-stage hashing algorithm first calculates identical sets considering only a fixed-size prefix (thus the -2 option requires -m) and then from these sets calculates the final result. This can be much faster when there are many files with the same size or when comparing files with whitespaces ignored. When -w and -m max are both set, the max refers to the first max non-white space characters.

EXAMPLES

Get help on usage

    $ ua -h
    $ ua -vh 

Find identical files in the current directory

    $ ua *
    $ ls | ua -p - 

In the first case, the files are read from the command line, while in the second the file names are read from the standard input. The letter one also prints the hashcode.

Compare text files

    $ ua -iwvb256 f1.txt f2.txt f3.txt 

Compares the three files ignoring letter case and white spaces. Intermediate steps will be reported on stderr (-v). The -w implies -n, thus file sizes are not grouped. The internal buffer size is reduced to 256, since the whitespaces will cause data to be moved in the buffer.

Calculate the number of identical files under home

    $ find ~ -type f | ua -2m256 - | wc -l 

Considering the large number of files, the calculation will be performed with a two stage hash (-2). Only files that pass the 256 byte prefix hash will be fully hashed.

Find identical header files

    $ find /usr/include -name '*.h' | ua -b256 -wm256 -2s, - 

Ignore white spaces -w (thus use a smaller buffer -b256). Perform the calculation in two stages (-2), first cluster based on the whitespace-free first 256 characters (-m256). Also, separate the identical files in the output by commas (-s,).

AUTHOR

(c) [mailto:istvan.hernadvolgyi@gmail.com Istvan T. Hernadvolgyi], [http://www.euedge.com EU.EDGE LLC], 2007

LICENSE

This is free software. You may redistribute copies of it under the terms of the [http://www.mozilla.org/MPL/MPL-1.1.txt MOZILLA PUBLIC LICENSE VERSION 1.1]. There is NO WARRANTY, to the extent permitted by law.

SEE ALSO

__MD5__(3), __md5sum__(1), __find__(1)