Skip to content

Latest commit

 

History

History
525 lines (455 loc) · 12.5 KB

README.md

File metadata and controls

525 lines (455 loc) · 12.5 KB

About

Find duplicate files and directories.

As other tools we use file hashes but additionally, we report duplicate directories as well, using a Merkle tree for directory hash calculation.

To increase performance, we use

  • parallel hash calculation (-t/--nthreads option), see Benchmarks below
  • optional limits on data to be hashed (-l/--limit option)

Install

From pypi:

    $ pip install findsame

Dev install of this repo:

    $ git clone ...
    $ cd findsame
    $ pip install -e .

The core part (package findsame and the CLI bin/findsame) has no external dependencies. If you want to run the benchmarks (see "Benchmarks" below), install:

    $ pip install -r requirements_benchmark.txt

Usage

usage: findsame [-h] [-b BLOCKSIZE] [-l LIMIT] [-p NPROCS] [-t NTHREADS]
                [-o OUTMODE] [-v]
                file/dir [file/dir ...]

Find same files and dirs based on file hashes.

positional arguments:
  file/dir              files and/or dirs to compare

optional arguments:
  -h, --help            show this help message and exit
  -b BLOCKSIZE, --blocksize BLOCKSIZE
                        blocksize in hash calculation, use units K,M,G as in
                        100M, 256K or just 1024 (bytes), if LIMIT is used and
                        BLOCKSIZE < LIMIT then we require mod(LIMIT,
                        BLOCKSIZE) = 0 else we set BLOCKSIZE = LIMIT [default:
                        256.0K]
  -l LIMIT, --limit LIMIT
                        read limit (bytes, see also BLOCKSIZE), calculate hash
                        only over the first LIMIT bytes, makes things go
                        faster for may large files, try 512K [default: None]
  -p NPROCS, --nprocs NPROCS
                        number of parallel processes [default: 1]
  -t NTHREADS, --nthreads NTHREADS
                        threads per process [default: 4]
  -o OUTMODE, --outmode OUTMODE
                        1: list of dicts (values of dict from mode 2), one
                        dict per hash, 2: dict of dicts (full result), keys
                        are hashes, 3: compact, sort by type (file, dir)
                        [default: 3]
  -v, --verbose         enable verbose/debugging output

The output format is json, see -o/--outmode, default is -o 3. An example using the test suite data:

    $ cd findsame/tests
    $ findsame data | jq .
    {
      "dir:empty": [
        [
          "data/dir2/empty_dir",
          "data/dir2/empty_dir_copy",
          "data/empty_dir",
          "data/empty_dir_copy"
        ]
      ],
      "dir": [
        [
          "data/dir1",
          "data/dir1_copy"
        ]
      ],
      "file:empty": [
        [
          "data/dir2/empty_dir/empty_file",
          "data/dir2/empty_dir_copy/empty_file",
          "data/empty_dir/empty_file",
          "data/empty_dir_copy/empty_file",
          "data/empty_file",
          "data/empty_file_copy"
        ]
      ],
      "file": [
        [
          "data/dir1/file2",
          "data/dir1/file2_copy",
          "data/dir1_copy/file2",
          "data/dir1_copy/file2_copy",
          "data/file2"
        ],
        [
          "data/lena.png",
          "data/lena_copy.png"
        ],
        [
          "data/file1",
          "data/file1_copy"
        ]
      ]
    }

This returns a dict whose keys are the path type (file, dir). Values are nested lists. Each sub-list contains paths having the same hash. Note that we also report empty files and dirs.

Use jq for pretty-printing, e.g.

    $ findsame data | jq .

    # keep colors in less(1)
    $ findsame data | jq . -C | less -R

To check out large amounts of data (as in GiB) for the first time, use the -l/--limit option for speed and use less -n as well (don't wait for input to load)

    $ findsame -l512K data | jq . -C | less -nR

Post-processing is only limited by your ability to process json (using jq, Python, ...).

Note that the order of key-value entries in the output from both findsame and jq is random.

Note that currently, we skip symlinks.

Performance

Parallel hash calculation

By default, we use --nthreads equal to the number of cores. See "Benchmarks" below.

Limit data to be hashed

Apart from parallelization, by far the most speed is gained by using --limit. Note that this may lead to false positives, if files are exactly equal in the first LIMIT bytes. Finding a good enough value can be done by trial and error. Try 512K. This is still quite fast and seems to cover most real-world data.

Tests

Run nosetests, pytest or any other test runner with test discovery.

Benchmarks

You may run the benchmark script to find the best blocksize and number threads and/or processes for hash calculations on your machine.

    $ cd findsame/benchmark
    $ ./clean.sh
    $ ./benchmark.py
    $ ./plot.py

This writes test files of various size to benchmark/files and runs a couple of benchmarks (runtime ~10 min for all benchmarks). Make sure to avoid doing any other extensive IO tasks while the benchmarks run, of course.

The default value of "maxsize" in benchmark.py (in the __main__ part) is only some MiB to allow quick testing. This needs to be changed to, say, 1 GiB in order to have meaningful benchmarks.

Observations:

  • blocksizes below 512 KiB (-b/--blocksize 512K) work best for all file sizes on most systems, even though the variation to worst timings is at most factor 1.25 (e.g. 1 vs. 1.25 seconds)
  • multithreading (-t/--nthreads): up to 2x speedup on dual-core box -- very efficient, use NTHREADS = number of cores for good baseline performance (problem is mostly IO-bound)
  • multiprocessing (-p/--nprocs): less efficient speedup, but on some systems NPROCS + NTHREADS is even a bit faster than NTHREADS alone, testing is mandatory
  • we have a linear increase of runtime with filesize, of course

Output modes

Default (-o3)

The default output format is -o3 (same as the initial example above).

    $ findsame -o3 data | jq .
    {
      "dir:empty": [
        [
          "data/dir2/empty_dir",
          "data/dir2/empty_dir_copy",
          "data/empty_dir",
          "data/empty_dir_copy"
        ]
      ],
      "dir": [
        [
          "data/dir1",
          "data/dir1_copy"
        ]
      ],
      "file:empty": [
        [
          "data/dir2/empty_dir/empty_file",
          "data/dir2/empty_dir_copy/empty_file",
          "data/empty_dir/empty_file",
          "data/empty_dir_copy/empty_file",
          "data/empty_file",
          "data/empty_file_copy"
        ]
      ],
      "file": [
        [
          "data/dir1/file2",
          "data/dir1/file2_copy",
          "data/dir1_copy/file2",
          "data/dir1_copy/file2_copy",
          "data/file2"
        ],
        [
          "data/lena.png",
          "data/lena_copy.png"
        ],
        [
          "data/file1",
          "data/file1_copy"
        ]
      ]
    }

Output with hashes (-o2)

    $ findsame -o2 data | jq .
    {
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": {
        "dir:empty": [
          "data/dir2/empty_dir",
          "data/dir2/empty_dir_copy",
          "data/empty_dir",
          "data/empty_dir_copy"
        ],
        "file:empty": [
          "data/dir2/empty_dir/empty_file",
          "data/dir2/empty_dir_copy/empty_file",
          "data/empty_dir/empty_file",
          "data/empty_dir_copy/empty_file",
          "data/empty_file",
          "data/empty_file_copy"
        ]
      },
      "55341fe74a3497b53438f9b724b3e8cdaf728edc": {
        "dir": [
          "data/dir1",
          "data/dir1_copy"
        ]
      },
      "9619a9b308cdebee40f6cef018fef0f4d0de2939": {
        "file": [
          "data/dir1/file2",
          "data/dir1/file2_copy",
          "data/dir1_copy/file2",
          "data/dir1_copy/file2_copy",
          "data/file2"
        ]
      },
      "0a96c2e755258bd46abdde729f8ee97d234dd04e": {
        "file": [
          "data/lena.png",
          "data/lena_copy.png"
        ]
      },
      "312382290f4f71e7fb7f00449fb529fce3b8ec95": {
        "file": [
          "data/file1",
          "data/file1_copy"
        ]
      }
    }

The output is one dict (json object) where all same-hash files/dirs are found at the same key (hash).

Dict values (-o1)

The format -o1 lists only the dict values from -o2, i.e. a list of dicts.

    $ findsame -o1 data | jq .
    [
      {
        "dir:empty": [
          "data/dir2/empty_dir",
          "data/dir2/empty_dir_copy",
          "data/empty_dir",
          "data/empty_dir_copy"
        ],
        "file:empty": [
          "data/dir2/empty_dir/empty_file",
          "data/dir2/empty_dir_copy/empty_file",
          "data/empty_dir/empty_file",
          "data/empty_dir_copy/empty_file",
          "data/empty_file",
          "data/empty_file_copy"
        ]
      },
      {
        "dir": [
          "data/dir1",
          "data/dir1_copy"
        ]
      },
      {
        "file": [
          "data/file1",
          "data/file1_copy"
        ]
      },
      {
        "file": [
          "data/dir1/file2",
          "data/dir1/file2_copy",
          "data/dir1_copy/file2",
          "data/dir1_copy/file2_copy",
          "data/file2"
        ]
      },
      {
        "file": [
          "data/lena.png",
          "data/lena_copy.png"
        ]
      }
    ]

More usage examples

Here we show examples of common post-processing tasks using jq. When the jq command works for all three output modes, we don't specify the -o option.

Count the total number of all equals:

    $ findsame data | jq '.[]|.[]|.[]' | wc -l

Find only groups of equal dirs:

    $ findsame -o1 data | jq '.[]|select(.dir)|.dir'
    $ findsame -o2 data | jq '.[]|select(.dir)|.dir'
    $ findsame -o3 data | jq '.dir|.[]'
    [
      "data/dir1",
      "data/dir1_copy"
    ]

Groups of equal files:

    $ findsame -o1 data | jq '.[]|select(.file)|.file'
    $ findsame -o2 data | jq '.[]|select(.file)|.file'
    $ findsame -o3 data | jq '.file|.[]'
    [
      "data/dir1/file2",
      "data/dir1/file2_copy",
      "data/dir1_copy/file2",
      "data/dir1_copy/file2_copy",
      "data/file2"
    ]
    [
      "data/lena.png",
      "data/lena_copy.png"
    ]
    [
      "data/file1",
      "data/file1_copy"
    ]

Find the first element in a group of equal items (file or dir):

    $ findsame data | jq '.[]|.[]|[.[0]]'
    [
      "data/lena.png"
    ]
    [
      "data/dir2/empty_dir"
    ]
    [
      "data/dir2/empty_dir/empty_file"
    ]
    [
      "data/dir1/file2"
    ]
    [
      "data/file1"
    ]
    [
      "data/dir1"
    ]

or more compact w/o the length-1 list:

    $ findsame data | jq '.[]|.[]|.[0]'
    "data/dir2/empty_dir"
    "data/dir2/empty_dir/empty_file"
    "data/dir1/file2"
    "data/lena.png"
    "data/file1"
    "data/dir1"

Find all but the first element in a group of equal items (file or dir):

    $ findsame data | jq '.[]|.[]|.[1:]'
    [
      "data/dir1_copy"
    ]
    [
      "data/lena_copy.png"
    ]
    [
      "data/dir1/file2_copy",
      "data/dir1_copy/file2",
      "data/dir1_copy/file2_copy",
      "data/file2"
    ]
    [
      "data/dir2/empty_dir_copy/empty_file",
      "data/empty_dir/empty_file",
      "data/empty_dir_copy/empty_file",
      "data/empty_file",
      "data/empty_file_copy"
    ]
    [
      "data/dir2/empty_dir_copy",
      "data/empty_dir",
      "data/empty_dir_copy"
    ]
    [
      "data/file1_copy"
    ]

And more compact:

    $ findsame data | jq '.[]|.[]|.[1:]|.[]'
    "data/file1_copy"
    "data/dir1/file2_copy"
    "data/dir1_copy/file2"
    "data/dir1_copy/file2_copy"
    "data/file2"
    "data/lena_copy.png"
    "data/dir2/empty_dir_copy/empty_file"
    "data/empty_dir/empty_file"
    "data/empty_dir_copy/empty_file"
    "data/empty_file"
    "data/empty_file_copy"
    "data/dir2/empty_dir_copy"
    "data/empty_dir"
    "data/empty_dir_copy"
    "data/dir1_copy"

The last one can be used to remove all but the first in a group of equal files/dirs:

    $ findsame data | jq '.[]|.[]|.[1:]|.[]' | xargs cp -rvt duplicates/

Other tools

fdupes, jdupes, duff, rdfind, rmlint, findup (from fslint)