Find exact duplicate files across one or more directories and reclaim the wasted space —
safely. DupeSweep never deletes anything by default: it reports first, and when you do ask it
to act, --apply quarantine moves duplicates into a folder with a manifest so the move can be
undone with dsweep restore.
Most "find duplicates" scripts either trust file size alone (wrong — two different files can be the same size) or hash every file up front (slow — hashing a 4 GB video you'll never compare against anything is wasted work). DupeSweep runs a three-stage funnel so the expensive work only happens on real candidates:
| Stage | Cost | What it does |
|---|---|---|
| 1. Group by size | Free (already have it from the directory listing) | Files with a unique size can't have a duplicate — discarded immediately. |
| 2. Quick hash | Cheap | SHA-256 of just the first 64 KB, computed in parallel across CPU cores. Narrows same-size files down to real candidates. |
| 3. Full hash | Only for survivors | Full-file SHA-256 confirms a true byte-for-byte match before anything is reported as a duplicate. |
The result: a ~/Downloads or photo library scan with tens of thousands of files only fully
hashes the handful that are actually worth comparing.
DupeSweep targets .NET 10. Build a single-file executable from source:
git clone https://github.com/amasen02/dupesweep.git
cd dupesweep
# Framework-dependent single file (uses the installed .NET 10 runtime):
dotnet publish src/DupeSweep -c Release -r win-x64 --self-contained false -p:PublishSingleFile=true -o dist
# Linux: -r linux-x64 macOS (Apple Silicon): -r osx-arm64
# …or fully self-contained (no runtime needed on the target machine):
dotnet publish src/DupeSweep -c Release -r linux-x64 --self-contained true -p:PublishSingleFile=true -o distgit clone https://github.com/amasen02/dupesweep.git
cd dupesweep
docker build -t dupesweep .
# Mount the directory you want to scan at /scan:
docker run --rm -v "$PWD":/scan dupesweep /scan
docker run --rm -v "$PWD":/scan dupesweep /scan --json
docker run --rm -v "$PWD":/scan dupesweep /scan --apply quarantineThe image runs the full test suite as part of the build (so a broken build never produces an
image) and runs as the container's built-in unprivileged app user, not root.
The binary is named dsweep (dist/dsweep, or dist\dsweep.exe on Windows). Put dist on
your PATH to call dsweep from anywhere. To run without publishing:
dotnet run --project src/DupeSweep -- <dir> [options]dsweep <dir> [dir2 ...] [options]
dsweep restore <manifest.json> [--dry-run]
dsweep ~/Downloads[1] 4.2 MB x 3 copies (sha256 74c18fa049de…)
KEEP /home/ama/Downloads/report.pdf
DUPLICATE /home/ama/Downloads/report (1).pdf
DUPLICATE /home/ama/Downloads/old/report.pdf
dsweep: 1 duplicate group(s), 2 reclaimable file(s), 8.4 MB reclaimable.
# Only image files, keep the oldest copy of each duplicate
dsweep ~/Photos ~/Backups/Photos --ext .jpg,.png --keep oldest
# Skip noisy directories, machine-readable output
dsweep . --exclude node_modules --exclude .git --json
# Ignore anything smaller than 1 MB, cap hashing to 8 threads
dsweep /data --min-size 1M --parallel 8dsweep ~/Downloads --apply quarantineThis moves every duplicate (never the file chosen to keep) into
~/Downloads/.dupesweep-quarantine/, grouped by duplicate set, and writes a manifest.json
recording each file's original location. Review the quarantine folder, then either delete it
once you're confident, or undo the whole thing:
dsweep restore ~/Downloads/.dupesweep-quarantine/manifest.json --dry-run # preview
dsweep restore ~/Downloads/.dupesweep-quarantine/manifest.json # actually restoreRestore never overwrites a file that already exists at the destination — if you recreated a file since quarantining it, that entry is skipped and reported instead of clobbering your work.
For an irreversible cleanup once you've reviewed the report, --apply delete permanently
deletes duplicates instead of quarantining them. There is no undo for delete — use
quarantine unless you are certain.
| Flag | Meaning |
|---|---|
--no-recursive |
Only scan the given directories, not subdirectories. |
--min-size <size> |
Ignore files smaller than this (default 1 byte; 0 includes empty files). Accepts K/M/G suffixes. |
--ext <list> |
Only consider these extensions, e.g. .jpg,.png. |
--exclude <glob> |
Exclude files/directories matching a glob (repeatable), e.g. node_modules, *.tmp. |
--follow-symlinks |
Follow symlinked files and directories (off by default; no cycle detection). |
--keep <strategy> |
Which copy survives per group: first (default, scan order), oldest, newest, shortest-path. |
--apply <mode> |
quarantine (reversible, recommended) or delete (permanent). Omit to just report. |
--quarantine-dir <path> |
Destination for --apply quarantine (default: <first-dir>/.dupesweep-quarantine). |
-j, --parallel <n> |
Hashing worker count (default: CPU core count, max 64). |
--json |
Emit a machine-readable JSON report instead of text. |
-q / -v |
Quiet / verbose (verbose reports files skipped because they couldn't be read). |
dotnet test DupeSweep.slnxThe suite is deterministic: every test that touches the filesystem creates its own temp
directory (see tests/DupeSweep.Tests/Support/TempDirectory.cs) and cleans up after itself. It
covers the full pipeline — scanning/filtering, the size → quick-hash → full-hash funnel
(including same-size-but-different-content files, which must not be grouped), all four keep
strategies, quarantine collision handling, restore's no-overwrite guarantee, CLI parsing, and
both report formats. CI runs the suite on both Ubuntu and Windows.
src/DupeSweep/
Program.cs CLI orchestration: parse -> scan -> find duplicates -> report -> apply
CommandLine.cs argv parsing + usage (scan mode and restore mode)
ScanOptions.cs parsed configuration + ApplyMode/KeepStrategy enums
FileScanner.cs recursive walk with size/extension/exclude/symlink filtering
Hashing.cs two-tier SHA-256: 64 KB quick hash, then full-file hash
DuplicateFinder.cs the size -> quick-hash -> full-hash funnel, parallelised
KeepSelector.cs chooses which file in a group survives, per strategy
QuarantineService.cs moves duplicates to a quarantine folder + writes the manifest
RestoreService.cs reads a manifest and moves files back, never overwriting
ReportWriter.cs text and JSON rendering
Format.cs / Models.cs byte/duration formatting + records
tests/DupeSweep.Tests/ xUnit tests against isolated temp directories
Contributions are welcome — bug fixes, new keep/apply strategies, better docs. See
CONTRIBUTING.md for the workflow and coding bar, and please be mindful of
the Code of Conduct. Use the issue templates; green CI (build + test
on Ubuntu and Windows) is required on every pull request. Report security issues privately per
SECURITY.md — never as a public issue.
This project is, and will remain, free and open source. As maintainer I commit to:
- A permissive licence, kept stable. MIT — use it commercially, fork it, build on it. No relicensing of accepted contributions.
- No CLA. Contributions are accepted under the MIT licence; you keep the copyright to your work.
- An honest history. Real, walkable commits — no fabricated activity, no rewritten releases.
- Best-effort, transparent triage. Issues and pull requests are read and answered; security reports are acknowledged within 72 hours.
- A welcoming community governed by the Code of Conduct.
- Reproducible builds. Green CI — build, tests on two OSes, and CodeQL security analysis — on every change.
MIT — see LICENSE. You are free to use, modify, and distribute this software,
including for commercial purposes, provided the copyright notice is retained.
Ama Senevirathne — GitHub