This script scans a directory tree and identifies duplicate files with a given file extension. It uses SHA256 hashing to compare the files and outputs the duplicate matches to a CSV file.
File signatures courtesy of: fleep @ua-nick
Python 3.8 or higher
- Clone the repository:
git clone https://github.com/dfirsec/dup_file_finder.git
- Navigate to the project directory:
cd dup_file_finder
- Install the dependencies using poetry:
poetry install
- Create the virtual environment
poetry shell
- Run using the following commands:
python dup_file_finder.py dirpath ext
dirpath
: The directory path to scan for duplicate files.ext
: The file extension to scan for.
python dup_file_finder.py /path/to/directory pdf
This will scan the specified directory for PDF files and identify duplicate matches. The results will be saved to a CSV file named duplicate_matches.csv in the results directory.
Contributions are welcome! If you find any issues or have suggestions for improvement, please create an issue or submit a pull request.
This project is licensed under the MIT License.