Research options for handling duplicate files #4

atc0005 · 2020-01-09T20:02:01Z

From #1:

A later revision could recursively perform this task and move duplicates into a subfolder alongside the image? What if the images (or files) are in entirely different folders? Perhaps create a log instead? Or, an option to choose which of those steps are performed?

The initial v0.1 release will focus on creating a CSV file for manual review, but it would be useful to perform some sort of cleanup option either automatically or based on a column entry for the CSV file.

For example, the generated CSV file could create a column with prefilled "keep" action entries that the user could replace with delete. This could involve adding new flags:

simulate
prune

The text was updated successfully, but these errors were encountered:

atc0005 · 2020-01-11T08:23:04Z

Perhaps run a small built-in web server and serve up a thin web UI for managing the duplicates? This would almost necessitate creating/maintaining an in-memory state of the files as they're pruned/manged.

atc0005 · 2020-02-09T13:00:52Z

Status

After working further with some photos, particularly trying to figure out where I left off on uploading files, I'm ready to push forward with this support.

I think my earlier notes on this issue are still relevant; I think that pre-filling the column entries with keep helps to make clear what will take place, but on the other hand keeping all column entries blank by default and only filling in those entries where we want something to happen also makes sense. This would also add a nice contrast when quickly scanning the entries in the CSV file: if you see something in the column/row, then something is going to happen to that file.

Flag changes

Some flag changes (based off of some scratch notes):

input-csv
- read in a marked-up copy of the original output CSV file
- generate a warning/error if nothing is found to do
prune
- as indicated before
dry-run
- as implied, this will report what would happen
- not 100% sure, but if this option is specified it seems like a good idea to emit results to stdout since the user is intentionally opting into feedback
backup-dir
- instead of removing a file, start at the specified path, recreate the subdirectory structure holding a file removal candidate and then move that file into the matching subdirectory path

When processing a specified CSV input file, here are the requirements for each entry in the CSV file:

Input CSV file requirements

appropriate removal keyword
- probably best to only support one pattern/phrase such as "remove" (if we name the column header action) or YES (if we name the column header remove)
filename
directory
checksum
- this might help prove that we are dealing with one of our generated files
size (human)
size (bytes)
- this isn't yet recorded in the output CSV file, but may be worth doing so since we have access to that detail anyway when processing files

atc0005 · 2020-02-09T17:09:32Z

Notes to self:

create type to represent rows in CSV file
- fields in struct to represent each field in CSV
create validation method to confirm all required fields are present
additional conversion methods, etc on type
wrapper around desired/required settings for encoding/csv stdlib package

atc0005 · 2020-02-10T03:43:08Z

Other (unsorted) thoughts:

read in csv file
- trim spaces on both ends of Remove column
- skip empty rows
- enforce fixed number of columns (for non-empty rows)
Does the encoding/csv package create named fields similar to how I (vaguely) recall the Python library offering the option (Dictionary?)?
Build the objects ourselves?
- slice of slices
- map back to CSV file object?
- row at a time back to FileChecksumIndex of FileMatch entries?

atc0005 · 2020-02-10T12:53:14Z

https://gobyexample.com/command-line-subcommands

It's worth considering subcommands as a means of splitting application logic (somewhat) down the middle between two "modes" of operation:

create CSV, report duplicates
read CSV, prune duplicates

Still need to give it some more thought, but these two subcommands seem like a decent starting point:

bridge report
bridge prune

We can then tie specific flags to each of them. For example, having a dry-run flag makes sense for pruning files, but not so much for creating a report (that should be a safe thing to do and is a starting point for further work using the tool).

atc0005 · 2020-02-17T03:25:26Z

Scratch notes for removal logic:

don't require a value in the remove* field
allow only a small.set of values that can be (easily) coerced to boolean
- initial implementation can enforce just true or false

atc0005 · 2020-02-23T04:37:26Z

Re subcommand: stdlib covers this, but alexflint/go-arg does too and looks like it would be much easier to implement:

https://github.com/alexflint/go-arg#subcommands

- Split application logic (and flags) into subcommands - `report` for existing behavior and set of flags - `prune` for new behavior and new set of flags - Add support for pruning flagged/marked items in an input CSV file previously generated by the `report` subcommand - Documentation updates - README updates to cover both subcommands - CHANGELOG updates - GoDocs coverage of subcommands - GitHub Actions Workflow update to reflect cmd subdirectory - Makefile updates - new cmd subdirectory path - use "Simply expanded variables" (`:=`) vs those which are evaluated at each use (`=`) since we don't yet use that feature - Fix various linting errors refs #4

- Split application logic (and flags) into subcommands - `report` for existing behavior and set of flags - `prune` for new behavior and new set of flags - Create subpackages for related chucks of code - e.g., `matches`, `paths`, ... - Add support for pruning flagged/marked items in an input CSV file previously generated by the `report` subcommand - Documentation updates - README updates to cover both subcommands - CHANGELOG updates - GoDocs coverage of subcommands - GitHub Actions Workflow update to reflect cmd subdirectory - Makefile updates - new cmd subdirectory path - use "Simply expanded variables" (`:=`) vs those which are evaluated at each use (`=`) since we don't yet use that feature - Fix various linting errors refs #4

atc0005 · 2020-02-24T12:59:54Z

Re subcommand: stdlib covers this, but alexflint/go-arg does too and looks like it would be much easier to implement:

https://github.com/alexflint/go-arg#subcommands

Going to stick with the standard library for now. There are still some (many) lessons I can learn from implementing subcommand support using the flags stdlib package.

atc0005 added enhancement New feature or request question Further information is requested labels Jan 9, 2020

atc0005 added this to the v0.2 milestone Jan 9, 2020

atc0005 self-assigned this Jan 9, 2020

atc0005 mentioned this issue Jan 11, 2020

Test creating Excel (or equivalent) spreadsheet of duplicate files #6

Closed

atc0005 modified the milestones: v0.2.0, Future Jan 14, 2020

atc0005 modified the milestones: Future, v0.4.0 Feb 9, 2020

atc0005 pinned this issue Feb 10, 2020

atc0005 mentioned this issue Feb 16, 2020

Add support for blocking file removal operations if ALL files from a set would be removed #46

Open

atc0005 mentioned this issue Feb 24, 2020

Add support for removing duplicate files #48

Merged

atc0005 closed this as completed in #48 Feb 24, 2020

atc0005 unpinned this issue Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research options for handling duplicate files #4

Research options for handling duplicate files #4

atc0005 commented Jan 9, 2020

atc0005 commented Jan 11, 2020

atc0005 commented Feb 9, 2020

atc0005 commented Feb 9, 2020

atc0005 commented Feb 10, 2020

atc0005 commented Feb 10, 2020

atc0005 commented Feb 17, 2020

atc0005 commented Feb 23, 2020

atc0005 commented Feb 24, 2020

Research options for handling duplicate files #4

Research options for handling duplicate files #4

Comments

atc0005 commented Jan 9, 2020

atc0005 commented Jan 11, 2020

atc0005 commented Feb 9, 2020

Status

Flag changes

Input CSV file requirements

atc0005 commented Feb 9, 2020

atc0005 commented Feb 10, 2020

atc0005 commented Feb 10, 2020

atc0005 commented Feb 17, 2020

atc0005 commented Feb 23, 2020

atc0005 commented Feb 24, 2020