Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research options for handling duplicate files #4

Closed
atc0005 opened this issue Jan 9, 2020 · 8 comments · Fixed by #48
Closed

Research options for handling duplicate files #4

atc0005 opened this issue Jan 9, 2020 · 8 comments · Fixed by #48
Assignees
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@atc0005
Copy link
Owner

atc0005 commented Jan 9, 2020

From #1:

A later revision could recursively perform this task and move duplicates into a subfolder alongside the image? What if the images (or files) are in entirely different folders? Perhaps create a log instead? Or, an option to choose which of those steps are performed?

The initial v0.1 release will focus on creating a CSV file for manual review, but it would be useful to perform some sort of cleanup option either automatically or based on a column entry for the CSV file.

For example, the generated CSV file could create a column with prefilled "keep" action entries that the user could replace with delete. This could involve adding new flags:

  • simulate
  • prune
@atc0005 atc0005 added enhancement New feature or request question Further information is requested labels Jan 9, 2020
@atc0005 atc0005 added this to the v0.2 milestone Jan 9, 2020
@atc0005 atc0005 self-assigned this Jan 9, 2020
@atc0005
Copy link
Owner Author

atc0005 commented Jan 11, 2020

Perhaps run a small built-in web server and serve up a thin web UI for managing the duplicates? This would almost necessitate creating/maintaining an in-memory state of the files as they're pruned/manged.

@atc0005 atc0005 modified the milestones: v0.2.0, Future Jan 14, 2020
@atc0005 atc0005 modified the milestones: Future, v0.4.0 Feb 9, 2020
@atc0005
Copy link
Owner Author

atc0005 commented Feb 9, 2020

Status

After working further with some photos, particularly trying to figure out where I left off on uploading files, I'm ready to push forward with this support.

I think my earlier notes on this issue are still relevant; I think that pre-filling the column entries with keep helps to make clear what will take place, but on the other hand keeping all column entries blank by default and only filling in those entries where we want something to happen also makes sense. This would also add a nice contrast when quickly scanning the entries in the CSV file: if you see something in the column/row, then something is going to happen to that file.

Flag changes

Some flag changes (based off of some scratch notes):

  • input-csv
    • read in a marked-up copy of the original output CSV file
    • generate a warning/error if nothing is found to do
  • prune
    • as indicated before
  • dry-run
    • as implied, this will report what would happen
    • not 100% sure, but if this option is specified it seems like a good idea to emit results to stdout since the user is intentionally opting into feedback
  • backup-dir
    • instead of removing a file, start at the specified path, recreate the subdirectory structure holding a file removal candidate and then move that file into the matching subdirectory path

When processing a specified CSV input file, here are the requirements for each entry in the CSV file:

Input CSV file requirements

  • appropriate removal keyword
    • probably best to only support one pattern/phrase such as "remove" (if we name the column header action) or YES (if we name the column header remove)
  • filename
  • directory
  • checksum
    • this might help prove that we are dealing with one of our generated files
  • size (human)
  • size (bytes)
    • this isn't yet recorded in the output CSV file, but may be worth doing so since we have access to that detail anyway when processing files

@atc0005
Copy link
Owner Author

atc0005 commented Feb 9, 2020

Notes to self:

  • create type to represent rows in CSV file
    • fields in struct to represent each field in CSV
  • create validation method to confirm all required fields are present
  • additional conversion methods, etc on type
  • wrapper around desired/required settings for encoding/csv stdlib package

@atc0005
Copy link
Owner Author

atc0005 commented Feb 10, 2020

Other (unsorted) thoughts:

  • read in csv file

    • trim spaces on both ends of Remove column
    • skip empty rows
    • enforce fixed number of columns (for non-empty rows)
  • Does the encoding/csv package create named fields similar to how I (vaguely) recall the Python library offering the option (Dictionary?)?

  • Build the objects ourselves?

    • slice of slices
    • map back to CSV file object?
    • row at a time back to FileChecksumIndex of FileMatch entries?

@atc0005 atc0005 pinned this issue Feb 10, 2020
@atc0005
Copy link
Owner Author

atc0005 commented Feb 10, 2020

https://gobyexample.com/command-line-subcommands

It's worth considering subcommands as a means of splitting application logic (somewhat) down the middle between two "modes" of operation:

  • create CSV, report duplicates
  • read CSV, prune duplicates

Still need to give it some more thought, but these two subcommands seem like a decent starting point:

  • bridge report
  • bridge prune

We can then tie specific flags to each of them. For example, having a dry-run flag makes sense for pruning files, but not so much for creating a report (that should be a safe thing to do and is a starting point for further work using the tool).

@atc0005
Copy link
Owner Author

atc0005 commented Feb 17, 2020

Scratch notes for removal logic:

  • don't require a value in the remove* field
  • allow only a small.set of values that can be (easily) coerced to boolean
    • initial implementation can enforce just true or false

@atc0005
Copy link
Owner Author

atc0005 commented Feb 23, 2020

Re subcommand: stdlib covers this, but alexflint/go-arg does too and looks like it would be much easier to implement:

https://github.com/alexflint/go-arg#subcommands

atc0005 added a commit that referenced this issue Feb 24, 2020
- Split application logic (and flags) into subcommands
  - `report` for existing behavior and set of flags
  - `prune` for new behavior and new set of flags
- Add support for pruning flagged/marked items in an input CSV file
  previously generated by the `report` subcommand
- Documentation updates
  - README updates to cover both subcommands
  - CHANGELOG updates
  - GoDocs coverage of subcommands
- GitHub Actions Workflow update to reflect cmd subdirectory
- Makefile updates
  - new cmd subdirectory path
  - use "Simply expanded variables" (`:=`) vs those which are
    evaluated at each use (`=`) since we don't yet use that feature
- Fix various linting errors

refs #4
atc0005 added a commit that referenced this issue Feb 24, 2020
- Split application logic (and flags) into subcommands
  - `report` for existing behavior and set of flags
  - `prune` for new behavior and new set of flags
- Add support for pruning flagged/marked items in an input CSV file
  previously generated by the `report` subcommand
- Documentation updates
  - README updates to cover both subcommands
  - CHANGELOG updates
  - GoDocs coverage of subcommands
- GitHub Actions Workflow update to reflect cmd subdirectory
- Makefile updates
  - new cmd subdirectory path
  - use "Simply expanded variables" (`:=`) vs those which are
    evaluated at each use (`=`) since we don't yet use that feature
- Fix various linting errors

refs #4
atc0005 added a commit that referenced this issue Feb 24, 2020
- Split application logic (and flags) into subcommands
  - `report` for existing behavior and set of flags
  - `prune` for new behavior and new set of flags
- Create subpackages for related chucks of code
  - e.g., `matches`, `paths`, ...
- Add support for pruning flagged/marked items in an input CSV file
  previously generated by the `report` subcommand
- Documentation updates
  - README updates to cover both subcommands
  - CHANGELOG updates
  - GoDocs coverage of subcommands
- GitHub Actions Workflow update to reflect cmd subdirectory
- Makefile updates
  - new cmd subdirectory path
  - use "Simply expanded variables" (`:=`) vs those which are
    evaluated at each use (`=`) since we don't yet use that feature
- Fix various linting errors

refs #4
atc0005 added a commit that referenced this issue Feb 24, 2020
- Split application logic (and flags) into subcommands
  - `report` for existing behavior and set of flags
  - `prune` for new behavior and new set of flags
- Create subpackages for related chucks of code
  - e.g., `matches`, `paths`, ...
- Add support for pruning flagged/marked items in an input CSV file
  previously generated by the `report` subcommand
- Documentation updates
  - README updates to cover both subcommands
  - CHANGELOG updates
  - GoDocs coverage of subcommands
- GitHub Actions Workflow update to reflect cmd subdirectory
- Makefile updates
  - new cmd subdirectory path
  - use "Simply expanded variables" (`:=`) vs those which are
    evaluated at each use (`=`) since we don't yet use that feature
- Fix various linting errors

refs #4
@atc0005
Copy link
Owner Author

atc0005 commented Feb 24, 2020

Re subcommand: stdlib covers this, but alexflint/go-arg does too and looks like it would be much easier to implement:

https://github.com/alexflint/go-arg#subcommands

Going to stick with the standard library for now. There are still some (many) lessons I can learn from implementing subcommand support using the flags stdlib package.

@atc0005 atc0005 unpinned this issue Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant