-
Notifications
You must be signed in to change notification settings - Fork 110
addurls: Improve reporting and handling of file name collisions #5675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5675 +/- ##
==========================================
- Coverage 90.48% 82.75% -7.74%
==========================================
Files 305 302 -3
Lines 41883 41926 +43
==========================================
- Hits 37898 34696 -3202
- Misses 3985 7230 +3245
Continue to review full report at Codecov.
|
The logic around collisions is going to get more involved. Move it to a helper to avoid blowing up an already-too-long __call__().
addurls() checks if there a collision by taking the set() of file names across rows and checking that the length is equal to the number of total rows. In preparation for letting the caller specify different ways to handle the collision (e.g., take the last row), rework the logic to store the collisions as a dictionary.
addurls() filters out rows where the URL is an empty string (or matches --missing-value). That means the row positions in the input don't necessarily align with the rows of extracted information. Store the original index so that a row of extracted information can be linked to a row in the input (e.g., to give informative debugging output).
addurls() aborts if any rows produce the same file name, even if all of the relevant details are shared among the colliding rows. Let the caller specify --on-collision=error-if-different to proceed without cleaning up the input as long as addurls() would treat each set of colliding rows exactly the same: the same URL would be used and the same metadata (if any) would be added.
The previous commit taught addurls() to ignore file name collisions when the colliding rows have the same URL and metadata. Go one step farther and allow the caller to say "I don't care if they're different, just take the {first,last} row and ignore the rest".
When more than one row produces a file name, the error message just reports that there are collisions, leaving the caller to figure out where. Help the caller by providing 1) the row positions that conflict for each file name and 2) a sample of two rows that conflict. Given that there may be many conflicts and that an input row may be very long, the error message of the result doesn't seem like a good place to put this information. Instead log it at the debug level.
When there are file name collisions, more details are logged at the debug level, but rephrase the error message to give the caller a sense of how many conflicts there are.
f6ece8f
to
03b2ced
Compare
The appveyor failure is unrelated ( https://ci.appveyor.com/project/mih/datalad/builds/39238237/job/a3hye565bo383me2#L3233 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, thx!
args=("--on-collision",), | ||
constraints=EnsureChoice("error", "error-if-different", | ||
"take-first", "take-last"), | ||
doc="""What to do when more than one row produces the same file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to say that "we always start lower-case!", but that is apparently not the case 🤔
% git grep 'doc="""[a-z]' | wc -l
185
% git grep 'doc="""[A-Z]' | wc -l
96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
% git grep 'doc="""[A-Z]' | wc -l 96
Relieved to see that at least I'm not responsible for all of those.
@mih Thanks for taking a look. |
As mentioned in gh-4840,
addurls
doesn't provide the caller any help in troubleshooting file name collisions. This series improves the error message and debugging output. It also implements an option like the one suggested in gh-4840 that allows the caller to either 1) ignore collisions if the end result is the same or 2) specify that the first/last row should be used.Closes #4840.
The tests in this series require the fix from PR gh-5674, which is against maint. That PR is merged into this one.