Skip to content

addurls: Improve reporting and handling of file name collisions #5675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 25, 2021

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented May 19, 2021

As mentioned in gh-4840, addurls doesn't provide the caller any help in troubleshooting file name collisions. This series improves the error message and debugging output. It also implements an option like the one suggested in gh-4840 that allows the caller to either 1) ignore collisions if the end result is the same or 2) specify that the first/last row should be used.

Closes #4840.


The tests in this series require the fix from PR gh-5674, which is against maint. That PR is merged into this one.

  • Drop temporary merge.

@codecov
Copy link

codecov bot commented May 19, 2021

Codecov Report

Merging #5675 (545df5e) into master (fd7229a) will decrease coverage by 7.73%.
The diff coverage is 94.18%.

❗ Current head 545df5e differs from pull request most recent head 03b2ced. Consider uploading reports for the commit 03b2ced to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5675      +/-   ##
==========================================
- Coverage   90.48%   82.75%   -7.74%     
==========================================
  Files         305      302       -3     
  Lines       41883    41926      +43     
==========================================
- Hits        37898    34696    -3202     
- Misses       3985     7230    +3245     
Impacted Files Coverage Δ
datalad/local/addurls.py 94.92% <91.07%> (-2.16%) ⬇️
datalad/local/tests/test_addurls.py 99.36% <100.00%> (-0.19%) ⬇️
datalad/plugin/wtf.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/addurls.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/no_annex.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/add_readme.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/check_dates.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/support/tests/utils.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/export_archive.py 0.00% <0.00%> (-100.00%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd7229a...03b2ced. Read the comment docs.

kyleam added 7 commits May 19, 2021 15:55
The logic around collisions is going to get more involved.  Move it to
a helper to avoid blowing up an already-too-long __call__().
addurls() checks if there a collision by taking the set() of file
names across rows and checking that the length is equal to the number
of total rows.  In preparation for letting the caller specify
different ways to handle the collision (e.g., take the last row),
rework the logic to store the collisions as a dictionary.
addurls() filters out rows where the URL is an empty string (or
matches --missing-value).  That means the row positions in the input
don't necessarily align with the rows of extracted information.

Store the original index so that a row of extracted information can be
linked to a row in the input (e.g., to give informative debugging
output).
addurls() aborts if any rows produce the same file name, even if all
of the relevant details are shared among the colliding rows.

Let the caller specify --on-collision=error-if-different to proceed
without cleaning up the input as long as addurls() would treat each
set of colliding rows exactly the same: the same URL would be used and
the same metadata (if any) would be added.
The previous commit taught addurls() to ignore file name collisions
when the colliding rows have the same URL and metadata.  Go one step
farther and allow the caller to say "I don't care if they're
different, just take the {first,last} row and ignore the rest".
When more than one row produces a file name, the error message just
reports that there are collisions, leaving the caller to figure out
where.

Help the caller by providing 1) the row positions that conflict for
each file name and 2) a sample of two rows that conflict.  Given that
there may be many conflicts and that an input row may be very long,
the error message of the result doesn't seem like a good place to put
this information.  Instead log it at the debug level.
When there are file name collisions, more details are logged at the
debug level, but rephrase the error message to give the caller a sense
of how many conflicts there are.
@kyleam kyleam force-pushed the addurls-collisions branch from f6ece8f to 03b2ced Compare May 19, 2021 21:55
@kyleam kyleam marked this pull request as ready for review May 19, 2021 21:56
@kyleam
Copy link
Contributor Author

kyleam commented May 20, 2021

The appveyor failure is unrelated (test_asyncio_forked).

https://ci.appveyor.com/project/mih/datalad/builds/39238237/job/a3hye565bo383me2#L3233

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thx!

args=("--on-collision",),
constraints=EnsureChoice("error", "error-if-different",
"take-first", "take-last"),
doc="""What to do when more than one row produces the same file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say that "we always start lower-case!", but that is apparently not the case 🤔

% git grep 'doc="""[a-z]' | wc -l
185
% git grep 'doc="""[A-Z]' | wc -l
96

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% git grep 'doc="""[A-Z]' | wc -l
96

Relieved to see that at least I'm not responsible for all of those.

@kyleam
Copy link
Contributor Author

kyleam commented May 25, 2021

@mih Thanks for taking a look.

@kyleam kyleam merged commit ddc7108 into datalad:master May 25, 2021
@kyleam kyleam deleted the addurls-collisions branch May 25, 2021 20:17
@yarikoptic yarikoptic added the semver-minor Increment the minor version when merged label Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

addurls: better reporting and handling of collisions
3 participants