Skip to content

Introduce AnnexRepo.get_file_annexinfo(), deprecate AnnexRepo.get_file_key()#6104

Merged
bpoldrack merged 3 commits intodatalad:masterfrom
mih:rf-getfilekey2
Oct 25, 2021
Merged

Introduce AnnexRepo.get_file_annexinfo(), deprecate AnnexRepo.get_file_key()#6104
bpoldrack merged 3 commits intodatalad:masterfrom
mih:rf-getfilekey2

Conversation

@mih
Copy link
Copy Markdown
Member

@mih mih commented Oct 19, 2021

This is another step in a slow paradigm change to assemble more information with fewer calls to Git. Here it is get_file_key() which is often used in conjunction with file_has_content() (or a local is_available() which does the same thing with a different implementation), or get_contentlocation(), or get_key|file_backend().

All this information is readily provided via get_content_annexinfo(). However, in most places, a query for a single files is made (that this can be wasteful is a problem, but also a reality right now). The new get_file_annexinfo() makes the single file query use case more convenient, but otherwise relies on get_content_annexinfo() as the work horse.

Beyond this addition the previous get_file_key() was reimplemented using get_content_annexinfo() alone. Simultaneously it is being deprecated, because better means exist now, to perform its tasks. For example the OpenNeuro usage referred to in #6104 (comment) is fully replaced by a single get_file_annexinfo() call now (not just the get_file_key() method call, but the entire loop body). Moreover, the behavior of that function is extremely complex (different error behavior, dependent on parameter types). It's exception behavior (3 different exceptions for why a key could not be obtained) also led to abuse of this method for type-checking of repository file types (call that didn't even bother checking the return value of a get_* method). And lastly, it uses normalize_paths.

TODO:

  • see if current tests pass
  • implement dedicated tests for get_file_annexinfo() pulling together test code for all the related methods listed above.
  • follow up PR will then remove their usage too

Developer notes:

This is furthering the goal of finally getting rid of normalize_paths() #4595

This is a second and different attempt to this. The last one was #5069

Final progress ping to #3333

@mih mih changed the title Deprecated AnnexRepo.get_file_key() Deprecate AnnexRepo.get_file_key() Oct 19, 2021
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 19, 2021

Codecov Report

Merging #6104 (8690e81) into master (4ba1388) will increase coverage by 0.02%.
The diff coverage is 94.62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6104      +/-   ##
==========================================
+ Coverage   89.80%   89.83%   +0.02%     
==========================================
  Files         319      319              
  Lines       42451    42477      +26     
==========================================
+ Hits        38124    38159      +35     
+ Misses       4327     4318       -9     
Impacted Files Coverage Δ
datalad/distributed/export_to_figshare.py 26.25% <0.00%> (ø)
datalad/metadata/aggregate.py 74.92% <0.00%> (+0.44%) ⬆️
datalad/customremotes/archives.py 62.71% <100.00%> (ø)
datalad/customremotes/tests/test_base.py 97.89% <100.00%> (ø)
datalad/distribution/tests/test_get.py 100.00% <100.00%> (ø)
datalad/distribution/tests/test_uninstall.py 100.00% <100.00%> (ø)
datalad/distribution/tests/test_update.py 100.00% <100.00%> (ø)
datalad/interface/add_archive_content.py 89.85% <100.00%> (ø)
...atalad/interface/tests/test_add_archive_content.py 99.62% <100.00%> (ø)
datalad/metadata/extractors/annex.py 93.54% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ba1388...8690e81. Read the comment docs.

@mih mih added the semver-patch Increment the patch version when merged label Oct 19, 2021
@mih
Copy link
Copy Markdown
Member Author

mih commented Oct 19, 2021

So this works! Will now proceed with removing remaining usage of get_file_key()

@yarikoptic
Copy link
Copy Markdown
Member

note: well, we just had discussion on how deprecations/removal could be affecting the downstream code etc - there seems to be a good number (didn't group to see which repos) of uses of this method, e.g. openneuro https://github.com/OpenNeuroOrg/datalad-service/blob/4f501d3a808cca1ad04fc5d808f49bc0c68614f1/datalad_service/migrate/versions.py . I am not saying that it should not be deprecated if so needed, but also do not see an immediate need.

@mih
Copy link
Copy Markdown
Member Author

mih commented Oct 21, 2021

The need comes from the need to get rid of normalize_paths (#4595).

Right now, I am exploring what migration path makes sense. The usage in OpenNeuro code is a common one (ie., operation on a single file). Once I cataloged all our usage, I will know whether my current plan hold. This plan is to implement something like .get_file_props(path) as a convenience front-end for .get_content_(annex)info() for the single file use case.

The OpenNeuro usage transition would then become:

get_file_key(name) -> get_file_props(name)['key']

Instead of handling three types of exceptions (not in annex, not in git, not there), the full set up file properties can always be obtained for any known file and acted on, including the key.

All of what the openneuro loop body is doing, is to implement get_content_annexinfo() without this functionality. And it much faster to use this single call than to assemble this information incrementally, like they do, because back then the Repo interface offered nothing else.

Alternatively, we could consider just removing the normalize path decorator. But this would either require to deprecate all its usage at once (this is already taking years, and will still take more), or to silently break the behavior of the method. I consider my proposal less problematic, right now.

@mih mih mentioned this pull request Oct 21, 2021
21 tasks
@mih mih changed the title Deprecate AnnexRepo.get_file_key() Introduce AnnexRepo.get_file_annexinfo(), deprecate AnnexRepo.get_file_key() Oct 22, 2021
@mih
Copy link
Copy Markdown
Member Author

mih commented Oct 23, 2021

No impact on benchmarks after complete removal of get_file_key() usage from codebase.

image

mih added 3 commits October 23, 2021 08:48
This is the companion of get_content_annexinfo() for single file
queries. This is a common use case, not only in the tests.

This is (for now), not channeling all functionality that
get_content_annexinfo() provides (amending Git properties, etc),
but the API could be extended, if this turns out to be
useful in the future.

Importantly, with this method in use, in many places dedicated
additional calls to file_has_content() can now be reduced to a single
call to this new method.
This is furthering the goal of finally getting rid of
`normalize_paths()` datalad#4595

This is a second and different attempt to this. The last one was
datalad#5069
@mih mih added the merge-if-ok OP considers this work done, and requests review/merge label Oct 23, 2021
@mih
Copy link
Copy Markdown
Member Author

mih commented Oct 24, 2021

FTR: This would also simplify some code on #6105

@bpoldrack bpoldrack merged commit c86643f into datalad:master Oct 25, 2021
@mih mih deleted the rf-getfilekey2 branch November 11, 2021 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-if-ok OP considers this work done, and requests review/merge semver-patch Increment the patch version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants