Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement iter_gittree() and support it in ls_file_collections #580

Merged
merged 4 commits into from
Jan 5, 2024

Conversation

mih
Copy link
Member

@mih mih commented Jan 4, 2024

This is the first iterator that is not simply reporting on some file system that is identifiable via a single location parameter. Instead, it requires a repository location and a second "tree-ish" identifier.

I decided to skip the implementation of a file-pointer inclusion into the returned items. Two reasons: (1) to make this efficient it would need a lazy-execution of git cat-file, and (2) I cannot think of a use case for content hashing beyond identification, and that use case is perfectly addressed by the reported gitsha already.

For the integration with ls-file-collection I decided to stay close to the behavior of git ls-tree itself. The collection identifier parameter is used for the tree-ish, while the repository selection is done via the working directory. This includes constraining reports to subdirectories (in non-bare repos), like git ls-tree does too.

Closes: #349

TODO:

  • Add tests

This is the first iterator that is not simply reporting on some file
system that is identifiable via a single location parameter.  Instead,
it requires a repository location and a second "tree-ish" identifier.

I decided to skip the implementation of a file-pointer inclusion
into the returned items. Two reasons: (1) to make this efficient
it would need a lazy-execution of `git cat-file`, and (2) I cannot
think of a use case for content hashing beyond identification,
and that use case is perfectly addressed by the reported `gitsha`
already.

For the integration with `ls-file-collection` I decided to stay
close to the behavior of `git ls-tree` itself. The collection identifier
parameter is used for the tree-ish, while the repository selection is
done via the working directory. This includes constraining reports
to subdirectories (in non-bare repos), like `git ls-tree` does too.

Closes: datalad#349
Copy link

codecov bot commented Jan 4, 2024

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (8bf779e) 92.71% compared to head (30d68eb) 92.48%.

Files Patch % Lines
datalad_next/commands/ls_file_collection.py 15.78% 15 Missing and 1 partial ⚠️
...ad_next/iter_collections/tests/test_itergittree.py 95.45% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #580      +/-   ##
==========================================
- Coverage   92.71%   92.48%   -0.23%     
==========================================
  Files         147      149       +2     
  Lines       10922    10775     -147     
  Branches     1630     1618      -12     
==========================================
- Hits        10126     9965     -161     
- Misses        619      633      +14     
  Partials      177      177              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mih
Copy link
Member Author

mih commented Jan 5, 2024

Some performance figures (again a Git repo with 36k tracked files):

❯ multitime -q -n 10 git ls-tree -r HEAD
===> multitime results
1: -q git ls-tree -r HEAD
            Mean        Std.Dev.    Min         Median      Max
real        0.015       0.002       0.013       0.014       0.022       
user        0.013       0.002       0.009       0.014       0.014       
sys         0.002       0.003       0.000       0.000       0.010       
In [1]: from datalad_next.iter_collections.gittree import iter_gittree
In [2]: %timeit list(iter_gittree('.', 'HEAD', recursive='repository'))
104 ms ± 9.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

~6x slower than the plain Git command at this point.

@mih
Copy link
Member Author

mih commented Jan 5, 2024

Given the underwhelming performance I did some profiling and found that half the runtime it spent on converting path strings to PurePath instances. That seems a bit out of proportion...

%timeit list(iter_gittree('.', 'HEAD', recursive='repository'))
92.7 ms ± 737 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

turns into

%timeit list(iter_gittree('.', 'HEAD', recursive='repository'))
40.4 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

with

-        name=PurePosixPath(path),
+        #name=PurePosixPath(path),
+        name=path,

#581

@mih
Copy link
Member Author

mih commented Jan 5, 2024

I think this is good now. I will merge it shortly in order to then address #581 and #554 jointly for all collection types.

@mih mih merged commit a27a4eb into datalad:main Jan 5, 2024
5 of 8 checks passed
@mih mih deleted the nf-349 branch January 5, 2024 12:56
@mih mih added this to the 1.1 milestone Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement itergittree() for ls-file-collection
1 participant