Skip to content

Reimplement _get_dot_git() to be more efficient#6325

Merged
mih merged 4 commits intodatalad:masterfrom
mih:perf-dotgit
Dec 22, 2021
Merged

Reimplement _get_dot_git() to be more efficient#6325
mih merged 4 commits intodatalad:masterfrom
mih:perf-dotgit

Conversation

@mih
Copy link
Copy Markdown
Member

@mih mih commented Dec 16, 2021

It now tests the common scenarios first, rather then leaving them for last.

Moreover, it introduces an explicit resolved flag for result path reporting, and applies it consistently when enabled -- rather than just for symlinked directories, but not files, or gitdirs specified in .git files. Searching through the code, I did not spot a case where this features is needed, nor desirable.

Fixes #6321
Fixes #6324

It now tests the common scenarios first, rather then leaving them
for last.

Moreover, it introduces an explicit `resolved` flag for result path
reporting, and applies it consistently when enabled -- rather than just
for symlinked directories, but not files, or gitdirs specified in .git
files. Searching through the code, I did not spot a case where this
features is needed, nor desirable.

Fixes datalad#6321
Fixes datalad#6324
@mih mih added semver-minor Increment the minor version when merged bare-mode Issue regarding operation on bare mode repositories labels Dec 16, 2021
@codecov
Copy link
Copy Markdown

codecov bot commented Dec 16, 2021

Codecov Report

Merging #6325 (1a2be15) into master (2b92a76) will decrease coverage by 28.13%.
The diff coverage is 37.93%.

❗ Current head 1a2be15 differs from pull request most recent head bb25c26. Consider uploading reports for the commit bb25c26 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           master    #6325       +/-   ##
===========================================
- Coverage   89.87%   61.74%   -28.14%     
===========================================
  Files         331      331               
  Lines       43210    43189       -21     
===========================================
- Hits        38835    26666    -12169     
- Misses       4375    16523    +12148     
Impacted Files Coverage Δ
datalad/dataset/tests/test_gitrepo.py 0.00% <0.00%> (-88.49%) ⬇️
datalad/distribution/tests/test_utils.py 100.00% <ø> (ø)
datalad/dataset/gitrepo.py 93.95% <83.33%> (-3.22%) ⬇️
datalad/support/gitrepo.py 82.40% <100.00%> (-8.51%) ⬇️
datalad/version.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/wtf.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/addurls.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/interface/clean.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/interface/rerun.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
... and 189 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b92a76...bb25c26. Read the comment docs.

@yarikoptic
Copy link
Copy Markdown
Member

FWIW, IIRC I think resolving any path was "accepted" as a practice a while ago for obscure setups like mine where an entire TMPDIR might be a symlink to somewhere else and then comparing resolved paths to non-resolved ones. I think we have some travis matrix runs simulating some scenario like that ... one of the succeeds and another one fails, but there are other fails as well, so I guess those should be addressed first

@mih
Copy link
Copy Markdown
Member Author

mih commented Dec 17, 2021

FWIW, IIRC I think resolving any path was "accepted" as a practice a while ago for obscure setups like mine where an entire TMPDIR might be a symlink to somewhere else and then comparing resolved paths to non-resolved ones. I think we have some travis matrix runs simulating some scenario like that ... one of the succeeds and another one fails, but there are other fails as well, so I guess those should be addressed first

Hmm, I thought this was all about Repo.path being resolved, and Dataset.path not. But what is happening here is resolving Repo.dot_git beyond Repo.path.

I am not saying that it has to be one or the other. However the current state is broken. The documentation says

Given a pathobj to a repository return path to .git/ directory

and later

Absolute path to resolved .git/ directory

(let's ignore misleading POSIX notation)

So the return value must be a resolved path. However the code looks like this (note that I labeled the conditionals for easier reference and also note that they are mutually exclusive):

    dot_git = pathobj / '.git'
    # 1
    if dot_git.is_file():
        with dot_git.open() as f:
            line = f.readline()
            if line.startswith("gitdir: "):
                dot_git = pathobj / line[7:].strip()
            else:
                raise InvalidGitRepositoryError("Invalid .git file")
    # 2
    elif dot_git.is_symlink():
        dot_git = dot_git.resolve()
    # 3
    elif not dot_git.exists() and \
            (pathobj / 'HEAD').exists() and \
            (pathobj / 'config').exists():
        # looks like a bare repo
        dot_git = pathobj
    # 4
    elif not (ok_missing or dot_git.exists()):
        raise RuntimeError("Missing .git in %s." % pathobj)
    return dot_git

(1) IF we have an existing (symlinked) .git file that points to a symlinked location: not resolved

(2) ELIF we have a symlinked directory or non-existing file: resolved (and since PY3.6 no longer crash in case of the symlink to a non-existing file)

(3) ELIF we have a bare repo: not resolved

(4) ELIF .git exists or it is OK not to exist: not resolved

(also not that ok_missing=False with a symlinked .git pointing to a non-existing path will not raise an exception).

So when we are talking about acceptance here, it is the acceptance implied by "my uncle is a lunatic, there is nothing I can do about it". ;-) I would prefer to sort this out and fix it up, even if that means a behavior change.

It is made redundant by
`datalad.dataset.tests.test_gitrepo.py:test_get_dot_git`

This old test also required an undesirable .git path resolving,
as an implied special case that did not match the described
behavior. Resolving paths is now done consistently and explicitly
when requested.
@mih
Copy link
Copy Markdown
Member Author

mih commented Dec 17, 2021

From chat with @bpoldrack

I don't currently think it's [resolving the path] necessary. But: If we ever switch the design, to represent the repository and a/the worktree separately, then the identifier for the actual Repo needs to become the resolved .git I think

Don't see another reason ATM

I think it's OK to change the test accordingly. It may have only be done for

  • sense of consistency: If it's bare, it's resolved already (== path), so have it resolved always.
  • Anticipation of changing *Repo's flyweight ID to the resolved .git, rather than the worktree path.

mih added 2 commits December 17, 2021 08:47
Resolving is now an explicit action. If it is explicitly off, and
relative paths are passed, the only way to turn them absolute is to
consider them relative to CWD -- this is by no means guaranteed to be
meaningful or intentional.

I now document the behavior. All internal usage is with absolute paths
anyways.
@mih mih merged commit ec8fdab into datalad:master Dec 22, 2021
@mih mih deleted the perf-dotgit branch December 22, 2021 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bare-mode Issue regarding operation on bare mode repositories semver-minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repo.dot_git inconsistent Repo.dot_git inefficient

2 participants