Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of commits when cloning into a dataset #5480

Merged
merged 6 commits into from Jun 11, 2021

Conversation

bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Mar 10, 2021

When a dataset is cloned into another one as a subdataset, we record the originally given URL in .gitmodules. This used to be a dedicated commit in addition to the one adding the new dataset as a subdataset. The former is now amended to the latter.

@mih
Copy link
Member

mih commented Mar 10, 2021

Restarted Appveyor

@bpoldrack
Copy link
Member Author

bpoldrack commented Mar 10, 2021

There appears to be a real issue with parallel execution - converting to draft until resolved.

Edit:
Had nothing to do with parallel execution. .gitmodules was unintentionally annexed. Fixed.

@bpoldrack bpoldrack marked this pull request as draft March 10, 2021 16:54
@bpoldrack bpoldrack marked this pull request as ready for review March 11, 2021 11:41
@codecov
Copy link

codecov bot commented Mar 11, 2021

Codecov Report

Merging #5480 (c9fd7ea) into master (0b0e7fe) will decrease coverage by 34.74%.
The diff coverage is 66.66%.

❗ Current head c9fd7ea differs from pull request most recent head e24063f. Consider uploading reports for the commit e24063f to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5480       +/-   ##
===========================================
- Coverage   90.63%   55.89%   -34.75%     
===========================================
  Files         308      305        -3     
  Lines       42168    42145       -23     
===========================================
- Hits        38220    23555    -14665     
- Misses       3948    18590    +14642     
Impacted Files Coverage Δ
datalad/core/distributed/tests/test_clone.py 0.00% <0.00%> (-97.34%) ⬇️
datalad/local/add_readme.py 92.98% <ø> (ø)
datalad/core/distributed/clone.py 63.36% <85.71%> (-28.60%) ⬇️
datalad/local/tests/test_add_readme.py 100.00% <100.00%> (ø)
datalad/plugin/wtf.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/addurls.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/no_annex.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/add_readme.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/check_dates.py 0.00% <0.00%> (-100.00%) ⬇️
... and 197 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b0e7fe...e24063f. Read the comment docs.

yield r
ds.subdatasets(path,
set_property=[("datalad-url", source)])
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping through a simple datalad clone -d. /path/to/a b, the logic under status=ok looks correct to me.

The logic under the else arm may need some more thought. For example, it'd be triggered by the notneeded record that comes out of a failed attempt to clone a dataset (say datalad clone -d. http://nothere.datalad.org/abc sub). While the subdatasets call is a noop there, it seems confusing to do, and in general I wonder if saving via the subdatasets call is a good action if we know the type=dataset, action=save call had a non-ok status.

A couple of more thoughts:

  • A regression test for the "more than one commit" behavior is probably in order.

  • Relying on result records to decide to amend makes me a little uneasy (mostly in terms of changes upstream having unanticipated effects). Do you think it's worth adding a get_hexsha call pre and post ds.save(path, ...) as a sanity check that HEAD moved?

Copy link
Member Author

@bpoldrack bpoldrack Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the subdatasets call is a noop there, it seems confusing to do, and in general I wonder if saving via the subdatasets call is a good action if we know the type=dataset, action=save call had a non-ok status.

Agree - I did wonder, too. The else is there to maintain previous behavior when there wasn't an ok result. However, I'm happy to remove it entirely and say: Either saving did something and we amend-record datalad-url or something went wrong and we don't want to put a commit on top of something broken or no action is needed anyway.

A regression test for the "more than one commit" behavior is probably in order.

I'll add a test.

Relying on result records to decide to amend makes me a little uneasy (mostly in terms of changes upstream having unanticipated effects). Do you think it's worth adding a get_hexsha call pre and post ds.save(path, ...) as a sanity check that HEAD moved?

I don't think so. Relying on return values is no special case when it is a result record in my view. We do that everywhere. Plus: If we think, we can't rely on it internally, we should urgently remove the result-hook mechanism. IMO it's the other way around: We need to consider the result dicts as part of the interface that we shouldn't easily break.
Regarding get_hexsha: It's more expensive and not that simple either. We need to consider adjusted branches. save calls AnnexRepo.localsync internally. If we would come from not yet synced state, we would get a different hexsha due the annex-sync even if the save itself was notneeded, I think. We can of course check the corresponding branch instead, ending up with 3 git calls (figure active branch + 2 times get_hexsha. But issue remains: If we were not yet synced, the different hexsha might not stem from us actually committing something in save.

I'd prefer to rely on what save tells us. However, we may agree then, that the result record from save should be enhanced accordingly. save has the get_hexsha calls anyway and could include them in the result, which might be useful elsewhere, too. With that, we can still check on commit SHAs instead of solely relying on current notion of ok.

Copy link
Contributor

@kyleam kyleam Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Relying on return values is no special case when it is a result record in my view. We do that everywhere.

Right, result records are used everywhere. I think the thing that is making me pause here is a combination of this being an operation that rewrites history and the specific result check (action=save, type=dataset, status=ok for each result yielded by save(path, ...)). In the context of this call, I can't think of a future scenario where, say, there would be two action=save, type=dataset, status=ok results yielded, and I think risk of breakage is probably pretty low. I'm just undecided about whether the code should be more defensive given the operation (and am okay with your answer of "no").

Plus: If we think, we can't rely on it internally, we should urgently remove the result-hook mechanism.

I think it depends on what specific features of the result records are being relied on. For example, I'd guess we have had to adjust assert_result_count tests enough to know relying on a specific number of records matching a certain set of conditions is prone to become stale.

IMO it's the other way around: We need to consider the result dicts as part of the interface that we shouldn't easily break.

I don't think anyone disagrees that result dicts should be considered as part of the interface (and this was one of the aspects discussed when it was introduced).

Regarding get_hexsha: It's more expensive and not that simple either. [...]

Hmm, I haven't hit any issues quickly testing with ./tools/eval_under_testloopfs, but either way I was proposing "head pre != head post" as an additional guard/check, so notneeded would still be passed over as it is now.

save has the get_hexsha calls anyway and could include them in the result, which might be useful elsewhere, too.

If it's readily available, going that route sounds fine to me.

But as I mentioned above, I'm on the fence about the additional check in general, and it's okay with me if you decide to leave it as is.

mih
mih previously requested changes Mar 26, 2021
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this. Somewhat echoing @kyleam on the reliability of the result records, I'd say a few things:

  • it is OK to rely on result records, but that reliance needs to be specific
  • it is not reasonable to assume that order and number of records will never change

With that in mind, some observations:

  • amending is triggered by a {status=ok, type=dataset, action=save)
  • there is no check on ds actually being the subject of that record
  • further modification of ds is done while the original ds.save could still be running -- the knowledge that the target result is (presently) the last one is implicit
  • whenever any of these assumptions change, problems could arise

My request:

  • at the very minimum we should check whether the dataset being reported as saved is actually ds
  • if find it much safer to delay the amend until we know for sure that no further modification of ds can take place before we attempt an amend, or other change -- this would be after a return of ds.save
  • the result record inspection can still be used to decided between amend and property change.

@mih
Copy link
Member

mih commented Apr 28, 2021

It would be very helpful to have this merged. @bpoldrack can you please have a look at what is left todo? Thx!

Reduce the number of commits created when cloning into an existing
dataset, making the clone a new subdataset. Amend the commit adding the
new subdataset with the recording of originally given URL, if possible.

Note, that ideally the subdatasets command should see a refactoring
exposing relevant parts via public helpers.

(Closes datalad#5412)
Changes to .gitmodules on clone should be included in
the commit that adds the subdatasets.
Wait for save to actually finish before amending the expected commit,
instead of reacting to the result when it's yielded.

Furthermore be a bit more specific about the save result we expect.
However, it's actually not easy to be sure the subdataset was saved
last based on save's results. That is because the 'save' result doesn't
really contain that information. Instead, the path of the submodule (and
.gitmodules) are part of 'add' results beforehand. Now, being future
safe here in when exactly we amend (and we don't), is not obvious to me.
If we can't rely on order, number, etc. of save's results, how then to
test whether the last commit (we amend!) is the one? We can use git
directly to inspect, but on that's rather expensive, given that at this
spot we are in clone and save a freshly cloned subdataset. I think,
there aren't future changes in `save` to be safeguarded against here,
that warrant expensive additional checks. If anything, save's results
need to be enhanced to include a list of paths, that were saved.
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the update. I made a proposal to simplify this change a bit. Please have a look.

# particularly relevant for postclone routines on a later `get`
# for that subdataset. See gh-5256.
if any(r['action'] == 'save' and r['type'] == 'dataset' and \
r['refds'] == ds.path for r in save_results):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is the only place where save_results is used. I am not sure how many it could be, but it seems a simpler approach would be to build the boolean indicator that triggers this conditional in the loop above. So something like

actually_saved_subds = None
...
actually_saved_subds = actually_saved_subds or r['status'] if \
    r['action'] == 'save' and r['type'] == 'dataset' and r['refds'] == ds.path else None
...
if actually_saved_subds:
    ...

This seems leaner and more clear as to why and how information is retained.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. With slight variation.

# for that subdataset. See gh-5256.
if any(r['action'] == 'save' and r['type'] == 'dataset' and \
r['refds'] == ds.path for r in save_results):
if r['status'] == 'ok':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if r['status'] == 'ok':
if actually_saved_subdataset == 'ok':

@mih mih dismissed their stale review June 10, 2021 12:26

Outdated

@mih
Copy link
Member

mih commented Jun 10, 2021

Thx! Works for me now. Please add a proper title and top comment, so we will end up with something intelligible in the changelog.

@bpoldrack bpoldrack changed the title Clone: Amend subds commit Reduce number of commits when cloning into a dataset Jun 10, 2021
@bpoldrack
Copy link
Member Author

Like this, @mih?

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thx!

@bpoldrack bpoldrack merged commit f5b6db0 into datalad:master Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants