Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retire distribution.create and promote rev_create #3383

Merged
merged 8 commits into from May 7, 2019

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented May 3, 2019

Remaining things to do:

Closes #3379.

kyleam added 6 commits May 3, 2019 11:42
There is no additional commit in the submodule; it just has staged
changes.
Unlike distribution.create, rev_create puts a .noannex file in the
repository if --no-annex was given.  Adjust test_unlock_raises to
remove the .noannex file before trying to convert the existing
repository into an annex repository.
8389502 (RF: internal use of `create` -> `rev_create`, 2019-03-22)
converted most spots, but there are still a few places outside of
distribution/tests/test_create.py where we use distribution.create.
Convert the remaining spots, leaving only a few in test_add.py
(to be addressed next) and the ones in the benchmarks.
These tests rely on create's save argument, which rev_create dropped.
In test_add_dirty_tree, the save=False is unnecessary, so use a plain
rev_create call.  In the other spots, use an unbound rev_create call
to avoid saving the subdataset in the superdataset.  The results
aren't identical, but the differences don't matter in the context of
these tests.
@codecov
Copy link

codecov bot commented May 3, 2019

Codecov Report

Merging #3383 into master will decrease coverage by 0.15%.
The diff coverage is 99.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3383      +/-   ##
==========================================
- Coverage   91.24%   91.08%   -0.16%     
==========================================
  Files         265      263       -2     
  Lines       34452    34152     -300     
==========================================
- Hits        31436    31109     -327     
- Misses       3016     3043      +27
Impacted Files Coverage Δ
datalad/interface/__init__.py 100% <ø> (ø) ⬆️
datalad/customremotes/tests/test_archives.py 89.47% <0%> (-0.6%) ⬇️
datalad/interface/tests/test_utils.py 100% <100%> (ø) ⬆️
datalad/distribution/tests/test_add.py 82.95% <100%> (+0.46%) ⬆️
datalad/metadata/extractors/tests/test_exif.py 93.54% <100%> (ø) ⬆️
datalad/plugin/tests/test_plugins.py 89.02% <100%> (ø) ⬆️
datalad/support/tests/test_repo_save.py 100% <100%> (ø) ⬆️
datalad/distribution/tests/test_create_github.py 93.02% <100%> (ø) ⬆️
datalad/distribution/tests/test_publish.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_run_procedure.py 100% <100%> (ø) ⬆️
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca35877...9cced04. Read the comment docs.

With rev-create (now create), anything after the path is taken as an
option to 'git init'.
@kyleam kyleam added the WIP work in progress label May 3, 2019
kyleam added a commit to kyleam/datalad-revolution that referenced this pull request May 3, 2019
datalad/datalad#3383 replaces core's create with rev-create.  Add back
a compatibility alias, like we already do for diff and status.
@mih
Copy link
Member

mih commented May 3, 2019

The diff looks pleasantly straightforward. Thx.

@kyleam
Copy link
Contributor Author

kyleam commented May 4, 2019

The crawler used the backend argument of the old create to set the backend to datalad.crawl.default_backend. The new create relies soley on config values, which works nicely from the command line:

datalad -c datalad.repo.backend=SHA256E create blah

What's the cleanest way for a python caller to set this for one invocation of create? The two best options I can think of, neither pretty IMO, are (1) temporarily overriding the environment variable and reloading datalad.cfg or (2) using a fresh config instance instead of datalad.cfg and specifying overrides. Anything better that I'm overlooking?

@mih
Copy link
Member

mih commented May 4, 2019

What's the cleanest way for a python caller to set this for one invocation of create?

I would take one further step back and ignore the create default entirely. I see no problem or disadvantage of resetting the value after create is done, as this is a setting that lives in the local clone, and does not affect any content in the create commit.

Hence:

>>> import datalad.api as dl
>>> ds = dl.rev_create('.')
[INFO   ] Creating a new annex repo at /tmp/backend

% cat .git/config
[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
[annex]
        uuid = a878132b-1be5-41e0-923d-74c1754b02ef
        version = 5
        backends = MD5E

# replace value SHA256E with value of `datalad.crawl.default_backend`
>>> ds.config.set('annex.backends', 'SHA256E', where='local')

% cat .git/config
[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
[annex]
        uuid = a878132b-1be5-41e0-923d-74c1754b02ef
        version = 5
        backends = SHA256E                                                                                                                           

So that is doable and less complicated than any global overrides. However, I wonder why the crawler needs a different default, and what requirements it has that demand it to be different than MD5E -- and whether we should have the same default in -core?

Looking into this I see:

datalad_crawler/nodes/annex.py:        backend = self.backend or cfg.obtain('datalad.crawl.default_backend', default='MD5E')

So it seems to me that those settings are redundant, the requirements are identical, and datalad.crawl.default_backend can be retired in favor of the more general datalad.repo.backend.

@yarikoptic what do you think?

@kyleam
Copy link
Contributor Author

kyleam commented May 4, 2019

I see no problem or disadvantage of resetting the value after create is done, as this is a setting that lives in the local clone, and does not affect any content in the create commit.

It is more than just a setting in the local clone. Both the old create and rev-create set the annex backend in .gitattributes with set_default_backend(..., persistent=True). With your suggestion to change git/.config, adding new files would stil honor the value set in .gitattributes:

import os
import tempfile
import datalad.api as dl

ds = dl.create(tempfile.mkdtemp(prefix="dl-"))  # rev_create
ds.config.set('annex.backends', 'SHA256E', where='local')
(ds.pathobj / "blah").write_text("ooo")
ds.rev_save()
print(os.readlink(str(ds.pathobj / "blah")))
# .git/annex/objects/37/Mv/MD5E-s3--7f94dd413148ff9ac9e9e4b6ff2b6ca9/MD5E-s3--7f94dd413148ff9ac9e9e4b6ff2b6ca9

Before asking about this, I briefly explored doing it after, with something like

diff --git a/datalad_crawler/nodes/annex.py b/datalad_crawler/nodes/annex.py
index 442397f..af802d3 100644
--- a/datalad_crawler/nodes/annex.py
+++ b/datalad_crawler/nodes/annex.py
@@ -138,8 +138,6 @@ def _initiate_dataset(self, path, name):
             # TODO: RF whenevever create becomes a dedicated factory/method
             # and/or branch becomes an option for the "creator"

-        backend = self.backend or cfg.obtain('datalad.crawl.default_backend', default='MD5E')
-
         ds = create(
                 path=path,
                 force=False,
@@ -148,11 +146,14 @@ def _initiate_dataset(self, path, name):
                 #  custom backend was specified, but now with dataset id -- should always save
                 # save=not bool(backend),
                 # annex_version=None,
-                annex_backend=backend,
                 #git_opts=None,
                 #annex_opts=None,
                 #annex_init_opts=None
         )
+        ds.repo.set_default_backend(
+            self.backend or cfg.obtain(
+                'datalad.crawl.default_backend', default='MD5E'),
+            persistent=True, commit=True)
         if self.add_to_super:
             # place hack from 'add-to-super' times here
             # MIH: tests indicate that this wants to discover any dataset above

But set_default_backend() won't override an existing value (perhaps it should learn a force option or just do so if there are no annex objects yet?).

@mih
Copy link
Member

mih commented May 5, 2019

Sorry, I missed the attributes. I see no better way than using cfg overrides (as you initially pointed out). You can also dump a file with just * annex.backend=SHA256 in the directory, and run create with --force. But I would not call this pretty either.

But @yarikoptic's feedback is still out on why having this second same default to begin with.

@yarikoptic
Copy link
Member

I think that ideally all python API commands should support passing dictionary of configuration variables, to be in line with command line API.

As for crawler, we could indeed retire it anyone asks for it back. I typically really to default anyways for any new dataset

kyleam added a commit to kyleam/datalad-crawler that referenced this pull request May 6, 2019
rev-create, soon to be the main create, doesn't allow configuring the
backend via an argument.  To be compatible with the new create, we
could set datalad.repo.backend to whatever value is configured via a
crawler-specific mechanism.  But it's not clear that any
crawler-specific mechanisms are needed to configure the backend, so
instead warn that the crawler setting isn't honored and point to
datalad.repo.backend.

Re: datalad/datalad#3383 (comment)
@kyleam kyleam removed the WIP work in progress label May 6, 2019
@kyleam
Copy link
Contributor Author

kyleam commented May 6, 2019

This PR and the sister PRs (datalad/datalad-revolution#121 and datalad/datalad-crawler#41) are ready to be reviewed.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I was hoping for the diff to be like this, and now I am happy that it actually worked out like this. Thanks much for taking care of this PR.

@mih
Copy link
Member

mih commented May 6, 2019

datalad/datalad-revolution#121 is now waiting for this to be merged. I think an rc3 release is also due after the merge.

@mih
Copy link
Member

mih commented May 7, 2019

@yarikoptic I will merge this now in order to get rc3 out as a depencency for the metadata work. Given the nature of the diff (almost exclusively a renaming) I think the chances for unexpected things a low. And we can always adjust later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace create with rev_create (and rename to create)
3 participants