New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards supporting access to RIA stores in clone() #4022
Conversation
Can now theoretically provide/decode additional properties, such as a version identifier.
Not strictly necessary, but if we can fail early on an invalid URI, we should.
Makes it possible to use labels instead of hardcoded locations in ria+:// URIs
basepath = '' | ||
|
||
dsid, version = source_ri.fragment.split('@', maxsplit=1) \ | ||
if '@' in source_ri.fragment else (source_ri.fragment, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my reservations for use of @
remain: #2109 (comment) Delaying coming up with generally applicable "schema" for URLs would only bring pains (or even prevent) down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here the @
is used withn fragment
, so after #
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rright, and then for other types of URLs we would need to place it into a fragment explicitly as http://host/path#@version
even when there is nothing else to go into fragment.
It might work - we would need anyways to generally urlencode all values for the keys in fragment, so @
or ;
in e.g. email address or even a path (if present among keys in the fragment) shouldn't end up seen in URL, confusing our parser.
Also having fragment for ria as #datasetid
makes it probably ria specific -- so we would need per url backend handling or just in general decide that non-keyed element in fragment is the dataset id
. So, general "schema" would then be
#[datasetid][;key=value]*[@version]
although probably datasetid
wouldn't be handled by majority of "url backends"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note though, that IIRC @
is not officially an allowed character within fragment
https://tools.ietf.org/html/rfc3986#section-3.5 :
fragment = *( pchar / "/" / "?" )
but above it says:
A fragment identifier component is indicated by the presence of a
number sign ("#") character and terminated by the end of the URI.
so it seems that splitting it out from a url is unambiguous and having @
in it shouldn't confuse any system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the detailed assessment!
AFAICT Implementation and tests are missing out on a local RIA store. While it would work to specify such as ria+ssh://my-local-ria-remote#23h4afc53 ...
with no ssh-host configured, it would actually pass an SSH-Scheme URL on to git with either no host or 0
. Or am I missing something?
I concur with the general sentiment here that this is not a (or the) general solution to all URL setups, but it should be a meaningful setup for this particular use case. I am adding a bunch of additional TODOs to the top comment. |
For more specialized/structured source like RIA stores, more meaningful defaults are possible. This RF makes it easier to tailor the default.
Codecov Report
@@ Coverage Diff @@
## master #4022 +/- ##
=========================================
Coverage ? 89.72%
=========================================
Files ? 272
Lines ? 36536
Branches ? 0
=========================================
Hits ? 32781
Misses ? 3755
Partials ? 0
Continue to review full report at Codecov.
|
Otherwise `install` filters the results out.
# can we clone from the store w/o any dedicated config | ||
ds = clone('ria+http://store.datalad.org#{}'.format(datalad_store_testds_id), path) | ||
ok_(ds.is_installed()) | ||
eq_(ds.id, datalad_store_testds_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- special remote needs to be autoenabled
- origin needs annex-ignore
- ideally we should probably set publish-depends for origin
- test availability via whereis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re availability: May be test, that fsck doesn't spit out some (fixed loaction log ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
get
needs work. -
May be also include some annexed content in archive?
-
if and when we include complete post-procedure, configs for submodule result hooks must be there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
I will not test get
. A get would require ria-remote
. And we can test a get in the ria-remote tests for the main store. The test dataset already contains an annexed file.
If we decide to suck ria-remote into datalad-core, adding a test would be trivial, and can be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test dataset already contains an annexed file.
Yes, but we should prob. put an archive in there as well. However, we can do that within git-annex-ria-remote tests for now - agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require SSH access to the main store. Not a good idea, IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a new test that goes through all the steps for a dataset with a subdataset and ensures a proper setup for both
…n clone So far bits were a bit scattered, so reporting would not have access to all information on each source candidate. This is the precondition for per-source post-processing of datasets. Coming next...
Should now be able to deal with ports and paths. This also fixes a bug where we generated SSH URLs with ':' host-path delimiters, which should not be there.
Makes it more obvious what the exact conditions are that we build on.
A dataset cloned from a store can have any of its subdataset also in that store and datalad will find them there, regardless of the submodule url setting.
From my POV this is reasonably complete, and ready for review. It enables full and convenient access to RIA stores for any clone-related functionality. I would add documentation to this PR, once we agree on the trajectory. A next step (in a separeate PR) would be to decide on and implement #4024 Once done,
|
I think, this approach lacks one aspect of the current setup in |
True, but I see this as a something different that only has marginally to do with RIA. It is a reckless mode for any local annex that uses hashdir mixed. Hence, I prefer to handle that separately to avoid overfitting, and to keep things orthogonal. |
Yes, makes sense. Just want to make sure, we are aware. |
datalad/core/distributed/clone.py
Outdated
# TODO are there any advantages of going through a URL-reconversion | ||
# by the RI instance, instead of using the original input right away? | ||
# it feels like we could only accumulate bugs. Keeping how it was, | ||
# for now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it'd be better to use the original value. This is particularly important because git supports flexible sources via gitremote-helper
.
By the way, in the context of the call's discussion today on whether "ria" (or whatever name) should be in the scheme part or in the fragment, in my view gitremote-helper
provides two related points in favor of putting it in the scheme: (1) to be consistent with what git does (<transport>://<address>
or <transport>::<address>
) and (2) it keeps open the possibility of eventually implementing remote helpers for ria+
URLs so that git commands can work directly with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kyleam for pointing to gitremote-helper
-- I should have remembered about it since I must have looked at it in the past.
My non-executive summary:
- I am Ok with going with this PR as is ATM
but in the long run (although datalad::ria::
instead of ria+
could be a short one) I think
- we should RF to provide
datalad
git remote helper and advise to usedatalad::url
or at least to adhere to such schema ourselves to start. pros: even for local paths it would become sensibledatalad::some/possibly/relative/path
(what would it be now in this PR? haven't seen an example, could it would be smth likeria+some/possibly/relative/path
if to stay consistent). That would make it clear to the user who seesdatalad::
prefix that it relates to datalad, similarly to howhg::URL
gives immediate information to me that remote URL I am giving togit clone
ishg
repository. (please no grandma examples, use prototypical datalad users) - layout could be specified as explicit option in the fragment, e.g.
#layout=uuid2
(and describe in docs what uuid2 is and that ria uses such layout scheme (hardcoded)) e.g.datalad::path#id=UUID;layout=uuid2
ordatalad::path#layout=dataset-tarballs
(for whenever we implement export of the datasets with tarball for.git
itself), but - I think that we might want to altogether avoid hard coding assumptions of the layout on remote end into URL, be it a
ria
(abbreviation which has nothing to do with layout per se) or anything else; neither in prefix of transport nor in URL fragment like I was proposing. I think, we might want it to be "queried" by datalad from that remote (e.g. some config option within.datalad/config
on the remote end) similarly how to git-annex queries UUID etc. It would though require providing support for underlying transport mechanisms to requests such a file (or just entire super/top dataset to some temporary location if not present), but pros: it would open possibilities for migrating layout/etc on remote end without requiring adjustment to the URLs or code, and avoid possible guesswork then otherwise needed. But ATM we even do guesswork for either there is/.git
or not, so I guess it would be a long shot. - may alternatively in the short run we could/should "chain" hardcoded
ria
asdatalad::ria::URL
but I am not found of it as much as ofria+
because then no alternative tool (e.g. browser rendering top JS page of the URL, such as http://store.datalad.org) to datalad would be given a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ria (abbreviation which has nothing to do with layout per se)
Yes and no. You can have that layout without a RIA special remote and current implementation here does that. But: The other way around: RIA would produce such a layout. So, if you know your store was created by it, you wouldn't need know about the layout but just that it's a RIA(-created) store.
Another aspect of it (not currently implemented here): That layout RIA special remote is created is versioned and stored in that store's base-path. So, if we know it's a RIA store and the special remote makes it into core, we can query the special remote for the correct layout without changing that URL. RIA remote's own implementation does exactly that. To do it here would require to import it, of course.
I think, we might want it to be "queried" by datalad from that remote (e.g. some config option within .datalad/config on the remote end) similarly how to git-annex queries UUID etc.
I actually wanted to make a very similar point as an argument for using the scheme rather than fragment: Currently we can have a label for a store, that just determines the base of the dataset tree as an actual URL. But: If it's not a regular URL we can enhance that to adress a distrubuted store. That is, several of such stores the user would know about as one thing and let datalad query an index to figure out which store to use for a given dataset id.
datalad/core/distributed/clone.py
Outdated
'annex.ria-remote.{}.ssh-host'.format(source_ri.hostname), | ||
source_ri.hostname) | ||
basepath = cfg.get( | ||
'annex.ria-remote.{}.base-path'.format(source_ri.hostname), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the role of these options, I think url.<base>.insteadOf
could also be used to handle the URL rewrites, in which case pointing users to a more general existing solution might be better than adding more options on our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original intend for those options is about the special remote itself, which doesn't pull its configuration from git-annex:remote.log
but from those, allowing to reconfigure the mapping of an (autoenabled) special remote to the actual store being used by it depending on config. This was intended to easily switch the store depending on where a clone is (infrastructure reasons) based on a system config the user doesn't have to care about.
However, as the RIA approach evolves, we do indeed want to switch to something like you propose here, since we now can have all required info in that kind of URL. But this needs to be in sync with git-annex-ria-remote itself, so I'd postpone that change wrt this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the insteadOf
idea. The consequence for this PR would be to simply remove the config query, and let lower-level code handle any conversion. I will do just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@kyleam suggested to consult `url.<base>.insteadOf` when necessary. This means we can just pass on whatever we get here. datalad#4022 (comment)
Conflicts: datalad/core/distributed/tests/test_clone.py
This is done from my POV. Given that we are all somewhat OK with this, I will merge once the tests pass -- unless someone stops me. It is clear that this is not the end of road, but it seems to me that this PR is not blocking any future developments. Moreover, I consider it a good thing (TM) that we use a special "ria" label for all of this, and do not claim any definitive "datalad" namespace. Whenever I have done such a power-move in the past, I quickly came to regret it. |
generalize existing URL normalization function to a helper to decode more it less arbitrary specifications to git-compatible clone url plus any number of additional properties
add support for
ria+<protocol>://
URIspopulated
store.datalad.org
with a test dataset using datalad/git-annex-ria-remote#38. Requires the client-side setup described below and then:datalad create test
datalad -d test create-sibling-ria storedataladorg
Result now sitting at
http://store.datalad.org/76b/6ca66-36b1-11ea-a2e6-f0d5bf7b5561
make
datalad clone ria+http://store.datalad.org#76b6ca66-36b1-11ea-a2e6-f0d5bf7b5561
workgit update-server-info
on the server, would a smart http setup on the server help @yarikoptic ?make
datalad clone ria+ssh://store.datalad.org#76b6ca66-36b1-11ea-a2e6-f0d5bf7b5561
workria-remote
.add tests against
store.datalad.org
enhance the clone source specification decoding to also suggest a destination path.Often more clever names will be possible then just the basename derived from the final git-clone url
describe
ria+http|ssh://
URIs in docscheck that
ria+http
logic can deal with URLs were the store needs a path in addition to the hostname, .e.g.example.com/mystore#dsid