Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards supporting access to RIA stores in clone() #4022

Merged
merged 18 commits into from Jan 17, 2020
Merged

Conversation

mih
Copy link
Member

@mih mih commented Jan 13, 2020

  • generalize existing URL normalization function to a helper to decode more it less arbitrary specifications to git-compatible clone url plus any number of additional properties

  • add support for ria+<protocol>:// URIs

  • populated store.datalad.org with a test dataset using datalad/git-annex-ria-remote#38. Requires the client-side setup described below and then:

    • datalad create test
    • datalad -d test create-sibling-ria storedataladorg

    Result now sitting at http://store.datalad.org/76b/6ca66-36b1-11ea-a2e6-f0d5bf7b5561

  • make datalad clone ria+http://store.datalad.org#76b6ca66-36b1-11ea-a2e6-f0d5bf7b5561 work

    • for now I had to manually run git update-server-info on the server, would a smart http setup on the server help @yarikoptic ?
  • make datalad clone ria+ssh://store.datalad.org#76b6ca66-36b1-11ea-a2e6-f0d5bf7b5561 work

    • only works with additional client-side configuration (see below), because the actual location on the server cannot be inferred from the identifier. This is not a problem, because only people with write access will need to have this, and they already need that for ria-remote.
  • add tests against store.datalad.org

    • clone from http
    • verify required post-clone configuration is in place
  • enhance the clone source specification decoding to also suggest a destination path.Often more clever names will be possible then just the basename derived from the final git-clone url

  • describe ria+http|ssh:// URIs in docs

  • check that ria+http logic can deal with URLs were the store needs a path in addition to the hostname, .e.g. example.com/mystore#dsid

mih added 5 commits Jan 13, 2020
Can now theoretically provide/decode additional properties, such as a
version identifier.
Not strictly necessary, but if we can fail early on an invalid URI,
we should.
Makes it possible to use labels instead of hardcoded locations in
ria+:// URIs
basepath = ''

dsid, version = source_ri.fragment.split('@', maxsplit=1) \
if '@' in source_ri.fragment else (source_ri.fragment, None)
Copy link
Member

@yarikoptic yarikoptic Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my reservations for use of @ remain: #2109 (comment) Delaying coming up with generally applicable "schema" for URLs would only bring pains (or even prevent) down the road.

Copy link
Member

@bpoldrack bpoldrack Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here the @ is used withn fragment, so after #.

Copy link
Member

@yarikoptic yarikoptic Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rright, and then for other types of URLs we would need to place it into a fragment explicitly as http://host/path#@version even when there is nothing else to go into fragment.
It might work - we would need anyways to generally urlencode all values for the keys in fragment, so @ or ; in e.g. email address or even a path (if present among keys in the fragment) shouldn't end up seen in URL, confusing our parser.
Also having fragment for ria as #datasetid makes it probably ria specific -- so we would need per url backend handling or just in general decide that non-keyed element in fragment is the dataset id. So, general "schema" would then be
#[datasetid][;key=value]*[@version]
although probably datasetid wouldn't be handled by majority of "url backends"?

Copy link
Member

@yarikoptic yarikoptic Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note though, that IIRC @ is not officially an allowed character within fragment
https://tools.ietf.org/html/rfc3986#section-3.5 :

      fragment    = *( pchar / "/" / "?" )

but above it says:

A fragment identifier component is indicated by the presence of a
number sign ("#") character and terminated by the end of the URI.

so it seems that splitting it out from a url is unambiguous and having @ in it shouldn't confuse any system.

Copy link
Member Author

@mih mih Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the detailed assessment!

Copy link
Member

@bpoldrack bpoldrack left a comment

AFAICT Implementation and tests are missing out on a local RIA store. While it would work to specify such as ria+ssh://my-local-ria-remote#23h4afc53 ... with no ssh-host configured, it would actually pass an SSH-Scheme URL on to git with either no host or 0. Or am I missing something?

datalad/core/distributed/clone.py Outdated Show resolved Hide resolved
@mih
Copy link
Member Author

@mih mih commented Jan 14, 2020

I concur with the general sentiment here that this is not a (or the) general solution to all URL setups, but it should be a meaningful setup for this particular use case.

I am adding a bunch of additional TODOs to the top comment.

mih added 2 commits Jan 14, 2020
For more specialized/structured source like RIA stores, more meaningful
defaults are possible. This RF makes it easier to tailor the default.
@codecov
Copy link

@codecov codecov bot commented Jan 14, 2020

Codecov Report

No coverage uploaded for pull request base (master@fd3de38). Click here to learn what that means.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4022   +/-   ##
=========================================
  Coverage          ?   89.72%           
=========================================
  Files             ?      272           
  Lines             ?    36536           
  Branches          ?        0           
=========================================
  Hits              ?    32781           
  Misses            ?     3755           
  Partials          ?        0
Impacted Files Coverage Δ
datalad/support/gitrepo.py 89.73% <ø> (ø)
datalad/version.py 37.5% <100%> (ø)
datalad/core/distributed/clone.py 92.79% <100%> (ø)
datalad/core/distributed/tests/test_clone.py 90.47% <83.33%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd3de38...345c533. Read the comment docs.

# can we clone from the store w/o any dedicated config
ds = clone('ria+http://store.datalad.org#{}'.format(datalad_store_testds_id), path)
ok_(ds.is_installed())
eq_(ds.id, datalad_store_testds_id)
Copy link
Member

@bpoldrack bpoldrack Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • special remote needs to be autoenabled
  • origin needs annex-ignore
  • ideally we should probably set publish-depends for origin
  • test availability via whereis?

Copy link
Member

@bpoldrack bpoldrack Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re availability: May be test, that fsck doesn't spit out some (fixed loaction log ...)?

Copy link
Member

@bpoldrack bpoldrack Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • get needs work.

  • May be also include some annexed content in archive?

  • if and when we include complete post-procedure, configs for submodule result hooks must be there

Copy link
Member Author

@mih mih Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

I will not test get. A get would require ria-remote. And we can test a get in the ria-remote tests for the main store. The test dataset already contains an annexed file.

If we decide to suck ria-remote into datalad-core, adding a test would be trivial, and can be done

Copy link
Member

@bpoldrack bpoldrack Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test dataset already contains an annexed file.

Yes, but we should prob. put an archive in there as well. However, we can do that within git-annex-ria-remote tests for now - agreed.

Copy link
Member Author

@mih mih Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require SSH access to the main store. Not a good idea, IMHO

Copy link
Member Author

@mih mih Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new test that goes through all the steps for a dataset with a subdataset and ensures a proper setup for both

mih added 4 commits Jan 15, 2020
…n clone

So far bits were a bit scattered, so reporting would not have access to
all information on each source candidate.

This is the precondition for per-source post-processing of datasets.
Coming next...
Should now be able to deal with ports and paths.

This also fixes a bug where we generated SSH URLs with ':' host-path
delimiters, which should not be there.
Makes it more obvious what the exact conditions are that we build on.
A dataset cloned from a store can have any of its subdataset also in
that store and datalad will find them there, regardless of the submodule
url setting.
@mih mih marked this pull request as ready for review Jan 15, 2020
@mih
Copy link
Member Author

@mih mih commented Jan 15, 2020

From my POV this is reasonably complete, and ready for review. It enables full and convenient access to RIA stores for any clone-related functionality. I would add documentation to this PR, once we agree on the trajectory.

A next step (in a separeate PR) would be to decide on and implement #4024

Once done, get()-related functionality should be tested for RIA clones, too, as suggested by @bpoldrack above. The test included here can easily be extended to do that.

  • check if a RIA special remote is or can be enabled
  • add a publication dependency for origin on the RIA special remote
  • add another reckless mode for ephemeral clones #3528
  • deploy ria special remote configuration automatically in a clone (if not already set). All information can be obtained from a ria+http|ssh:// URL.

@mih mih added merge-if-ok feedback-desired and removed merge-if-ok labels Jan 15, 2020
@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jan 15, 2020

I think, this approach lacks one aspect of the current setup in git-annex-ria-remote with its ria-post-intall procedure: The configuration to setup subdatasets in the same reckless/ephemeral mode as the superdataset. Since this is not an information from within the URL but a parameter to Clone itself, we would still need such a procedure (which might be connected to the result hooks by postclonecfg_annexdataset now).

@mih
Copy link
Member Author

@mih mih commented Jan 15, 2020

True, but I see this as a something different that only has marginally to do with RIA. It is a reckless mode for any local annex that uses hashdir mixed. Hence, I prefer to handle that separately to avoid overfitting, and to keep things orthogonal.

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jan 15, 2020

Hence, I prefer to handle that separately to avoid overfitting, and to keep things orthogonal.

Yes, makes sense. Just want to make sure, we are aware.
Otherwise this LGTM.

@mih mih added the conference agenda item label Jan 15, 2020
@mih mih removed the conference agenda item label Jan 15, 2020
# TODO are there any advantages of going through a URL-reconversion
# by the RI instance, instead of using the original input right away?
# it feels like we could only accumulate bugs. Keeping how it was,
# for now...
Copy link
Contributor

@kyleam kyleam Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it'd be better to use the original value. This is particularly important because git supports flexible sources via gitremote-helper.

By the way, in the context of the call's discussion today on whether "ria" (or whatever name) should be in the scheme part or in the fragment, in my view gitremote-helper provides two related points in favor of putting it in the scheme: (1) to be consistent with what git does (<transport>://<address> or <transport>::<address>) and (2) it keeps open the possibility of eventually implementing remote helpers for ria+ URLs so that git commands can work directly with them.

Copy link
Member

@yarikoptic yarikoptic Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kyleam for pointing to gitremote-helper -- I should have remembered about it since I must have looked at it in the past.
My non-executive summary:

  • I am Ok with going with this PR as is ATM

but in the long run (although datalad::ria:: instead of ria+ could be a short one) I think

  • we should RF to provide datalad git remote helper and advise to use datalad::url or at least to adhere to such schema ourselves to start. pros: even for local paths it would become sensible datalad::some/possibly/relative/path (what would it be now in this PR? haven't seen an example, could it would be smth like ria+some/possibly/relative/path if to stay consistent). That would make it clear to the user who sees datalad:: prefix that it relates to datalad, similarly to how hg::URL gives immediate information to me that remote URL I am giving to git clone is hg repository. (please no grandma examples, use prototypical datalad users)
  • layout could be specified as explicit option in the fragment, e.g. #layout=uuid2 (and describe in docs what uuid2 is and that ria uses such layout scheme (hardcoded)) e.g. datalad::path#id=UUID;layout=uuid2 or datalad::path#layout=dataset-tarballs (for whenever we implement export of the datasets with tarball for .git itself), but
  • I think that we might want to altogether avoid hard coding assumptions of the layout on remote end into URL, be it a ria (abbreviation which has nothing to do with layout per se) or anything else; neither in prefix of transport nor in URL fragment like I was proposing. I think, we might want it to be "queried" by datalad from that remote (e.g. some config option within .datalad/config on the remote end) similarly how to git-annex queries UUID etc. It would though require providing support for underlying transport mechanisms to requests such a file (or just entire super/top dataset to some temporary location if not present), but pros: it would open possibilities for migrating layout/etc on remote end without requiring adjustment to the URLs or code, and avoid possible guesswork then otherwise needed. But ATM we even do guesswork for either there is /.git or not, so I guess it would be a long shot.
  • may alternatively in the short run we could/should "chain" hardcoded ria as datalad::ria::URL but I am not found of it as much as of ria+ because then no alternative tool (e.g. browser rendering top JS page of the URL, such as http://store.datalad.org) to datalad would be given a chance.

Copy link
Member

@bpoldrack bpoldrack Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ria (abbreviation which has nothing to do with layout per se)

Yes and no. You can have that layout without a RIA special remote and current implementation here does that. But: The other way around: RIA would produce such a layout. So, if you know your store was created by it, you wouldn't need know about the layout but just that it's a RIA(-created) store.
Another aspect of it (not currently implemented here): That layout RIA special remote is created is versioned and stored in that store's base-path. So, if we know it's a RIA store and the special remote makes it into core, we can query the special remote for the correct layout without changing that URL. RIA remote's own implementation does exactly that. To do it here would require to import it, of course.

I think, we might want it to be "queried" by datalad from that remote (e.g. some config option within .datalad/config on the remote end) similarly how to git-annex queries UUID etc.

I actually wanted to make a very similar point as an argument for using the scheme rather than fragment: Currently we can have a label for a store, that just determines the base of the dataset tree as an actual URL. But: If it's not a regular URL we can enhance that to adress a distrubuted store. That is, several of such stores the user would know about as one thing and let datalad query an index to figure out which store to use for a given dataset id.

'annex.ria-remote.{}.ssh-host'.format(source_ri.hostname),
source_ri.hostname)
basepath = cfg.get(
'annex.ria-remote.{}.base-path'.format(source_ri.hostname),
Copy link
Contributor

@kyleam kyleam Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the role of these options, I think url.<base>.insteadOf could also be used to handle the URL rewrites, in which case pointing users to a more general existing solution might be better than adding more options on our side.

Copy link
Member

@bpoldrack bpoldrack Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intend for those options is about the special remote itself, which doesn't pull its configuration from git-annex:remote.log but from those, allowing to reconfigure the mapping of an (autoenabled) special remote to the actual store being used by it depending on config. This was intended to easily switch the store depending on where a clone is (infrastructure reasons) based on a system config the user doesn't have to care about.

However, as the RIA approach evolves, we do indeed want to switch to something like you propose here, since we now can have all required info in that kind of URL. But this needs to be in sync with git-annex-ria-remote itself, so I'd postpone that change wrt this PR.

Copy link
Member Author

@mih mih Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the insteadOf idea. The consequence for this PR would be to simply remove the config query, and let lower-level code handle any conversion. I will do just that.

Copy link
Member Author

@mih mih Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

mih added 3 commits Jan 16, 2020
@kyleam suggested to consult `url.<base>.insteadOf` when necessary.
This means we can just pass on whatever we get here.

datalad#4022 (comment)
 Conflicts:
	datalad/core/distributed/tests/test_clone.py
@mih
Copy link
Member Author

@mih mih commented Jan 16, 2020

This is done from my POV. Given that we are all somewhat OK with this, I will merge once the tests pass -- unless someone stops me.

It is clear that this is not the end of road, but it seems to me that this PR is not blocking any future developments. Moreover, I consider it a good thing (TM) that we use a special "ria" label for all of this, and do not claim any definitive "datalad" namespace. Whenever I have done such a power-move in the past, I quickly came to regret it.

@mih mih added merge-if-ok and removed feedback-desired labels Jan 16, 2020
@mih mih merged commit 47980eb into datalad:master Jan 17, 2020
17 checks passed
@mih mih deleted the enh-clone branch Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants