Switch/abandon ORA abstraction paradigm #30

mih · 2021-11-23T07:16:00Z

The ORA remote uses an internal IO abstraction that aims to make handling uniform across protocols (file://, ssh://, http(s)://) while everything is going through a single special remote implementation.

This sounds nice on paper, but creates a complex problem of supporting a uniform set of operations a using these exact same operations across all implementations. The present implementation fails to deliver on this promise.

I'd argue that a simpler system can be implemented that is more in line with the paradigm preferred by git-annex. Rather than having a single complex beast, let's have the individual pieces implemented properly (one protocol per implementation). Rather than supporting push/pull URL combinations in a single remote, let's use two in such cases (with --same-as), one for pull, and possibly another one, or none at all for push. Rather than fiddling with the internal parameterization of a single special remote type, let's switch externaltype= when a reconfig is required.

This will make the code base simpler, easier to maintain, and most importantly enable 3rd-party extensions without having to touch -core code.

The text was updated successfully, but these errors were encountered:

bpoldrack · 2021-11-23T08:03:52Z

I see two things to think about:

1.)

Rather than fiddling with the internal parameterization of a single special remote type, let's switch externaltype= when a reconfig is required.

This comes with the implication, that we can't have a local reconfig (which we recently introduced), since externaltype needs to be committed as far as I'm aware. Protocol switching is an inherently local thing, though. Hence, if we switch to that approach we either get back to committed reconfig ping-pong or remove that entirely, meaning one would start going over HTTP although on same file system ...

2.)
Just generally: Special remotes are committed, therefore we need a "backwards compatibility shim" of sorts. This would need to be a layer that is actually still a special remote (ORA) and then "redirects" to different, protocol specific implementations (classes) based on its config. But then we have build the thing I wanted (that complex beast) anyway and the question would be: What do we need the different special remotes for?

mih · 2021-11-23T08:18:16Z

This comes with the implication, that we can't have a local reconfig (which we recently introduced), since externaltype needs to be committed as far as I'm aware. Protocol switching is an inherently local thing, though. Hence, if we switch to that approach we either get back to committed reconfig ping-pong or remove that entirely, meaning one would start going over HTTP although on same file system ...

I could not come up with a use case that would require a local reconfiguration. AFAIR all such scenarios lead to problems down the line. Most, if not all, consumption scenarios are fully addressed via datalad/datalad#5835 (stale), in which committing a local reconfiguration is not an issue.

Just generally: Special remotes are committed, therefore we need a "backwards compatibility shim" of sorts. This would need to be a layer that is actually still a special remote (ORA) and then "redirects" to different, protocol specific implementations (classes) based on its config. But then we have build the thing I wanted (that complex beast) anyway and the question would be: What do we need the different special remotes for?

I don't understand what you are saying. The current system can stay in place forever. If it works for people with its limitations, nothing needs to be done on their end. And there are no redirections needed.

bpoldrack · 2021-11-23T08:43:22Z

I could not come up with a use case that would require a local reconfiguration. AFAIR all such scenarios lead to problems down the line.

I think we had a bunch of cases where one would want to have a local clone from a store that is also served over HTTP/SSH. Operations on such a local clone would ideally not go via HTTP/SSH. All issues with that, that I remember were that either we couldn't detect whether this is needed or that the reconfiguration was committed. Both led to changes and particularly with local reconfiguration a lot of trouble in that regard should be addressed.
In a scenario where the store is served via HTTP and some people also need local clones, I would assume, that the latter are more likely to not be pure consumption scenarios.

I don't understand what you are saying. The current system can stay in place forever.

Yes, that's what I am saying. It needs to stay in some shape. But it would ideally try to share code with the new special remotes you aim for, rather than us having two implementations, I think. Hence, it seems to me, that it would evolve into the very thing this approach tries to avoid.

Anyway, that's not a fundamental objection. May be it helps getting there. However, the more special remote types there are (and are part of existing datasets) the more we need to maintain.

If we figure a way to have a proper RIA abstraction along the way, that can be used with pretty much any (special) remote, that would be cool nevertheless.

mih · 2021-11-23T09:10:35Z

think we had a bunch of cases where one would want to have a local clone from a store that is also served over HTTP/SSH. Operations on such a local clone would ideally not go via HTTP/SSH. All issues with that, that I remember were that either we couldn't detect whether this is needed or that the reconfiguration was committed. Both led to changes and particularly with local reconfiguration a lot of trouble in that regard should be addressed.
In a scenario where the store is served via HTTP and some people also need local clones, I would assume, that the latter are more likely to not be pure consumption scenarios

Can you describe a concrete case, where this is desired, and that is not a plain consumption (read-only) case?

bpoldrack · 2021-11-23T09:38:39Z

Can you describe a concrete case, where this is desired, and that is not a plain consumption (read-only) case?

RIA store, which for consumption is set up to be served over HTTP. Dataset maintainer/curator making updates to datasets in a store. For large data additions, a local clone is desired, b/c of awesome network allowing to quickly download lots of stuff into a local clone, committing and pushing to the store locally, rather than downloading elsewhere and pushing over SSH.
However, that machine is not meant for computation. Hence, other users need to push their results from other machines over SSH. Without reconfiguration this can only be captured by different datasets, I think. Might be advisable anyway to separate such datasets, but we can't really know that in the general case to what extend say preprocessed data needs to be mixed in that sense (downloads + results).
Furthermore, smaller fixes, deletions, etc. could be done by the curator from a different machine (and do the work offline) to later push over SSH. Arguably, the latter is "only" convenience, but depending on work/network conditions (think rise in remote work), that convenience may be quite desirable.

Am I making sense?

bpoldrack · 2021-11-23T09:44:57Z

However, this business might be addressable by having yet another sameas remote (+ proper costs obv.) rather than reconfiguration. May be that's the better way.

mih · 2021-11-23T12:08:10Z

I think the scenario you describe should be covered by the normal "ephemeral clone" setup, which can directly interface any store on the local machine. All file content is directly available.

The only case not covered is a store that hosts file content in 7z archives. So taken together, it would not cover the use case of having to modify existing file content in a dataset, kept in a store with archive.7z, and push the modified content back to that store.

That seems like a corner case. If there was an archive 7z before, there will likely have to be one after the update too. And if so, a push doesn't give that. Instead the archive file needs to be updated by a manual process outside the special remote universe.

mih · 2022-09-21T12:40:12Z

I just came across the need to turn a provided special remote configuration into a working one (configured URL was no accessible to me, but the location was accessible via another channel).

❯ git annex initremote mynewname --private --sameas=abda3a9a-8581-4c60-9f27-b6264fa8c0b1 type=<type> <whatever-is-different-too> [exporttree=yes]

Has worked great. It promises to create a new special remote that is not shared and points to the original source.

Worth nothing that one can override type and all parameters that need changing as expected. However, it does not inherit exporttree=yes -- which I found confusing initially, but ultimately also made sense.

mih mentioned this issue May 20, 2022

Broken RIA store dataset annex datalad/datalad#6697

Closed

adswa mentioned this issue Jul 3, 2023

Overview of technical implementation aims #4

Open

6 tasks

adswa transferred this issue from datalad/datalad Sep 4, 2023

adswa mentioned this issue Sep 4, 2023

Reimplementation of the ORA special remote #6

Merged

mih mentioned this issue Sep 13, 2023

Define migration path for adoption of this package #57

Open

christian-monch mentioned this issue Apr 25, 2024

Coordination: RIA annex remote requirements, Implementation alternatives, and status #107

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch/abandon ORA abstraction paradigm #30

Switch/abandon ORA abstraction paradigm #30

mih commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

mih commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

mih commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

mih commented Nov 23, 2021

mih commented Sep 21, 2022

Switch/abandon ORA abstraction paradigm #30

Switch/abandon ORA abstraction paradigm #30

Comments

mih commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

mih commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

mih commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

bpoldrack commented Nov 23, 2021

mih commented Nov 23, 2021

mih commented Sep 21, 2022