Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch/abandon ORA abstraction paradigm #30

Open
mih opened this issue Nov 23, 2021 · 8 comments
Open

Switch/abandon ORA abstraction paradigm #30

mih opened this issue Nov 23, 2021 · 8 comments

Comments

@mih
Copy link
Member

mih commented Nov 23, 2021

The ORA remote uses an internal IO abstraction that aims to make handling uniform across protocols (file://, ssh://, http(s)://) while everything is going through a single special remote implementation.

This sounds nice on paper, but creates a complex problem of supporting a uniform set of operations a using these exact same operations across all implementations. The present implementation fails to deliver on this promise.

I'd argue that a simpler system can be implemented that is more in line with the paradigm preferred by git-annex. Rather than having a single complex beast, let's have the individual pieces implemented properly (one protocol per implementation). Rather than supporting push/pull URL combinations in a single remote, let's use two in such cases (with --same-as), one for pull, and possibly another one, or none at all for push. Rather than fiddling with the internal parameterization of a single special remote type, let's switch externaltype= when a reconfig is required.

This will make the code base simpler, easier to maintain, and most importantly enable 3rd-party extensions without having to touch -core code.

@bpoldrack
Copy link
Member

I see two things to think about:

1.)

Rather than fiddling with the internal parameterization of a single special remote type, let's switch externaltype= when a reconfig is required.

This comes with the implication, that we can't have a local reconfig (which we recently introduced), since externaltype needs to be committed as far as I'm aware. Protocol switching is an inherently local thing, though. Hence, if we switch to that approach we either get back to committed reconfig ping-pong or remove that entirely, meaning one would start going over HTTP although on same file system ...

2.)
Just generally: Special remotes are committed, therefore we need a "backwards compatibility shim" of sorts. This would need to be a layer that is actually still a special remote (ORA) and then "redirects" to different, protocol specific implementations (classes) based on its config. But then we have build the thing I wanted (that complex beast) anyway and the question would be: What do we need the different special remotes for?

@mih
Copy link
Member Author

mih commented Nov 23, 2021

This comes with the implication, that we can't have a local reconfig (which we recently introduced), since externaltype needs to be committed as far as I'm aware. Protocol switching is an inherently local thing, though. Hence, if we switch to that approach we either get back to committed reconfig ping-pong or remove that entirely, meaning one would start going over HTTP although on same file system ...

I could not come up with a use case that would require a local reconfiguration. AFAIR all such scenarios lead to problems down the line. Most, if not all, consumption scenarios are fully addressed via datalad/datalad#5835 (stale), in which committing a local reconfiguration is not an issue.

Just generally: Special remotes are committed, therefore we need a "backwards compatibility shim" of sorts. This would need to be a layer that is actually still a special remote (ORA) and then "redirects" to different, protocol specific implementations (classes) based on its config. But then we have build the thing I wanted (that complex beast) anyway and the question would be: What do we need the different special remotes for?

I don't understand what you are saying. The current system can stay in place forever. If it works for people with its limitations, nothing needs to be done on their end. And there are no redirections needed.

@bpoldrack
Copy link
Member

I could not come up with a use case that would require a local reconfiguration. AFAIR all such scenarios lead to problems down the line.

I think we had a bunch of cases where one would want to have a local clone from a store that is also served over HTTP/SSH. Operations on such a local clone would ideally not go via HTTP/SSH. All issues with that, that I remember were that either we couldn't detect whether this is needed or that the reconfiguration was committed. Both led to changes and particularly with local reconfiguration a lot of trouble in that regard should be addressed.
In a scenario where the store is served via HTTP and some people also need local clones, I would assume, that the latter are more likely to not be pure consumption scenarios.

I don't understand what you are saying. The current system can stay in place forever.

Yes, that's what I am saying. It needs to stay in some shape. But it would ideally try to share code with the new special remotes you aim for, rather than us having two implementations, I think. Hence, it seems to me, that it would evolve into the very thing this approach tries to avoid.

Anyway, that's not a fundamental objection. May be it helps getting there. However, the more special remote types there are (and are part of existing datasets) the more we need to maintain.

If we figure a way to have a proper RIA abstraction along the way, that can be used with pretty much any (special) remote, that would be cool nevertheless.

@mih
Copy link
Member Author

mih commented Nov 23, 2021

think we had a bunch of cases where one would want to have a local clone from a store that is also served over HTTP/SSH. Operations on such a local clone would ideally not go via HTTP/SSH. All issues with that, that I remember were that either we couldn't detect whether this is needed or that the reconfiguration was committed. Both led to changes and particularly with local reconfiguration a lot of trouble in that regard should be addressed.
In a scenario where the store is served via HTTP and some people also need local clones, I would assume, that the latter are more likely to not be pure consumption scenarios

Can you describe a concrete case, where this is desired, and that is not a plain consumption (read-only) case?

@bpoldrack
Copy link
Member

Can you describe a concrete case, where this is desired, and that is not a plain consumption (read-only) case?

RIA store, which for consumption is set up to be served over HTTP. Dataset maintainer/curator making updates to datasets in a store. For large data additions, a local clone is desired, b/c of awesome network allowing to quickly download lots of stuff into a local clone, committing and pushing to the store locally, rather than downloading elsewhere and pushing over SSH.
However, that machine is not meant for computation. Hence, other users need to push their results from other machines over SSH. Without reconfiguration this can only be captured by different datasets, I think. Might be advisable anyway to separate such datasets, but we can't really know that in the general case to what extend say preprocessed data needs to be mixed in that sense (downloads + results).
Furthermore, smaller fixes, deletions, etc. could be done by the curator from a different machine (and do the work offline) to later push over SSH. Arguably, the latter is "only" convenience, but depending on work/network conditions (think rise in remote work), that convenience may be quite desirable.

Am I making sense?

@bpoldrack
Copy link
Member

However, this business might be addressable by having yet another sameas remote (+ proper costs obv.) rather than reconfiguration. May be that's the better way.

@mih
Copy link
Member Author

mih commented Nov 23, 2021

I think the scenario you describe should be covered by the normal "ephemeral clone" setup, which can directly interface any store on the local machine. All file content is directly available.

The only case not covered is a store that hosts file content in 7z archives. So taken together, it would not cover the use case of having to modify existing file content in a dataset, kept in a store with archive.7z, and push the modified content back to that store.

That seems like a corner case. If there was an archive 7z before, there will likely have to be one after the update too. And if so, a push doesn't give that. Instead the archive file needs to be updated by a manual process outside the special remote universe.

@mih
Copy link
Member Author

mih commented Sep 21, 2022

I just came across the need to turn a provided special remote configuration into a working one (configured URL was no accessible to me, but the location was accessible via another channel).

❯ git annex initremote mynewname --private --sameas=abda3a9a-8581-4c60-9f27-b6264fa8c0b1 type=<type> <whatever-is-different-too> [exporttree=yes]

Has worked great. It promises to create a new special remote that is not shared and points to the original source.

Worth nothing that one can override type and all parameters that need changing as expected. However, it does not inherit exporttree=yes -- which I found confusing initially, but ultimately also made sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants