Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: RIA remote #4203

Merged
merged 49 commits into from May 5, 2020
Merged

RF: RIA remote #4203

merged 49 commits into from May 5, 2020

Conversation

bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Feb 26, 2020

First of the announced follow up PR sitting on top of PR #4164 .

This includes:

  • move creation routines out of the special remote itself, leading to initremote and prepare to expect an existing and accessible storage location. Furthermore split store from dataset representation in that regard, meaning two creation routines (one for a store and its layout and for a dataset within a store and its layout)
  • rewrite tests
  • New concept of storing rewritten URL in special remote config and its uuid in the bare repo's config in a store. clone gets a fallback solution in postclone_ria in case we cloned from a RIA store and annex-init didn't enable an ORA remote. We would now look for an ORA remote UUID in origin's config and try to enable this one using the ria+ URL we cloned from. This is useful for example when the special remote URL used to publish to the store isn't valid from the point of view of a consumer. Note, that this currently supports file: and ssh: URLS. Meaningful http:support depends on actual support for HTTP in the special remote itself.
  • renaming of the special remote and related configs, etc. from RIA to ORA. Includes command renaming ria-export-archive to ora-export-archive
  • exclude @turtle tests in AppVeyor, too. Not just in Travis.

There is still some RF'ing to do and issues like #4436 and #4437 to address/investigate. However, this should be based on the remaining RF'ing which depends on PR #4068. Plan is to build a remote execution abstraction using the approach of SSHRemoteIO in the special remote but with the async based new runner.

@codecov
Copy link

codecov bot commented Feb 26, 2020

Codecov Report

Merging #4203 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4203   +/-   ##
=======================================
  Coverage   89.20%   89.20%           
=======================================
  Files         285      285           
  Lines       38526    38526           
=======================================
  Hits        34366    34366           
  Misses       4160     4160           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d3edaa...2d3edaa. Read the comment docs.

@mih
Copy link
Member

mih commented Mar 7, 2020

Needs to start from #4260 now

@bpoldrack bpoldrack force-pushed the rf-ria branch 2 times, most recently from d90a779 to 379ec0a Compare Mar 16, 2020
@bpoldrack
Copy link
Member Author

bpoldrack commented Mar 16, 2020

Update: rebased on master and hopefully fixed the fallout. test_ria_basics should now pass again.
Proceeding to rewrite/restructure the remaining tests to account for change that requires creation of the store location beforehand now.

bpoldrack added 21 commits Apr 15, 2020
objtree -> store. This was confusing, since the naming is based on an earlier
to RIA, that didn't contain actual repositories. It didn't refer to
annex/objects, which we also call 'object tree'.
Separates check for layout(version) and creation thereof
as well as treatment of store and the dataset therein.
note, that this is a temporary RF step. Creation should go out of
special remote protocol altogether. However, one step at a time.
This allows to better test/investigate implications of further
RF.
One more RF step to get creation out of the special
special remote altogether.
Also don't set configs in initremote before success.
Special remote now expects an existing setup at the remote end.
It was a bad idea to begin with, since initremote are even more so
prepare are called in all kinds of situations leading to
unintended creation (trials). Furthermore we need autoenable in a
clone to fail on non-existing/non-accessible remote end.
and move git-annex-testremote run to basic tests.
Note, that those will ultimately be reported by git-annex after being
converted to a RemoteError (see handle_error decorator).
While configs should still work ATM, ORA will cease to support them.
Therefore no need to adapt tests.
This is about dataset layout version 1, which uses dirhashlower
(as does annex for a bare repo). Different test for version 2
still needed, to ensure it doesn't interfere.
Changed since annex can consume dirhashlower as well
as dirhashmixed from a bare repo. But using it for upload
git-annex will use dirhashlower. Therefore, dataset layout
versions are different in whether or not they can access
keys uploaded via a git remote into the bare repo.
@mih
Copy link
Member

mih commented Apr 29, 2020

@bpoldrack I have no issues with disabling Windows tests and fixing the bugs later. Usage on Windows is still largely hypothetical, and we need to make a release.

@@ -582,20 +582,136 @@ def postclonecfg_ria(ds, props):
# and was enabled (there could be RIA stores that actually only have repos)
# make this function be a generator
ria_remotes = [s for s in ds.siblings('query', result_renderer='disabled')
if s.get('annex-externaltype', None) == 'ora'
]
if s.get('annex-externaltype', None) == 'ora']
if not ria_remotes:
Copy link
Member

@mih mih Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this isn't introduced here, but shouldn't the entire lookup operation be conditioned upon the actual presence of any ora special remote configuration, i.e. look into AnnexRepo.get_special_remote(). That would be cheap, but all the remote talking is relatively expensive. For example, the HCP datasets live in a RIA store but none has an ORA remote configured. It already takes 5+ hours to clone them. This would make it even longer.

Copy link
Member Author

@bpoldrack bpoldrack Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But to be solved separately, right?

Copy link
Member

@mih mih May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix that in master right now. Conflicts will then have to be resolved in here. I am OK with either approach.

Copy link
Member

@mih mih May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I take that back. The problem doesn't exist in master. It is introduced here.

Copy link
Member

@mih mih May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a fix in here. cadfdac

Copy link
Member Author

@bpoldrack bpoldrack May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought replacing the siblings call right away and collect info on all special remotes (enabled or not) in one call. Anyways - your commit solves the bigger issue. Agree.

@kyleam
Copy link
Collaborator

kyleam commented Apr 29, 2020

Sorry, I've canceled the remaining AppVeyor builds for this PR. Those one hour hangs are really clogging up the queued jobs given AppVeyor runs one job at a time. If it's just one or a few tests hanging, perhaps decorating the tests with @nose.tools.timed(<some reasonable value>) would be a good way to debug these.

See dataladgh-4436
AppVeyor hangs in SSH tests.
@mih mih mentioned this pull request Apr 30, 2020
@mih
Copy link
Member

mih commented Apr 30, 2020

Trying this code I see this:

dl clone ria+ssh://bulk1.htc.inm7.de/ds/cat_12.5/srv#f2e9193e-8401-11ea-99ab-a0369f287950 myclone
[INFO   ] RIA store doesn't exist at /ds/cat_12.5/srv                                        
[INFO   ] Configure additional publication dependency on "inm7-storage"                      
configure-sibling(ok): . (sibling)
install(ok): /tmp/myclone (dataset)
action summary:
  configure-sibling (ok: 1)
  install (ok: 1)

The store does exist, cloning and getting work fine. The message is wrong, or misleading.

@mih
Copy link
Member

mih commented Apr 30, 2020

Performance anecdote: Rsyncing a dataset store location vs getting a full dataset (7k files, 23GB total, all files via ora-special remote) is ~70% faster (both going through an SSH connection). I am not sure which pieces in the (obviously more complicated and more capable) stack contribute to the slowdown, but it might be worth investigation at some point.

@mih
Copy link
Member

mih commented Apr 30, 2020

This is broken for GitRepo:

% datalad create --no-annex
% datalad create-sibling-ria -d . -s inm7 --trust-level trust ria+file:///ds/cat_12.5/srv
[INFO   ] create siblings 'inm7' and 'inm7-storage' ... 
[ERROR  ] 'GitRepo' object has no attribute 'init_remote' [create_sibling_ria.py:_create_sibling_ria:491] (AttributeError) 

It should not attempt to enable a storage sibling and then crash for a non-annex dataset.

Edit: in a follow-up attempt to make it work, I hit another bug (already appeared in #4415):

% datalad create-sibling-ria -d . -s inm7 --trust-level trust ria+file:///ds/cat_12.5/srv --no-storage-sibling --existing reconfigure
[ERROR  ] [Errno 17] File exists: '/ds/cat_12.5/srv/653/8a7fe-8ac7-11ea-83ba-a0369f287950' [pathlib.py:mkdir:1251] (FileExistsError) 

Interestingly, even a rm -rf /ds/cat_12.5/srv/653/8a7fe-8ac7-11ea-83ba-a0369f287950 didn't help.

Couldn't deal with plain git repositories
- Creating remote directories when there's no special remote isn't
  necessary anymore due to RF'ing of creation routines and there
  explicit use in the command. - deleted.
- ignore storage-sibling options when ds isn't an AnnexRepo. This is
  particularly meant to enable recursive invocation to deal with a mix
  of annex/git repos.
@bpoldrack
Copy link
Member Author

bpoldrack commented Apr 30, 2020

Plain git issues addressed in 18e45e4

@bpoldrack
Copy link
Member Author

bpoldrack commented Apr 30, 2020

For now skipping ORA tests completely on Windows. #4469
No time to investigate ATM.

@bpoldrack
Copy link
Member Author

bpoldrack commented Apr 30, 2020

Trying this code I see this:

dl clone ria+ssh://bulk1.htc.inm7.de/ds/cat_12.5/srv#f2e9193e-8401-11ea-99ab-a0369f287950 myclone
[INFO   ] RIA store doesn't exist at /ds/cat_12.5/srv                                        
[INFO   ] Configure additional publication dependency on "inm7-storage"                      
configure-sibling(ok): . (sibling)
install(ok): /tmp/myclone (dataset)
action summary:
 configure-sibling (ok: 1)
 install (ok: 1)

The store does exist, cloning and getting work fine. The message is wrong, or misleading.

On current commit doesn't happen for me:

❱ datalad clone "ria+ssh://bulk1.htc.inm7.de/ds/cat_12.5/srv#f2e9193e-8401-11ea-99ab-a0369f287950" myclone 
[INFO   ] Configure additional publication dependency on "inm7-storage"                                                                       
configure-sibling(ok): . (sibling)                                                                                                            
install(ok): /home/bpoldrack/myclone (dataset)
action summary:
  configure-sibling (ok: 1)
  install (ok: 1)

Can you confirm, @mih?

@bpoldrack
Copy link
Member Author

bpoldrack commented May 1, 2020

While above mentioned call worked for @mih as well, the actual issue was still not solved.

Issue was that the [INFO ] RIA store doesn't exist at /ds/cat_12.5/srv is a factually correct but irrelevant failure. It was checking that path locally rather than at the store's machine via SSH. This is because the special remote URL committed in the dataset is a ria+file: URL, while we cloned on a different machine via ria+ssh:. Therefore autoenabling during git-annex-init failed (correctly!) and new feature in datalad-clone fix the situation by using the URL we cloned from to reconfigure the special remote.
Addressed in 58016f8

In the case we discovered this would now change the output from

[INFO   ] RIA store doesn't exist at /ds/cat_12.5/srv 

to:

[INFO   ] RIA store doesn't exist at ria+file:///ds/cat_12.5/srv                                                                                                                                                                                                                                                                                           
[INFO   ] Reconfigured inm7-storage for ria+ssh://cat_12.5.ds.inm7.de/ds/cat_12.5/srv                                                                                                                                                                                                                                                                      

bpoldrack added 2 commits May 1, 2020
This is taken care of in GitRepo already.
(Closes datalad#4415)
When committed URL for ORA remote is invalid for someone cloning from
RIA store, autoenabling will (correctly!) fail during datalad-clone's
git-annex-init, while clone will (attempt to) fix this afterwards by
explicitly enabling based on the URL it cloned from. This used to lead
to an INFO message during git-annex-init claiming that there was no RIA
store but then everything works just fine, which is confusing.

Changed to printing out the entire URL of what failed within ORA remote
and add a message on successful reconfiguration by clone.
bpoldrack and others added 3 commits May 1, 2020
Be more specific about what's wrong and how to fix it.
(Closes datalad#4329)
mih
mih approved these changes May 2, 2020
Copy link
Member

@mih mih left a comment

I have spent the last few days exercising this code on real data by building and updating largish stores, and consuming datasets on the same machine and across the network. The additionally discovered issues are in the comments of this PR. Given the complexity of the changes I cannot say that I have tried all combinations, but I am easily convinced that merging this will overall improve the situation, and in many cases will make things work at all. With this present state, I was able to perform all pieces necessary for the envisioned workflow that the RIA code aims to enable. I'd vote for a merge and continue from here in much smaller PRs. Thx @bpoldrack !

Matching a local remote configuration with a store is a costly
procedure. We can quickly rule out the possibility of success, by
checking for a known special remote config locally. If there is none,
we can skip early.

Also rename `ria_remote` -> `ora_remote` to avoid confusion.
Copy link
Member

@yarikoptic yarikoptic left a comment

This is a cursory review, largely from a position of a fellow developer not familiar with the code base and not yet using this functionality. Overall looks great, but I have left some comments at various spots pointing to aspects which IMHO could make code (more consistent with other code in datalad, and thus more readable) and UX (more informative logs and exceptions etc) better. Feel free to ignore, but I hope that at least some of them would be addressed before the merge.

datalad/core/distributed/clone.py Outdated Show resolved Hide resolved
datalad/distributed/tests/test_ria_git_remote.py Outdated Show resolved Hide resolved
datalad/core/distributed/clone.py Outdated Show resolved Hide resolved
datalad/core/distributed/clone.py Outdated Show resolved Hide resolved
datalad/core/distributed/clone.py Outdated Show resolved Hide resolved
datalad/customremotes/ria_utils.py Show resolved Hide resolved
datalad/customremotes/ria_utils.py Outdated Show resolved Hide resolved
datalad/customremotes/ria_utils.py Outdated Show resolved Hide resolved
datalad/customremotes/ria_utils.py Show resolved Hide resolved
datalad/distributed/tests/test_ria_basics.py Outdated Show resolved Hide resolved
- move import statements
- improve logging and exceptions
- rely on dict.get's default
- more specific split
- constant for relevant remote name
- add docstrings for helper functions
- properly mark windows failures instead of just skipping them
@bpoldrack
Copy link
Member Author

bpoldrack commented May 4, 2020

@yarikoptic :
See #4203 (comment)
Apart from that addressed your comments in 2d3edaa

@yarikoptic yarikoptic merged commit 19ef978 into datalad:master May 5, 2020
12 checks passed
@mih mih deleted the rf-ria branch May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants