New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: gen_URLS (to replace get_URLS) - query schemes in order of prior "successes" #4955
Conversation
it seems it was "adopted" from archives remote but never really implemented etc
Codecov Report
@@ Coverage Diff @@
## master #4955 +/- ##
==========================================
- Coverage 89.80% 89.61% -0.20%
==========================================
Files 289 289
Lines 40709 40715 +6
==========================================
- Hits 36559 36487 -72
- Misses 4150 4228 +78
Continue to review full report at Codecov.
|
@@ -93,7 +92,7 @@ def req_CHECKPRESENT(self, key): | |||
""" | |||
lgr.debug("VERIFYING key %s" % key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: hm, coverage suggests that this piece of code is not covered... that is a bit surprising since test_basic_scenario_local_url
does hit that code but in the external process of special remote. I guess at some point we lost proper functioning of merging coverage from multiple reports (i.e. from the main nose process + special remotes which run underneath). And indeed, filed #4956
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can cut interactions between git-annex and datalad special remote significantly, and will pollute log less.
One thought I had when reading over these changes is whether it'd be better to call GETURLS
with an empty prefix and then do post-processing of all the URLs on our side.
else: | ||
break | ||
if scheme_urls: | ||
# note: generator would ceise to exist thus not asking | ||
# for URLs for other schemes if this scheme is good enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ceise/cease/
I don't understand this comment and how it relates to the commit where it was added (7da0b69).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it somewhat relates to both changes of 1. making it a generator; 2. sorting so we hit first the one we had succeeded before with; and probably to 1. more than to the 2. But since both commits are in the same PR I guess no need to move the comment, unless you insist.
As for more explanation re 1.: since it is a generator, if the first scheme succeeds to find URLs to which we e.g. perform successfully the CHECKPRESENT, then we would not even consider the next scheme/prefix since generator will not be asked for the "next" one. If it was not a generator -- we go through all schemes regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is a generator, if the first scheme succeeds to find URLs to which we e.g. perform successfully the CHECKPRESENT, then we would not even consider the next scheme/prefix since generator will not be asked for the "next" one. If it was not a generator -- we go through all schemes regardless.
If I understand correctly, perhaps something like this then: ""Note: Yield a list of URLs per scheme so that a caller can decide to stop when it gets a scheme that it considers good enough".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident that my rephrasing keeps the same meaning. Either way, I don't want to hold up the improvements in this PR with this comment or my likely muddled remarks about alternative approaches. Unless any objections come in, I'll plan to merge this later today. (I'll s/ceise/cease/ locally after the merge to avoid another run of the tests.)
yeah, I was thinking about that as well... in some cases it might be beneficial (all URLs are "datalad" urls) but in some might be the opposite -- bunch of urls we do not want to handle, but now would need to fetch and all those lines from annex first before we proceed. With this reordering of supported protocols and asking for the one which we got responses for before, we make it more efficient as a whole (no URLs we would not care, immediate hit for the prefix we do care about). So I decided to go with the sorting approach. |
I don't know that it is "more efficient on the whole". Do you expect it to be common to get a whole bunch of URLs we don't care about? And even if you did, how does the processing weigh against eliminating additional interactions with annex? Plus, wouldn't asking for all the URLs have the advantage of gen_URLS yielding URLs for all supported schemes? At any rate, without actual timings it's hard to say anything. Feel free of course to go with whatever approach you think is best. |
it can happen. E.g. in the course of openneuro we had a mix of regular urls which could be handled internally by git-annex or by
Note that in case of Also without using a prefix and being able to stop generator without checking all schemes, there would be no benefit from having
it would yield them all as well in current implementation, although looping through schemas (in order of preference based on prior interactions).
Indeed, and for any realistic example it would be heavily masked by actual interactions with the resource (remote urls) making it hard-to-impossible to time. BUT it already resulted in less pollution of the logs, so even for that it should be of benefit even if performance advantage is in question. |
We're talking past each other, but it doesn't matter. Please go with whatever you want. |
ok, although I think I might be getting on what you are trying to hint on -- that every request to git-annex for GETURLS has an overhead of first getting all the urls and then filtering for a specified prefix (on git-annex side). Then indeed, if the remote ends up querying more than a single prefix -- we would just multiply that overhead. Is that what you are talking about? The idea/hope here is that a single prefix hit would be sufficient. meanwhile, I wondered how much of penalty from doing multiple queries to annex. Here is an original script performance and then performance of modified one to also do those geturl requests from annex for one or more schemas script from `~datalad/trash/speedyannex2`$> cat git-annex-remote-datalad
#!/bin/bash
set -e
schemes=( $SCHEMES )
# Gets a VALUE response and stores it in $RET
report () {
# echo "$@" >&2
:
}
recv () {
read resp
#resp=${resp%\n}
target="$@"
if [ "$resp" != "$target" ]; then
report "! exp $target"
report " got $resp"
else
report "+ got $resp"
fi
}
send () {
echo "$@"
# report "sent $@"
}
send VERSION 1
recv EXTENSIONS INFO ASYNC
send UNSUPPORTED-REQUEST
recv PREPARE
send PREPARE-SUCCESS
send DEBUG Encodings: filesystem utf-8, default utf-8
# from now on should be checkpresent sequence
# recv CHECKPRESENT MD5E-s70189716--45f4514f972325f481d5c09434c1c94a.nii.gz
while read CMD KEY; do
if [ $CMD != CHECKPRESENT ]; then
echo "Was expecting CHECKPRESENT, got $CMD"
exit 1
fi
for scheme in ${schemes[*]}; do
send GETURLS $KEY $scheme:
url=bogusforuntil
while [ ! -z "$url" ]; do
read resp url
test "$resp" = VALUE
done
done
send CHECKPRESENT-SUCCESS $KEY
done
$> time PATH=~datalad/trash/speedyannex:$PATH git annex fsck --from datalad --fast >/dev/null
PATH=~datalad/trash/speedyannex:$PATH git annex fsck --from datalad --fast > 11.18s user 2.51s system 141% cpu 9.690 total
$> time PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >/dev/null
PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast > 11.75s user 2.74s system 141% cpu 10.241 total
$> time SCHEMES=s3 PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >/dev/null
SCHEMES=s3 PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from > 25.33s user 5.68s system 123% cpu 25.158 total
$> time SCHEMES="s3 http https ftp" PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >/dev/null
SCHEMES="s3 http https ftp" PATH=~datalad/trash/speedyannex2:$PATH git annex 28.21s user 7.41s system 122% cpu 29.180 total |
Thank you Kyle. Fwiw, was using this in "production" since then and spotted no side effects |
Ref: #4954
I do not think it would be of tremendous performance boost (since most of the time is likely to be spent in interaction with external resources) but it can cut interactions between git-annex and
datalad
special remote significantly, and will pollute log less.I think the change is quite straightforward and I would not mind to rebase it against
maint
, but I am also ok to keep it in master.