Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: gen_URLS (to replace get_URLS) - query schemes in order of prior "successes" #4955

Merged
merged 4 commits into from Oct 1, 2020

Conversation

yarikoptic
Copy link
Member

Ref: #4954

I do not think it would be of tremendous performance boost (since most of the time is likely to be spent in interaction with external resources) but it can cut interactions between git-annex and datalad special remote significantly, and will pollute log less.

I think the change is quite straightforward and I would not mind to rebase it against maint, but I am also ok to keep it in master.

@yarikoptic yarikoptic added the performance Improve performance of an existing feature label Sep 23, 2020
@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #4955 into master will decrease coverage by 0.19%.
The diff coverage is 85.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4955      +/-   ##
==========================================
- Coverage   89.80%   89.61%   -0.20%     
==========================================
  Files         289      289              
  Lines       40709    40715       +6     
==========================================
- Hits        36559    36487      -72     
- Misses       4150     4228      +78     
Impacted Files Coverage Δ
datalad/customremotes/datalad.py 40.32% <50.00%> (-1.22%) ⬇️
datalad/customremotes/base.py 70.00% <93.33%> (-13.93%) ⬇️
datalad/customremotes/archives.py 63.21% <100.00%> (ø)
datalad/customremotes/tests/__init__.py 91.66% <0.00%> (-8.34%) ⬇️
datalad/support/keyring_.py 85.10% <0.00%> (-6.39%) ⬇️
datalad/tests/test_base.py 96.87% <0.00%> (-3.13%) ⬇️
datalad/customremotes/tests/test_archives.py 86.87% <0.00%> (-3.13%) ⬇️
datalad/cmd.py 90.20% <0.00%> (-2.83%) ⬇️
datalad/__init__.py 89.36% <0.00%> (-1.42%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6cf0ae...7da0b69. Read the comment docs.

@@ -93,7 +92,7 @@ def req_CHECKPRESENT(self, key):
"""
lgr.debug("VERIFYING key %s" % key)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: hm, coverage suggests that this piece of code is not covered... that is a bit surprising since test_basic_scenario_local_url does hit that code but in the external process of special remote. I guess at some point we lost proper functioning of merging coverage from multiple reports (i.e. from the main nose process + special remotes which run underneath). And indeed, filed #4956

@yarikoptic yarikoptic added the merge-if-ok OP considers this work done, and requests review/merge label Sep 24, 2020
Copy link
Collaborator

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can cut interactions between git-annex and datalad special remote significantly, and will pollute log less.

One thought I had when reading over these changes is whether it'd be better to call GETURLS with an empty prefix and then do post-processing of all the URLs on our side.

else:
break
if scheme_urls:
# note: generator would ceise to exist thus not asking
# for URLs for other schemes if this scheme is good enough
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ceise/cease/

I don't understand this comment and how it relates to the commit where it was added (7da0b69).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it somewhat relates to both changes of 1. making it a generator; 2. sorting so we hit first the one we had succeeded before with; and probably to 1. more than to the 2. But since both commits are in the same PR I guess no need to move the comment, unless you insist.
As for more explanation re 1.: since it is a generator, if the first scheme succeeds to find URLs to which we e.g. perform successfully the CHECKPRESENT, then we would not even consider the next scheme/prefix since generator will not be asked for the "next" one. If it was not a generator -- we go through all schemes regardless.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is a generator, if the first scheme succeeds to find URLs to which we e.g. perform successfully the CHECKPRESENT, then we would not even consider the next scheme/prefix since generator will not be asked for the "next" one. If it was not a generator -- we go through all schemes regardless.

If I understand correctly, perhaps something like this then: ""Note: Yield a list of URLs per scheme so that a caller can decide to stop when it gets a scheme that it considers good enough".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident that my rephrasing keeps the same meaning. Either way, I don't want to hold up the improvements in this PR with this comment or my likely muddled remarks about alternative approaches. Unless any objections come in, I'll plan to merge this later today. (I'll s/ceise/cease/ locally after the merge to avoid another run of the tests.)

@yarikoptic
Copy link
Member Author

One thought I had when reading over these changes is whether it'd be better to call GETURLS with an empty prefix and then do post-processing of all the URLs on our side.

yeah, I was thinking about that as well... in some cases it might be beneficial (all URLs are "datalad" urls) but in some might be the opposite -- bunch of urls we do not want to handle, but now would need to fetch and all those lines from annex first before we proceed. With this reordering of supported protocols and asking for the one which we got responses for before, we make it more efficient as a whole (no URLs we would not care, immediate hit for the prefix we do care about). So I decided to go with the sorting approach.

@kyleam
Copy link
Collaborator

kyleam commented Sep 24, 2020

With this reordering of supported protocols and asking for the one which we got responses for before, we make it more efficient as a whole (no URLs we would not care, immediate hit for the prefix we do care about). So I decided to go with the sorting approach.

I don't know that it is "more efficient on the whole". Do you expect it to be common to get a whole bunch of URLs we don't care about? And even if you did, how does the processing weigh against eliminating additional interactions with annex? Plus, wouldn't asking for all the URLs have the advantage of gen_URLS yielding URLs for all supported schemes?

At any rate, without actual timings it's hard to say anything. Feel free of course to go with whatever approach you think is best.

@yarikoptic
Copy link
Member Author

I don't know that it is "more efficient on the whole". Do you expect it to be common to get a whole bunch of URLs we don't care about?

it can happen. E.g. in the course of openneuro we had a mix of regular urls which could be handled internally by git-annex or by datalad + datalad-archives remote urls:

(git-annex)lena:~/datalad/openfmri/ds000001[master]sub-01/anat
$> git annex whereis sub-01_inplaneT2.nii.gz
whereis sub-01_inplaneT2.nii.gz (5 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	899f0347-0888-48ef-91b6-bac213ca8cef -- [datalad-archives]
   	c8bd3d05-33d4-4b59-9d53-ca7efbdcdd13 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000001
   	ce96a357-a40c-4107-9039-77124f428aae -- [datalad]
   	e89f69c2-d802-415d-a819-b29ca80daefd -- [mydirectory]

  The following untrusted locations may also have copies:
  	766676d0-7be4-4b0b-a583-d115a5d18353 -- [mydirectory-exported2]
   	9f989208-cccf-471a-8b92-877dec8bb622 -- yoh@hopa:~/.annex-cache [cache]

  web: http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/inplane001.nii.gz?versionId=ystKDnaPkdzSwzdRPZH0PtMMknZJCQV4
  web: http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.1/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=wVn1tHi1XBn7avQ9Q1oXuoqE2Splyc.X
  web: http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.2/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=RQFsRCUlj0X77.OTMJZ01Dcx5MRjfljQ
  web: http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.3/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=XdlgqNsFbXeHxzHsdZkR1CUNxozBcL8W
  web: http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=ZY7d3uaj43tD6oPwq8JCuRyEBD91OgCY

  datalad-archives: dl+archive:MD5E-s2415911708--63e02dea2a059517f25d1c2c9ae6c423.zip#path=ds000001_R2.0.2/sub-01/anat/sub-01_inplaneT2.nii.gz&size=669578
  datalad-archives: dl+archive:MD5E-s2415911710--5e509bac5a729d668f900d0440228402.zip#path=ds000001_R2.0.3/sub-01/anat/sub-01_inplaneT2.nii.gz&size=669578
  datalad-archives: dl+archive:MD5E-s2416078169--53fbcb667531f96e2532723e74ed218d.zip#path=ds001_R2.0.4/sub-01/anat/sub-01_inplaneT2.nii.gz&size=669578
  datalad-archives: dl+archive:MD5E-s2416581890--662e0713d0ce42bcdbadb8251b893b8a.tgz#path=ds001/sub-01/anat/sub-01_inplaneT2.nii.gz&size=669578
  datalad-archives: dl+archive:MD5E-s2426534406--aa96b8184370de7170ab46e1a643c3c0.0raw.tgz#path=ds001_R2.0.0/sub-01/anat/sub-01_inplaneT2.nii.gz&size=669578
  datalad-archives: dl+archive:MD5E-s2426534406--aa96b8184370de7170ab46e1a643c3c0.0raw.tgz/ds001_R2.0.0/sub-01/anat/sub-01_inplaneT2.nii.gz#size=669578
  datalad-archives: dl+archive:MD5E-s2527262329--bd3ea399057c529b37b09dcecec1ca60.0raw.tgz/ds001_R1.1.0/sub001/anatomy/inplane001.nii.gz#size=669578

Note that in case of datalad-archives: special remote it makes no sense to not specify the only prefix it supports. So it becomes then - when to ask for all + filter (clearly cannot be more efficient) vs. ask for only specific prefixes.

Also without using a prefix and being able to stop generator without checking all schemes, there would be no benefit from having gen_URLS as generator since we would be doomed to first suck in all reponses from annex for a given prefix.

Plus, wouldn't asking for all the URLs have the advantage of gen_URLS yielding URLs for all supported schemes?

it would yield them all as well in current implementation, although looping through schemas (in order of preference based on prior interactions).

And even if you did, how does the processing weigh against eliminating additional interactions with annex?
At any rate, without actual timings it's hard to say anything

Indeed, and for any realistic example it would be heavily masked by actual interactions with the resource (remote urls) making it hard-to-impossible to time. BUT it already resulted in less pollution of the logs, so even for that it should be of benefit even if performance advantage is in question.

@kyleam
Copy link
Collaborator

kyleam commented Sep 24, 2020

We're talking past each other, but it doesn't matter. Please go with whatever you want.

@yarikoptic
Copy link
Member Author

ok, although I think I might be getting on what you are trying to hint on -- that every request to git-annex for GETURLS has an overhead of first getting all the urls and then filtering for a specified prefix (on git-annex side). Then indeed, if the remote ends up querying more than a single prefix -- we would just multiply that overhead. Is that what you are talking about? The idea/hope here is that a single prefix hit would be sufficient.

meanwhile, I wondered how much of penalty from doing multiple queries to annex. Here is an original script performance and then performance of modified one to also do those geturl requests from annex for one or more schemas

script from `~datalad/trash/speedyannex2`
$> cat git-annex-remote-datalad
#!/bin/bash

set -e

schemes=( $SCHEMES )
# Gets a VALUE response and stores it in $RET
report () {
    # echo "$@" >&2
    :
}

recv () {
    read resp
    #resp=${resp%\n}
    target="$@"
    if [ "$resp" != "$target" ]; then
        report "! exp $target"
        report "  got $resp"
    else
        report "+ got $resp"
    fi
}

send () {
    echo "$@"
    # report "sent $@"
}

send VERSION 1
recv EXTENSIONS INFO ASYNC
send UNSUPPORTED-REQUEST
recv PREPARE
send PREPARE-SUCCESS
send DEBUG Encodings: filesystem utf-8, default utf-8

# from now on should be checkpresent sequence
# recv CHECKPRESENT MD5E-s70189716--45f4514f972325f481d5c09434c1c94a.nii.gz
while read CMD KEY; do
    if [ $CMD != CHECKPRESENT ]; then
        echo "Was expecting CHECKPRESENT, got $CMD"
        exit 1
    fi
    for scheme in ${schemes[*]}; do
        send GETURLS $KEY $scheme:
        url=bogusforuntil
        while [ ! -z "$url" ]; do
          read resp url
          test "$resp" = VALUE
        done
	done
    send CHECKPRESENT-SUCCESS $KEY
done
$> time PATH=~datalad/trash/speedyannex:$PATH git annex fsck --from datalad --fast >/dev/null 
PATH=~datalad/trash/speedyannex:$PATH git annex fsck --from datalad --fast >   11.18s user 2.51s system 141% cpu 9.690 total

$> time PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >/dev/null
PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >  11.75s user 2.74s system 141% cpu 10.241 total

$> time SCHEMES=s3 PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >/dev/null
SCHEMES=s3 PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from   >   25.33s user 5.68s system 123% cpu 25.158 total

$> time SCHEMES="s3 http https ftp" PATH=~datalad/trash/speedyannex2:$PATH git annex fsck --from datalad --fast >/dev/null
SCHEMES="s3 http https ftp" PATH=~datalad/trash/speedyannex2:$PATH git annex   28.21s user 7.41s system 122% cpu 29.180 total
so it does add a little bit (4 seconds added to 25). Once again, in a realistic case such impact would likely be not notable since it would take longer to actually query the resource.

@kyleam kyleam merged commit 25c03f9 into datalad:master Oct 1, 2020
2 checks passed
@yarikoptic
Copy link
Member Author

Thank you Kyle. Fwiw, was using this in "production" since then and spotted no side effects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants