New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: addurls - no add per file, mock.patch once per dataset #4867
Conversation
…tadata Since they were addurl'ed first anyways they should have been added already and known to annex, so the reason for the operation is not clear to me, and tests seems to still pass (haven't tried in RL to see any side-effect yet). Moreover, unless `save=False`, there is a subsequent `dataset.save` call to follow anyways which would add/commit them if so desired (i.e. by default).
I didn't look at this yet, but it seems like that should be the first step. |
heh, ok the script#!/bin/bash
export PS4='> '
set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
d=`pwd`
echo -e "url,filename,bogus1,bogus2" > list.csv
mkdir src
(cd src; for s in {1..100}; do echo $s > $s; echo "file:///$d/src/$s,$s,pocus,focus" >> ../list.csv; done)
set -x
datalad create dest
(
cd dest;
git config annex.security.allowed-url-schemes file;
time datalad addurls -x '*' ../list.csv '{url}' '{filename}'
) master 0.13.3-151-g996368d32 addurls - 12sec while enjoying progress bar$> bash time-addurls.sh
> datalad create dest
[INFO ] Creating a new annex repo at /home/yoh/.tmp/dl-tM2yQxA/dest
[INFO ] Scanning for unlocked files (this may take some time)
create(ok): /home/yoh/.tmp/dl-tM2yQxA/dest (dataset)
> cd dest
> git config annex.security.allowed-url-schemes file
> datalad addurls -x '*' ../list.csv '{url}' '{filename}'
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/1 (file) [to 1]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/2 (file) [to 2]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/3 (file) [to 3]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/4 (file) [to 4]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/5 (file) [to 5]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/6 (file) [to 6]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/7 (file) [to 7]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/8 (file) [to 8]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/9 (file) [to 9]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/10 (file) [to 10]
[90 similar messages have been suppressed]
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/1 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/2 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/3 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/4 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/5 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/6 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/7 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/8 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/9 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/10 (file)
[90 similar messages have been suppressed]
save(ok): /home/yoh/.tmp/dl-tM2yQxA/dest (dataset)
action summary:
add (ok: 100)
addurl (ok: 100)
save (ok: 1)
real 0m11.716s
user 0m8.735s
sys 0m3.480s
bash time-addurls.sh 10.14s user 3.96s system 104% cpu 13.518 total this PR - 2 sec and no progress bar$> bash time-addurls.sh
> datalad create dest
[INFO ] Creating a new annex repo at /home/yoh/.tmp/dl-GP2jWms/dest
[INFO ] Scanning for unlocked files (this may take some time)
create(ok): /home/yoh/.tmp/dl-GP2jWms/dest (dataset)
> cd dest
> git config annex.security.allowed-url-schemes file
> datalad addurls -x '*' ../list.csv '{url}' '{filename}'
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/1 (file) [to 1]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/2 (file) [to 2]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/3 (file) [to 3]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/4 (file) [to 4]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/5 (file) [to 5]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/6 (file) [to 6]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/7 (file) [to 7]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/8 (file) [to 8]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/9 (file) [to 9]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/10 (file) [to 10]
[90 similar messages have been suppressed]
save(ok): /home/yoh/.tmp/dl-GP2jWms/dest (dataset)
action summary:
addurl (ok: 100)
save (ok: 1)
real 0m1.873s
user 0m1.487s
sys 0m0.487s
If I remove master - 20 secaction summary:
add (notneeded: 99, ok: 1)
addurl (ok: 100)
metadata (ok: 100)
save (ok: 1)
real 0m19.662s
user 0m15.224s
sys 0m6.237s
> datalad status
nothing to save, working tree clean edit NB: in above this PR - 10 sec...
action summary:
addurl (ok: 100)
metadata (ok: 100)
save (ok: 1)
real 0m10.476s
user 0m8.678s
sys 0m3.462s
> datalad status
nothing to save, working tree clean
|
Thanks! :) |
Codecov Report
@@ Coverage Diff @@
## master #4867 +/- ##
=======================================
Coverage 89.70% 89.71%
=======================================
Files 289 289
Lines 40599 40596 -3
=======================================
Hits 36419 36419
+ Misses 4180 4177 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. (And thanks again for the paste-and-go timing scripts.)
Subject: [PATCH 1/4] DOC: fix up .debug message to note that it is about to
start adding URLs not metadata
Oops, thanks for catching my sloppy copy-paste error.
Subject: [PATCH 2/4] RM: add_meta - do not bother .repo.add'ing each file
before adding metadataSince they were addurl'ed first anyways they should have been added
already and known to annex, so the reason for the operation is not
clear to me, and tests seems to still pass (haven't tried in RL to see
any side-effect yet).
Hmm, true, addurl should always take care of that. Here's how I think things ended up in this confusing spot. Originally a datalad.add(save=SAVE)
call used to be done, taking care of the save (unless save=False) before adding metadata. Then the save=SAVE
was flipped to save=False
in 9faa07b (ENH: addurls: Reduce the number of metadata commits, 2018-03-26), introducing a downstream save
call. At that point, I didn't step back and realize I could drop the add
call entirely. And that still didn't occur to me when I reviewed a later commit (a5e6eea) that replaced the datalad.add
call with a repo.add
call in add_meta
:/
So, this change looks correct to me.
Subject: [PATCH 3/4] OPT: do not bother calling add_meta on files without
metadata to be added
Good idea.
Subject: [PATCH 4/4] OPT: group datasets and perform patch.object for
always_commit once per datasetI bet it does not take much time to patch.object but it still does take time,
so why to waste it?
Makes sense.
datalad/plugin/addurls.py
Outdated
@@ -556,30 +556,24 @@ def add_meta(rows): | |||
""" | |||
from unittest.mock import patch | |||
|
|||
# OPT: group by dataset first so to not patch/unpatch always_commit | |||
# per each file of which we could have thousands | |||
from collections import defaultdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine as is, but we use collections widely enough that delaying this import doesn't save us anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Forgot to move it up ;)
datalad/plugin/addurls.py
Outdated
ds, filename = row["ds"], row["ds_filename"] | ||
dss_rows[row["ds"]].append(row) | ||
|
||
for ds, ds_rows in dss_rows.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, fine as is, but in case you didn't consider it: itertools.groupby
would work nicely here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, will use it and that will remove aforementioned import.
…er dataset I bet it does not take much time to patch.object but it still does take time, so why to waste it?
6c7b4a5
to
6af1398
Compare
pushed recommended changes + parametrized benchmark. I have made benchmark to fail if in the teardown status shows something not clean. benchmark results$> asv continuous -b .*addurls.* master opt-addurls-no-metadataloop
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7
·· Building 6af13984 <opt-addurls-no-metadataloop> for virtualenv-py3.7.
·· Installing 6af13984 <opt-addurls-no-metadataloop> into virtualenv-py3.7.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For DataLad commit 996368d3 <master> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.7...
[ 0.00%] ·· Benchmarking virtualenv-py3.7
[ 25.00%] ··· Running (plugins.addurls.addurls1.time_addurls--).
[ 25.00%] · For DataLad commit 6af13984 <opt-addurls-no-metadataloop> (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7..
[ 25.00%] ·· Benchmarking virtualenv-py3.7
[ 50.00%] ··· Running (plugins.addurls.addurls1.time_addurls--).
[ 50.00%] · For DataLad commit 6af13984 <opt-addurls-no-metadataloop> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7
[ 75.00%] ··· plugins.addurls.addurls1.time_addurls ok
[ 75.00%] ··· ================== ===========
exclude_metadata
------------------ -----------
None 2.26±0.1s
* 694±70ms
================== ===========
[ 75.00%] · For DataLad commit 996368d3 <master> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7..
[ 75.00%] ·· Benchmarking virtualenv-py3.7
[100.00%] ··· plugins.addurls.addurls1.time_addurls ok
[100.00%] ··· ================== ============
exclude_metadata
------------------ ------------
None 4.25±0.1s
* 2.46±0.05s
================== ============
before after ratio
[996368d3] [6af13984]
<master> <opt-addurls-no-metadataloop>
- 4.25±0.1s 2.26±0.1s 0.53 plugins.addurls.addurls1.time_addurls(None)
- 2.46±0.05s 694±70ms 0.28 plugins.addurls.addurls1.time_addurls('*')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED. |
Wonderful! Very happy to see such a change. This would have literally decreased the time to populate the HCP dataset by several days (~17 days to be more precise ;-) ! |
I have been running addurls with
-x '*'
to not bother with adding metadata for this datasets with a notable amount of files (will endup with 2M files across subdataset(s)). In the "batch" it seems to be only 20k files, and actual adding urls completed quite fast. Now I had been staring at slowly growingAdding metadata (10240 skipped):...
so majority were skipped but not due to "no metadata" I think but because.add
was not needed would be my guess (with_result_progress
maps "notneeded" into "skipped" and does nohow filters by record).PR is now a collection of small changes intended to optimize (although didn't test yet either there is a real advantage)
addurls
while running on large(r) number of files and not bothering to add metadata.One change in behavior would be that I removed
repo.add
which now was ran for every file and then message was perculated up and then at the end in a typical use caseds.save
would have been called on those files again. Now, ifsave=False
, then those files would remain "annexed" (converted to symlinks) but not added to git repository. IMHO that is Ok since "staging does no longer exists" in datalad world, andsave=False
was IIRC introduced to facilitate use withdatalad run
which would callsave
anyways.Let me know what you think, and I will probably test next batch of running addurls on this version to see if it feels any better.