OPT: addurls - no add per file, mock.patch once per dataset #4867

yarikoptic · 2020-09-03T20:20:37Z

I have been running addurls with -x '*' to not bother with adding metadata for this datasets with a notable amount of files (will endup with 2M files across subdataset(s)). In the "batch" it seems to be only 20k files, and actual adding urls completed quite fast. Now I had been staring at slowly growing Adding metadata (10240 skipped):... so majority were skipped but not due to "no metadata" I think but because .add was not needed would be my guess (with_result_progress maps "notneeded" into "skipped" and does nohow filters by record).

PR is now a collection of small changes intended to optimize (although didn't test yet either there is a real advantage) addurls while running on large(r) number of files and not bothering to add metadata.

One change in behavior would be that I removed repo.add which now was ran for every file and then message was perculated up and then at the end in a typical use case ds.save would have been called on those files again. Now, if save=False, then those files would remain "annexed" (converted to symlinks) but not added to git repository. IMHO that is Ok since "staging does no longer exists" in datalad world, and save=False was IIRC introduced to facilitate use with datalad run which would call save anyways.

Let me know what you think, and I will probably test next batch of running addurls on this version to see if it feels any better.

…RLs not metadata

…tadata Since they were addurl'ed first anyways they should have been added already and known to annex, so the reason for the operation is not clear to me, and tests seems to still pass (haven't tried in RL to see any side-effect yet). Moreover, unless `save=False`, there is a subsequent `dataset.save` call to follow anyways which would add/commit them if so desired (i.e. by default).

…dded

kyleam · 2020-09-03T20:30:46Z

although didn't test yet either there is a real advantage

I didn't look at this yet, but it seems like that should be the first step.

yarikoptic · 2020-09-03T21:12:02Z

heh, ok

the script

#!/bin/bash

export PS4='> '

set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
d=`pwd`

echo -e "url,filename,bogus1,bogus2" > list.csv
mkdir src
(cd src; for s in {1..100}; do echo $s > $s; echo "file:///$d/src/$s,$s,pocus,focus" >> ../list.csv; done)

set -x
datalad create dest
(
cd dest;
git config annex.security.allowed-url-schemes file;
time datalad addurls -x '*' ../list.csv '{url}' '{filename}'
)

master 0.13.3-151-g996368d32 addurls - 12sec while enjoying progress bar

$> bash time-addurls.sh
> datalad create dest
[INFO   ] Creating a new annex repo at /home/yoh/.tmp/dl-tM2yQxA/dest 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /home/yoh/.tmp/dl-tM2yQxA/dest (dataset)
> cd dest
> git config annex.security.allowed-url-schemes file
> datalad addurls -x '*' ../list.csv '{url}' '{filename}'
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/1 (file) [to 1]                                                       
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/2 (file) [to 2]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/3 (file) [to 3]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/4 (file) [to 4]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/5 (file) [to 5]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/6 (file) [to 6]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/7 (file) [to 7]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/8 (file) [to 8]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/9 (file) [to 9]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/10 (file) [to 10]
  [90 similar messages have been suppressed]                                                                     
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/1 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/2 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/3 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/4 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/5 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/6 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/7 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/8 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/9 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/10 (file)
  [90 similar messages have been suppressed]
save(ok): /home/yoh/.tmp/dl-tM2yQxA/dest (dataset)
action summary:
  add (ok: 100)
  addurl (ok: 100)
  save (ok: 1)

real	0m11.716s
user	0m8.735s
sys	0m3.480s
bash time-addurls.sh  10.14s user 3.96s system 104% cpu 13.518 total

this PR - 2 sec and no progress bar

$> bash time-addurls.sh
> datalad create dest
[INFO   ] Creating a new annex repo at /home/yoh/.tmp/dl-GP2jWms/dest 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /home/yoh/.tmp/dl-GP2jWms/dest (dataset)
> cd dest
> git config annex.security.allowed-url-schemes file
> datalad addurls -x '*' ../list.csv '{url}' '{filename}'
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/1 (file) [to 1]                                                       
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/2 (file) [to 2]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/3 (file) [to 3]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/4 (file) [to 4]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/5 (file) [to 5]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/6 (file) [to 6]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/7 (file) [to 7]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/8 (file) [to 8]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/9 (file) [to 9]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/10 (file) [to 10]
  [90 similar messages have been suppressed]  
save(ok): /home/yoh/.tmp/dl-GP2jWms/dest (dataset)
action summary:
  addurl (ok: 100)
  save (ok: 1)

real	0m1.873s
user	0m1.487s
sys	0m0.487s

If I remove -x '*' thus to add metadata (and also datalad status added at the end)

master - 20 sec

action summary:
  add (notneeded: 99, ok: 1)
  addurl (ok: 100)
  metadata (ok: 100)
  save (ok: 1)

real	0m19.662s
user	0m15.224s
sys	0m6.237s
> datalad status
nothing to save, working tree clean

edit NB: in above add (notneeded: 99, ok: 1) is "intriguing" since not clear why 1 file was oh so special.

this PR - 10 sec

...
action summary:
  addurl (ok: 100)
  metadata (ok: 100)
  save (ok: 1)

real	0m10.476s
user	0m8.678s
sys	0m3.462s
> datalad status
nothing to save, working tree clean

kyleam · 2020-09-03T21:14:57Z

Thanks! :)

codecov · 2020-09-03T21:17:59Z

Codecov Report

Merging #4867 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4867   +/-   ##
=======================================
  Coverage   89.70%   89.71%           
=======================================
  Files         289      289           
  Lines       40599    40596    -3     
=======================================
  Hits        36419    36419           
+ Misses       4180     4177    -3

Impacted Files	Coverage Δ
datalad/plugin/addurls.py	`99.72% <100.00%> (+0.54%)`	⬆️
datalad/support/gitrepo.py	`89.42% <0.00%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996368d...6af1398. Read the comment docs.

kyleam

Nice. (And thanks again for the paste-and-go timing scripts.)

Subject: [PATCH 1/4] DOC: fix up .debug message to note that it is about to
start adding URLs not metadata

Oops, thanks for catching my sloppy copy-paste error.

Subject: [PATCH 2/4] RM: add_meta - do not bother .repo.add'ing each file
before adding metadata

Since they were addurl'ed first anyways they should have been added
already and known to annex, so the reason for the operation is not
clear to me, and tests seems to still pass (haven't tried in RL to see
any side-effect yet).

Hmm, true, addurl should always take care of that. Here's how I think things ended up in this confusing spot. Originally a datalad.add(save=SAVE) call used to be done, taking care of the save (unless save=False) before adding metadata. Then the save=SAVE was flipped to save=False in 9faa07b (ENH: addurls: Reduce the number of metadata commits, 2018-03-26), introducing a downstream save call. At that point, I didn't step back and realize I could drop the add call entirely. And that still didn't occur to me when I reviewed a later commit (a5e6eea) that replaced the datalad.add call with a repo.add call in add_meta :/

So, this change looks correct to me.

Subject: [PATCH 3/4] OPT: do not bother calling add_meta on files without
metadata to be added

Good idea.

Subject: [PATCH 4/4] OPT: group datasets and perform patch.object for
always_commit once per dataset

I bet it does not take much time to patch.object but it still does take time,
so why to waste it?

Makes sense.

kyleam · 2020-09-03T22:55:38Z

datalad/plugin/addurls.py

@@ -556,30 +556,24 @@ def add_meta(rows):
    """
    from unittest.mock import patch

+    # OPT: group by dataset first so to not patch/unpatch always_commit
+    # per each file of which we could have thousands
+    from collections import defaultdict


Fine as is, but we use collections widely enough that delaying this import doesn't save us anything.

Thanks for the review! Forgot to move it up ;)

kyleam · 2020-09-03T22:56:53Z

datalad/plugin/addurls.py

-        ds, filename = row["ds"], row["ds_filename"]
+        dss_rows[row["ds"]].append(row)
+
+    for ds, ds_rows in dss_rows.items():


Again, fine as is, but in case you didn't consider it: itertools.groupby would work nicely here.

Great idea, will use it and that will remove aforementioned import.

…er dataset I bet it does not take much time to patch.object but it still does take time, so why to waste it?

… not)

yarikoptic · 2020-09-04T01:47:59Z

pushed recommended changes + parametrized benchmark. I have made benchmark to fail if in the teardown status shows something not clean.

benchmark results

$> asv continuous -b .*addurls.* master opt-addurls-no-metadataloop
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7
·· Building 6af13984 <opt-addurls-no-metadataloop> for virtualenv-py3.7.
·· Installing 6af13984 <opt-addurls-no-metadataloop> into virtualenv-py3.7.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For DataLad commit 996368d3 <master> (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7...
[  0.00%] ·· Benchmarking virtualenv-py3.7
[ 25.00%] ··· Running (plugins.addurls.addurls1.time_addurls--).
[ 25.00%] · For DataLad commit 6af13984 <opt-addurls-no-metadataloop> (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7..
[ 25.00%] ·· Benchmarking virtualenv-py3.7
[ 50.00%] ··· Running (plugins.addurls.addurls1.time_addurls--).
[ 50.00%] · For DataLad commit 6af13984 <opt-addurls-no-metadataloop> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7
[ 75.00%] ··· plugins.addurls.addurls1.time_addurls                                                                            ok
[ 75.00%] ··· ================== ===========
               exclude_metadata             
              ------------------ -----------
                     None         2.26±0.1s 
                      *            694±70ms 
              ================== ===========

[ 75.00%] · For DataLad commit 996368d3 <master> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7..
[ 75.00%] ·· Benchmarking virtualenv-py3.7
[100.00%] ··· plugins.addurls.addurls1.time_addurls                                                                            ok
[100.00%] ··· ================== ============
               exclude_metadata              
              ------------------ ------------
                     None         4.25±0.1s  
                      *           2.46±0.05s 
              ================== ============

       before           after         ratio
     [996368d3]       [6af13984]
     <master>         <opt-addurls-no-metadataloop>
-       4.25±0.1s        2.26±0.1s     0.53  plugins.addurls.addurls1.time_addurls(None)
-      2.46±0.05s         694±70ms     0.28  plugins.addurls.addurls1.time_addurls('*')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

mih · 2020-09-04T06:01:18Z

Wonderful! Very happy to see such a change. This would have literally decreased the time to populate the HCP dataset by several days (~17 days to be more precise ;-) !

yarikoptic added 3 commits September 3, 2020 15:11

DOC: fix up .debug message to note that it is about to start adding U…

719f5d7

…RLs not metadata

OPT: do not bother calling add_meta on files without metadata to be a…

cd315f9

…dded

yarikoptic added the performance Improve performance of an existing feature label Sep 3, 2020

kyleam approved these changes Sep 3, 2020

View reviewed changes

yarikoptic added 3 commits September 3, 2020 21:36

OPT: group datasets and perform patch.object for always_commit once p…

4787a61

…er dataset I bet it does not take much time to patch.object but it still does take time, so why to waste it?

BM: basic benchmark for addurls in two scenarios (exclude autometa or…

f0c9f4b

… not)

ENH: prune a stale import guard for py <= 3.3 in touched code

6af1398

yarikoptic force-pushed the opt-addurls-no-metadataloop branch from 6c7b4a5 to 6af1398 Compare September 4, 2020 01:46

mih merged commit c8397d2 into datalad:master Sep 4, 2020

yarikoptic deleted the opt-addurls-no-metadataloop branch September 4, 2020 15:30

yarikoptic mentioned this pull request Sep 29, 2020

BF+OPT: addurls - avoid resources (processes) exhaustion in case of having large number of subdatasets #4973

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPT: addurls - no add per file, mock.patch once per dataset #4867

OPT: addurls - no add per file, mock.patch once per dataset #4867

yarikoptic commented Sep 3, 2020

kyleam commented Sep 3, 2020

yarikoptic commented Sep 3, 2020 •

edited

kyleam commented Sep 3, 2020

codecov bot commented Sep 3, 2020 •

edited

kyleam left a comment

kyleam Sep 3, 2020

yarikoptic Sep 4, 2020

kyleam Sep 3, 2020

yarikoptic Sep 4, 2020

yarikoptic commented Sep 4, 2020

mih commented Sep 4, 2020 •

edited

OPT: addurls - no add per file, mock.patch once per dataset #4867

OPT: addurls - no add per file, mock.patch once per dataset #4867

Conversation

yarikoptic commented Sep 3, 2020

kyleam commented Sep 3, 2020

yarikoptic commented Sep 3, 2020 • edited

kyleam commented Sep 3, 2020

codecov bot commented Sep 3, 2020 • edited

Codecov Report

kyleam left a comment

Choose a reason for hiding this comment

kyleam Sep 3, 2020

Choose a reason for hiding this comment

yarikoptic Sep 4, 2020

Choose a reason for hiding this comment

kyleam Sep 3, 2020

Choose a reason for hiding this comment

yarikoptic Sep 4, 2020

Choose a reason for hiding this comment

yarikoptic commented Sep 4, 2020

mih commented Sep 4, 2020 • edited

yarikoptic commented Sep 3, 2020 •

edited

codecov bot commented Sep 3, 2020 •

edited

mih commented Sep 4, 2020 •

edited