Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: addurls - no add per file, mock.patch once per dataset #4867

Merged
merged 6 commits into from Sep 4, 2020

Conversation

yarikoptic
Copy link
Member

I have been running addurls with -x '*' to not bother with adding metadata for this datasets with a notable amount of files (will endup with 2M files across subdataset(s)). In the "batch" it seems to be only 20k files, and actual adding urls completed quite fast. Now I had been staring at slowly growing Adding metadata (10240 skipped):... so majority were skipped but not due to "no metadata" I think but because .add was not needed would be my guess (with_result_progress maps "notneeded" into "skipped" and does nohow filters by record).

PR is now a collection of small changes intended to optimize (although didn't test yet either there is a real advantage) addurls while running on large(r) number of files and not bothering to add metadata.

One change in behavior would be that I removed repo.add which now was ran for every file and then message was perculated up and then at the end in a typical use case ds.save would have been called on those files again. Now, if save=False, then those files would remain "annexed" (converted to symlinks) but not added to git repository. IMHO that is Ok since "staging does no longer exists" in datalad world, and save=False was IIRC introduced to facilitate use with datalad run which would call save anyways.

Let me know what you think, and I will probably test next batch of running addurls on this version to see if it feels any better.

…tadata

Since they were addurl'ed first anyways they should have been added
already and known to annex, so the reason for the operation is not
clear to me, and tests seems to still pass (haven't tried in RL to see
any side-effect yet).

Moreover, unless `save=False`, there is a subsequent `dataset.save`
call to follow anyways which would add/commit them if so
desired (i.e. by default).
@yarikoptic yarikoptic added the performance Improve performance of an existing feature label Sep 3, 2020
@kyleam
Copy link
Contributor

kyleam commented Sep 3, 2020

although didn't test yet either there is a real advantage

I didn't look at this yet, but it seems like that should be the first step.

@yarikoptic
Copy link
Member Author

yarikoptic commented Sep 3, 2020

heh, ok

the script
#!/bin/bash

export PS4='> '

set -eu
cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
d=`pwd`

echo -e "url,filename,bogus1,bogus2" > list.csv
mkdir src
(cd src; for s in {1..100}; do echo $s > $s; echo "file:///$d/src/$s,$s,pocus,focus" >> ../list.csv; done)

set -x
datalad create dest
(
cd dest;
git config annex.security.allowed-url-schemes file;
time datalad addurls -x '*' ../list.csv '{url}' '{filename}'
)
master 0.13.3-151-g996368d32 addurls - 12sec while enjoying progress bar
$> bash time-addurls.sh
> datalad create dest
[INFO   ] Creating a new annex repo at /home/yoh/.tmp/dl-tM2yQxA/dest 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /home/yoh/.tmp/dl-tM2yQxA/dest (dataset)
> cd dest
> git config annex.security.allowed-url-schemes file
> datalad addurls -x '*' ../list.csv '{url}' '{filename}'
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/1 (file) [to 1]                                                       
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/2 (file) [to 2]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/3 (file) [to 3]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/4 (file) [to 4]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/5 (file) [to 5]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/6 (file) [to 6]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/7 (file) [to 7]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/8 (file) [to 8]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/9 (file) [to 9]
addurl(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/10 (file) [to 10]
  [90 similar messages have been suppressed]                                                                     
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/1 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/2 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/3 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/4 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/5 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/6 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/7 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/8 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/9 (file)
add(ok): /home/yoh/.tmp/dl-tM2yQxA/dest/10 (file)
  [90 similar messages have been suppressed]
save(ok): /home/yoh/.tmp/dl-tM2yQxA/dest (dataset)
action summary:
  add (ok: 100)
  addurl (ok: 100)
  save (ok: 1)

real	0m11.716s
user	0m8.735s
sys	0m3.480s
bash time-addurls.sh  10.14s user 3.96s system 104% cpu 13.518 total
this PR - 2 sec and no progress bar
$> bash time-addurls.sh
> datalad create dest
[INFO   ] Creating a new annex repo at /home/yoh/.tmp/dl-GP2jWms/dest 
[INFO   ] Scanning for unlocked files (this may take some time) 
create(ok): /home/yoh/.tmp/dl-GP2jWms/dest (dataset)
> cd dest
> git config annex.security.allowed-url-schemes file
> datalad addurls -x '*' ../list.csv '{url}' '{filename}'
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/1 (file) [to 1]                                                       
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/2 (file) [to 2]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/3 (file) [to 3]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/4 (file) [to 4]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/5 (file) [to 5]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/6 (file) [to 6]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/7 (file) [to 7]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/8 (file) [to 8]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/9 (file) [to 9]
addurl(ok): /home/yoh/.tmp/dl-GP2jWms/dest/10 (file) [to 10]
  [90 similar messages have been suppressed]  
save(ok): /home/yoh/.tmp/dl-GP2jWms/dest (dataset)
action summary:
  addurl (ok: 100)
  save (ok: 1)

real	0m1.873s
user	0m1.487s
sys	0m0.487s

If I remove -x '*' thus to add metadata (and also datalad status added at the end)

master - 20 sec
action summary:
  add (notneeded: 99, ok: 1)
  addurl (ok: 100)
  metadata (ok: 100)
  save (ok: 1)

real	0m19.662s
user	0m15.224s
sys	0m6.237s
> datalad status
nothing to save, working tree clean

edit NB: in above add (notneeded: 99, ok: 1) is "intriguing" since not clear why 1 file was oh so special.

this PR - 10 sec
...
action summary:
  addurl (ok: 100)
  metadata (ok: 100)
  save (ok: 1)

real	0m10.476s
user	0m8.678s
sys	0m3.462s
> datalad status
nothing to save, working tree clean

@kyleam
Copy link
Contributor

kyleam commented Sep 3, 2020

Thanks! :)

@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #4867 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4867   +/-   ##
=======================================
  Coverage   89.70%   89.71%           
=======================================
  Files         289      289           
  Lines       40599    40596    -3     
=======================================
  Hits        36419    36419           
+ Misses       4180     4177    -3     
Impacted Files Coverage Δ
datalad/plugin/addurls.py 99.72% <100.00%> (+0.54%) ⬆️
datalad/support/gitrepo.py 89.42% <0.00%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996368d...6af1398. Read the comment docs.

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. (And thanks again for the paste-and-go timing scripts.)

Subject: [PATCH 1/4] DOC: fix up .debug message to note that it is about to
start adding URLs not metadata

Oops, thanks for catching my sloppy copy-paste error.

Subject: [PATCH 2/4] RM: add_meta - do not bother .repo.add'ing each file
before adding metadata

Since they were addurl'ed first anyways they should have been added
already and known to annex, so the reason for the operation is not
clear to me, and tests seems to still pass (haven't tried in RL to see
any side-effect yet).

Hmm, true, addurl should always take care of that. Here's how I think things ended up in this confusing spot. Originally a datalad.add(save=SAVE) call used to be done, taking care of the save (unless save=False) before adding metadata. Then the save=SAVE was flipped to save=False in 9faa07b (ENH: addurls: Reduce the number of metadata commits, 2018-03-26), introducing a downstream save call. At that point, I didn't step back and realize I could drop the add call entirely. And that still didn't occur to me when I reviewed a later commit (a5e6eea) that replaced the datalad.add call with a repo.add call in add_meta :/

So, this change looks correct to me.

Subject: [PATCH 3/4] OPT: do not bother calling add_meta on files without
metadata to be added

Good idea.

Subject: [PATCH 4/4] OPT: group datasets and perform patch.object for
always_commit once per dataset

I bet it does not take much time to patch.object but it still does take time,
so why to waste it?

Makes sense.

@@ -556,30 +556,24 @@ def add_meta(rows):
"""
from unittest.mock import patch

# OPT: group by dataset first so to not patch/unpatch always_commit
# per each file of which we could have thousands
from collections import defaultdict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine as is, but we use collections widely enough that delaying this import doesn't save us anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Forgot to move it up ;)

ds, filename = row["ds"], row["ds_filename"]
dss_rows[row["ds"]].append(row)

for ds, ds_rows in dss_rows.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, fine as is, but in case you didn't consider it: itertools.groupby would work nicely here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, will use it and that will remove aforementioned import.

@yarikoptic
Copy link
Member Author

pushed recommended changes + parametrized benchmark. I have made benchmark to fail if in the teardown status shows something not clean.

benchmark results
$> asv continuous -b .*addurls.* master opt-addurls-no-metadataloop
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7
·· Building 6af13984 <opt-addurls-no-metadataloop> for virtualenv-py3.7.
·· Installing 6af13984 <opt-addurls-no-metadataloop> into virtualenv-py3.7.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For DataLad commit 996368d3 <master> (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7...
[  0.00%] ·· Benchmarking virtualenv-py3.7
[ 25.00%] ··· Running (plugins.addurls.addurls1.time_addurls--).
[ 25.00%] · For DataLad commit 6af13984 <opt-addurls-no-metadataloop> (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7..
[ 25.00%] ·· Benchmarking virtualenv-py3.7
[ 50.00%] ··· Running (plugins.addurls.addurls1.time_addurls--).
[ 50.00%] · For DataLad commit 6af13984 <opt-addurls-no-metadataloop> (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7
[ 75.00%] ··· plugins.addurls.addurls1.time_addurls                                                                            ok
[ 75.00%] ··· ================== ===========
               exclude_metadata             
              ------------------ -----------
                     None         2.26±0.1s 
                      *            694±70ms 
              ================== ===========

[ 75.00%] · For DataLad commit 996368d3 <master> (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7..
[ 75.00%] ·· Benchmarking virtualenv-py3.7
[100.00%] ··· plugins.addurls.addurls1.time_addurls                                                                            ok
[100.00%] ··· ================== ============
               exclude_metadata              
              ------------------ ------------
                     None         4.25±0.1s  
                      *           2.46±0.05s 
              ================== ============

       before           after         ratio
     [996368d3]       [6af13984]
     <master>         <opt-addurls-no-metadataloop>
-       4.25±0.1s        2.26±0.1s     0.53  plugins.addurls.addurls1.time_addurls(None)
-      2.46±0.05s         694±70ms     0.28  plugins.addurls.addurls1.time_addurls('*')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@mih
Copy link
Member

mih commented Sep 4, 2020

Wonderful! Very happy to see such a change. This would have literally decreased the time to populate the HCP dataset by several days (~17 days to be more precise ;-) !

@mih mih merged commit c8397d2 into datalad:master Sep 4, 2020
@yarikoptic yarikoptic deleted the opt-addurls-no-metadataloop branch September 4, 2020 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants