Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: addurls(..., drop_after=True) to drop content after adding #5081

Merged
merged 55 commits into from Nov 10, 2020

Conversation

yarikoptic
Copy link
Member

We have already similar option for add_archive_content and a few places in the
crawler. It is a very much needed feature while crawling/addurl-ing big
datasets, which might not even fit to local harddrive -- we need to drop
content as soon as it is annexed (well -- later, may be after we extract
metadata). For now I am just adding it as a bool, but we might later
indeed specify when to drop it (right after added, or after dataset saved
and possibly metadata extracted via some hook or explicit operation etc).

This PR is based on #5022 changes since that is where addurls was RFed
to accomodate parallel execution, and I would prefer to not mess resolving
conflicts later on

First I added a drop after all files are added, but then decided to not be lazy and RF'ed to
use drop_key in batch mode right after the file is addurl'ed. So it is two last commits which are relevant, the rest is #4592
Yet to try out on real case.

…mer tandem

The idea is to feed consumer with items from producer as soon as they arrive,
and update log_progress with the total only when producer is finished producing
(well -- we could have updated total upon each iteration but I am not sure if that is worthwhile
since it would be providing ETA which would keep rising)
In general -- "works", but it shows the problem of interference from logs
(and possible stdout from running procedures) with progress reporting.
Delaying dealing with it anyhow for now
…rogress

Based on Threads, it would parallelize execution of producer (e.g. of paths)
and consumer (e.g. create) to operate on subdatasets in parallel threads.
Given that underlying git/git-annex processes would also consume CPU, overall
we can escape 1 CPU "performance" bottleneck in cases of parallel operation
across (sub)datasets.
…ions

that we do not touch super-dataset until we are done creating sub-datasets, and then
just use a single save within super dataset to save all freshly created subdatasets
…s, sleep time will increment upon no activity
…record

And before that - clear progress bars, and refresh after.  That allows to mix
regular log output and progress bars without "interference"
… for ProducerConsumer

5 is arbitrary and I would have preferred 10 ;)  But it is also not
sensible to base it on cores since we are just
multi-threading. Ideally some dynamic scaling is needed.
Unfortunately the solution is "incomplete" due to

datalad#5021

and would trigger test failures
this construct will be useful in other places as well
…dataset

So it is not parallelizing across the entire hierarchy as more
subdatasets become available.  Parallelizes only at any given level.
Even that already provides a considerable improvement over serial
installation for collections of datasets such as OpenNeuro or HCP.

Because ATM "producer" is assumed to just output entries to pass into
"consumer" and there is no logic to "filter" those records and yield
some message outside, I am collecting all "notneeded" entries and
yielding them in a loop after ConsumerProducer is done.

An alternative (without changing ConsumerProducer) could have been to
move that check into "consumer", but that, although making it may be
more informative, would "taint" the progress indicator ETA which is
IMHO useful information, so I decided against this approach.
Before now, total_announced was a bool making it set the total only
whenever producer is finished.  To allow for the producer_queue to be
"enriched" by e.g. consumer, and still have total updating as more
"jobs" to be consumed come in, we will make total "constantly" updating.
This way some "tricky" consumer (install the one of install) could
enrich the queue with more jobs as it makes more jobs
possible (e.g. subdatasets are installed)
As more subdatasets are installed, "install" consumer will add more
subdatsets to be installed.  Also in "jobs=auto" (default) mode
disable parallization altogether: there could be ssh hosts etc we
might need to login to via user/passw prompt and there is ATM no
centralized locking of any kind. So, --jobs > 1 should be used
whenever user knows that no prompts should appear.

Note: we are still mixing "jobs" parameter for datalad and git.annex
get calls.  Yet to decide how/if to decouple.

See a comment about passing ds_path -- may be it is already present in
the record, I was not 100% certain.
so we should pass it to "shared" across all levels of recursion consumer explicitly
otherwise it is not "respected"
…at is going on

Sorted the fields in the order of logical 'completion'
Unfortunately I found no way to actually it to behave sanely, i.e.
raise that Assertion and just fail -- it will just get stuck somewhere
if ran as a test
See the header on how to use -- should be real easy ;)
… for get('.')

Even without paralellization, I think it would be suboptimal to rely on some order of subdatasets
returned for get('.')
By properly I mean that generator functions will stay generator
functions.

I believe that it Closes dataladgh-4929
not unlike tqdm itself but it will use our logger and otherwise it is
based of @with_results_progress
addurls submethods yield all kinds of records -- on create, addurl, metadata.
For progress reporting I think we should just look for addurl and also match
.total to the number of files we are addurl"ing for.

Notes:

- changes to parallel.py came without a dedicated unittest

- for some reason the total progress bar is not updated between datasets
  progress reporting... From my PoV everything looks kosher and it should have
  worked.
  Since we also have progress bars from individual _add_urls and _add_metadata
  I think it is "acceptable" overall, but something to figure out later
Initial idea was to just do it while working on a row. Unfortunately if we do that
annex would start "ignoring" always_commit=False, so I needed to do a separate loop.

But I think it would still be benefitial to have it absorbed within the same function
so we could just collect necessary filename: meta  records and not to loop through the
list of rows again to discover if we need to add any meta or not.

Also so far there is an anecdotal evidence that it seems to resolve jumping
total progress bar, seems to advance nicely upon each file
…ot per function

manifests itself in parallel execution where there would be the same
function used in multiple threads
Since in our case we want only addurls ones, but also it is a generally
useful feature for this function as well, since it is likely that wrapped
function would yield various types of records
Since we already have topds, it is safe to proceed to its subdatasets.  It would
allow for them not to wait for all files to finish adding to it.
I chose this one to mimic max-annex-jobs although it makes env variable
quite ugly: DATALAD_RUNTIME_MAX__JOBS
I was always wondering, what is this "offlineimap" wants from me when
I press Ctrl-C and it still keeps going asking for more Ctrl-C or just
wait a little.  At times I was patient and waited...

In our case I think it might be crucial to avoid data loss in other
parallel processes if one process dies - we should not jump to
interrupt all the rest, but we also unlikely want to proceed
forward (although we might decide to add that feature, so to become
"friendlier" to on_failure="ignore" alike invocation.

But also, if user interrupts with Ctrl-C, and he/she can wait for
already running processes to complete, it would be great!  So I
somewhat mimiced two-stage killing procedure of offlineimap.  Upon
initial exception (might come from producer or consumer or user,
i.e. Ctrl-C) we just purge the "queues" so no new jobs are to be ran,
but we let already running ones to finish.  If there is a subsequent
Ctrl-C (all other exceptions are just logged), then we do a
forceful (well, as much as we can) shutdown.

I played with that script of mine on top of addurls
    #!/bin/bash

    export PS4='> '

    set -eu
    cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"
    topd=`pwd`

    echo -e "url,filename,bogus1,bogus2" > list.csv
    mkdir src
    n=${1:-2}
    J=${2:-10}
    (cd src;
    for id in `seq 1 $n`; do
        d=d$id
        mkdir $d
        for s in `seq 1 $n`; do
            f="$d//$s"
            fu="$d/$s"
            # so there is at least something notable for annex to checksum
            #dd if=/dev/random of=$fu count=10000 >& /dev/null
            # super fast one ;)
            echo $s > $fu;
            echo "file:///$topd/src/$fu,$f,pocus,focus" >> ../list.csv;
        done
    done
    )

    set -x
    datalad -c annex.security.allowed-url-schemes=file create dest
    (
    cd dest;
    time datalad -c annex.security.allowed-url-schemes=file --on-failure stop addurls -J$J ../list.csv '{url}' '{filename}'

    )

and I liked how it behaved -- you could see some additional results
yielded/reported while no new runners are executed etc.

Additional minor enhancement - producer could be a generator function
now (one use case test added)
also marked that gracefull_death only for python where parallel
execution is possible due to test assumptions
While running tests while being in zoom (or jitsi) meeting showed following
failure:

	(git)lena:~datalad/datalad-maint[enh-iter-progress]git
	$> DATALAD_LOG_LEVEL=INFO DATALAD_LOG_TRACEBACK=1 python -m nose -s -v datalad/support/tests/test_parallel.py:test_gracefull_death
	datalad.support.tests.test_parallel.test_gracefull_death ... [WARNING] ...>log:209  Received an exception ValueError() [test_parallel.py:faulty_consumer:157].
	| Canceling not-yet running jobs and waiting for completion of running.
	| You can force earlier forceful exit by Ctrl-C.
	[INFO   ] ...>log:209  Canceled 942 out of 998 jobs. 8 left running.
	[WARNING] ...>log:209  Received an exception ValueError() [test_parallel.py:producer:172].
	| Canceling not-yet running jobs and waiting for completion of running.
	| You can force earlier forceful exit by Ctrl-C.
	[INFO   ] ...>log:209  Canceled 2 out of 8 jobs. 2 left running.
	FAIL
	Versions: annexremote=1.4.3 appdirs=1.4.3 boto=2.49.0 cmd:7z=16.02 cmd:annex=8.20200908+git175-g95d02d6e2-1~ndall+1 cmd:bundled-git=2.24.0 cmd:git=2.24.0 cmd:system-git=2.28.0 cmd:system-ssh=8.1p1 exifread=2.1.2 git=3.1.0 gitdb=4.0.2 humanize=2.3.0 iso8601=0.1.12 keyring=18.0.1 keyrings.alt=3.4.0 msgpack=0.6.2 mutagen=1.40.0 requests=2.23.0 scrapy=1.7.3 wrapt=1.11.2
	Obscure filename: str=b' |;&%b5{}\'"\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 .datc ' repr=' |;&%b5{}\'"ΔЙקم๗あ .datc '
	Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
	Environment: PATH='/home/yoh/proj/datalad/datalad-maint/venvs/dev3/bin:/home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin' LANG='en_US.UTF-8' GIT_PAGER='less --no-init --quit-if-one-screen' GIT_CONFIG_PARAMETERS="'init.defaultBranch=master'"

	======================================================================
	FAIL: datalad.support.tests.test_parallel.test_gracefull_death
	----------------------------------------------------------------------
	Traceback (most recent call last):
	  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
		self.test(*self.arg)
	  File "/home/yoh/proj/datalad/datalad-maint/datalad/tests/utils.py", line 257, in _wrap_skip_if
		return func(*args, **kwargs)
	  File "/home/yoh/proj/datalad/datalad-maint/datalad/support/tests/test_parallel.py", line 203, in test_gracefull_death
		assert_equal(consumed, list(range(len(consumed))))
	AssertionError: Lists differ: [0, 1, 2, 3, 4, 5, 470] != [0, 1, 2, 3, 4, 5, 6]

	First differing element 6:
	470
	6

	- [0, 1, 2, 3, 4, 5, 470]
	?                    ^^^

	+ [0, 1, 2, 3, 4, 5, 6]
	?                    ^

	----------------------------------------------------------------------
	Ran 1 test in 1.422s

	FAILED (failures=1)

so if system busy, executor might still manage to submit some job while they
being canceled.  If it is submitted, better be an earlier one, so we just
cancel jobs in reverse order in the dict.
…ile heavy logging

discovered by running

DATALAD_TESTS_OBSCURE_PREFIX=- DATALAD_LOG_VMEM=1 DATALAD_LOG_LEVEL=DEBUG DATALAD_LOG_TRACEBACK=1 python -m nose -v datalad/support/tests/test_parallel.py
For some reason it was not failing for me ever when I was providing
msg=None, but was failing on travis.  To make it more robust etc,
just cast None for the message into ""
We have already similar option for add_archive_content and a few places in the
crawler.  It is a very much needed feature while crawling/addurl-ing big
datasets, which might not even fit to local harddrive -- we need to drop
content as soon as it is annexed (well -- later, may be after we extract
metadata).  For now I am just adding it as a bool, but we might later
indeed specify when to drop it (right after added, or after dataset saved
and possibly metadata extracted via some hook or explicit operation etc).

This PR is based on datalad#5022 changes since that is where addurls was RFed
to accomodate parallel execution, and I would prefer to not mess resolving
conflicts later on
@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #5081 into master will increase coverage by 0.13%.
The diff coverage is 94.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5081      +/-   ##
==========================================
+ Coverage   89.82%   89.95%   +0.13%     
==========================================
  Files         292      297       +5     
  Lines       41163    41797     +634     
==========================================
+ Hits        36973    37597     +624     
- Misses       4190     4200      +10     
Impacted Files Coverage Δ
datalad/interface/common_cfg.py 100.00% <ø> (ø)
datalad/support/annexrepo.py 89.13% <ø> (ø)
datalad/support/collections.py 62.50% <62.50%> (ø)
datalad/support/parallel.py 94.92% <94.92%> (ø)
datalad/support/tests/test_parallel.py 94.96% <94.96%> (ø)
datalad/plugin/addurls.py 99.00% <96.73%> (-0.72%) ⬇️
datalad/log.py 95.51% <97.82%> (+5.37%) ⬆️
datalad/core/distributed/tests/test_push.py 98.01% <100.00%> (+<0.01%) ⬆️
datalad/core/local/save.py 98.82% <100.00%> (+0.08%) ⬆️
datalad/core/local/tests/test_save.py 98.00% <100.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af85ad1...582cec4. Read the comment docs.

@yarikoptic
Copy link
Member Author

Osx fail is unrelated, needs to wait for #5022 to be merged, meanwhile I will test drive it along

@yarikoptic yarikoptic added the do not merge Not to be merged, will trigger WIP app "not passed" status label Oct 24, 2020
@yarikoptic yarikoptic removed the do not merge Not to be merged, will trigger WIP app "not passed" status label Nov 10, 2020
@yarikoptic
Copy link
Member Author

eh, forgot about this PR to only discover that we have not yet merged it whenever I tried to use it (from master) :-/ since pertinent changes were actually small and blessed by @kyleam back then, I will just proceed and merge without a fresh round of CI. I will deal with consequences if somehow some failures would come about

@yarikoptic yarikoptic merged commit 64174d7 into datalad:master Nov 10, 2020
@yarikoptic yarikoptic deleted the enh-addurls-drop branch April 29, 2021 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants