Skip to content

NF: foreach-dataset --o-s relpath - capture and pass-through prefixing with path to subds #7071

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 8, 2022

Conversation

yarikoptic
Copy link
Member

First commit is just a helper, 2nd is the actual changes with following ATM description:

Motivation for this feature is frequent use of foreach-dataset git grep PATTERN on e.g
captured by con/tinuous logs as I did in e.g. #6848 :

$> datalad foreach-dataset "git grep 'FAILED .*test_nested_pushclone_cycle_allplatforms'  | grep cron || :"
foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022 (dataset)
03/cron/20220603T193647/173dc6f/travis-14100-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022/01 (dataset)
10/cron/20220610T193717/029f839/travis-14136-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
20/cron/20220520T193556/84f979a/travis-14036-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
27/cron/20220527T193600/310fd9f/travis-14055-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022/06 (dataset)
foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022/05 (dataset) 
...

As you could see - the default pass-through + default rendering there is "suboptimal" since

  • main hurdle: paths from grep are relative to that subdataset and not
    readily cut-pasteable since we need to join with subdataset path first
  • pollutes with default renderer 'ok' or not for a given dataset (hence needed
    || : to avoid complains)
  • prints all those (ok) which we do not quite care about -- we just want to see output
    as this hierarchy was a single dataset, so if nothing hit -- we just carry forward.

With this PR and "relpath" mode, we adjust renderer so that output looks like:

(git)smaug:/mnt/datasets/datalad/ci/logs/2022[master]git
$> datalad foreach-dataset --o-s relpath "git grep 'FAILED .*test_nested_pushclone_cycle_allplatforms'  | grep cron "
06/03/cron/20220603T193647/173dc6f/travis-14100-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
06/10/cron/20220610T193717/029f839/travis-14136-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
05/20/cron/20220520T193556/84f979a/travis-14036-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
05/27/cron/20220527T193600/310fd9f/travis-14055-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
07/15/cron/20220715T194006/52e822a/travis-14276-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms

pros:

  • readily usable path which melded relative path to the subdataset
    with output of grep

  • since we ignore status -- no errors reported even though I had no ||:

  • no useless ok's

  • with "smart" treatment of either "dataset" argument is provided or not
    IMHO reported relative paths are quite neat despite C1 (below)

  • I thought it could be sufficiently closely matched simply via
    datalad -f '{path}{stdout}' foreach-dataset --o-s capture but "not":

      $> datalad -f '{path}/{stdout}' foreach-dataset --o-s capture "git grep 'FAILED .*test_nested_pushclone_cycle_allplatforms'  | grep cron ||:"
      /mnt/datasets/datalad/ci/logs/2022/
      /mnt/datasets/datalad/ci/logs/2022/01/
      /mnt/datasets/datalad/ci/logs/2022/06/03/cron/20220603T193647/173dc6f/travis-14100-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
      10/cron/20220610T193717/029f839/travis-14136-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
    
      /mnt/datasets/datalad/ci/logs/2022/05/20/cron/20220520T193556/84f979a/travis-14036-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
      27/cron/20220527T193600/310fd9f/travis-14055-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
    
      /mnt/datasets/datalad/ci/logs/2022/04/
    

    since we would get bunch of empty lines, those paths for datasets without hits etc.
    So - decided that it is worth adding this new mode

cons:

  • C1: overfitting "grep" use case since we prefix with actual path and without
    any separation (like via :). So may be we should name it grep instead of relpath.
  • C2: someone might want full path! But hey -- there is a reason why I named it "relpath"

both cons somewhat point to may be choosing another, better describing name. Please suggest one!

Note: "relpath" idea in output rendering was more generally excercised but
never finalized in other PRs

and argued in #3994 to be "less useful" which I would disagree.

NB code is based off maint but positioning against master since a new feature, it would be a sin to propose to maint.

@yarikoptic yarikoptic added the semver-minor Increment the minor version when merged label Oct 7, 2022
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I understand the problem and hence the desire for that kind of result rendering.

But I'm a bit confused by the appraoch. Why entangling a result rendering option with the output stream option? This kinda destroys a basic concept: That a datalad option like -f should not interfere with actual command behavior. But now it does. If I specify -f, I overrule --output-stream which is far from obvious. It also means, that if I disable result rendering altogether, I don't get any output "passed-through". The doc declaring that relpath essentially is like pass-through suggests otherwise and is technically a lie. It's capture+prefix+spit out again (as long as one doesn't specify a result renderer). That's a different concept.
I feel this is an abuse of the result renderer concept (including writing to sys.stdout/err directly instead of ui). And I agree - it's an overfit to grep.

Not happy with it, but also not instantly clear to me, what's the better alternative, though. I think what you really want is disable result rendering and read+prefix+write the actual output pipes as things are coming in. That would suggest a runner protocol for it (but another construct for python callables, I guess). Hm.

@yarikoptic
Copy link
Member Author

An interesting analysis/concern @bpoldrack. But even with --o-s relpath we would still have -f working as expected, thus overloading any custom_result_renderer and that is our design. And wouldn't your concern then apply to any other interface with custom_result_renderer (14 of those?), especially the one which uses kwargs to tune its output?

But I am open for any pragmatic suggestion on how to tune it up. As I have mentioned in the original description I was even considering smth like -f {path}{stdout} but I have mentioned its shortcoming. I guess we could have come up with some -f {stdout.prefix_eachline(relpath)} or smth like that, and I do want to get back to those issues I have mentioned and add/use relpath to any record, but I feel that due to need to process captured data in specific way (one per line), I still think proposed somewhat ad-hoc way might be the best way forward.

@yarikoptic yarikoptic force-pushed the enh-foreach-relpath branch from 8819c57 to 200159f Compare October 8, 2022 20:13
@codecov
Copy link

codecov bot commented Oct 8, 2022

Codecov Report

Base: 87.75% // Head: 90.81% // Increases project coverage by +3.06% 🎉

Coverage data is based on head (5b22b4d) compared to base (8abe1d0).
Patch coverage: 95.45% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7071      +/-   ##
==========================================
+ Coverage   87.75%   90.81%   +3.06%     
==========================================
  Files         325      325              
  Lines       44074    44142      +68     
  Branches        0     5862    +5862     
==========================================
+ Hits        38675    40089    +1414     
+ Misses       5399     4038    -1361     
- Partials        0       15      +15     
Impacted Files Coverage Δ
datalad/local/foreach_dataset.py 81.57% <90.32%> (+2.54%) ⬆️
datalad/local/tests/test_foreach_dataset.py 100.00% <100.00%> (ø)
datalad/log.py 88.71% <100.00%> (+6.64%) ⬆️
datalad/ui/dialog.py 93.01% <100.00%> (-0.04%) ⬇️
datalad/ui/tests/test_dialog.py 99.12% <100.00%> (ø)
datalad/customremotes/tests/test_datalad.py 95.34% <0.00%> (-2.28%) ⬇️
datalad/cli/common_args.py 100.00% <0.00%> (ø)
datalad/tests/test_dochelpers.py 100.00% <0.00%> (ø)
datalad/support/tests/test_network.py 100.00% <0.00%> (ø)
datalad/local/tests/test_add_readme.py 100.00% <0.00%> (ø)
... and 74 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bpoldrack
Copy link
Member

On second thought, @yarikoptic:

I think, what we need to do instead is properly enabling result filtering. What's missing from your -f {path}{stdout} approach is to only use results that actually have a stdout, I think. Which brings up #5581 WRT result filtering.
That would seem to generalize much better. WDYT?

I would like to not conflate the "output stream of a command" with its rendered results. Those are two different concepts. While rendered results can be part of the output stream of a command, that's not generally true and an output stream of a command can contain things that are not rendered results (as is the case with --o-s pass-through).

Another aspect of this is the writing to stdout directly from within the result renderer. Not going through ui, disables other interfaces like datalad-gooey, where one would probably want to have the same result rendering (ready for c&p or drag and drop). Ignoring other ui backends would sabotage that.

@yarikoptic
Copy link
Member Author

I think, what we need to do instead is properly enabling result filtering. What's missing from your -f {path}{stdout} approach is to only use results that actually have a stdout, I think. Which brings up #5581 WRT result filtering. That would seem to generalize much better. WDYT?

what about needing to prefix each line in stdout -- how would we instruct that?

Another aspect of this is the writing to stdout directly from within the result renderer. Not going through ui, disables other interfaces like datalad-gooey, where one would probably want to have the same result rendering (ready for c&p or drag and drop). Ignoring other ui backends would sabotage that.

We would need to differ stdout and stderr. In ui we have message and error but using .error would add undesired prefix ERROR. I guess we could make prefix configurable and then make it STDERR although also prefix would be displayed after, since I aim for stdout -- ok, with me.

On the other hand, the fact that everything centralizes via ui is good, but I bet gooey etc should still capture/report somehow stdout/stderr somehow to not swallow the output.

@yarikoptic
Copy link
Member Author

note: rebased on top of master (was on maint) since diff was poisoned with other maint diffs

@bpoldrack
Copy link
Member

bpoldrack commented Oct 20, 2022

Registering an idea here (no time yet to properly test it). I think, something like this may be the solution:

class PrefixStdout(WitlessProtocol):

    proc_out = True

    def __init__(self, prefix, done_future=None, encoding=None):
        super().__init__(done_future, encoding)
        self._prefix = prefix

    def pipe_data_received(self, fd, data):
        for line in data.splitlines():
            os.write(1, self._prefix + line)

and make the new parameter (grep-relpath or whatever) use this Protocol class. Disable result rendering altogether when you don't want results to pollute stdout. If this works, it would keep result rendering and "output stream" separate concepts, which I'd much prefer.
What I'm not sure about is, whether an output line of the subprocess could end up in several calls to pipe_data_received and how to deal with that, if that's the case. Any insights on this aspect, @christian-monch?

WDYT, @yarikoptic?

@yarikoptic
Copy link
Member Author

I like the idea, thank you @bpoldrack . My concerns are

  • forgot what I did re python commands, your solution covers only the externally executed commands
  • the same concern as yours about partial data, so might need some internal accumulation if data comes without trailing newline... I will wait for analysis by @christian-monch too ;)
  • should write to the fd I guess not to 1 all the time

… around context

adjust test to patch correct/used logger. Which logger it is I do not think
matters for anywhere but the test.
…g with path to subds

Motivation for this feature is frequent use of  `foreach-dataset  git grep PATTERN` on e.g
captured by con/tinuous logs as I did in e.g. datalad#6848 :

	$> datalad foreach-dataset "git grep 'FAILED .*test_nested_pushclone_cycle_allplatforms'  | grep cron || :"
	foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022 (dataset)
	03/cron/20220603T193647/173dc6f/travis-14100-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022/01 (dataset)
	10/cron/20220610T193717/029f839/travis-14136-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	20/cron/20220520T193556/84f979a/travis-14036-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	27/cron/20220527T193600/310fd9f/travis-14055-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022/06 (dataset)
	foreach-dataset(ok): /mnt/datasets/datalad/ci/logs/2022/05 (dataset)
	...

As you could see - the default pass-through + default rendering there is "suboptimal" since

- main hurdle: paths from `grep` are relative to that subdataset and not
  readily cut-pasteable since we need to join with subdataset path first
- pollutes with default renderer 'ok' or not for  a given dataset (hence needed
  || : to avoid complains)
- prints all those (ok) which we do not quite care about -- we just want to see output
  as this hierarchy was a single dataset, so if nothing hit -- we just carry forward.

With this PR and "relpath" mode, we adjust renderer so that output looks like:

	(git)smaug:/mnt/datasets/datalad/ci/logs/2022[master]git
	$> datalad foreach-dataset --o-s relpath "git grep 'FAILED .*test_nested_pushclone_cycle_allplatforms'  | grep cron "
	06/03/cron/20220603T193647/173dc6f/travis-14100-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	06/10/cron/20220610T193717/029f839/travis-14136-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	05/20/cron/20220520T193556/84f979a/travis-14036-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	05/27/cron/20220527T193600/310fd9f/travis-14055-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
	07/15/cron/20220715T194006/52e822a/travis-14276-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms

pros:

- readily usable path which melded relative path to the subdataset
  with output of  grep
- since we ignore status -- no errors reported even though I had no ||:
- no useless ok's
- with "smart" treatment of either "dataset" argument is provided or not
  IMHO reported relative paths are quite neat despite C1 (below)
- I thought it could be sufficiently closely matched simply via
  `datalad -f '{path}{stdout}' foreach-dataset --o-s capture` but "not":

		$> datalad -f '{path}/{stdout}' foreach-dataset --o-s capture "git grep 'FAILED .*test_nested_pushclone_cycle_allplatforms'  | grep cron ||:"
		/mnt/datasets/datalad/ci/logs/2022/
		/mnt/datasets/datalad/ci/logs/2022/01/
		/mnt/datasets/datalad/ci/logs/2022/06/03/cron/20220603T193647/173dc6f/travis-14100-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
		10/cron/20220610T193717/029f839/travis-14136-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms

		/mnt/datasets/datalad/ci/logs/2022/05/20/cron/20220520T193556/84f979a/travis-14036-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms
		27/cron/20220527T193600/310fd9f/travis-14055-failed/14.txt:FAILED ../core/distributed/tests/test_push.py::test_nested_pushclone_cycle_allplatforms

		/mnt/datasets/datalad/ci/logs/2022/04/

  since we would get bunch of empty lines, those paths for datasets without hits etc.
  So - decided that it is worth adding this new mode

cons:

- C1: overfitting "grep" use case since we prefix with actual path and without
  any separation (like via :).  So may be we should name it `grep` instead of `relpath`.
- C2: someone might want full path! But hey -- there is a reason why I named it "relpath"

both cons somewhat point to may be choosing another, better describing name.

Note: "relpath" idea in output rendering was more generally excercised but
never finalized in other PRs

- datalad#2679
- datalad#4454

and argued in datalad#3994 to be "less
useful" which I would disagree.
@yarikoptic
Copy link
Member Author

I liked the idea of PrefixStdout (despite that would need to also add -f disabled to datalad invocation), and started to implement but then:

  • forgot what I did re python commands, your solution covers only the externally executed commands

well, we just eval or exec them, so I do not think we have any direct control over stdout/err AFAIK and would need to monkey-patch sys.stdout/err and hope that those were not bound in that underlying function first , i.e. any monkey patching would work.
We could then make it also non-python only, but that would kind suck IMHO.

With that it begs to question if it is a worthwhile approach? ATM I have been using this PR version successfully for my use case of invoking git grep, which is an external command. So, may be give it another consideration?

@bpoldrack
Copy link
Member

@yarikoptic
Yeah, I don't have an all-including solution either. So, let's go with it. However, may be try to rephrase the help at least? It's described as a prefixed pass-through, but that's not what it is. It's a prefixed capture + dedicated rendering to spit it out again (after execution in a ds finished).

@codeclimate
Copy link

codeclimate bot commented Nov 3, 2022

Code Climate has analyzed commit 5b22b4d and detected 7 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 4
Security 3

View more on Code Climate.

@bpoldrack
Copy link
Member

Ok, I linked the failing archive test in the issue - don't think it's related. Let's proceed then. Thx, @yarikoptic!

@bpoldrack bpoldrack merged commit 79f7a83 into datalad:master Nov 8, 2022
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.18.0

@yarikoptic yarikoptic deleted the enh-foreach-relpath branch January 20, 2023 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants