Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: cmd: Don't leave unread lines on stream when logging online #3656

Merged
merged 1 commit into from
Sep 10, 2019

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Sep 9, 2019

Now that we have a branch that drops Python 2, I've been looking into using asyncio in the runner. Much of that so far has been trying to wrap my head around how the different parameters of Runner.run() interact. While doing this, I realized that some output lines were being lost.


When the Runner is called with log_online=True and we're logging
either stdout or stderr, the readline() call for the stream happens
under a `while proc.poll() is None`.  After that block, we may have
uncaptured output, either because the process was already finished by
the time we got to the proc.poll() condition or because each iteration
reads only a single line.  To collect this output, communicate() is
called.

However, communicate() may return an empty string even when there is
uncaptured output and appears to reliably do so if

  (1) we've called readline() on the stream
  (2) both stdout and stderr are set to subprocess.PIPE
  (3) we're running under Python 3

The second condition likely involves the fact that subprocess.py has
an optimization to avoid using threads if the number of pipes is below
two.

For example, try

    import subprocess as sp
    p = sp.Popen(["printf", "one\ntwo\n"], stdout=sp.PIPE, stderr=sp.PIPE)
    print(p.stdout.readline())
    print(p.communicate())

"two\n" is not included in the output returned by communicate().

To work around this, gather these remaining lines with a direct read()
call rather than relying on communicate() when we're using a pipe for
bother stdout and stderr.

@kyleam kyleam changed the base branch from master to 0.11.x September 9, 2019 20:31
When the Runner is called with log_online=True and we're logging
either stdout or stderr, the readline() call for the stream happens
under a `while proc.poll() is None`.  After that block, we may have
uncaptured output, either because the process was already finished by
the time we got to the proc.poll() condition or because each iteration
reads only a single line.  To collect this output, communicate() is
called.

However, communicate() may return an empty string even when there is
uncaptured output and appears to reliably do so if

  (1) we've called readline() on the stream
  (2) both stdout and stderr are set to subprocess.PIPE
  (3) we're running under Python 3

The second condition likely involves the fact that subprocess.py has
an optimization to avoid using threads if the number of pipes is below
two.

For example, try

    import subprocess as sp
    p = sp.Popen(["printf", "one\ntwo\n"], stdout=sp.PIPE, stderr=sp.PIPE)
    print(p.stdout.readline())
    print(p.communicate())

"two\n" is not included in the output returned by communicate().

To work around this, gather these remaining lines with a direct read()
call rather than relying on communicate() when we're using a pipe for
bother stdout and stderr.
@yarikoptic
Copy link
Member

thank you @kyleam !!!!!

@yarikoptic
Copy link
Member

all green -- good.

@yarikoptic yarikoptic merged commit 5926b24 into datalad:0.11.x Sep 10, 2019
@kyleam kyleam deleted the cmd-missed-lines branch September 10, 2019 11:42
yarikoptic added a commit that referenced this pull request Oct 14, 2019
0.11.8 (Oct 11, 2019) -- annex-we-are-catching-up

Fixes

- Our internal command runner failed to capture output in some cases.
  ([#3656][])
- Workaround in the tests around python in cPython >= 3.7.5 ';' in
  the filename confusing mimetypes ([#3769][]) ([#3770][])

Enhancements and new features

- Prepared for upstream changes in git-annex, including support for
  the latest git-annex
  - 7.20190912 auto-upgrades v5 repositories to v7.  ([#3648][]) ([#3682][])
  - 7.20191009 fixed treatment of (larger/smaller)than in .gitattributes ([#3765][])

- The `cfg_text2git` procedure, as well the `--text-no-annex` option
  of [create][], now configure .gitattributes so that empty files are
  stored in git rather than annex.  ([#3667][])

* tag '0.11.8': (27 commits)
  DOC: add CHANGELOG entry about mimetypes workaround, and regenerate changelog.rst
  RF: reuse fn*obscure* variables from test_archives for testing archives custom remote
  BF(TST,workaround): do not use ; in the test archive filenames
  Finalize changelog and boost version
  DOC: Adjust CHANGELOG for the fix of test
  RF(TST): use 'willgetshort' name to correctly reflect file behavior
  BF(TST): reflect the fact that since 7.20191009 file would jump from annex to git based on current size
  CHANGELOG.md: Add entry for gh-3667
  CHANGELOG.md: First batch for 0.11.8
  RF: simplify the expression for largefiles based on size
  ENH: exit with dedicated 99 exit code if installed annex is newer than -devel
  TST: known_failure_v6_or_later: Consider whether v5 is supported by git-annex
  BF(v7): gitrepo: Avoid adding files to annex
  BF: 3rdparty_analysis_workflow: Make example compatible with v6+
  ENH: annexrepo: Give more informative assertion error
  BF: annexrepo: Skip empty lines when expecting one output line
  TST: create: Adjust --text-no-annex test for aa6b8dc
  ENH: add file size rule to --text-no-annex
  TST: basic test for empty files in text2git ds
  ENH: exclude empty files from being annexed after text2git
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants