Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not decode fetched content when retrieving status #3210

Merged
merged 1 commit into from Mar 9, 2019

Conversation

@driusan
Copy link
Collaborator

@driusan driusan commented Mar 7, 2019

When annexificator is used in a datalad-crawler pipeline, it attempts
to call get_status on the URL to see if the file has changed. get_status
attempts to use _fetch, which attempts to decode the content under Python3.

This is invalid for content that isn't utf-8, and get_status was only
trying to verify that it wasn't receiving a login or failure page or similar.
_fetch, confusingly, converts the UnicodeFormatError into a generic DownloadError
(despite the fact that the download succeeded, but conversion to a string failed.)

This teaches _fetch a "decode" parameter that toggles whether or not it should
even attempt to decode. In the case of get_status, the content is discarded,
so the decode serves no purpose as far as I can tell.

NB. fetch (without the _)'s documentation claims that it doesn't decode, but
it calls _fetch (which currently always decodes.) This can new parameter can
eventually be used to fix that, but currently trying to fix "fetch" results in
a json.loads error from a mysterious part in the code if fetch is updated to
not decode as documented.

@driusan driusan changed the base branch from master to 0.11.x Mar 7, 2019
@driusan driusan changed the base branch from 0.11.x to master Mar 7, 2019
@driusan
Copy link
Collaborator Author

@driusan driusan commented Mar 7, 2019

This is trying to fix a crawler that I'm writing here: https://github.com/driusan/datalad-crawler/blob/enh-loriscrawler/datalad_crawler/pipelines/loris.py which only gets triggered under Python3.. but if my understanding of the code is right, it should be triggered with any datalad-crawler pipeline where a generator directly yields a URL to an annexificator node with Python3 and the url contains binary data.

Copy link
Contributor

@kyleam kyleam left a comment

Thanks for the PR and the detailed analysis. So this essentially avoids something like

% python -c "from datalad.downloaders.http import HTTPDownloader; HTTPDownloader().fetch('https://google.com')"
[INFO   ] Fetching 'https://google.com'
[ERROR  ] Failed to fetch https://google.com: 'utf-8' codec can't decode byte 0xa0 in position 11216: invalid start byte [base.py:_fetch:545]
Traceback (most recent call last):
  File "/home/kyle/src/python/datalad/datalad/downloaders/base.py", line 545, in _fetch
    content = content.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 11216: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kyle/src/python/datalad/datalad/downloaders/base.py", line 580, in fetch
    out = self.access(self._fetch, url, **kwargs)
  File "/home/kyle/src/python/datalad/datalad/downloaders/base.py", line 145, in access
    result = method(url, **kwargs)
  File "/home/kyle/src/python/datalad/datalad/downloaders/base.py", line 555, in _fetch
    raise DownloadError(exc_str(e, limit=8))  # for now
datalad.support.exceptions.DownloadError: 'utf-8' codec can't decode byte 0xa0 in position 11216: invalid start byte [base.py:_fetch:545]

As an immediate fix, I'm OK with the patch (we should change the base to 0.11.x though). But

NB. fetch (without the _)'s documentation claims that it doesn't decode, but
it calls _fetch (which currently always decodes.)

So fetch is clear about returning the content as bytes. I don't see why py3 should deviate from this, and no motivation for this change is given in 43eb31b (RF: return downloaded content undecoded so size could be estimated. decode in fetch, 2016-07-19). (I also didn't spot anything in the associated PR, #627.) So it might be worth removing that code to see if anything breaks, at least at the test level:

diff --git a/datalad/downloaders/base.py b/datalad/downloaders/base.py
index eafe16a70..dad2bd583 100644
--- a/datalad/downloaders/base.py
+++ b/datalad/downloaders/base.py
@@ -540,11 +540,6 @@ def _fetch(self, url, cache=None, size=None, allow_redirects=True):
             #pbar.finish()
             downloaded_size = len(content)

-            # now that we know size based on encoded content, let's decode into string type
-            if PY3 and isinstance(content, binary_type):
-                content = content.decode()
-            # downloaded_size = os.stat(temp_filepath).st_size
-
             self._verify_download(url, downloaded_size, target_size, None, content=content)

         except (AccessDeniedError, IncompleteDownloadError) as e:

kyleam added a commit that referenced this issue Mar 8, 2019
@kyleam
Copy link
Contributor

@kyleam kyleam commented Mar 8, 2019

So it might be worth removing that code to see if anything breaks

Let's see what happens with https://travis-ci.org/datalad/datalad/builds/503692960

@kyleam
Copy link
Contributor

@kyleam kyleam commented Mar 8, 2019

OK, there is one failing test when we drop the py3-specific decoding:

======================================================================
FAIL: datalad.downloaders.tests.test_http.test_HTTPDownloader_basic
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/utils.py", line 442, in newfunc
    return t(*(arg + (d,)), **kw)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/utils.py", line 554, in newfunc
    return tfunc(*(args + (path, url)), **kwargs)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/downloaders/tests/test_http.py", line 121, in test_HTTPDownloader_basic
    assert_equal(downloader.fetch(furl), 'abc')
AssertionError: b'abc' != 'abc'

IMO the test is wrong because on Python 3 it's asserting a unicode result despite what is documented.

core's crawler test run isn't too informative because only a small subset of the tests are executed (I'm guessing some issue with how the nose selector is specified; haven't looked). But if I run py3 crawler tests locally, I see one failure:

======================================================================
ERROR: datalad_crawler.pipelines.tests.test_crcns.test_get_metadata
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kyle/src/python/venvs/datalad-crawler/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/kyle/src/python/datalad/datalad/tests/utils.py", line 849, in newfunc
    return func(*args, **kwargs)
  File "/home/kyle/src/python/datalad-crawler/datalad_crawler/pipelines/tests/test_crcns.py", line 29, in test_get_metadata
    all_meta = get_metadata()
  File "/home/kyle/src/python/datalad-crawler/datalad_crawler/pipelines/crcns.py", line 69, in get_metadata
    rj = fetch_datacite_metadata()
  File "/home/kyle/src/python/datalad-crawler/datalad_crawler/pipelines/crcns.py", line 42, in fetch_datacite_metadata
    return json.loads(text)
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'

We should fix this caller in crcns, but so that we don't have compatibility issues between the core and the extension, I think we should also update get_cached_url_content, whose only caller is crawler's crcns, to return unicode.

So, I'll plan to move things in this direction:

  • drop py3 decoding in _fetch
  • make crawler's crncns decode for json.loads
  • make get_cached_url_content return unicode

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 8, 2019

Some tests for crawler might not run because require credentials (for aws, or crcns). I will share crcns ones to use later on privately just in case

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 8, 2019

I think appveyor failure is unrelated.

@driusan
Copy link
Collaborator Author

@driusan driusan commented Mar 8, 2019

@kyleam Yes, those are the exact errors I'm trying to work around. I can't catch the UnicodeDecodeError in the caller because fetch converts it to a DownloadError, I can't catch DownloadError in the caller because it could be a real download error, and I didn't want to have fetch not convert the UnicodeDecodeError to a DownloadError because I wasn't sure how many places depend on the current behaviour, so I just added a parameter to avoid it.

I'm surprised only one test is failing when you remove it. When I tried to pass Decode=False from fetch (so that it works as documented) I was getting similar type errors to what you're seeing in datalad_crawler.pipelines.tests.test_crcns.test_get_metadata from pretty much everywhere.

@kyleam
Copy link
Contributor

@kyleam kyleam commented Mar 8, 2019

I'm surprised only one test is failing when you remove it. When I tried to pass Decode=False from fetch (so that it works as documented) I was getting similar type errors to what you're seeing in datalad_crawler.pipelines.tests.test_crcns.test_get_metadata from pretty much everywhere.

Eh good to know. I might be doing something silly on my end.

@driusan
Copy link
Collaborator Author

@driusan driusan commented Mar 8, 2019

It's the same json.loads error that I was seeing, so it's possible that there is only one root cause of it and I was just seeing it from my own crawler when I tried

When annexificator is used in a datalad-crawler pipeline, it attempts
to call get_status on the URL to see if the file has changed. get_status
attempts to use _fetch, which attempts to decode the content under Python3.

This is invalid for content that isn't utf-8, and get_status was only
trying to verify that it wasn't receiving a login or failure page or similar.
_fetch, confusingly, converts the UnicodeFormatError into a generic DownloadError
(despite the fact that the download succeeded, but conversion to a string failed.)

This teaches _fetch a "decode" parameter that toggles whether or not it should
even attempt to decode. In the case of get_status, the content is discarded,
so the decode serves no purpose.

NB. fetch (without the _)'s documentation claims that it doesn't decode, but
it calls _fetch (which currently always decodes.) This can new parameter can
eventually be used to fix that, but currently trying to fix "fetch" results in
a json.loads error from a mysterious part in the code if fetch is updated to
not decode as documented.
@kyleam kyleam force-pushed the bf-annexificator branch from 7b13d18 to 7b4ea52 Mar 8, 2019
@codecov
Copy link

@codecov codecov bot commented Mar 8, 2019

Codecov Report

Merging #3210 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3210      +/-   ##
==========================================
- Coverage    90.6%   90.58%   -0.03%     
==========================================
  Files         249      249              
  Lines       32863    32863              
==========================================
- Hits        29775    29768       -7     
- Misses       3088     3095       +7
Impacted Files Coverage Δ
datalad/downloaders/base.py 80.72% <100%> (ø) ⬆️
datalad/downloaders/http.py 82.53% <0%> (-2.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b27238...7b4ea52. Read the comment docs.

1 similar comment
@codecov
Copy link

@codecov codecov bot commented Mar 8, 2019

Codecov Report

Merging #3210 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3210      +/-   ##
==========================================
- Coverage    90.6%   90.58%   -0.03%     
==========================================
  Files         249      249              
  Lines       32863    32863              
==========================================
- Hits        29775    29768       -7     
- Misses       3088     3095       +7
Impacted Files Coverage Δ
datalad/downloaders/base.py 80.72% <100%> (ø) ⬆️
datalad/downloaders/http.py 82.53% <0%> (-2.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b27238...7b4ea52. Read the comment docs.

@kyleam kyleam changed the base branch from master to 0.11.x Mar 8, 2019
@kyleam
Copy link
Contributor

@kyleam kyleam commented Mar 8, 2019

OK, given that this is a targeted fix to an internal helper function and that the commit message explains the outstanding issues with fetch, I'd say we go ahead with it. I'll have a look at fixing fetch (and hopefully #3212 will make the crawler run more informative regarding the effects of the change).

@kyleam kyleam merged commit 7b4ea52 into datalad:0.11.x Mar 9, 2019
1 of 3 checks passed
@yarikoptic yarikoptic added this to the Release 0.11.4 milestone Mar 14, 2019
yarikoptic added a commit that referenced this issue Apr 6, 2019
    ## 0.11.4 (Mar 18, 2019) -- get-ready

    Largely a bug fix release with a few enhancements

    ### Important

    - 0.11.x series will be the last one with support for direct mode of [git-annex][]
      which is used on crippled (no symlinks and no locking) filesystems.
      v7 repositories should be used instead.

    ### Fixes

    - Extraction of .gz files is broken without p7zip installed.  We now
      abort with an informative error in this situation.  ([#3176][])

    - Committing failed in some cases because we didn't ensure that the
      path passed to `git read-tree --index-output=...` resided on the
      same filesystem as the repository.  ([#3181][])

    - Some pointless warnings during metadata aggregation have been
      eliminated.  ([#3186][])

    - With Python 3 the LORIS token authenticator did not properly decode
      a response ([#3205][]).

    - With Python 3 downloaders unnecessarily decoded the response when
      getting the status, leading to an encoding error.  ([#3210][])

    - In some cases, our internal command Runner did not adjust the
      environment's `PWD` to match the current working directory specified
      with the `cwd` parameter.  ([#3215][])

    - The specification of the pyliblzma dependency was broken.  ([#3220][])

    - [search] displayed an uninformative blank log message in some
      cases.  ([#3222][])

    - The logic for finding the location of the aggregate metadata DB
      anchored the search path incorrectly, leading to a spurious warning.
      ([#3241][])

    - Some progress bars were still displayed when stdout and stderr were
      not attached to a tty.  ([#3281][])

    - Check for stdin/out/err to not be closed before checking for `.isatty`.
      ([#3268][])

    ### Enhancements and new features

    - Creating a new repository now aborts if any of the files in the
      directory are tracked by a repository in a parent directory.
      ([#3211][])

    - [run] learned to replace the `{tmpdir}` placeholder in commands with
      a temporary directory.  ([#3223][])

    - [duecredit][] support has been added for citing DataLad itself as
      well as datasets that an analysis uses.  ([#3184][])

    - The `eval_results` interface helper unintentionally modified one of
      its arguments.  ([#3249][])

    - A few DataLad constants have been added, changed, or renamed ([#3250][]):
      - `HANDLE_META_DIR` is now `DATALAD_DOTDIR`.  The old name should be
         considered deprecated.
      - `METADATA_DIR` now refers to `DATALAD_DOTDIR/metadata` rather than
        `DATALAD_DOTDIR/meta` (which is still available as
        `OLDMETADATA_DIR`).
      - The new `DATASET_METADATA_FILE` refers to `METADATA_DIR/dataset.json`.
      - The new `DATASET_CONFIG_FILE` refers to `DATALAD_DOTDIR/config`.
      - `METADATA_FILENAME` has been renamed to `OLDMETADATA_FILENAME`.

* tag '0.11.4': (82 commits)
  Updated CHANGELOG.md for having merged check for not being closed
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  CHANGELOG.md: Update 0.11.4 entries
  CHANGELOG.md: Adjust the format of a link
  BF: split lines on spaces and commas befoe doing sed capture of #issue
  RF: adjust tools/link_issues_CHANGELOG to have MD links as [#issue][]
  BF+DOC: Fix all links to mark them valid markdown
  BF: Fix markup bug that prevents sphinx-build from succeeding
  DOC: extend coumentation of is_interactive
  BF: guard .istty with try/except
  ENH: ui: Drop progress bars if not attached to tty
  ENH: ui: Add another name for SilentConsoleLog
  BF: check for stdin/out/err to not be closed before checking for .isatty
  RF: HANDLE_META_DIR -> DATALAD_DOTDIR (but keeping an alias for compatibility)
  RF: define/use consts DATASET_CONFIG_FILE, DATASET_METADATA_FILE, METADATA_DIR
  RF: METADATA_DIR/FILE -> OLDMETADATA_DIR/FILE
  BF: do not cause a side-effect on kwargs in @eval_results
  BF: join with ds.path when trying to see if any other metadata file is available
  RF: Move duecredit into its own section so it is not a part of install_requires
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  ...
yarikoptic added a commit to yarikoptic/datalad that referenced this issue Apr 6, 2019
    ## 0.11.4 (Mar 18, 2019) -- get-ready

    Largely a bug fix release with a few enhancements

    ### Important

    - 0.11.x series will be the last one with support for direct mode of [git-annex][]
      which is used on crippled (no symlinks and no locking) filesystems.
      v7 repositories should be used instead.

    ### Fixes

    - Extraction of .gz files is broken without p7zip installed.  We now
      abort with an informative error in this situation.  ([datalad#3176][])

    - Committing failed in some cases because we didn't ensure that the
      path passed to `git read-tree --index-output=...` resided on the
      same filesystem as the repository.  ([datalad#3181][])

    - Some pointless warnings during metadata aggregation have been
      eliminated.  ([datalad#3186][])

    - With Python 3 the LORIS token authenticator did not properly decode
      a response ([datalad#3205][]).

    - With Python 3 downloaders unnecessarily decoded the response when
      getting the status, leading to an encoding error.  ([datalad#3210][])

    - In some cases, our internal command Runner did not adjust the
      environment's `PWD` to match the current working directory specified
      with the `cwd` parameter.  ([datalad#3215][])

    - The specification of the pyliblzma dependency was broken.  ([datalad#3220][])

    - [search] displayed an uninformative blank log message in some
      cases.  ([datalad#3222][])

    - The logic for finding the location of the aggregate metadata DB
      anchored the search path incorrectly, leading to a spurious warning.
      ([datalad#3241][])

    - Some progress bars were still displayed when stdout and stderr were
      not attached to a tty.  ([datalad#3281][])

    - Check for stdin/out/err to not be closed before checking for `.isatty`.
      ([datalad#3268][])

    ### Enhancements and new features

    - Creating a new repository now aborts if any of the files in the
      directory are tracked by a repository in a parent directory.
      ([datalad#3211][])

    - [run] learned to replace the `{tmpdir}` placeholder in commands with
      a temporary directory.  ([datalad#3223][])

    - [duecredit][] support has been added for citing DataLad itself as
      well as datasets that an analysis uses.  ([datalad#3184][])

    - The `eval_results` interface helper unintentionally modified one of
      its arguments.  ([datalad#3249][])

    - A few DataLad constants have been added, changed, or renamed ([datalad#3250][]):
      - `HANDLE_META_DIR` is now `DATALAD_DOTDIR`.  The old name should be
         considered deprecated.
      - `METADATA_DIR` now refers to `DATALAD_DOTDIR/metadata` rather than
        `DATALAD_DOTDIR/meta` (which is still available as
        `OLDMETADATA_DIR`).
      - The new `DATASET_METADATA_FILE` refers to `METADATA_DIR/dataset.json`.
      - The new `DATASET_CONFIG_FILE` refers to `DATALAD_DOTDIR/config`.
      - `METADATA_FILENAME` has been renamed to `OLDMETADATA_FILENAME`.

* tag '0.11.4':
  Updated CHANGELOG.md for having merged check for not being closed
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  CHANGELOG.md: Update 0.11.4 entries
  CHANGELOG.md: Adjust the format of a link
  BF: split lines on spaces and commas befoe doing sed capture of #issue
  RF: adjust tools/link_issues_CHANGELOG to have MD links as [#issue][]
  BF+DOC: Fix all links to mark them valid markdown
  BF: Fix markup bug that prevents sphinx-build from succeeding
  DOC: extend coumentation of is_interactive
  BF: guard .istty with try/except
  BF: check for stdin/out/err to not be closed before checking for .isatty
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  CHANGELOG: Add entries for 0.11.4
  ENH: version boost and initial changes to CHANGELOG
yarikoptic added a commit to yarikoptic/datalad that referenced this issue May 15, 2020
In datalad#3210 (specifically
622cada) we stopped decoding
fetched content while checking the status.  It was initially triggered
within datalad-crawler pipelines and has further discussion on
datalad/datalad-crawler#9 .  So tests should
compare to bytes not strings if not explicitly decoding.

Situation got grave with the entire test module being disabled due to
httpretty at those times not supporting python3. That lead to these
lines not exercised at all.  This PR series includes one more fix
addressing switch from string to bytes for content and also enables
back testing using httpretty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants