Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sample to be False when reading CSV from a remote URL. #4634

Merged
merged 5 commits into from Apr 5, 2019

Conversation

Projects
None yet
3 participants
@ian-r-rose
Copy link
Contributor

commented Mar 26, 2019

This allows blocksize=None, sample=False to be passed into dd.read_csv, which allows the user to load a CSV from a remote URL in one go. This is useful if the remote doesn't support HEAD or range requests.

Fixes #4633

  • Tests added / passed
  • Passes flake8 dask
@martindurant
Copy link
Member

left a comment

Thanks for the quick attention, @ian-r-rose !

@@ -110,7 +110,7 @@ def __init__(self, url, session=None, block_size=None, **kwargs):
try:
self.size = file_size(url, self.session, allow_redirects=True,
**self.kwargs)
except ValueError:
except (ValueError, requests.HTTPError):

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 26, 2019

Member

The second exception is what a failed HEAD gives (methodnotallowed or some such)? Do you think it would be useful to give more verbose errors within HTTPFile for when things go wrong with naive usage?

This comment has been minimized.

Copy link
@ian-r-rose

ian-r-rose Mar 26, 2019

Author Contributor

Yes, this is the error thrown by the failed HEAD. I can certainly try to give a more descriptive error here.

@@ -48,7 +48,7 @@ def read_bytes(urlpath, delimiter=None, not_zero=False, blocksize="128 MiB",
Chunk size in bytes, defaults to "128 MiB"
compression : string or None
String like 'gzip' or 'xz'. Must support efficient random access.
sample : int or string
sample : int, string, or boolean

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 26, 2019

Member

Only False is a valid boolean here, right?

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 26, 2019

Member

This should have a test

This comment has been minimized.

Copy link
@ian-r-rose

ian-r-rose Mar 26, 2019

Author Contributor

There is a check for True down below for backwards compatibility reasons, so I changed it to boolean. If that should be treated as private, I can change it back to False.

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 26, 2019

Member

That's fine, just making sure.

This comment has been minimized.

Copy link
@ian-r-rose

ian-r-rose Mar 27, 2019

Author Contributor

Added a (very) simple test here.

# If we have not sampled, then use the first row of the first values
# as a representative sample.
if b_sample is False and len(values[0]):
b_sample = values[0][0].compute()

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 26, 2019

Member

I like this as a very simple way to achieve things.

The docstring here needs to be updated.

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 26, 2019

Member

Do we still get an appropriate exception when values[0] == []?

This comment has been minimized.

Copy link
@ian-r-rose

ian-r-rose Mar 26, 2019

Author Contributor

Good question!

This comment has been minimized.

Copy link
@ian-r-rose

ian-r-rose Mar 27, 2019

Author Contributor

I'm not really sure what the intended behavior should be here. When the CSV is empty it raises a ValueError that the sample is not large enough.

This comment has been minimized.

Copy link
@martindurant

martindurant Mar 27, 2019

Member

I guess it's not the time to change the behaviour here, so it's fine - but this is about inferring the meta from the first file, but the first file might not contain sufficient or any data; in theory we could deal with that, but probably the error message is good enough.

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 26, 2019

I would love to hear suggestions for where/what to test this on.

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

The builtin http.server does allow HEAD, but not POST :|

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

Your exception handler seems to also catch not-found and presumably other errors. I suggest it should catch only 405 Method-not-allowed, which as an exception looks like

requests.exceptions.HTTPError("501 Server Error: Unsupported method ...
@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

Unfortunately, in my test case the server sends back a 400 Client Error, with an X-Error-Message: HEAD is not supported. If we are mostly trying to respond to not-found, would it be acceptable to re-raise if the status code is 404?

I appreciate that I should also be complaining to Socrata about their loose interpretation of HTTP status codes :/

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

I suppose the most common ones that should just be re-raised are 401, 403, 404. 400 is tricky, because it could be from HEAD or not. Maybe we should have a custom exception raised only as

try:
    requests.head(...)
except Exception as e:
    raise MyHeadException(original_exception=e)  # or raise_from
@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

In this particular usage http.file_size, the only HTTP request comes from requests.head(), so we know that going in. Or do I misunderstand your suggestion?

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

OK, sorry, you're right: at worst, then, we might not raise a real issue here, but only later on data access. I still think 401, 403, 404 should probably be reraised immediately.

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

Sounds good, I'll do that.

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

Whoops, good catch on unpacking.

On the output: I think False is correct: read_bytes returns it unmodified, as far as I can tell:

dask/dask/bytes/core.py

Lines 126 to 135 in 96c3381

if sample:
if sample is True:
sample = '10 kiB' # backwards compatibility
if isinstance(sample, str):
sample = parse_bytes(sample)
with OpenFile(fs, paths[0], compression=compression) as f:
sample = read_block(f, 0, sample, delimiter)
if include_path:
return sample, out, paths
return sample, out

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

w/r/t testing, it should be possible to override do_HEAD() on http.server to mock the right response (with a bit of work).

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

I think False is correct

I will not push; after all, the caller has no reason to care what this is when they don't want a sample at all.

@ian-r-rose ian-r-rose force-pushed the ian-r-rose:full-sample branch from 6ed63ee to 38f57c6 Mar 27, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

it should be possible to override do_HEAD() on http.server

Note that this is in a different place on py2 (if we still care) and I don't suppose it's async, so would be hard to put inside a single process

@ian-r-rose ian-r-rose force-pushed the ian-r-rose:full-sample branch 2 times, most recently from 12a223f to 20a9b5f Mar 27, 2019

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

Checking in. What's the status here?

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2019

I've got a test that is failing on python 2 (seemingly some bytes issue). Open to suggestions for fixing it. Also was hoping to mock up a test for the specific case that initiated the PR (no HEAD), but have not had the chance yet.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Mar 31, 2019

I've got a test that is failing on python 2 (seemingly some bytes issue)

If it's relatively isolated I think it's probably fine to xfail Python 2 tests pretty liberally today.

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Apr 2, 2019

I took a look at making a HEADless server for the test suite, and as @martindurant suspected it's not exactly trivial to do in the process of the test suite. If there are any ideas as to how to push on that, I'm all ears. Otherwise, I think this is in reasonably good shape (some parquet issues aside, possibly related to #4655?).

@ian-r-rose ian-r-rose force-pushed the ian-r-rose:full-sample branch 2 times, most recently from 6f8e62d to a94b0a3 Apr 2, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Failures are in parquet and also image - did something else get released recently? ( @jakirkham )

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

Please merge from master, and the failures should go away.

@ian-r-rose ian-r-rose force-pushed the ian-r-rose:full-sample branch from a94b0a3 to f8b5ddb Apr 4, 2019

@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2019

Rebased.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

Please add no-coverage marker to the two exceptions on HTTP access, since the tests don't trigger them.

@martindurant

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Looks good, thank you.

@martindurant martindurant merged commit 5dc48d2 into dask:master Apr 5, 2019

4 checks passed

codecov/patch 100% of diff hit (target 91.2%)
Details
codecov/project 91.21% (+<.01%) compared to 0780ca5
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ian-r-rose

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2019

Thanks for the guidance!

@ian-r-rose ian-r-rose referenced this pull request Apr 11, 2019

Merged

Multiple input sources #2

asmith26 added a commit to asmith26/dask that referenced this pull request Apr 22, 2019

Allow sample to be False when reading CSV from a remote URL. (dask#4634)
* Allow sample to be False when reading CSV from a remote URL in order to just read the whole file.

* Re-raise HTTP error if they are unlikely to be HEAD method-not-supported errors.

* Add some simple tests.

* Don't check length of sample if it is a boolean.

* Add no-cover markers for untested lines.

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Allow sample to be False when reading CSV from a remote URL. (dask#4634)
* Allow sample to be False when reading CSV from a remote URL in order to just read the whole file.

* Re-raise HTTP error if they are unlikely to be HEAD method-not-supported errors.

* Add some simple tests.

* Don't check length of sample if it is a boolean.

* Add no-cover markers for untested lines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.