Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance downloading using byte-range requests #1848

Closed
dopplershift opened this issue Sep 23, 2020 · 29 comments · Fixed by #1849
Closed

Poor performance downloading using byte-range requests #1848

dopplershift opened this issue Sep 23, 2020 · 29 comments · Fixed by #1849

Comments

@dopplershift
Copy link
Member

So if I try to get the data for a variable using byte-range requests, it performs pretty terribly:

❯ time ncdump -v temperature_anomaly 'https://coawst-public.s3-us-west-2.amazonaws.com/testing/HadCRUT.4.6.0.0.median.nc#mode=bytes' > /dev/null
ncdump -v temperature_anomaly  > /dev/null  8.89s user 0.83s system 7% cpu 2:05.89 total

During this transfer, my system showed it was sustaining about 180 kB/s. If I instead just download the whole file:

❯ time wget https://coawst-public.s3-us-west-2.amazonaws.com/testing/HadCRUT.4.6.0.0.median.nc
--2020-09-23 11:07:42--  https://coawst-public.s3-us-west-2.amazonaws.com/testing/HadCRUT.4.6.0.0.median.nc
Resolving coawst-public.s3-us-west-2.amazonaws.com (coawst-public.s3-us-west-2.amazonaws.com)... 52.218.185.201
Connecting to coawst-public.s3-us-west-2.amazonaws.com (coawst-public.s3-us-west-2.amazonaws.com)|52.218.185.201|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21541372 (21M) [application/x-netcdf]
Saving to: ‘HadCRUT.4.6.0.0.median.nc’

HadCRUT.4.6.0.0.median.nc         100%[=============================================================>]  20.54M  7.45MB/s    in 2.8s    

2020-09-23 11:07:45 (7.45 MB/s) - ‘HadCRUT.4.6.0.0.median.nc’ saved [21541372/21541372]

wget   0.06s user 0.16s system 7% cpu 3.101 total

I get that there's some overhead with byte range requests, but a difference of 40x is so slow as to make the byte-range support useless.

Do we know what chunk size is being requested?

See Unidata/netcdf4-python#1043 for the original issue that provoked this.

@dopplershift
Copy link
Member Author

cc @DennisHeimbigner

@DennisHeimbigner
Copy link
Collaborator

According to ncdump -hs, the chunking for that variable is:
Re: chunksizes

temperature_anomaly:_ChunkSizes = 1, 36, 72 ;

We need something to compare against. Is there any way
to get equivalent times from netcdf-Java?
Can you do the similar thing with nccopy or the NCO operators?

@dopplershift
Copy link
Member Author

To be clear, when I asked about chunks, what I really meant is: how many bytes are asked for in each request? It feels like the low performance is due to too many small requests rather than getting e.g. 1MB chunks.

@dopplershift
Copy link
Member Author

Can you do the similar thing with nccopy or the NCO operators?

Is nccopy not working on your build of netcdf-c? 😉

@rsignell-usgs
Copy link

rsignell-usgs commented Sep 23, 2020

Also, just a note that on Unidata/netcdf4-python#1043 (comment) I was just opening the file as an xarray dataset (not downloading, at least not intentionally anyway). That should just read some metadata and the coordinates, but not the data variables. But it was still very slow.

@DennisHeimbigner
Copy link
Collaborator

The file is a netcdf-4 file apparently, so the decision about how much to read
with each request is determined by the hdf5 library. Is there an equivalent sized
netcdf-3 file on that server that we can attempt to read to see if this a problem in general
or primarily with hdf5.

@dopplershift
Copy link
Member Author

@rsignell-usgs So when I tested, I tried just opening with netCDF4-python, and didn't have the horrendous slow-down. The quickest way to get it was for me to try to get data, which was unreasonably slow. The original xarray issue might instead be downloading coordinates or something. Either way, it'd be nice to find a way to address a 40x slowdown.

@DennisHeimbigner
Copy link
Collaborator

To be clear, I am not surprised by these numbers. We know that, inherently, access using
http byte-ranges will be significantly slower than file-based IO. The HDF5 access patterns
and reads are tailored to the file environment, so there is no reason to expect them
to work well for remote access.

@DennisHeimbigner
Copy link
Collaborator

One important point, maybe.
If you do direct calls to the netcdf library to get some specific variable,
then the lazy evaluation should keep you from reading extraneous meta-data.
Then the cost is mostly reading the data for the variable.
Ncdump reads all the meta-data so it cannot take advantage of the lazy
meta-data evaluation.

@dopplershift
Copy link
Member Author

So I went ahead an ncdump-ed the time variable and saw this:

	float time(time) ;
		time:_Storage = "chunked" ;
		time:_ChunkSizes = 1 ;

and can confirm my suspicions that getting the coordinates (specifically time) was the xarray problem. Getting the data values for time took:

ncdump -v time   0.79s user 0.28s system 0% cpu 2:13.20 total

Doing the same for latitude and longitude, which while smaller are contiguous, and took seconds.

@dopplershift
Copy link
Member Author

@DennisHeimbigner I expected it to be worse, but it seems like HDF5 is doing the naive approach and reading each chunk as an individual request. You could greatly speed this up by reading 512kB or 1M at a time.

@DennisHeimbigner
Copy link
Collaborator

Possibly relevant: time is an UNLIMITED dimension, while latitude and longitude are not.

@rsignell-usgs
Copy link

I think fsspec reads 10MB at a time, @martindurant could confirm.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Sep 23, 2020

You could greatly speed this up by reading 512kB or 1M at a time.

Not sure I follow. The chunks are not necessarily contiguous on disk
for HDF5. Or are you referring to the slab reads mad by the netcdf code?
Making better use of the HDF5 chunk cache might help
in some cases, but of course only if one was reading the same chunk multiple times.

@martindurant
Copy link

5MB is the typical default, but this is configurable or can be turned off (no caching beyond the specific read).

@DennisHeimbigner
Copy link
Collaborator

Martin- does reading large amounts help or hurt when the file is being
randomly accessed by the client software in pieces much smaller than
the 5mb?

@dopplershift
Copy link
Member Author

Possibly relevant: time is an UNLIMITED dimension, while latitude and longitude are not.

Definitely relevant, since that's why time (and temperature_anomaly) are chunked.

@martindurant
Copy link

That depends on the connection establishment time (slow for SSL) versus the connection throughput. 5MB was specifically chosen to be "worthwhile", that connection overhead becomes relatively small compared to the total time; of course, if most of the bytes are useless, you would be better off only getting those you need.

Note: fsspec allows concurrent downloads, to memory or disc, of multiple URLs, but this is not yet implemented for bytes ranges or random-access. For the latter it probably never will be, because it is unclear how to handle the file instance's internal state.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Sep 23, 2020

There are at least two experiments that need to run:

  1. test with netcdf-java
  2. test with the new hdf5 read-only S3 virtual file driver.

I can undertake #2, although it is possible that it would be quicker to test
using h5 python.

@martindurant
Copy link

Some caching reference for fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#read-buffering (and https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.BaseCache , the no-cache option). The type is chosen by passing cache_type= to the open function.

@dopplershift
Copy link
Member Author

So some relevant experiments from Python land. Running this code:

%%time

fobj = fsspec.open('https://coawst-public.s3-us-west-2.amazonaws.com/testing/HadCRUT.4.6.0.0.median.nc', cache_type='none')
with fobj as f:
    nc = h5netcdf.File(f, mode='r')

a = nc.variables['temperature_anomaly'][:]

which uses the h5netcdf implementation (so netCDF4 in pure python on top of h5py, which sits on libhdf5). This allows me to the custom file objects in fsspec. Results:
'none' (no cache):

CPU times: user 3.44 s, sys: 376 ms, total: 3.82 s
Wall time: 1min 53s

'readahead':

CPU times: user 288 ms, sys: 69.4 ms, total: 358 ms
Wall time: 3.52 s

'bytes':

CPU times: user 542 ms, sys: 117 ms, total: 660 ms
Wall time: 10.9 s

@martindurant I'm not sure if you'll find it surprising (but I did) that cache_type='block' has yet to finish after 5 minutes and significant network activity. I would have expected that one to perform very well.

@martindurant
Copy link

cache_type='block' has yet to finish after 5 minutes and significant network activity

That does sound wrong, worth investigating.

@DennisHeimbigner
Copy link
Collaborator

Interesting; so I would say that readahead being signficantly faster
implies that there is more locality of reference in HDF5 files than I
would have expected. Good to know.

@dopplershift
Copy link
Member Author

Well, there's more locality this file, which only has a couple of unlimited variables. I'd be the locality is much worse in a file with many unlimited--but then again it depends on your access pattern. What I will say is that I can get much better performance just by even using block sizes of 128 kB or 512 kB. Given that those block sizes are cheap to access even on a mediocre cell phone connection, it'd be worth making a moderate blocksize the default if there's some way to do that.

@martindurant Would you like me to open an issue over at intake/filesystem_spec or are you already planning on taking care of that?

@martindurant
Copy link

You are welcome to propose that, but there are counterarguments: actually loading longer blocks of data is the norm for most formats (indeed, I assume this is true for HDF-stored data chunks too). Pushing the connection overhead fraction down is important! Also note that (obviously) most big-data, high-performance work happens in the cloud, where indeed latency is better, but bandwidth is much better.

Would it be reasonable to have different file objects for metadata and data, with different caching and block-sizes? zarr allows this, for instance.

@martindurant
Copy link

Fix for blockcache: fsspec/filesystem_spec#420

@dopplershift
Copy link
Member Author

Apologies @martindurant , I was unclear (prose isn't working so well for me today...then again neither is code). The only issue I was going to open was regarding blockcache, but I see you've got that well in hand.

My proposal was more a minimum improvement to make to (hopefully) the netcdf-c library.

@martindurant
Copy link

Understood, and sorted :)

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Sep 24, 2020
re: Issue Unidata#1848

The existing Virtual File Driver built to support byte-range
read-only file access is quite old. It turns out to be extremely
slow (reason unknown at the moment).

Starting with HDF5 1.10.6, the HDF5 library has its own version
of such a file driver. The HDF5 developers have better knowledge
about building such a driver and what incantations are needed to
get good performance.

This PR modifies the byte-range code in hdf5open.c so
that if the HDF5 file driver is available, then it is used
in preference to the one written by the Netcdf group.

Misc. Other Changes:

1. Moved all of nc4print code to ncdump to keep appveyor quiet.
@DennisHeimbigner
Copy link
Collaborator

I have checked in a PR (#1849) that
partially mitigates the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants