Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPeNDAP loading error for datasets with many zeros and NaNs #1667

Closed
naomi-henderson opened this issue Mar 10, 2020 · 17 comments · Fixed by #1670
Closed

OPeNDAP loading error for datasets with many zeros and NaNs #1667

naomi-henderson opened this issue Mar 10, 2020 · 17 comments · Fixed by #1670

Comments

@naomi-henderson
Copy link

Hi, I have been struggling to read some opendap URLs using netcdf-python. I have tried with
versions '1.5.1.2' and '1.4.0', same issue. The same URLs work fine with many other opendap-enabled tools: ncview, plot with panoply and the ncdump and the ncotools work fine, but am having trouble reading into python using the netCDF4-python package.

I raised this issue at Unidata/netcdf4-python#998 and @jswhit
kindly suggested I raise it here "since this is definitely happening in the C library".

Here is an example.
The OPeNDAP version of the data is:

http://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc

or it can downloaded directly from:

http://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc

The downloaded file ( <150M ) gives me no problems.

This particular OPeNDAP file is global seaice concentration, 'siconc' (time: 1980, latitude: 384, longitude: 320). Although it throws an error when the whole dataset is requested,
it loads fine if the request is for a small enough chunk (in this case the first 1017 of 1980 time slices can be successfully loaded, but the first 1018 cannot). Note that the dataset consists of mostly NaNs (land values) and zeros (clear ocean values). The non-zeros are only at high latitude ocean grid points, so it is highly compressible.

Since the file opens in ncview and displays the whole dataset from beginning to end, I am assuming the trouble is not in the OPeNDAP server at "llnl.gov". Is it possible that the netcdf-c package has a check on the expected cache size which cannot handle all of the zeros and NaNs? Any ideas, anyone?

python code:
'''
import netCDF4
OPENDAP_url =
'http://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc'
ncds = netCDF4.Dataset(OPENDAP_url)
ncds['siconc'][:] # does not work
ncds['siconc'][:1000] # works
'''

with the following error:

'''

RuntimeError

Traceback (most recent call last)
in
5 i = nc_fid.variables['i'][:]
6 j = nc_fid.variables['j'][:]
----> 7 siconc = nc_fid.variables['siconc'][:,:,:]

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.getitem()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._get()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Access failure
'''

@WardF
Copy link
Member

WardF commented Mar 10, 2020

Is it possible to get a C program that duplicates this issue? Otherwise, you may want to open this over at the python project, https://github.com/Unidata/netcdf4-python, and it will be linked back here if there is an issue related to the core C library. As it stands, I can't determine if this is an issue in the C library or the upstream python package.

@lesserwhirls
Copy link
Contributor

Unfortunately, the issue is due to size of the request although it does not seem to make sense when looking at the compressed file size. Although the data compress very well due to the repeated zero and 1.e+20f (_FillValue, missing_value) values , the data still have to be loaded into memory uncompressed before the TDS returns the DAP response. In this case, requesting the full variable is 1980 * 384 * 320 = 243,302,400 32-bit floating point values (roughly 930 MiB). Experimentally (through a frustrating and time consuming process, I'm sure), you found that 1017 * 384 * 320 = 124,968,960 32-bit floating point values (roughly 480 MiB) works. The default binary response from the DAP service shipped with the TDS is limited to 500 MB, so the first request is well over that limit, and the second request is just under it.

ncview and panoply most likely request only what is needed to generate a single image (so something like one time index, full spatial indices). ncdump does something interesting, in that it makes multiple requests like so:

Warning:fetch: https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc.dods?siconc.siconc[0][0][0:319]
Warning:fetch complete: 0.057 secs
Warning:fetch: https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc.dods?siconc.siconc[0][1][0:319]
Warning:fetch complete: 0.053 secs
Warning:fetch: https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc.dods?siconc.siconc[0][2][0:319]
Warning:fetch complete: 0.056 secs
Warning:fetch: https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc.dods?siconc.siconc[0][3][0:319]
...

So each request to the server is for a single time index, a single j index, and all i indicies. That means using ncdump to print out all of the values of the variable `siconc`` is going to require just over 760k requests to the server.

Other data access services offered by the TDS do not have the same size limits on requests. For example, I can use Siphon and the cdmremote service of the TDS to access the data without issue, but I'm not sure how using Siphon would fit within your specific end goals.

@naomi-henderson
Copy link
Author

naomi-henderson commented Mar 11, 2020

@lesserwhirls , thank you very much for your detailed response! Good to know that Siphon + cdmremote might be a solution, but would require a large refactoring of our work flow. We are part of the Pangeo effort to provide CMIP6 datasets in Google Cloud and have been using xarray to concatenate the datasets in time and convert to zarr format. We were hoping to skip the step of downloading the data to our local drives and then uploading to the cloud, but found that all of the 'tracer' types of variables, in addition to seaice, have this same issue.
Anyway, thanks again for your very helpful comments! I will now trace backwards through all of the folks I have involved in this and refer them to your comments

I still wonder that I am able to request the very large datasets (multilevel as well as the same time, latitude and longitude) using TDS with no difficulties, but these very compressible datasets are the ones causing trouble ... I guess the data providers (modellers) were much more conservative in the time chunking of the files for the larger datasets.

@WardF
Copy link
Member

WardF commented Mar 11, 2020

Thanks @lesserwhirls !

@rabernat
Copy link

Very informative discussion! The key points appears to be this:

The default binary response from the DAP service shipped with the TDS is limited to 500 MB

It sounds like other software (e.g. ncview, panoply) is smart / lazy enough not to just grab all the data.

Fortunately, we can easily do this in the Pangeo stack as well, by using Dask to request the data in more manageable chunks.

import xarray as xr
OPENDAP_url = 'http://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc'
# the decode_times=False bit is required for the cftime time index
ds_opendap = xr.open_dataset(OPENDAP_url, chunks={'time': '100MB'}, decode_times=False)
display(ds_opendap)

This works fine and runs immediately. We can then load the data as

data = ds_opendap.siconc.compute()

This was quite slow for me (about 5 min to get < 1 GB of data), but it did complete successfully. Maybe a different size chunk would have better performance.

For example, I can use Siphon and the cdmremote service of the TDS to access the data without issue, but I'm not sure how using Siphon would fit within your specific end goals.

We should be looking in to how to integrate siphon into our CMIP6 pipeline. It has a lot of powerful capabilities that could help us out a lot. I raised an issue with some questions about this in Unidata/siphon#258. That issue has a few code snippets that could be useful.

@rabernat
Copy link

Another point that this issue raises is error messages. If the opendap server had said: Help! I can't handle requests larger than 500MB! Try making a smaller request. we probably would have been able to figure this out ourselves without lighting up three different issue trackers. Instead we got NetCDF: Access failure.

Is there a way we could propagate more informative errors through this stack?

@DennisHeimbigner
Copy link
Collaborator

Yes. If you suffix your url with the string "#log"
e.g.:
'http://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/SImon/siconc/gn/v20200218/siconc_SImon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc#log'

you should get some output that helps to figure out
what is happening. But it is still a bit opaque.

The problem is that the netcdf library cannot itself recognize that this happened
because it is a server problem.

@lesserwhirls
Copy link
Contributor

There is a bug in 5.0 in which the response body isn't being returned in the server response (but still a 403 status). I will fix that. However, in 4.6 we do return a message. So are you saying this is as good as it gets when reading through netCDF-C?

from netCDF4 import Dataset
ds = Dataset("https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GEFS/Global_1p0deg_Ensemble/members/Best#log")
data = ds.variables['v-component_of_wind_isobaric_ens'][:]
Note:oc_open: server error retrieving url: code=403 message="Request too big=12216.71808 Mbytes, max=500.0"
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-55-addea7083b31> in <module>
----> 1 data = ds2.variables["v-component_of_wind_isobaric_ens"][:]

netCDF4\_netCDF4.pyx in netCDF4._netCDF4.Variable.__getitem__()

netCDF4\_netCDF4.pyx in netCDF4._netCDF4.Variable._get()

netCDF4\_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Access failure

There is no way to propagate the "Note:oc_open: server error ..." message upward, or capture that and give a more informative RuntimeError?

@rabernat
Copy link

That looks like a good error message to me! Not sure if @naomi-henderson saw that in her stack trace...

@naomi-henderson
Copy link
Author

naomi-henderson commented Mar 11, 2020

Ah, that would have been very helpful! Never saw it. Here is what I see with your example:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-72c7ec114606> in <module>
      1 from netCDF4 import Dataset
      2 ds = Dataset("https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GEFS/Global_1p0deg_Ensemble/members/Best#log")
----> 3 data = ds.variables['v-component_of_wind_isobaric_ens'][:]

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__getitem__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._get()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Access failure

This is with netCDF4.version = '1.5.1.2' . Thank you very much following through with this issue!

@lesserwhirls
Copy link
Contributor

So adding the #log suffix to the url makes it seem like you are asking the server to generate a debugging message. In reality, the server never even sees that because it's actually a client side switch. Adding a suffix to the access URL to get, essentially, client side debugging information is not at all intuitive. Using curl against the actual .dods request URL, we see:

curl -i "https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GEFS/Global_1p0deg_Ensemble/members/Best.dods?v-component_of_wind_isobaric_ens"
HTTP/1.1 403 403
Date: Wed, 11 Mar 2020 16:08:05 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=63072000; includeSubdomains;
Access-Control-Allow-Origin: *
XDODS-Server: opendap/3.7
Content-Description: dods-error
Content-Security-Policy: frame-ancestors 'self'
Transfer-Encoding: chunked
Content-Type: text/plain

Error {
    code = 403;
    message = "Request too big=12216.721864 Mbytes, max=500.0";
};

That's the kind of info the C library should have access to, and that's how the "Note:oc_open: server error ..." message is generated. Adding the #log suffix just tells netCDF-C to show you that (or more specifically, lib oc, which looks to be part of the netCDF-C codebase).

@naomi-henderson, do you know what version of netCDF-C you are using? (netCDF4.__netcdf4libversion__)?

@naomi-henderson
Copy link
Author

@lesserwhirls

netCDF4.netcdf4libversion = '4.6.2'

@DennisHeimbigner
Copy link
Collaborator

You need to review the RFC for URLs. The Fragment suffix is
explicitly client side. and is never sent to the server.

@DennisHeimbigner
Copy link
Collaborator

The 403 in the message is the standard HTTP access failure, which is
properly propagated. The problem is that the important info
is in English text in the message part. So to propagate the message
would require some kind of parsing of the message to try to figure out
what it means. We could do pattern matching for messages and hope
that the server returns a standard message text that we could recognize.
This would be, of course, server implementation dependent.

@lesserwhirls
Copy link
Contributor

You need to review the RFC for URLs. The Fragment suffix is
explicitly client side. and is never sent to the server.

100% true, but I still say it isn't intuitive for a lot of users.

The 403 in the message is the standard HTTP access failure, which is
properly propagated. The problem is that the important info
is in English text in the message part. So to propagate the message
would require some kind of parsing of the message to try to figure out
what it means. We could do pattern matching for messages and hope
that the server returns a standard message text that we could recognize.
This would be, of course, server implementation dependent.

First, in the case of netCDF-C and the TDS, we control both sides of the problem, so the least we can do for our users is to tell them something useful, no matter how we make it happen. Second, do we need pattern matching? Can we pass on the plain text of the response message if the status is in the 400 range without using #log?

@DennisHeimbigner
Copy link
Collaborator

The problem being addressed is how to send information down the stack.
Using the fragment is one way. There is a different syntax that is
the same thing, namely prefix the URL with e.g. '[log]https://...'
Neither is particularly intuitive. The #... form is preferred because
it is consistent with the RFC url syntax. The other form causes some systems
(e.g. some python packages) to fail because it is a non-standard URL.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Mar 11, 2020

As usual, I am overthinking things.

Can we pass on the plain text of the response message if the status is in the 400 range without using #log?

This is the right solution. I will implement it.

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Mar 11, 2020
re: Unidata#1667

Make DAP (2 and 4) forcibly report an error message
when an error response is received from the DAP servlet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants