Prefetch in StructureDataIterators to improve performance #671
Comments
Our point stack exhibits a structure-oriented data access pattern, meaning that one value from each data variable is needed for each observation that it returns. The problem is that we're making an HTTP request for each value because the data is not truly laid out as a collection of structures (i.e. records). Instead, we're cheating by organizing the variables into a pseudo-structure: a collection of variables which all have the same outer dimension (it's not a real Structure because the values are not stored contiguously). All CF DSG layouts require this pseudo-structure interpretation, which means they'll all have the same poor read performance by our point stack. To improve performance, we'd like to make fewer requests by grabbing more than one datum each time and caching the unneeded data that we get back for subsequent calls. But how? We can't just naively cache the entire variable. What if it's huge? Incidentally, we already do this for 1D coordinate axes, regardless of their size. I'm surprised we haven't been bitten by that yet. Perhaps we can cache only some of the data? For example, if the user requests @JohnLCaron Any thoughts on this? |
Im > The problem is that we're making an HTTP request for each value Through opendap? Ive given up on opendap for this reason, and cdmremote was
The whole point of DSG is to use iterators instead of direct access, On Tue, Nov 1, 2016 at 2:37 AM, Christian W notifications@github.com
|
So to fill in the blanks, by having the iterator model for access, we can prefetch however we want when making the request, without the user ever knowing about it. Side note: it's interesting that this problem seems to correspond exactly to the way overallocation of vectors/list allow one to avoid O(N^2) behavior when appending. (I realize we're missing the copy that causes the N^2, but still interesting.) |
CDM Remote exhibits the same behavior:
I've narrowed the performance problem to StructureDataIteratorLinked.next(). Here we read a single ArrayStructure records = s.readStructure(currRecno, count); So yeah, I definitely like the idea of prefetching in the iterators rather than prefetching in the Variables, since the iterators have a known access pattern. I'll change the title of this issue. |
hmmm, i have forgotten the details, so i will have to review them at some perhaps its the "cdm feature" API that is supposed to solve the problem by On Tue, Nov 1, 2016 at 2:35 PM, Christian W notifications@github.com
|
It turns out that we already have this prefetch capability: Something else I noticed: 3fec809 removes |
Also, for many |
Man, at this point, if the (Jenkins) tests pass on Is this needed for |
4.6 could certainly use it; my comments today were the result of an issue I'm working on with Yuan in the IDV relating to slow reading of PointFeatures. However, the point stack has changed so drastically in 5.0 that I'd have to implement 2 completely different fixes. I'm not gonna do that, so this'll be 5.0-only. |
TLDR: Reading remote CF DSG datasets via OPeNDAP and CDM Remote (and probably DAP4) is terribly slow.
This issue was originally raised in a netcdf-java mailing list message. Example dataset is a contiguous ragged array representation of profiles:
I popped it on a local THREDDS server and read the first 5 observations using both OPeNDAP and CDM Remote. I used the code:
Results for OPeNDAP, with
ucar.nc2.dods.DODSNetcdfFile.debugServerCall = true
Results for CDM Remote, with
ucar.nc2.stream.CdmRemote.showRequest = true
, are similar.So, to read the entire example file as a
FeatureDatasetPoint
, we'd need to make roughly11365 * 3 = 34095
HTTP requests! And the file is only 182 KB! As you can imagine, that's very slow. It took about ~25 seconds when reading from my local server, but upwards of an hour when reading from an external server.The text was updated successfully, but these errors were encountered: