Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Prefetch in StructureDataIterators to improve performance #671

Open
cwardgar opened this issue Nov 1, 2016 · 9 comments
Open

Prefetch in StructureDataIterators to improve performance #671

cwardgar opened this issue Nov 1, 2016 · 9 comments

Comments

@cwardgar
Copy link
Contributor

cwardgar commented Nov 1, 2016

TLDR: Reading remote CF DSG datasets via OPeNDAP and CDM Remote (and probably DAP4) is terribly slow.

This issue was originally raised in a netcdf-java mailing list message. Example dataset is a contiguous ragged array representation of profiles:

netcdf NERSC_ARC_PHYS_OBS_XBT_2012_v1 {
dimensions:
	obs = 11365 ;
	profile = 7 ;
variables:
	int profile(profile) ;
		profile:cf_role = "profile_id" ;
	double time(profile) ;
		time:standard_name = "time" ;
		time:long_name = "time" ;
		time:units = "days since 2011-09-11 15:30:00" ;
	float lon(profile) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
	float lat(profile) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
	int rowSize(profile) ;
		rowSize:long_name = "number of obs for this profile" ;
		rowSize:sample_dimension = "obs" ;
	float z(obs) ;
		z:standard_name = "depth" ;
		z:long_name = "Depth below sea level" ;
		z:units = "m" ;
		z:positive = "down" ;
		z:axis = "Z" ;
	float temperature(obs) ;
		temperature:standard_name = "sea_water_temperature" ;
		temperature:long_name = "Sea water temperature measured by XBT" ;
		temperature:units = "Celsius" ;
		temperature:coordinates = "time lon lat z" ;
	float pressure(obs) ;
		pressure:standard_name = "sea_water_pressure" ;
		pressure:long_name = "Sea water pressure" ;
		pressure:units = "dbar" ;
		pressure:coordinates = "time lon lat z" ;
	float svel(obs) ;
		svel:standard_name = "speed_of_sound_in_sea_water" ;
		svel:long_name = "Sound velocity" ;
		svel:units = "m s-1" ;
		svel:coordinates = "time lon lat z" ;

// global attributes:
		:featureType = "profile" ;
		:Conventions = "CF-1.6,ACDD-1.3,GCMD-DIF,NMDC-netCDF-0.1" ;
		// ...
}

I popped it on a local THREDDS server and read the first 5 observations using both OPeNDAP and CDM Remote. I used the code:

package ucar.nc2.ft.point.remote

import spock.lang.Specification
import ucar.nc2.constants.FeatureType
import ucar.nc2.ft.*

class PointFeatureDatasetSpec extends Specification {
    def "goo.gl/2OQ0t7"() {  // From netcdf-java mailing list
        setup: "feature dataset"
        String location =
            "http://localhost:8080/thredds/dodsC/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc"
//            "http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc"
        
        Formatter errLog = new Formatter()
        FeatureDatasetPoint fdPoint = (FeatureDatasetPoint) FeatureDatasetFactoryManager.open(
                FeatureType.ANY_POINT, location, null, errLog)
        
        expect:
        for (DsgFeatureCollection dsgFeatCol : fdPoint.pointFeatureCollectionList) {
            ProfileFeatureCollection profileCol = (ProfileFeatureCollection) dsgFeatCol
            
            for (ProfileFeature profile : profileCol) {
                PointFeatureIterator pointFeatIter = profile.pointFeatureIterator
                
                for (int i = 0; i < 5; ++i) {
                    printf "\nObs %s:\n", i + 1
                    pointFeatIter.hasNext()
                    pointFeatIter.next()
                }
                
                break;
            }
            
            break;
        }
        
        cleanup:
        fdPoint.close()
        errLog.close()
    }
}

Results for OPeNDAP, with ucar.nc2.dods.DODSNetcdfFile.debugServerCall = true

DConnect to = <http://localhost:8080/thredds/dodsC/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc>
DODSNetcdfFile readDDS
DODSNetcdfFile readDAS
DODSNetcdfFile.readDataDDSfromServer = <?profile,time,lon,lat,rowSize>

Obs 1:
DODSNetcdfFile.readDataDDSfromServer = <temperature[0:1:0]>
DODSNetcdfFile.readDataDDSfromServer = <pressure[0:1:0]>
DODSNetcdfFile.readDataDDSfromServer = <svel[0:1:0]>
DODSNetcdfFile.readDataDDSfromServer = <z[0:1:11364]>

Obs 2:
DODSNetcdfFile.readDataDDSfromServer = <temperature[1:1:1]>
DODSNetcdfFile.readDataDDSfromServer = <pressure[1:1:1]>
DODSNetcdfFile.readDataDDSfromServer = <svel[1:1:1]>

Obs 3:
DODSNetcdfFile.readDataDDSfromServer = <temperature[2:1:2]>
DODSNetcdfFile.readDataDDSfromServer = <pressure[2:1:2]>
DODSNetcdfFile.readDataDDSfromServer = <svel[2:1:2]>

Obs 4:
DODSNetcdfFile.readDataDDSfromServer = <temperature[3:1:3]>
DODSNetcdfFile.readDataDDSfromServer = <pressure[3:1:3]>
DODSNetcdfFile.readDataDDSfromServer = <svel[3:1:3]>

Obs 5:
DODSNetcdfFile.readDataDDSfromServer = <temperature[4:1:4]>
DODSNetcdfFile.readDataDDSfromServer = <pressure[4:1:4]>
DODSNetcdfFile.readDataDDSfromServer = <svel[4:1:4]>

Results for CDM Remote, with ucar.nc2.stream.CdmRemote.showRequest = true, are similar.

So, to read the entire example file as a FeatureDatasetPoint, we'd need to make roughly 11365 * 3 = 34095 HTTP requests! And the file is only 182 KB! As you can imagine, that's very slow. It took about ~25 seconds when reading from my local server, but upwards of an hour when reading from an external server.

@cwardgar
Copy link
Contributor Author

cwardgar commented Nov 1, 2016

Our point stack exhibits a structure-oriented data access pattern, meaning that one value from each data variable is needed for each observation that it returns. The problem is that we're making an HTTP request for each value because the data is not truly laid out as a collection of structures (i.e. records). Instead, we're cheating by organizing the variables into a pseudo-structure: a collection of variables which all have the same outer dimension (it's not a real Structure because the values are not stored contiguously). All CF DSG layouts require this pseudo-structure interpretation, which means they'll all have the same poor read performance by our point stack.

To improve performance, we'd like to make fewer requests by grabbing more than one datum each time and caching the unneeded data that we get back for subsequent calls. But how? We can't just naively cache the entire variable. What if it's huge? Incidentally, we already do this for 1D coordinate axes, regardless of their size. I'm surprised we haven't been bitten by that yet.

Perhaps we can cache only some of the data? For example, if the user requests temperature[0:1:0], we actually grab temperature[0:1:100] and cache the excess, hoping that the user will read the other values within that range next. Obviously, effectiveness depends on access pattern, but this might be a reasonable strategy, especially for point data.

@JohnLCaron Any thoughts on this?

@JohnLCaron
Copy link
Collaborator

Im > The problem is that we're making an HTTP request for each value

Through opendap? Ive given up on opendap for this reason, and cdmremote was
the attempt to solve it.

To improve performance, we'd like to make fewer requests by grabbing more
than one datum each time

The whole point of DSG is to use iterators instead of direct access,
allowing us to efficiently cache.

On Tue, Nov 1, 2016 at 2:37 AM, Christian W notifications@github.com
wrote:

Our point stack exhibits a structure-oriented data access pattern,
meaning that one value from each data variable is needed for each
observation that it returns. The problem is that we're making an HTTP
request for each value because the data is not truly laid out as a
collection of structures (i.e. records). Instead, we're cheating by
organizing the variables into a pseudo-structure: a collection of
variables which all have the same outer dimension (it's not a real
Structure because the variables are not stored contiguously). All CF DSG
layouts require this pseudo-structure interpretation, which means they'll
all have the same poor read performance by our point stack.

To improve performance, we'd like to make fewer requests by grabbing more
than one datum each time and caching the unneeded data that we get back for
subsequent calls. But how? We can't just naively cache the entire variable.
What if it's huge? Incidentally, we already do this for 1D coordinate axes,
regardless of their size. I'm surprised we haven't been bitten by that yet.

Perhaps we can cache only some of the data? For example, if the user
requests temperature[0:1:0], we actually grab temperature[0:1:100] and
cache the excess, hoping that the user will read the other values within
that range next. Obviously, effectiveness depends on access pattern, but
this might be a reasonable strategy, especially for point data.

@JohnLCaron https://github.com/JohnLCaron Any thoughts on this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#671 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAlcwKMt-EChuhi1XM73QppBXEADKkwVks5q5vo0gaJpZM4Kl2bS
.

@dopplershift
Copy link
Member

So to fill in the blanks, by having the iterator model for access, we can prefetch however we want when making the request, without the user ever knowing about it.

Side note: it's interesting that this problem seems to correspond exactly to the way overallocation of vectors/list allow one to avoid O(N^2) behavior when appending. (I realize we're missing the copy that causes the N^2, but still interesting.)

@cwardgar
Copy link
Contributor Author

cwardgar commented Nov 1, 2016

The problem is that we're making an HTTP request for each value

Through opendap? Ive given up on opendap for this reason, and cdmremote was
the attempt to solve it.

CDM Remote exhibits the same behavior:

 CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header
 CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header took 50 msecs 
Obs 1:
CdmRemote data request for variable: 'temperature' section=(0:0)
CdmRemote data request for variable: 'pressure' section=(0:0)
CdmRemote data request for variable: 'svel' section=(0:0)
CdmRemote data request for variable: 'z' section=(0:11364)
Obs 2:
CdmRemote data request for variable: 'temperature' section=(1:1)
CdmRemote data request for variable: 'pressure' section=(1:1)
CdmRemote data request for variable: 'svel' section=(1:1)
Obs 3:
CdmRemote data request for variable: 'temperature' section=(2:2)
CdmRemote data request for variable: 'pressure' section=(2:2)
CdmRemote data request for variable: 'svel' section=(2:2)
Obs 4:
CdmRemote data request for variable: 'temperature' section=(3:3)
CdmRemote data request for variable: 'pressure' section=(3:3)
CdmRemote data request for variable: 'svel' section=(3:3)
Obs 5:
CdmRemote data request for variable: 'temperature' section=(4:4)
CdmRemote data request for variable: 'pressure' section=(4:4)
CdmRemote data request for variable: 'svel' section=(4:4)

I've narrowed the performance problem to StructureDataIteratorLinked.next(). Here we read a single StructureData (record) at a time, which causes the HTTP requests for single values. Instead, we could be prefetching with a call like:

ArrayStructure records = s.readStructure(currRecno, count);

So yeah, I definitely like the idea of prefetching in the iterators rather than prefetching in the Variables, since the iterators have a known access pattern. I'll change the title of this issue.

@cwardgar cwardgar changed the title Improve per-Variable data caching Prefetch in StructureDataIterators to improve performance Nov 1, 2016
@JohnLCaron
Copy link
Collaborator

hmmm, i have forgotten the details, so i will have to review them at some
point.

perhaps its the "cdm feature" API that is supposed to solve the problem by
caching in the iterators.

On Tue, Nov 1, 2016 at 2:35 PM, Christian W notifications@github.com
wrote:

The problem is that we're making an HTTP request for each value

Through opendap? Ive given up on opendap for this reason, and cdmremote was
the attempt to solve it.

CDM Remote exhibits the same behavior:

CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header
CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header took 50 msecs
Obs 1:
CdmRemote data request for variable: 'temperature' section=(0:0)
CdmRemote data request for variable: 'pressure' section=(0:0)
CdmRemote data request for variable: 'svel' section=(0:0)
CdmRemote data request for variable: 'z' section=(0:11364)
Obs 2:
CdmRemote data request for variable: 'temperature' section=(1:1)
CdmRemote data request for variable: 'pressure' section=(1:1)
CdmRemote data request for variable: 'svel' section=(1:1)
Obs 3:
CdmRemote data request for variable: 'temperature' section=(2:2)
CdmRemote data request for variable: 'pressure' section=(2:2)
CdmRemote data request for variable: 'svel' section=(2:2)
Obs 4:
CdmRemote data request for variable: 'temperature' section=(3:3)
CdmRemote data request for variable: 'pressure' section=(3:3)
CdmRemote data request for variable: 'svel' section=(3:3)
Obs 5:
CdmRemote data request for variable: 'temperature' section=(4:4)
CdmRemote data request for variable: 'pressure' section=(4:4)
CdmRemote data request for variable: 'svel' section=(4:4)

I've narrowed the performance problem to StructureDataIteratorLinked.
next()
https://github.com/Unidata/thredds/blob/5.0.0/cdm/src/main/java/ucar/nc2/ft/point/StructureDataIteratorLinked.java#L70.
Here we read a single StructureData (record) at a time, which causes the
HTTP requests for single values. Instead, we could be prefetching with a
call like:

ArrayStructure records = s.readStructure(currRecno, count);

So yeah, I definitely like the idea of prefetching in the iterators rather
than prefetching in the Variables, since the iterators have a known access
pattern. I'll change the title of this issue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#671 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAlcwP5KhOBc23oNNSETtAyv08-J1Qbhks5q56KEgaJpZM4Kl2bS
.

@cwardgar
Copy link
Contributor Author

It turns out that we already have this prefetch capability: StructureDataIterator.setBufferSize(int bytes). However, the only subclass that actually implements it is Structure.IteratorRank1. A better solution would be to move that functionality to a wrapper, and decorate other StructureDataIterators with it.

Something else I noticed: 3fec809 removes bufferSize from PointFeatureIterators (and many of the underlying StructureDataIterators). Was that a mistake if we want to enable prefetching? Should I revert it?

@cwardgar
Copy link
Contributor Author

Also, for many PointFeatureCollections, there wasn't an easy way to change the bufferSize of the underlying PointFeatureIterator, even before the above commit. That needs to change.

@lesserwhirls
Copy link
Collaborator

Man, at this point, if the (Jenkins) tests pass on 5.0 with a revert and the additions to the StructureDataIteratorss , I'd say go for it and we'll deal with the repercussions, if any 👍

Is this needed for 4.6.x?

@cwardgar
Copy link
Contributor Author

4.6 could certainly use it; my comments today were the result of an issue I'm working on with Yuan in the IDV relating to slow reading of PointFeatures. However, the point stack has changed so drastically in 5.0 that I'd have to implement 2 completely different fixes. I'm not gonna do that, so this'll be 5.0-only.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants