Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow when using indexing with a constant step size, for example [0, 2, 4] #680

Open
fredrik-1 opened this issue Jun 21, 2017 · 39 comments

Comments

@fredrik-1
Copy link
Contributor

Reading a netcdf variable with for example the index 0:10:2 is several hundreds times slower than with a index with a non constant step size.

nf=Dataset('file.nc', 'r')
v=nf['temperature'][0:10:2, :, :] takes much longer time than
v=nf['temperature'][0:10, :, :]

The same is also true when using [0, 2, 4], [0, 3], or np.array([0, 5, 10]). It is also much faster to use
np.array([0, 5, 10, 10]) compared to using np.array([0, 5, 10])

I guess the problem is related to #325 but I am not sure if it is supposed to be fixed or not.

I have been using netcdf 1.2.4 and 1.2.9, the behavior is the same with both.
I am using windows, anaconda download, ipython and python and conda install of netCDF4

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

If this slice is small, the overhead of the python interface (creating the start,count,stride arrays to pass to the C library) will be large relative to the cost of actually reading the data from the file. I suspect that this is what is happening here. You say it is several hundred times slower, but what are the actual timings?

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

Could you post your actual benchmark code (including the data file)?

@fredrik-1
Copy link
Contributor Author

fredrik-1 commented Jun 22, 2017

The actual timing can be 10 seconds vs half an hour. The difference in time is also largest when the total download is relatively large (but the problematic slice might be small).

edit: It turned out easier to write a matrix to a very simple netcdf file and read it compared to finding a good example file on the internet. The result is similar for the files I tested from
https://www.unidata.ucar.edu/software/netcdf/examples/files.html
but the variables where to small to achieve a very large time difference (4-15 times for the files I tested).

edit: I also included a comparison of netcdf formats. The problem exist for all file formats but it is much larger for netcdf 4 files.

Here is the code I used:

from netCDF4 import Dataset
import time
import numpy as np

file_='test_slice.nc'
x=500
y=500
z=10

formatList=['NETCDF4', 'NETCDF4_CLASSIC','NETCDF3_CLASSIC', 'NETCDF3_64BIT_OFFSET', 'NETCDF3_64BIT_DATA']
for format_ in formatList:
    print format_
    nf=Dataset(file_, 'w', format=format_)
    nf.createDimension('time',0)
    nf.createDimension('xc', x)
    nf.createDimension('yc', y)
    nf.createDimension('zc', z)
    
    var=nf.createVariable('var', 'd', ('time','zc','yc','xc'))
    
    var[:]=np.random.rand(3,z,y,x)
    nf.close()
      
    nf=Dataset(file_,'r')
    
    index=[np.array([0,1,2]),np.array([0,2])]
    t=[None]*2
    print 'shape var: '+str(nf['var'].shape)
    for k in range(2):
        t0=time.time() 
        q=nf['var'][index[k],:,:,:]
        t[k]=time.time()-t0
        print 'shape out: ' +str(q.shape)
        print 'it took: '+str(t[k])
     
    print 'difference: '+str(t[1]/t[0]) 
    print ' '
    nf.close()

with the output:

NETCDF4
shape var: (3, 10, 500, 500)
shape out: (3L, 10L, 500L, 500L)
it took: 0.0750000476837
shape out: (2L, 10L, 500L, 500L)
it took: 58.5460000038
difference: 780.612837084
 
NETCDF4_CLASSIC
shape var: (3, 10, 500, 500)
shape out: (3L, 10L, 500L, 500L)
it took: 0.0759999752045
shape out: (2L, 10L, 500L, 500L)
it took: 59.0880000591
difference: 777.473938645
 
NETCDF3_CLASSIC
shape var: (3, 10, 500, 500)
shape out: (3L, 10L, 500L, 500L)
it took: 0.0759999752045
shape out: (2L, 10L, 500L, 500L)
it took: 0.693000078201
difference: 9.11842505655
 
NETCDF3_64BIT_OFFSET
shape var: (3, 10, 500, 500)
shape out: (3L, 10L, 500L, 500L)
it took: 0.0789999961853
shape out: (2L, 10L, 500L, 500L)
it took: 0.710999965668
difference: 9.0
 
NETCDF3_64BIT_DATA
shape var: (3, 10, 500, 500)
shape out: (3L, 10L, 500L, 500L)
it took: 0.0759999752045
shape out: (2L, 10L, 500L, 500L)
it took: 0.704999923706
difference: 9.27631781207

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

The reason the netcdf4 results are so slow has to be because of chunking:

http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters

@fredrik-1
Copy link
Contributor Author

fredrik-1 commented Jun 22, 2017

I doubt that. Why should it take maybe 800 times longer to download with t=[0,2] compared to downloading an extra time point using t=[0,2,2]? I would think that a structured index should if anything make the reading faster compared to a "random" index.

Something strange and unnecessary that slows down the reading is going on when it is possible to make the same reading with an index of i0: iend: s, with s>1.

Running the above script with index changed to
index=[np.array([0, 2, 2]), np.array([0,2])]
probably show the strange behavior clearer.

@fredrik-1
Copy link
Contributor Author

fredrik-1 commented Jun 22, 2017

I tried to looked at the actual code based on the information in ticket 325.

The speed difference when using arrays and lists disappear if I change line 273 in utils.py. Using a slice, for example 0:10:2 is still very slow.

if ee and len(e) == len(ee) and (e == np.arange(start,stop,step)).all():
                newElem.append(slice(start,stop,step))

to

if ee and len(e) == len(ee) and (e == np.arange(start,stop,step)).all():
		newElem.append(e) 

@fredrik-1
Copy link
Contributor Author

fredrik-1 commented Jun 22, 2017

Ok, I think I understand. _StartCountStride in utils.py divide the data matrix into parts of length 1 if an "unstructured" array or list is used. Each part is then read by the the nc_get_vars. So nc_get_vars is called many times if the array is "unstructered". A slice with constant stride result in a single call to nc_get_vars because the stride information is used. (edit: I don't know exactly was is called in the c-library but start, count and stride became matrixes instead of vectors when I read the data with [0, 2, 2] and the count was always 1 suggesting that every read had time length 1)

The problem is that, on my computer at least, a single call to nc_get_vars with a step size larger than 1 is much much slower than many calls to nc_get_vars that read a single stride.

I would suggest that the section that start on line 251
"# replace Ellipsis and integer arrays with slice objects, if possible."
should only be used if the resulting slice object has stride 1.

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

If all the strides are 1, nc_get_vara is called, if not nc_get_vars is called. In your example, there is only one call to either - it's just that the nc_get_vars call is very slow for NETCDF4 files. I still think chunking may be playing a role here, since it only comes into play for HDF5 files. Whatever it is, it is happening in the C libraries when nc_get_vars is called and the underlying disk format is HDF5. The slowdown is not occuring at the python level.

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

Some more information about the C lib calls:

in the case when all the strides are 1 (nc_get_vara called) the start,count,stride arrays are
[0 0 0 0] [ 3 10 500 500] [1 1 1 1]
and the time spent in nc_get_vara is 0.036445 seconds
in the strided read case (nc_get_vars called) start,count,stride are
[0 0 0 0] [ 2 10 500 500] [2 1 1 1]
and the time spent in nc_get_vars is 35.693566 seconds (!)

I don't understand why strided reads are so slow for NETCDF4 files. Maybe @WardF or @DennisHeimbigner can shed some light.

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

@fredrik-1
Copy link
Contributor Author

The difference in speed is larger for NETCDF4 files than for NETCDF3 files but it is still almost a factor of 10 for the NETCDF3 files when I run the script above.

@fredrik-1
Copy link
Contributor Author

That seems to be the same problem.

@DennisHeimbigner
Copy link
Collaborator

You are probably better off calling nc_get_vara and then doing the stride
yourself in-memory.

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

Are you saying the python interface should not try to use nc_get_vars for strided reads?

@DennisHeimbigner
Copy link
Collaborator

Using nc_get_vars is very slow compared to nc_get_vara. So you get to choose
performance vs python code complexity. I will have to let you decide the trade-off.

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

I'd be inclined to leave it to client code to do something like this

data=nf['var'][:][::2]

instead of

data=nf['var'][::2]

Perhaps we could issue a warning when nc_get_vars is called with NETCDF4 or NETCDF4_CLASSIC files.

@fredrik-1
Copy link
Contributor Author

I would suggest to change the line
if ee and len(e) == len(ee) and (e == np.arange(start,stop,step)).all():
to
if ee and e[1]-e[0]==1 and len(e) == len(ee) and (e == np.arange(start,stop,step)).all():
(or maybe do the change a few lines above)
I don't see any reason why
"# Replace sequence of indices with slice object if possible."
When the reading with slice object is sometimes extremely slow. Why do that replacement when it actually make the code perform worse?

The above change seems to work well for me and it is possible for someone that wants to use a slice object to do that in the usual way.

To use slice object with a stride of 1 seems to be slightly faster than using an array for the same operation but the difference in speed when the stride is different than 1 is much larger also for netcdf3 classic files on my computer.

@fredrik-1
Copy link
Contributor Author

Are you saying the python interface should not try to use nc_get_vars for strided reads?

Why should it when nc_get_vars is (sometimes at least) much slower than the already existing python code that don't use it?

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

nc_get_vara can only be used when the strides are all 1. If the strides are not 1, then we would have to add extra code to subset the returned data. nc_get_vars handles this automatically.

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

I think the reason your suggested change speeds things up is that it is (incorrectly) calling nc_get_vara when it should be calling nc_get_vars. You'll see that when you run the tests, some fail. Of course using a slice with stride 1 is faster, but the whole point is that sometimes you want a stride of 2 (as in your original example). In this case, Dennis is suggest to go ahead an use a stride of 1, and then throw away every other element.

@DennisHeimbigner
Copy link
Collaborator

https://github.com/jswhit is correct. Sorry for being unclear. The tradeoff is that instead of
using nc_get_vars, you use nc_get_vara. This will read in extra data, but is much faster.
Once in memory, you can use python to extract every other element. You can use nc_get_vars
directly, but as you have seen, it is significantly slower. BTW, if anyone knows of any
reasonable optimizations, we can incorporate them directly into the netcdf-c library
to speed up nc_get_vars. The ones we have considered to date require significant code
complexity to manage allocated memory. The constraint under which we operate is that
the caller gives us memory for the strided data, so we must somehow use that to pull in
successive chunks and then do the striding in memory. This can get very complicated.
But perhaps we are over-thinking it.

@fredrik-1
Copy link
Contributor Author

fredrik-1 commented Jun 22, 2017

I don't understand your answers.

netCDF4.Dataset is supposed to work when the index is a sequence of indices, i.e. for example [0, 4,2, 10] or np.array([0,1,2, 10]).
This seems to work for me and I don't understand why it should result in any form of error. The algorithm seems to be to put the variable in chunks of length one.

The change of code I suggest above just say that [0, 3, 6] should be handled by the same algorithm that handle [0, 3, 5]. Why should that fail?

@jswhit
Copy link
Collaborator

jswhit commented Jun 22, 2017

It does work, it's just slow since the code will call nc_get_vars when the sequence of integers is evenly spaced (i.e. it fits into a strided read). I gather you are suggesting that we don't call nc_get_vars, instead call nc_get_vara multiple times with a stride of 1. If you have a particular work around in mind, why don't you create a pull request that we can look at?

@dopplershift
Copy link
Member

dopplershift commented Jun 22, 2017

Given that some people work with datasets larger than can fit in memory, having strided access that can only pull in data that does fit is an important use case. So if I have a 1000 * 5000 * 6000 variable (111G for a float32 array), I need to be able to grab a strided subset, say:

var[::20, ::10, ::10]

and not have that fail due to memory (the request is only 57M). I don't see any way to allow this use case if netCDF4-python does the striding itself after reading the entire variable. While I find the var[:][::2] method to be a bit obtuse, at least there's an optimization available.

@fredrik-1
Copy link
Contributor Author

I wrote pull request #681 that I believe solves the issue. It works as expected in my tries.

@jswhit
Copy link
Collaborator

jswhit commented Jun 23, 2017

Pull request #681 does not convert sequences of indices to slices unless step=1. For your example above this means that one nc_get_vara call will be made for each element of the index sequence, instead of calling nc_get_vars once. This will be faster as long as the length of index sequence is small enough - if there are a thousand points in the sequence it will probably be slower than the current implementation. Perhaps a check for the length of the index sequence should be added, and if the sequence exceeds a critical length then the original code would be used.

It also does solve the problem when a slice is specified to begin with (i.e. var[::2]) - although a workaround would be to convert strided slices to sequences of integers.

Also, some of the tests are failing - those that are checking that the conversion of sequences to slices is occuring. They should be easy to modify though.

@jswhit
Copy link
Collaborator

jswhit commented Jun 26, 2017

Pull request #683 builds on #681 by also converting strided slices (i.e var[::2]) to individual calls to nc_get_vara. An arbitrary threshold of 1000 is included, if the number of calls to nc_get_vara would exceed this, a single call to nc_get_vars is used instead.

I want to be very careful with this, since the indexing code is complicated and fragile.

@jswhit
Copy link
Collaborator

jswhit commented Jun 27, 2017

I removed the threshold, so that nc_get_vars is never called since tests showed that even 10,000 calls to nc_get_vara was faster than a single call to nc_get_vars.

@DennisHeimbigner
Copy link
Collaborator

Impressive. I need to see if I can implement this same strategy directly
in the netcdf-c library.

@fredrik-1
Copy link
Contributor Author

fredrik-1 commented Jun 27, 2017

I removed the threshold, so that nc_get_vars is never called since tests showed that even 10,000 calls to nc_get_vara was faster than a single call to nc_get_vars.

That was what I thought. nc_get_vars is always to slow with strides larger than 1

@jswhit
Copy link
Collaborator

jswhit commented Jun 27, 2017

Here's an example where using nc_get_vars is actually faster (it uses a new Variable method use_nc_get_vars which I just added to re-enable the previous default behavior)

mport netCDF4
from numpy.testing import assert_array_almost_equal
import time
import numpy as np

file_='test_slice.nc'
format_='NETCDF4_CLASSIC'
x=20
y=20
z=20
t=100

nf=netCDF4.Dataset(file_, 'w', format=format_)
td=nf.createDimension('time',None)
xd=nf.createDimension('xc', x)
yd=nf.createDimension('yc', y)
zd=nf.createDimension('zc', z)

var=nf.createVariable('var', 'd',
        ('time','zc','yc','xc'))

data = np.random.rand(t,z,y,x)
data2 = data[:,::2,::2,::2]
var[:]=data
nf.close()

nf=netCDF4.Dataset(file_,'r')

t1=time.time()
q=nf['var'][:][:,::2,::2,::2] # read everything, take every other time
t2=time.time()
print 'read everything, then slice with stride 2: ',t2-t1
assert_array_almost_equal(q,data2)
nf['var'].use_nc_get_vars(True)
t1=time.time()
q=nf['var'][:,::2,::2,::2] # slice with stride 2
t2=time.time()
print 'read slice with stride 2 (nc_get_vars): ',t2-t1
assert_array_almost_equal(q,data2)
nf['var'].use_nc_get_vars(False)
t1=time.time()
q=nf['var'][:,::2,::2,::2] # slice with stride 2
t2=time.time()
print 'read slice with stride 2 (nc_get_vara): ',t2-t1
assert_array_almost_equal(q,data2)

nf.close()

On my mac, this produces

read everything, then slice with stride 2:  0.0139389038086
read slice with stride 2 (nc_get_vars):  0.581645965576
read slice with stride 2 (nc_get_vara):  1.53950095177

@jswhit
Copy link
Collaborator

jswhit commented Jun 27, 2017

The counterexample can be seen by changing :,::2,::2,::2 to ::2 in the code above, which on my machine gives

read everything, then slice with stride 2:  0.0139241218567
read slice with stride 2 (nc_get_vars):  2.32832503319
read slice with stride 2 (nc_get_vara):  0.00344800949097

@jswhit
Copy link
Collaborator

jswhit commented Jun 30, 2017

If there are no more comments on this, I will merge over the weekend.

jswhit added a commit that referenced this issue Jul 3, 2017
extend fix for issue #680 to include strided slices
@jswhit jswhit closed this as completed Jul 3, 2017
@WeatherGod
Copy link

Don't think this is completely resolved. As I am using v1.3.1. I originally reported this in pydata/xarray#2004, where we found that using xarray's h5netcdf engine showed no performance issues, while using the netcdf4 engine had massively bad performance.

@jswhit
Copy link
Collaborator

jswhit commented Mar 21, 2018

@DennisHeimbigner, I just noticed that there is no nc4_get_vars, only nc4_get_vara in libsrc4/nc4hdf.c. Is there any reason that strided access to hdf5 is not implemented using H5Sselect_hyperslab? Apparently h5py uses this, and is much faster for strided access to netcdf4 data than netcdf4-python (see pydata/xarray#2004)..

@DennisHeimbigner
Copy link
Collaborator

It has been on our radar for a while to implement vars using the HDF5 equivalent.
This discussion reminded me about it, so I created an issue (https://github.com/Unidata/netcdf-c/issues?q=is%3Aissue+is%3Aopen+label%3Atype%2Fperformance)
to make it visible. It is a relatively straightforward change, but it has never
been high enough on our todo list.

@jswhit
Copy link
Collaborator

jswhit commented May 13, 2018

Reopening this issue so we can track the progress of Unidata/netcdf-c#908

Once this is fixed in netcdf-c, we can re-enable the use of nc_get_vars.

@jswhit
Copy link
Collaborator

jswhit commented Jun 11, 2018

A fix for this was merged into netcdf-c master, so the workaround is now disabled if the library version >= 4.6.2 (pull request #805). nc_get_vars performance is now greatly improved for NETCDF4/NETCDF4_CLASSIC files.

@jswhit
Copy link
Collaborator

jswhit commented Jun 11, 2018

@fredrik-1's initial test code now produces

NETCDF4
shape var: (3, 10, 500, 500)
shape out: (3, 10, 500, 500)
it took: 0.0612909793854
shape out: (2, 10, 500, 500)
it took: 0.045725107193
difference: 0.746033227916

NETCDF4_CLASSIC
shape var: (3, 10, 500, 500)
shape out: (3, 10, 500, 500)
it took: 0.0478830337524
shape out: (2, 10, 500, 500)
it took: 0.0342881679535
difference: 0.716081778167

NETCDF3_CLASSIC
shape var: (3, 10, 500, 500)
shape out: (3, 10, 500, 500)
it took: 0.0986239910126
shape out: (2, 10, 500, 500)
it took: 0.0672671794891
difference: 0.68205695996

NETCDF3_64BIT_OFFSET
shape var: (3, 10, 500, 500)
shape out: (3, 10, 500, 500)
it took: 0.101933956146
shape out: (2, 10, 500, 500)
it took: 0.0675051212311
difference: 0.66224370939

NETCDF3_64BIT_DATA
shape var: (3, 10, 500, 500)
shape out: (3, 10, 500, 500)
it took: 0.0984890460968
shape out: (2, 10, 500, 500)
it took: 0.0677559375763
difference: 0.687954044247

so the performance degradation for NETCDF4 and NETCDF4_CLASSIC appears to be gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants