New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow when using indexing with a constant step size, for example [0, 2, 4] #680
Comments
If this slice is small, the overhead of the python interface (creating the start,count,stride arrays to pass to the C library) will be large relative to the cost of actually reading the data from the file. I suspect that this is what is happening here. You say it is several hundred times slower, but what are the actual timings? |
Could you post your actual benchmark code (including the data file)? |
The actual timing can be 10 seconds vs half an hour. The difference in time is also largest when the total download is relatively large (but the problematic slice might be small). edit: It turned out easier to write a matrix to a very simple netcdf file and read it compared to finding a good example file on the internet. The result is similar for the files I tested from edit: I also included a comparison of netcdf formats. The problem exist for all file formats but it is much larger for netcdf 4 files. Here is the code I used:
with the output:
|
The reason the netcdf4 results are so slow has to be because of chunking: http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters |
I doubt that. Why should it take maybe 800 times longer to download with t=[0,2] compared to downloading an extra time point using t=[0,2,2]? I would think that a structured index should if anything make the reading faster compared to a "random" index. Something strange and unnecessary that slows down the reading is going on when it is possible to make the same reading with an index of i0: iend: s, with s>1. Running the above script with index changed to |
I tried to looked at the actual code based on the information in ticket 325. The speed difference when using arrays and lists disappear if I change line 273 in utils.py. Using a slice, for example 0:10:2 is still very slow.
to
|
Ok, I think I understand. _StartCountStride in utils.py divide the data matrix into parts of length 1 if an "unstructured" array or list is used. Each part is then read by the the nc_get_vars. So nc_get_vars is called many times if the array is "unstructered". A slice with constant stride result in a single call to nc_get_vars because the stride information is used. (edit: I don't know exactly was is called in the c-library but start, count and stride became matrixes instead of vectors when I read the data with [0, 2, 2] and the count was always 1 suggesting that every read had time length 1) The problem is that, on my computer at least, a single call to nc_get_vars with a step size larger than 1 is much much slower than many calls to nc_get_vars that read a single stride. I would suggest that the section that start on line 251 |
If all the strides are 1, |
Some more information about the C lib calls: in the case when all the strides are 1 (nc_get_vara called) the start,count,stride arrays are I don't understand why strided reads are so slow for NETCDF4 files. Maybe @WardF or @DennisHeimbigner can shed some light. |
The difference in speed is larger for NETCDF4 files than for NETCDF3 files but it is still almost a factor of 10 for the NETCDF3 files when I run the script above. |
That seems to be the same problem. |
You are probably better off calling nc_get_vara and then doing the stride |
Are you saying the python interface should not try to use |
Using nc_get_vars is very slow compared to nc_get_vara. So you get to choose |
I'd be inclined to leave it to client code to do something like this
instead of
Perhaps we could issue a warning when nc_get_vars is called with NETCDF4 or NETCDF4_CLASSIC files. |
I would suggest to change the line The above change seems to work well for me and it is possible for someone that wants to use a slice object to do that in the usual way. To use slice object with a stride of 1 seems to be slightly faster than using an array for the same operation but the difference in speed when the stride is different than 1 is much larger also for netcdf3 classic files on my computer. |
Why should it when nc_get_vars is (sometimes at least) much slower than the already existing python code that don't use it? |
nc_get_vara can only be used when the strides are all 1. If the strides are not 1, then we would have to add extra code to subset the returned data. nc_get_vars handles this automatically. |
I think the reason your suggested change speeds things up is that it is (incorrectly) calling nc_get_vara when it should be calling nc_get_vars. You'll see that when you run the tests, some fail. Of course using a slice with stride 1 is faster, but the whole point is that sometimes you want a stride of 2 (as in your original example). In this case, Dennis is suggest to go ahead an use a stride of 1, and then throw away every other element. |
https://github.com/jswhit is correct. Sorry for being unclear. The tradeoff is that instead of |
I don't understand your answers. netCDF4.Dataset is supposed to work when the index is a sequence of indices, i.e. for example [0, 4,2, 10] or np.array([0,1,2, 10]). The change of code I suggest above just say that [0, 3, 6] should be handled by the same algorithm that handle [0, 3, 5]. Why should that fail? |
It does work, it's just slow since the code will call nc_get_vars when the sequence of integers is evenly spaced (i.e. it fits into a strided read). I gather you are suggesting that we don't call nc_get_vars, instead call nc_get_vara multiple times with a stride of 1. If you have a particular work around in mind, why don't you create a pull request that we can look at? |
Given that some people work with datasets larger than can fit in memory, having strided access that can only pull in data that does fit is an important use case. So if I have a 1000 * 5000 * 6000 variable (111G for a float32 array), I need to be able to grab a strided subset, say: var[::20, ::10, ::10] and not have that fail due to memory (the request is only 57M). I don't see any way to allow this use case if netCDF4-python does the striding itself after reading the entire variable. While I find the |
I wrote pull request #681 that I believe solves the issue. It works as expected in my tries. |
Pull request #681 does not convert sequences of indices to slices unless step=1. For your example above this means that one It also does solve the problem when a slice is specified to begin with (i.e. Also, some of the tests are failing - those that are checking that the conversion of sequences to slices is occuring. They should be easy to modify though. |
Pull request #683 builds on #681 by also converting strided slices (i.e I want to be very careful with this, since the indexing code is complicated and fragile. |
I removed the threshold, so that nc_get_vars is never called since tests showed that even 10,000 calls to nc_get_vara was faster than a single call to nc_get_vars. |
Impressive. I need to see if I can implement this same strategy directly |
That was what I thought. nc_get_vars is always to slow with strides larger than 1 |
Here's an example where using mport netCDF4
from numpy.testing import assert_array_almost_equal
import time
import numpy as np
file_='test_slice.nc'
format_='NETCDF4_CLASSIC'
x=20
y=20
z=20
t=100
nf=netCDF4.Dataset(file_, 'w', format=format_)
td=nf.createDimension('time',None)
xd=nf.createDimension('xc', x)
yd=nf.createDimension('yc', y)
zd=nf.createDimension('zc', z)
var=nf.createVariable('var', 'd',
('time','zc','yc','xc'))
data = np.random.rand(t,z,y,x)
data2 = data[:,::2,::2,::2]
var[:]=data
nf.close()
nf=netCDF4.Dataset(file_,'r')
t1=time.time()
q=nf['var'][:][:,::2,::2,::2] # read everything, take every other time
t2=time.time()
print 'read everything, then slice with stride 2: ',t2-t1
assert_array_almost_equal(q,data2)
nf['var'].use_nc_get_vars(True)
t1=time.time()
q=nf['var'][:,::2,::2,::2] # slice with stride 2
t2=time.time()
print 'read slice with stride 2 (nc_get_vars): ',t2-t1
assert_array_almost_equal(q,data2)
nf['var'].use_nc_get_vars(False)
t1=time.time()
q=nf['var'][:,::2,::2,::2] # slice with stride 2
t2=time.time()
print 'read slice with stride 2 (nc_get_vara): ',t2-t1
assert_array_almost_equal(q,data2)
nf.close() On my mac, this produces
|
The counterexample can be seen by changing
|
If there are no more comments on this, I will merge over the weekend. |
extend fix for issue #680 to include strided slices
Don't think this is completely resolved. As I am using v1.3.1. I originally reported this in pydata/xarray#2004, where we found that using xarray's h5netcdf engine showed no performance issues, while using the netcdf4 engine had massively bad performance. |
@DennisHeimbigner, I just noticed that there is no |
It has been on our radar for a while to implement vars using the HDF5 equivalent. |
Reopening this issue so we can track the progress of Unidata/netcdf-c#908 Once this is fixed in netcdf-c, we can re-enable the use of |
A fix for this was merged into netcdf-c master, so the workaround is now disabled if the library version >= 4.6.2 (pull request #805). |
@fredrik-1's initial test code now produces
so the performance degradation for NETCDF4 and NETCDF4_CLASSIC appears to be gone. |
Reading a netcdf variable with for example the index 0:10:2 is several hundreds times slower than with a index with a non constant step size.
nf=Dataset('file.nc', 'r')
v=nf['temperature'][0:10:2, :, :] takes much longer time than
v=nf['temperature'][0:10, :, :]
The same is also true when using [0, 2, 4], [0, 3], or np.array([0, 5, 10]). It is also much faster to use
np.array([0, 5, 10, 10]) compared to using np.array([0, 5, 10])
I guess the problem is related to #325 but I am not sure if it is supposed to be fixed or not.
I have been using netcdf 1.2.4 and 1.2.9, the behavior is the same with both.
I am using windows, anaconda download, ipython and python and conda install of netCDF4
The text was updated successfully, but these errors were encountered: