-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance for subsetting a gridpoint is very slow #36
Comments
I am creating a few re-chunked files for us to test - I will keep original packing but writing chunked to disk for both in space and time
|
Ok, so I fill in with more info on our daily planning for this. The idea is to get a sample script today (seems you're working on this, much appreciated) so we can run it on one of our machines. That yields the baseline to compute throughput. We then scale the number of machines. Idea is to hit "play" on the "distributed" script on Friday, and come back on Monday with a shiny, new, speady dataset ready to use at CRIM. |
On my end I am not sure exactly how to rewrite a transposed dataset (realigned option). My feeling is that the in-between strategy (chunked in space and time) will provide the most flexible format for anything else we may wish do in a next phase (subet_bbox, subset_features, index calculation on relatively large zones etc etc). It will undoubtedly be slower for the gridcell download but if the performance hit isn't too bad it could be an interesting option |
Another strategy would be to hit CRIM's data repo for grid-cell + small regions, than switch to Ouranos server for larger regions. Would be relatively straight forward to measure the sweet spot. |
I think realigning the whole dataset is a much more complicated option. I made a test realigned dataset, with ncpdq using something like this:
You need to have the available ram (60Gb) and It takes around 1 hour. Not impossible for 270 files of 60Gb, but very hard in one weekend. The gridcell subset is extremely fast though: between 0.2 sec and 0.01 sec. We need to explore re-chunking first. |
Is there a way to 'unckunk' the data? If we can do that and arrive to the exact same hash as the original unchunked file, it would mean we didn't change the data. Or at least be very confident about it. |
FYI there was a small glitch in my chunking script outputting files as .nc.nc
|
I'm testing chunking for a 60Gb file. Am I right to say that the ram and cpu aren't used too much by the process? Ram seems to hover between 0.5 and 1.2 Gb and cpu usage is around 10% on the 4 cores on my computer. That would be good news if we can chunk multiple files at once on the same machine. @tlogan2000 Roughly how long did it take you to process a single 60Gb file? Edit: It took 18 minutes on my machine. |
Can you both please remind me of the effect or rechunking on file size? I know that discussion occured recently on GitHub, I lost it. Please link it. |
Using the settings in the script above, the output file size is 66Gb. So maybe 10% more than the input file size. I think the data is still stored as scaled/offsetted integers right? My guess for the growth in the file size is that empty data is inserted to align the chunks in the file. |
Ok, so if what you said can be validated by peer (D. Byrns, T. Logan), I'd be very very happy with a 10% increase. And that would be our working scenario to scale. |
I have the same on my end ... scale offset seems to still be intact bt with chunked data on disk. Am working on a validation script right now |
We still need to make sure performance is good also. |
I think that if the chunking gets smaller it will likely have a larger increase in total file size but am not 100% certain |
So I say we transmit the script @davidcaron has right now to Louis-David, so he can figure out the mapping to our machines. For read performance of grid cell, I expect anyway something like a degree of magnitude better, but three time as fast would be already pretty good. For file size increase, I wouldn't start to worry before we hit 1.5x times input dataset size. I need to validate avail size though. |
It's too early to give to L-D. I'm running a subset for a single point and It's very long... will need to test more with the chunking. edit: Subsetting a single grid point took 5 minutes... I think we can do much better. |
Single grid point, for all files all RCP? Single file? What was your subsetting time before? |
For a single 60Gb file. So multiply by 270 for all files. It's roughly twice as fast as before... but it's hard to tell as the time are not consistent. I got between 5 and 10 minutes before. |
@tlogan2000 should we see the chunks in |
|
Here is a first attempt at a validation script testing chunked versus original data Note : In getting it running I am only checking every 500th timestep of the gridded variable data. So far everything seems good but only converted 4 files to test with. @huard can you see anything I might be missing? import xarray as xr
import numpy as np
import glob
chunked_path = '{path_to_chunked}/testdata/Chunking_Netcdf/bccaqv2_daily'
orig_path = '{path_to_orig}/pcic/BCCAQv2/'
def test_equal(t,read):
print('Testing chunked versus original data with mask_and_scale =', read, )
ds_chunk = xr.open_dataset(t,mask_and_scale=read,decode_times=False, chunks=dict(time=365))
ds_orig = xr.open_dataset(t.replace(chunked_path, orig_path).replace('ReChunk_Test_packed_',''), mask_and_scale=read,decode_times=False, chunks=dict(time=365))
np.testing.assert_array_equal(ds_chunk.time,ds_orig.time)
assert(ds_chunk.time.calendar == ds_orig.time.calendar)
assert(ds_chunk.time.units == ds_orig.time.units)
np.testing.assert_array_equal(ds_chunk.lon, ds_orig.lon)
np.testing.assert_array_equal(ds_chunk.lat, ds_orig.lat)
for v in list(ds_chunk.data_vars):
# verfiy variables attributes (includes scale and offset vals when read=False )
for key in ds_chunk[v].attrs.keys():
assert(ds_chunk[v].attrs[key] == ds_orig[v].attrs[key])
# check every 500th time step for now
for t in np.arange(0,len(ds_chunk.time),500):
#print(t)
np.testing.assert_array_equal(ds_chunk[v][t,:,:],ds_orig[v][t,:,:] )
print('Chunked and Original data are identical')
test_files = glob.glob(glob.os.path.join(chunked_path,'*.nc'))
for t in test_files:
# test converted data
read_type =[True, False]
print('\n\n',t)
for read in read_type:
test_equal(t,read)
|
We could try making lon, lat chunks smaller and possibly increasing time chunking? |
Good news, I made a silly mistake ! Subsetting a single gridpoint takes between 3 and 5 seconds! 100 times faster! It's slightly faster when using smaller chunksizes and somehow the file size is almost identical to the original. I used |
Come on man. If you're were really trying, we'd be a < 1 sec. But I guess we'll settle for ~4 seconds. :-) Seriously, super cool. |
I'm going for lunch! |
Back of the napkin estimates: if we can do 4 files at a time, it would take around 24 hours on a single computer. But we can do better than that using faster drives and more machines processing at the same time. Still very realistic for a weekend. |
I'm pretty sure mine was under 5 minutes ... I'll double check again
edit - slightly longer this time around 7 min. Seems to vary a bit |
My last timing was less than 15 minutes, without configuring the distributed library. Do you know if specifying the FYI, I made a small repo for the processing: https://github.com/davidcaron/bccaqv2-processing |
Yes if you set see docs here: To be honest I am unsure of when it is more beneficial to use one or the other |
Using a chunk size of [256, 16, 16], slicing in lat-lon takes 2.5 seconds, and slicing along the time takes less than 2 seconds. This seems like a good balance. |
@davidcaron @tomLandry CCCS would like to reduce to the number of files downloaded to a specific set of 24 files. Here is the some code that should correctly filter the raw bccaqv2 files:
|
To be clear, is it 24 files or 24 models? I see 24 models, 3 variables and 3 rcps. So is it 24 * 3 * 3 = 216 files? |
Yes you are right. Sorry it is 24 files per combination of variable & rcp like you said |
@tlogan2000 @davidcaron Can this be closed ? |
ok for me . |
Description
Subsetting the BCCAQv2 datasets alog the time dimension takes 5-10 minutes for a single file. There are 270 files, so the processing is needlessy long for this simple operation.
While modifying the original data is not desired, here are the proposed solutions:
Related to #33
The text was updated successfully, but these errors were encountered: