Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance for subsetting a gridpoint is very slow #36

Closed
davidcaron opened this issue May 21, 2019 · 35 comments
Closed

Performance for subsetting a gridpoint is very slow #36

davidcaron opened this issue May 21, 2019 · 35 comments

Comments

@davidcaron
Copy link
Collaborator

davidcaron commented May 21, 2019

Description

Subsetting the BCCAQv2 datasets alog the time dimension takes 5-10 minutes for a single file. There are 270 files, so the processing is needlessy long for this simple operation.

While modifying the original data is not desired, here are the proposed solutions:

  • Don't do anything, keep using the original data
  • re-chunk the data
  • re-align the data so that the time dimension is last

Related to #33

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 23, 2019

I am creating a few re-chunked files for us to test - I will keep original packing but writing chunked to disk for both in space and time
First validation will be to ensure new data is identical to original

# rechunk test


datasets = sorted(glob.glob(glob.os.path.join(path_to_bccaqv2,'*tasmin*.nc')))
print("starting")
ChunkSizes = [365, 50, 56]
comp = dict(chunksizes=ChunkSizes)
for dataset in datasets[0:2]: 
    t = time.perf_counter()
    # to keep original 'packing' use mask_and_scale=False 
    ds = xr.open_dataset(dataset,mask_and_scale=False,decode_times=False, chunks=dict(time=365))
    encoding = {var: comp for var in ds.data_vars} 
    encoding['time'] = dict(dtype='single')
    out_fn = glob.os.path.join(outpath , "ReChunk_Test_packed_{n}.nc".format(n=glob.os.path.basename(dataset)))
    ds.to_netcdf(out_fn,encoding=encoding)
    print(f"{time.perf_counter() - t:.2f} seconds")

@tomLandry
Copy link

Ok, so I fill in with more info on our daily planning for this. The idea is to get a sample script today (seems you're working on this, much appreciated) so we can run it on one of our machines. That yields the baseline to compute throughput. We then scale the number of machines. Idea is to hit "play" on the "distributed" script on Friday, and come back on Monday with a shiny, new, speady dataset ready to use at CRIM.

@tlogan2000
Copy link
Collaborator

On my end I am not sure exactly how to rewrite a transposed dataset (realigned option).
@huard Do you have a feel as to what exactly the difference would be between this and re-chunking as time:-1, lon: 1, lat:1?

My feeling is that the in-between strategy (chunked in space and time) will provide the most flexible format for anything else we may wish do in a next phase (subet_bbox, subset_features, index calculation on relatively large zones etc etc). It will undoubtedly be slower for the gridcell download but if the performance hit isn't too bad it could be an interesting option

@tomLandry
Copy link

Another strategy would be to hit CRIM's data repo for grid-cell + small regions, than switch to Ouranos server for larger regions. Would be relatively straight forward to measure the sweet spot.

@davidcaron
Copy link
Collaborator Author

I think realigning the whole dataset is a much more complicated option. I made a test realigned dataset, with ncpdq using something like this:

ncpdq -a lat,lon,time tasmin_day_BCCAQv2+ANUSPLIN300_inmcm4_historical+rcp85_r1i1p1_19500101-21001231.nc reordered.nc

You need to have the available ram (60Gb) and It takes around 1 hour. Not impossible for 270 files of 60Gb, but very hard in one weekend. The gridcell subset is extremely fast though: between 0.2 sec and 0.01 sec.

We need to explore re-chunking first.

@davidcaron
Copy link
Collaborator Author

davidcaron commented May 23, 2019

Is there a way to 'unckunk' the data? If we can do that and arrive to the exact same hash as the original unchunked file, it would mean we didn't change the data. Or at least be very confident about it.

@tlogan2000
Copy link
Collaborator

FYI there was a small glitch in my chunking script outputting files as .nc.nc
use modified code below for creating filename:

out_fn = glob.os.path.join(outpath , "ReChunk_Test_packed_{n}".format(n=glob.os.path.basename(dataset)))
# OLD out_fn = glob.os.path.join(outpath ,"ReChunk_Test_packed_{n}.nc".format(n=glob.os.path.basename(dataset)))
   

@davidcaron
Copy link
Collaborator Author

davidcaron commented May 23, 2019

I'm testing chunking for a 60Gb file. Am I right to say that the ram and cpu aren't used too much by the process? Ram seems to hover between 0.5 and 1.2 Gb and cpu usage is around 10% on the 4 cores on my computer. That would be good news if we can chunk multiple files at once on the same machine.

@tlogan2000 Roughly how long did it take you to process a single 60Gb file?

Edit: It took 18 minutes on my machine.

@tomLandry
Copy link

Can you both please remind me of the effect or rechunking on file size? I know that discussion occured recently on GitHub, I lost it. Please link it.

@davidcaron
Copy link
Collaborator Author

Using the settings in the script above, the output file size is 66Gb. So maybe 10% more than the input file size.

I think the data is still stored as scaled/offsetted integers right? My guess for the growth in the file size is that empty data is inserted to align the chunks in the file.

@tomLandry
Copy link

tomLandry commented May 23, 2019

Ok, so if what you said can be validated by peer (D. Byrns, T. Logan), I'd be very very happy with a 10% increase. And that would be our working scenario to scale.

@tlogan2000
Copy link
Collaborator

I have the same on my end ... scale offset seems to still be intact bt with chunked data on disk. Am working on a validation script right now

@davidcaron
Copy link
Collaborator Author

We still need to make sure performance is good also.

@tlogan2000
Copy link
Collaborator

Can you both please remind me of the effect or rechunking on file size? I know that discussion occured recently on GitHub, I lost it. Please link it.

I think that if the chunking gets smaller it will likely have a larger increase in total file size but am not 100% certain

@tomLandry
Copy link

So I say we transmit the script @davidcaron has right now to Louis-David, so he can figure out the mapping to our machines. For read performance of grid cell, I expect anyway something like a degree of magnitude better, but three time as fast would be already pretty good. For file size increase, I wouldn't start to worry before we hit 1.5x times input dataset size. I need to validate avail size though.

@davidcaron
Copy link
Collaborator Author

davidcaron commented May 23, 2019

It's too early to give to L-D. I'm running a subset for a single point and It's very long... will need to test more with the chunking.

edit: Subsetting a single grid point took 5 minutes... I think we can do much better.

@tomLandry
Copy link

Single grid point, for all files all RCP? Single file? What was your subsetting time before?

@davidcaron
Copy link
Collaborator Author

davidcaron commented May 23, 2019

For a single 60Gb file. So multiply by 270 for all files. It's roughly twice as fast as before... but it's hard to tell as the time are not consistent. I got between 5 and 10 minutes before.

@davidcaron
Copy link
Collaborator Author

@tlogan2000 should we see the chunks in ncdump -h?

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 23, 2019

@tlogan2000 should we see the chunks in ncdump -h?

-hs
with the variable info

@tlogan2000
Copy link
Collaborator

Here is a first attempt at a validation script testing chunked versus original data
The script checks reading the data with both mask_and_scale=False (actual data stored on disk) and mask_and_scale=True (on the fly converted data using scale and offset attributes)

Note : In getting it running I am only checking every 500th timestep of the gridded variable data. So far everything seems good but only converted 4 files to test with.

@huard can you see anything I might be missing?

import xarray as xr
import numpy as np
import glob
chunked_path = '{path_to_chunked}/testdata/Chunking_Netcdf/bccaqv2_daily'
orig_path = '{path_to_orig}/pcic/BCCAQv2/'

def test_equal(t,read):
    print('Testing chunked versus original data with mask_and_scale =', read, )
    ds_chunk = xr.open_dataset(t,mask_and_scale=read,decode_times=False, chunks=dict(time=365))
    ds_orig = xr.open_dataset(t.replace(chunked_path, orig_path).replace('ReChunk_Test_packed_',''), mask_and_scale=read,decode_times=False, chunks=dict(time=365))

    np.testing.assert_array_equal(ds_chunk.time,ds_orig.time)
    assert(ds_chunk.time.calendar == ds_orig.time.calendar)
    assert(ds_chunk.time.units == ds_orig.time.units)
    np.testing.assert_array_equal(ds_chunk.lon, ds_orig.lon)
    np.testing.assert_array_equal(ds_chunk.lat, ds_orig.lat)
    for v in list(ds_chunk.data_vars):
        # verfiy variables attributes (includes scale and offset vals when read=False )
        for key in ds_chunk[v].attrs.keys():
            assert(ds_chunk[v].attrs[key] == ds_orig[v].attrs[key])

        
        # check every 500th time step for now
        for t in np.arange(0,len(ds_chunk.time),500):
            #print(t)
            np.testing.assert_array_equal(ds_chunk[v][t,:,:],ds_orig[v][t,:,:] )


    print('Chunked and Original data are identical')

test_files = glob.glob(glob.os.path.join(chunked_path,'*.nc'))
for t in test_files:
    # test converted data
    read_type =[True, False]
    print('\n\n',t)
    for read in read_type:
        test_equal(t,read)
   

@tlogan2000
Copy link
Collaborator

It's too early to give to L-D. I'm running a subset for a single point and It's very long... will need to test more with the chunking.

edit: Subsetting a single grid point took 5 minutes... I think we can do much better.

We could try making lon, lat chunks smaller and possibly increasing time chunking?

@davidcaron
Copy link
Collaborator Author

Good news, I made a silly mistake !

Subsetting a single gridpoint takes between 3 and 5 seconds! 100 times faster!

It's slightly faster when using smaller chunksizes and somehow the file size is almost identical to the original. I used [128, 16, 32] for the chunksizes.

@tomLandry
Copy link

Come on man. If you're were really trying, we'd be a < 1 sec. But I guess we'll settle for ~4 seconds. :-) Seriously, super cool.

@tlogan2000
Copy link
Collaborator

I'm going for lunch!

@davidcaron
Copy link
Collaborator Author

Back of the napkin estimates: if we can do 4 files at a time, it would take around 24 hours on a single computer.

But we can do better than that using faster drives and more machines processing at the same time. Still very realistic for a weekend.

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 23, 2019

I'm testing chunking for a 60Gb file. Am I right to say that the ram and cpu aren't used too much by the process? Ram seems to hover between 0.5 and 1.2 Gb and cpu usage is around 10% on the 4 cores on my computer. That would be good news if we can chunk multiple files at once on the same machine.

@tlogan2000 Roughly how long did it take you to process a single 60Gb file?

Edit: It took 18 minutes on my machine.

I'm pretty sure mine was under 5 minutes ... I'll double check again
Note however that I set a dask distributed client with 5 workers (7 cores each) to do the heavy lifting so in the end it may not be much different if you think you can run 4 files simultaneously

from distributed import Client
client=Client(n_workers=5,threads_per_worker=7,diagnostics_port=8788)

edit - slightly longer this time around 7 min. Seems to vary a bit

@davidcaron
Copy link
Collaborator Author

My last timing was less than 15 minutes, without configuring the distributed library. Do you know if specifying the chunks parameter uses multiple cores, without configuring the threads and workers like you did above?

FYI, I made a small repo for the processing: https://github.com/davidcaron/bccaqv2-processing

@tlogan2000
Copy link
Collaborator

Yes if you set chunks when creating the dataset object xarray/dask should still use it's default multithreaded scheduler (but it is not distributed and not multiprocess .. i.e. multi workers)

see docs here:
https://docs.dask.org/en/latest/scheduling.html
https://distributed.dask.org/en/latest/

To be honest I am unsure of when it is more beneficial to use one or the other

@davidcaron
Copy link
Collaborator Author

Using a chunk size of [256, 16, 16], slicing in lat-lon takes 2.5 seconds, and slicing along the time takes less than 2 seconds. This seems like a good balance.

@tlogan2000
Copy link
Collaborator

@davidcaron @tomLandry CCCS would like to reduce to the number of files downloaded to a specific set of 24 files. Here is the some code that should correctly filter the raw bccaqv2 files:

import glob
import os

path_to_bccaqv2daily = '{path_to_files}/BCCAQv2'

rcps = ['rcp26','rcp45','rcp85']
mods = ['BNU-ESM','CCSM4','CESM1-CAM5','CNRM-CM5','CSIRO-Mk3-6-0','CanESM2','FGOALS-g2','GFDL-CM3','GFDL-ESM2G','GFDL-ESM2M','HadGEM2-AO',
        'HadGEM2-ES','IPSL-CM5A-LR','IPSL-CM5A-MR','MIROC-ESM-CHEM','MIROC-ESM','MIROC5','MPI-ESM-LR','MPI-ESM-MR','MRI-CGCM3','NorESM1-M'
    ,'NorESM1-ME','bcc-csm1-1-m','bcc-csm1-1']

vari = ['pr','tasmin','tasmax']

for v in vari:
    files = {}

    for r in rcps:
        files[r] = []
        for m in mods:
            ncs = glob.glob(os.path.join(path_to_bccaqv2daily,"*{v}*{m}_*{r}*r1i1p1*.nc".format(v=v, r=r, m=m)))
            if len(ncs)>1:
                #print(ncs)
                raise Exception('Too many files for model / rcp combination. Expected a single file')
            files[r].extend(ncs)

        print(v,r,len(files[r]))
        if len(files[r]) != 24:
            raise Exception('Expected to find 24 .nc files')

@davidcaron
Copy link
Collaborator Author

To be clear, is it 24 files or 24 models? I see 24 models, 3 variables and 3 rcps. So is it 24 * 3 * 3 = 216 files?

@tlogan2000
Copy link
Collaborator

Yes you are right. Sorry it is 24 files per combination of variable & rcp like you said

@huard
Copy link
Collaborator

huard commented Jun 26, 2019

@tlogan2000 @davidcaron Can this be closed ?

@tlogan2000
Copy link
Collaborator

ok for me .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants