Performance for subsetting a gridpoint is very slow #36

davidcaron · 2019-05-21T15:57:16Z

Description

Subsetting the BCCAQv2 datasets alog the time dimension takes 5-10 minutes for a single file. There are 270 files, so the processing is needlessy long for this simple operation.

While modifying the original data is not desired, here are the proposed solutions:

Don't do anything, keep using the original data
re-chunk the data
re-align the data so that the time dimension is last

Related to #33

tlogan2000 · 2019-05-23T14:04:55Z

I am creating a few re-chunked files for us to test - I will keep original packing but writing chunked to disk for both in space and time
First validation will be to ensure new data is identical to original

Chunked data is identical to original
edit - Test performed on only 4 files ... verifying every 500th timestep in the variable grid
see : Performance for subsetting a gridpoint is very slow #36 (comment)

# rechunk test


datasets = sorted(glob.glob(glob.os.path.join(path_to_bccaqv2,'*tasmin*.nc')))
print("starting")
ChunkSizes = [365, 50, 56]
comp = dict(chunksizes=ChunkSizes)
for dataset in datasets[0:2]: 
    t = time.perf_counter()
    # to keep original 'packing' use mask_and_scale=False 
    ds = xr.open_dataset(dataset,mask_and_scale=False,decode_times=False, chunks=dict(time=365))
    encoding = {var: comp for var in ds.data_vars} 
    encoding['time'] = dict(dtype='single')
    out_fn = glob.os.path.join(outpath , "ReChunk_Test_packed_{n}.nc".format(n=glob.os.path.basename(dataset)))
    ds.to_netcdf(out_fn,encoding=encoding)
    print(f"{time.perf_counter() - t:.2f} seconds")

tomLandry · 2019-05-23T14:07:56Z

Ok, so I fill in with more info on our daily planning for this. The idea is to get a sample script today (seems you're working on this, much appreciated) so we can run it on one of our machines. That yields the baseline to compute throughput. We then scale the number of machines. Idea is to hit "play" on the "distributed" script on Friday, and come back on Monday with a shiny, new, speady dataset ready to use at CRIM.

tlogan2000 · 2019-05-23T14:23:28Z

On my end I am not sure exactly how to rewrite a transposed dataset (realigned option).
@huard Do you have a feel as to what exactly the difference would be between this and re-chunking as time:-1, lon: 1, lat:1?

My feeling is that the in-between strategy (chunked in space and time) will provide the most flexible format for anything else we may wish do in a next phase (subet_bbox, subset_features, index calculation on relatively large zones etc etc). It will undoubtedly be slower for the gridcell download but if the performance hit isn't too bad it could be an interesting option

tomLandry · 2019-05-23T14:25:33Z

Another strategy would be to hit CRIM's data repo for grid-cell + small regions, than switch to Ouranos server for larger regions. Would be relatively straight forward to measure the sweet spot.

davidcaron · 2019-05-23T14:30:51Z

I think realigning the whole dataset is a much more complicated option. I made a test realigned dataset, with ncpdq using something like this:

ncpdq -a lat,lon,time tasmin_day_BCCAQv2+ANUSPLIN300_inmcm4_historical+rcp85_r1i1p1_19500101-21001231.nc reordered.nc

You need to have the available ram (60Gb) and It takes around 1 hour. Not impossible for 270 files of 60Gb, but very hard in one weekend. The gridcell subset is extremely fast though: between 0.2 sec and 0.01 sec.

We need to explore re-chunking first.

davidcaron · 2019-05-23T14:38:10Z

Is there a way to 'unckunk' the data? If we can do that and arrive to the exact same hash as the original unchunked file, it would mean we didn't change the data. Or at least be very confident about it.

tlogan2000 · 2019-05-23T14:57:06Z

FYI there was a small glitch in my chunking script outputting files as .nc.nc
use modified code below for creating filename:

out_fn = glob.os.path.join(outpath , "ReChunk_Test_packed_{n}".format(n=glob.os.path.basename(dataset)))
# OLD out_fn = glob.os.path.join(outpath ,"ReChunk_Test_packed_{n}.nc".format(n=glob.os.path.basename(dataset)))

davidcaron · 2019-05-23T15:04:02Z

I'm testing chunking for a 60Gb file. Am I right to say that the ram and cpu aren't used too much by the process? Ram seems to hover between 0.5 and 1.2 Gb and cpu usage is around 10% on the 4 cores on my computer. That would be good news if we can chunk multiple files at once on the same machine.

@tlogan2000 Roughly how long did it take you to process a single 60Gb file?

Edit: It took 18 minutes on my machine.

tomLandry · 2019-05-23T15:05:04Z

Can you both please remind me of the effect or rechunking on file size? I know that discussion occured recently on GitHub, I lost it. Please link it.

davidcaron · 2019-05-23T15:07:35Z

Using the settings in the script above, the output file size is 66Gb. So maybe 10% more than the input file size.

I think the data is still stored as scaled/offsetted integers right? My guess for the growth in the file size is that empty data is inserted to align the chunks in the file.

tomLandry · 2019-05-23T15:08:56Z

Ok, so if what you said can be validated by peer (D. Byrns, T. Logan), I'd be very very happy with a 10% increase. And that would be our working scenario to scale.

tlogan2000 · 2019-05-23T15:10:30Z

I have the same on my end ... scale offset seems to still be intact bt with chunked data on disk. Am working on a validation script right now

davidcaron · 2019-05-23T15:11:21Z

We still need to make sure performance is good also.

tlogan2000 · 2019-05-23T15:12:14Z

Can you both please remind me of the effect or rechunking on file size? I know that discussion occured recently on GitHub, I lost it. Please link it.

I think that if the chunking gets smaller it will likely have a larger increase in total file size but am not 100% certain

tomLandry · 2019-05-23T15:14:06Z

So I say we transmit the script @davidcaron has right now to Louis-David, so he can figure out the mapping to our machines. For read performance of grid cell, I expect anyway something like a degree of magnitude better, but three time as fast would be already pretty good. For file size increase, I wouldn't start to worry before we hit 1.5x times input dataset size. I need to validate avail size though.

davidcaron · 2019-05-23T15:16:05Z

It's too early to give to L-D. I'm running a subset for a single point and It's very long... will need to test more with the chunking.

edit: Subsetting a single grid point took 5 minutes... I think we can do much better.

tomLandry · 2019-05-23T15:18:05Z

Single grid point, for all files all RCP? Single file? What was your subsetting time before?

davidcaron · 2019-05-23T15:20:45Z

For a single 60Gb file. So multiply by 270 for all files. It's roughly twice as fast as before... but it's hard to tell as the time are not consistent. I got between 5 and 10 minutes before.

davidcaron · 2019-05-23T15:23:33Z

@tlogan2000 should we see the chunks in ncdump -h?

tlogan2000 · 2019-05-23T15:26:03Z

@tlogan2000 should we see the chunks in ncdump -h?

-hs
with the variable info

tlogan2000 · 2019-05-23T16:02:59Z

Here is a first attempt at a validation script testing chunked versus original data
The script checks reading the data with both mask_and_scale=False (actual data stored on disk) and mask_and_scale=True (on the fly converted data using scale and offset attributes)

Note : In getting it running I am only checking every 500th timestep of the gridded variable data. So far everything seems good but only converted 4 files to test with.

@huard can you see anything I might be missing?

import xarray as xr
import numpy as np
import glob
chunked_path = '{path_to_chunked}/testdata/Chunking_Netcdf/bccaqv2_daily'
orig_path = '{path_to_orig}/pcic/BCCAQv2/'

def test_equal(t,read):
    print('Testing chunked versus original data with mask_and_scale =', read, )
    ds_chunk = xr.open_dataset(t,mask_and_scale=read,decode_times=False, chunks=dict(time=365))
    ds_orig = xr.open_dataset(t.replace(chunked_path, orig_path).replace('ReChunk_Test_packed_',''), mask_and_scale=read,decode_times=False, chunks=dict(time=365))

    np.testing.assert_array_equal(ds_chunk.time,ds_orig.time)
    assert(ds_chunk.time.calendar == ds_orig.time.calendar)
    assert(ds_chunk.time.units == ds_orig.time.units)
    np.testing.assert_array_equal(ds_chunk.lon, ds_orig.lon)
    np.testing.assert_array_equal(ds_chunk.lat, ds_orig.lat)
    for v in list(ds_chunk.data_vars):
        # verfiy variables attributes (includes scale and offset vals when read=False )
        for key in ds_chunk[v].attrs.keys():
            assert(ds_chunk[v].attrs[key] == ds_orig[v].attrs[key])

        
        # check every 500th time step for now
        for t in np.arange(0,len(ds_chunk.time),500):
            #print(t)
            np.testing.assert_array_equal(ds_chunk[v][t,:,:],ds_orig[v][t,:,:] )


    print('Chunked and Original data are identical')

test_files = glob.glob(glob.os.path.join(chunked_path,'*.nc'))
for t in test_files:
    # test converted data
    read_type =[True, False]
    print('\n\n',t)
    for read in read_type:
        test_equal(t,read)

tlogan2000 · 2019-05-23T16:16:17Z

It's too early to give to L-D. I'm running a subset for a single point and It's very long... will need to test more with the chunking.

edit: Subsetting a single grid point took 5 minutes... I think we can do much better.

We could try making lon, lat chunks smaller and possibly increasing time chunking?

davidcaron · 2019-05-23T16:21:07Z

Good news, I made a silly mistake !

Subsetting a single gridpoint takes between 3 and 5 seconds! 100 times faster!

It's slightly faster when using smaller chunksizes and somehow the file size is almost identical to the original. I used [128, 16, 32] for the chunksizes.

tomLandry · 2019-05-23T16:22:56Z

Come on man. If you're were really trying, we'd be a < 1 sec. But I guess we'll settle for ~4 seconds. :-) Seriously, super cool.

tlogan2000 · 2019-05-23T16:25:08Z

I'm going for lunch!

davidcaron · 2019-05-23T16:29:49Z

Back of the napkin estimates: if we can do 4 files at a time, it would take around 24 hours on a single computer.

But we can do better than that using faster drives and more machines processing at the same time. Still very realistic for a weekend.

tlogan2000 · 2019-05-23T17:20:32Z

I'm testing chunking for a 60Gb file. Am I right to say that the ram and cpu aren't used too much by the process? Ram seems to hover between 0.5 and 1.2 Gb and cpu usage is around 10% on the 4 cores on my computer. That would be good news if we can chunk multiple files at once on the same machine.

@tlogan2000 Roughly how long did it take you to process a single 60Gb file?

Edit: It took 18 minutes on my machine.

I'm pretty sure mine was under 5 minutes ... I'll double check again
Note however that I set a dask distributed client with 5 workers (7 cores each) to do the heavy lifting so in the end it may not be much different if you think you can run 4 files simultaneously

from distributed import Client
client=Client(n_workers=5,threads_per_worker=7,diagnostics_port=8788)

edit - slightly longer this time around 7 min. Seems to vary a bit

davidcaron · 2019-05-23T17:46:26Z

My last timing was less than 15 minutes, without configuring the distributed library. Do you know if specifying the chunks parameter uses multiple cores, without configuring the threads and workers like you did above?

FYI, I made a small repo for the processing: https://github.com/davidcaron/bccaqv2-processing

tlogan2000 · 2019-05-23T17:54:08Z

Yes if you set chunks when creating the dataset object xarray/dask should still use it's default multithreaded scheduler (but it is not distributed and not multiprocess .. i.e. multi workers)

see docs here:
https://docs.dask.org/en/latest/scheduling.html
https://distributed.dask.org/en/latest/

To be honest I am unsure of when it is more beneficial to use one or the other

davidcaron · 2019-05-23T19:45:29Z

Using a chunk size of [256, 16, 16], slicing in lat-lon takes 2.5 seconds, and slicing along the time takes less than 2 seconds. This seems like a good balance.

tlogan2000 · 2019-05-28T13:40:33Z

@davidcaron @tomLandry CCCS would like to reduce to the number of files downloaded to a specific set of 24 files. Here is the some code that should correctly filter the raw bccaqv2 files:

import glob
import os

path_to_bccaqv2daily = '{path_to_files}/BCCAQv2'

rcps = ['rcp26','rcp45','rcp85']
mods = ['BNU-ESM','CCSM4','CESM1-CAM5','CNRM-CM5','CSIRO-Mk3-6-0','CanESM2','FGOALS-g2','GFDL-CM3','GFDL-ESM2G','GFDL-ESM2M','HadGEM2-AO',
        'HadGEM2-ES','IPSL-CM5A-LR','IPSL-CM5A-MR','MIROC-ESM-CHEM','MIROC-ESM','MIROC5','MPI-ESM-LR','MPI-ESM-MR','MRI-CGCM3','NorESM1-M'
    ,'NorESM1-ME','bcc-csm1-1-m','bcc-csm1-1']

vari = ['pr','tasmin','tasmax']

for v in vari:
    files = {}

    for r in rcps:
        files[r] = []
        for m in mods:
            ncs = glob.glob(os.path.join(path_to_bccaqv2daily,"*{v}*{m}_*{r}*r1i1p1*.nc".format(v=v, r=r, m=m)))
            if len(ncs)>1:
                #print(ncs)
                raise Exception('Too many files for model / rcp combination. Expected a single file')
            files[r].extend(ncs)

        print(v,r,len(files[r]))
        if len(files[r]) != 24:
            raise Exception('Expected to find 24 .nc files')

davidcaron · 2019-05-28T13:45:51Z

To be clear, is it 24 files or 24 models? I see 24 models, 3 variables and 3 rcps. So is it 24 * 3 * 3 = 216 files?

tlogan2000 · 2019-05-28T13:47:40Z

Yes you are right. Sorry it is 24 files per combination of variable & rcp like you said

huard · 2019-06-26T16:21:13Z

@tlogan2000 @davidcaron Can this be closed ?

tlogan2000 · 2019-06-26T18:01:19Z

ok for me .

davidcaron mentioned this issue May 23, 2019

[WIP] Output results as a csv instead of a netcdf #37

Merged

davidcaron closed this as completed Jun 26, 2019

Performance for subsetting a gridpoint is very slow #36

Performance for subsetting a gridpoint is very slow #36

Comments

davidcaron commented May 21, 2019 • edited Loading

Description

tlogan2000 commented May 23, 2019 • edited Loading

tomLandry commented May 23, 2019

tlogan2000 commented May 23, 2019

tomLandry commented May 23, 2019

davidcaron commented May 23, 2019

davidcaron commented May 23, 2019 • edited Loading

tlogan2000 commented May 23, 2019

davidcaron commented May 23, 2019 • edited Loading

tomLandry commented May 23, 2019

davidcaron commented May 23, 2019

tomLandry commented May 23, 2019 • edited Loading

tlogan2000 commented May 23, 2019

davidcaron commented May 23, 2019

tlogan2000 commented May 23, 2019

tomLandry commented May 23, 2019

davidcaron commented May 23, 2019 • edited Loading

tomLandry commented May 23, 2019

davidcaron commented May 23, 2019 • edited Loading

davidcaron commented May 23, 2019

tlogan2000 commented May 23, 2019 • edited Loading

tlogan2000 commented May 23, 2019

tlogan2000 commented May 23, 2019

davidcaron commented May 23, 2019

tomLandry commented May 23, 2019

tlogan2000 commented May 23, 2019

davidcaron commented May 23, 2019

tlogan2000 commented May 23, 2019 • edited Loading

davidcaron commented May 23, 2019

tlogan2000 commented May 23, 2019

davidcaron commented May 23, 2019

tlogan2000 commented May 28, 2019

davidcaron commented May 28, 2019

tlogan2000 commented May 28, 2019

huard commented Jun 26, 2019

tlogan2000 commented Jun 26, 2019

davidcaron commented May 21, 2019 •

edited

Loading

tlogan2000 commented May 23, 2019 •

edited

Loading

davidcaron commented May 23, 2019 •

edited

Loading

davidcaron commented May 23, 2019 •

edited

Loading

tomLandry commented May 23, 2019 •

edited

Loading

davidcaron commented May 23, 2019 •

edited

Loading

davidcaron commented May 23, 2019 •

edited

Loading

tlogan2000 commented May 23, 2019 •

edited

Loading

tlogan2000 commented May 23, 2019 •

edited

Loading