[WIP] Development #33

davidcaron · 2019-05-16T13:25:46Z

Overview

This PR is mostly focused on adding a subset + heat wave frequency process.

Along the way, I tried to split operations in inherited methods to avoid some code duplication. I think it's not the best way to do it yet, but it will help if we want to extend the functionality to subsetting with bounding boxes and polygons with other indices.

This still needs more testing, as I don't think that the mock tests are enough.

Changes:

Added BCCAQv2_heat_wave_frequency_gridpoint process

Todo:
[ ] More testing, as I don't think that the mock tests are enough.
[ ] The BCCAQv2_heat_wave_frequency_gridpoint process is taking much longer than I thought to process for the 150 years. More than 10 hours... I think this needs to be improved, I'm opened to suggestions.

…cess

davidcaron · 2019-05-16T13:42:05Z

When subsetting, the ram and CPU usage is very low, so that's a good point in favor of multithreading.

It seems to take around 5 minutes to subset a single netcdf file for the 150 years for a single grid cell. And I think there was recently a bunch more files added to the bccaqv2 dataset (there are around 260 files now). To compute the heat wave frequency, we need tasmin and tasmax, so that’s 174 files. Only for subsetting, the estimated time would be 15 hours. I could be surprised again, but I think the heat wave computation will be pretty much insignificant compared to this.

davidcaron · 2019-05-16T15:17:25Z

@huard, @tlogan2000 I think the subset could be downloading way more data than it needs. The processing time for one file (~5 min) is close to the time it takes me to download the whole file in my browser (~10 min).

I'll keep looking into the xclim subset function for a gridpoint, but I would be very interested in knowing what you think.

huard · 2019-05-16T16:01:07Z

Nothing obvious comes to mind. You could try replacing the subset_bbox call by a simple selection by indices: out = ds.isel(lat=0, lon=0) and see if there is a noticeable speed difference.

davidcaron · 2019-05-16T16:27:58Z

I'm starting to think this could be related to Ouranosinc/pavics-sdi#106

I'm having a hard time executing this simple script:

from pathlib import Path
import time

from siphon.catalog import TDSCatalog
import xarray as xr
from xclim.subset import subset_gridpoint


catalog = TDSCatalog("https://boreas.ouranos.ca/thredds/catalog/birdhouse/pcic/BCCAQv2/catalog.xml")

print("starting")


for n, dataset in enumerate(catalog.datasets.values()):
    opendap_url = dataset.access_urls["OPENDAP"]
    t = time.perf_counter()
    ds = xr.open_dataset(opendap_url)

    out = ds.isel(lat=0, lon=0)

    out_fn = Path(__file__).parent / "tests" / "tmp" / f"temp{n}.nc"
    out_fn.parent.mkdir(parents=True, exist_ok=True)

    out.to_netcdf(out_fn)

    print(out_fn, f"{time.perf_counter() - t:.2f} seconds")

I load all the BCCAQv2 datasets, select the first cell element and save it to disk. When all works fine, it takes 5 seconds per dataset. But when it doesn't, it takes a minute or more. When I to load the main catalog in my browser https://boreas.ouranos.ca/thredds while I'm processing, I get 504 errors...

Here is what the output of the script looks like when it's working:

starting
/c/projects/finch/tests/tmp/temp0.nc 4.12 seconds
/c/projects/finch/tests/tmp/temp1.nc 4.16 seconds

And when it's not:

starting
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
/c/projects/finch/tests/tmp/temp0.nc 93.56 seconds
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
/c/projects/finch/tests/tmp/temp1.nc 310.05 seconds

tlogan2000 · 2019-05-16T20:43:14Z

I'm getting variable times running locally as well. It sometimes seems to take a very long time just to create the dataset. This seems to be consistently the same files though.... Will have to keep digging

ds = xr.open_dataset(ncfile)

tomLandry · 2019-05-16T20:45:07Z

Mmm... then us getting all the data locally here might not help at all if we keep this method of access to the dataset.

huard · 2019-05-16T20:49:58Z

Are the files taking longer to load those with CF calendars ?

tlogan2000 · 2019-05-17T18:29:11Z

The following files seem to have issues with either missing days in the time-series or timestamp that seem wrong (only 1 file). This could possibly cause slower dataset object creation? I'm not sure there is much to be done other than in case 1 where I think we can correct the time signature for the day that is clearly coded wrong

Case 1 : 2036-10-28 time step coded as 1850-01-01

tasmax_day_BCCAQv2+ANUSPLIN300_CESM1-CAM5_historical+rcp85_r1i1p1_19500101-21001231.nc

Case 2 : 2099-12-31 timestep missing

tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1_historical+rcp85_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1_historical+rcp85_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1_historical+rcp85_r1i1p1_19500101-21001231.nc

Case 3 HadGEM{} models: Missing month Dec 2005 and/or occasionally Dec 2100; and/or missing entire 2100

tasmax_day_BCCAQv2+ANUSPLIN300_HadGEM2-CC_historical+rcp45_r1i1p1_19500101-21001231.nc
tasmax_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp26_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-CC_historical+rcp45_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp26_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_HadGEM2-CC_historical+rcp45_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp26_r1i1p1_19500101-21001231.nc
tasmax_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc

tlogan2000 · 2019-05-17T18:38:49Z

I seem to be getting more consistent extraction times (reading off the local disk still ) by using a distributed client and dask.array memory chunking when creating the dataset objects

It can still occasionally take >30 sec but for the most part i am looking at between 5 and 15 seconds for most simulations

I have also set decode_times=False when reading the files to avoid any time lost in conversion

Code

from pathlib import Path
import time
import glob
from siphon.catalog import TDSCatalog
import xarray as xr
from xclim.subset import subset_gridpoint
from distributed import Client
client=Client()
client

datasets = sorted(glob.glob(glob.os.path.join(path_to_bccaqv2,'*.nc')))
print("starting")

for n,dataset in enumerate(datasets): 
    t = time.perf_counter()
    ds = xr.open_dataset(dataset,decode_times=False, chunks=dict(time=366))
    #out = ds.sel(lat=-70,lon=45, method='nearest')   
    out = subset_gridpoint(ds, lon=-70,lat=45)
    out_fn = glob.os.path.join(outpath , "temp{n}.nc".format(n=n))
    out.to_netcdf(out_fn)
    print(n, dataset, f"{time.perf_counter() - t:.2f} seconds")

davidcaron · 2019-05-17T19:40:03Z

I think we need to be careful with disk caching when benchmarking. I run subset_gridpoint for a local file once, it takes 10 minutes, and I run it again and it takes less than a second. Then when I change the location of the gridpoint (lat/lon) it takes 10 minutes again for the same file.

Also, subsetting along lat/lon for a simgle time step instead of extracting a timeseries for a grid point is vary fast.

I'm experimenting with chunking the datasets:

https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters

I think these datasets are not chunked, do you know if that's the case?

huard · 2019-05-17T20:13:28Z

ncdump -hs should tell you.
see https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_utilities_guide.html

tlogan2000 · 2019-05-17T20:23:29Z

Yes, I just ran the same test selecting a single time-step and extracting is extremely fast.

running ncdump -hs I don't see any chunking / storage info so as far as I can tell files are in the 'index-order' described in the link making spatial access fast.

I think that the fact that PCIC used the 'scale offset packing' option makes chunking the data not possible. If we convert to a chunked format we will probably immediately result in a double precision encoding and likely greatly increase file size.

tomLandry · 2019-05-17T20:47:38Z

So last question would be, worst case scenario for subsetting grid cell if we can't rechunk: ~10 minutes x ~30 models x 3 RCP = 900 minutes = 15 hours. So let's say 8 to 15 hours before receiving an email. Would be worth to send a email to the user so he's invited to look again... much later.

Next question would be, how can we cache stuff a bit. Like pre-fetching everything close to big cities, instead of relying on random point in middle of nowhere.

Other question is to see how our estimate degrades with multiple users accessing files. If 10 users query the stuff and it drops to 30 minutes per, bad...

tlogan2000 · 2019-05-21T15:13:47Z

Hello all, I produced a rechunked daily bccaqv2 file here:
https://pavics.ouranos.ca/thredds/catalog/birdhouse/testdata/Chunking_Netcdf/bccaqv2_daily/catalog.html

@davidcaron could you try extracting via thredds to test performance?

Notes:
In order to future proof an eventual need for subset_bbox and polygon extractions I have chunked the file as something that should give reasonable access speed in both time and space:
time:365, lat:50, lon:56

I have also forced the temperature variable to single precision to reduce file size (@huard do you see any problem using single vs double precision with these data?)

Also added low-level zlib compression (level 2) but I actually end up with a file size less than the original! ~40GB so I could probably get rid of the compression to improve access speed if there is a need

Conversion took ~40min using three worker distributed client
Code for conversion

ChunkSizes = [365, 50, 56]
comp = dict(zlib=True, complevel=2, chunksizes=ChunkSizes, dtype='single')

for n,dataset in enumerate([datasets[0]]): 
    t = time.perf_counter()
    ds = xr.open_dataset(dataset,decode_times=False, chunks=dict(time=365))
    encoding = {var: comp for var in ds.data_vars}
    encoding['time'] = dict(dtype='single')
    out_fn = glob.os.path.join(outpath , "ReChunk_Test_{n}.nc".format(n=glob.os.path.basename(dataset)))
    ds.to_netcdf(out_fn,encoding=encoding)
    print(n, dataset, f"{time.perf_counter() - t:.2f} seconds")

davidcaron · 2019-05-21T15:17:00Z

Thanks Travis! I will test it first thing when I get the time.

I tried to use nccopy to do the chunking, but with no luck... maybe because the data is scaled and offseted like you previously said?

tlogan2000 · 2019-05-21T15:20:16Z

Ok cool.
nccopy might be trying to load the entire file to memory? Not sure how it works but that could be difficult considering the file sizes.

Xarray seemed to handle it ok as long as you specifiy dask memory chunking when reading/creating the original dataset onject..

davidcaron · 2019-05-21T15:23:32Z

nccopy might be trying to load the entire file to memory? Not sure how it works but that could be difficult considering the file sizes.

It doesn't by defaut. I made some tests with ncpdq to rearrange dimensions and this one does though.

tomLandry · 2019-05-21T15:34:28Z

@tlogan2000 thank you. We now have a THREDDS server at CRIM will the whole BCCAQv2 dataset ready. We also now have a mirror of the cliamte data portal in-house. Big part we're missing is the portal's geoserver. We do have more flexibility now. We're looking into rechunking the data, but this is not a "today" priority for us. Today, if we can gather timings on the process, we can establish when rechunking should/could occur.

@huard Also, before mass-converting anything, we need to adress the fact that MD5 checksums we received doesn't match Ouranos & CRIM data. I was informed a few times by your team that no modifcation was done on data from PCIC. Hopefully that includes metadata. Maybe MD5 checksums was computed by PCIC on the wrong set.

tlogan2000 · 2019-05-21T15:37:40Z

@Zeitsperre any insight for checksum discrepancies on the original BCCAQv2 transferred files?

tlogan2000 · 2019-05-21T15:45:48Z

I do remember that we actually transferred the data twice. The first version had grid offset issues for certain files

e.g. see below

The second transfer was in november 2019

Zeitsperre · 2019-05-21T15:58:44Z

@tlogan2000 @tomLandry

We did indeed transfer the data twice. The most recent version on the server was added in November after PCIC corrected the grid-shift problem. The old/erroneous data was purged not long after that. The transfer via GLOBUS automatically checks sha sums so I'm surprised to hear that none of the files are seemingly the same. I'm with Tom on this that they may have performed the MD5 sums on the old data (or maybe they've done further modifications to the existing set?). It doesn't help that they don't include revision numbers in their metadata.

Tom has mentioned he'll be contacting PCIC to verify the numbers again. I've asked that he include me in CC. My time is pretty scarce this week but if they have a new revision, I can start another transfer via GLOBUS tomorrow.

huard · 2019-05-21T17:09:59Z

@tlogan2000 The decision on single vs double precision is not ours to make. We're distributing PCIC's data, we should not make any user-visible change to it unless we have their explicit consent.

Also, note that with some compression algorithms, the gains on IO more than offset the compute time, so it's actually faster to read a compressed file (see blosc algorithm for example). In other words, don't expect compression necessarily increases load time.

tlogan2000 · 2019-05-21T17:43:59Z

@huard Sorry my question/thought process about the datatype was relatively unclear.

PCIC original data is 'packed' (saved as short ints on disk and using scale and offset when loaded etc) My understanding of netcdf chunking was that there is no way to 'chunk' the data on disk using the packed option and that we would be forced to change the datatype (if the decision was made to re-chunk everything of course). The default datatype using xarray.to_netcdf is double precision making files quite large... hence my question about using single instead.

However, I think that in the end maybe we can chunk to disk and still have 'packed' data....

tomLandry · 2019-05-21T17:49:33Z

Thank you both. So yes, keeping 100% integrity to PCIC original files on Ouranos and CRIM servers is something to thrive for. Hence checksumming (still pending). As for creating a new "optimized" copy of the dataset (rechunked), if we can insure that the data is similar (precision-wise), I think CRIM can make that call. I will contact James in due time. CRIM would incur additional storage cost though, so it's a pretty important decision. And... time is not on our side, sorry about that.

davidcaron added 23 commits May 15, 2019 15:20

configure sentry when a process inherits from FinchProcess

8708142

move the bccaqv2_link constant in base.py

c837a5f

extract function to create a zipfile from a metalink

d3f6b88

extract function to get bccaqv2 datasets

a4a8c05

extract subset function for gridpoint subsetting and add heatwave pro…

f376f92

…cess

extract base subsetting process

70ac4f6

reuse functions and methods for bccaqv2 heatwave process

e29ebd8

move bccaqv2_link constant

30f6eb3

fix bccaqv2_link reference

e139bfb

typo

c55e231

simpler handler wrapper

f8f8683

add BCCAQV2HeatWave process

6c7dcb3

fix handler wrapper

afa8a93

fix tests

180a4d5

if rcp is None, all rcp are returned

37fd8b9

extract compute_indices method

a7c620c

finish first implementation of heat wave frequency bccaqv2

eb00c08

add zip function to het_wave_frequency

dc8081d

fix for rcp is None

52e8bbb

fix handler

f2e0dc0

add tests for bccaqv2 heat wave frequency gridpoint

729baf0

add workdir for generated inputs

ed46d98

fix result variable in test

bcb39bd

davidcaron added 2 commits May 16, 2019 09:49

fix test

e5a6e0d

flake8

adabe37

output an empty csv file for now, so that the interface can be tested

97b33df

davidcaron added 4 commits May 16, 2019 16:01

change output name for bccaqv2 subset

d959394

fix pathlib concatenation

dc90668

input ordering

c531d90

required inputs for tests

991bd6b

better handling of optional inputs

fb461fc

davidcaron force-pushed the development branch from 6cba8a4 to fb461fc Compare May 21, 2019 15:01

fix output filename

6707ce0

0.2.2

5b55a0d

davidcaron merged commit 1bdb5bb into master May 21, 2019

davidcaron mentioned this pull request May 21, 2019

Performance for subsetting a gridpoint is very slow #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Development #33

[WIP] Development #33

davidcaron commented May 16, 2019

davidcaron commented May 16, 2019

davidcaron commented May 16, 2019

huard commented May 16, 2019

davidcaron commented May 16, 2019 •

edited

Loading

tlogan2000 commented May 16, 2019

tomLandry commented May 16, 2019

huard commented May 16, 2019

tlogan2000 commented May 17, 2019 •

edited

Loading

tlogan2000 commented May 17, 2019 •

edited

Loading

davidcaron commented May 17, 2019

huard commented May 17, 2019

tlogan2000 commented May 17, 2019

tomLandry commented May 17, 2019

tlogan2000 commented May 21, 2019 •

edited

Loading

davidcaron commented May 21, 2019

tlogan2000 commented May 21, 2019

davidcaron commented May 21, 2019

tomLandry commented May 21, 2019

tlogan2000 commented May 21, 2019 •

edited

Loading

tlogan2000 commented May 21, 2019 •

edited

Loading

Zeitsperre commented May 21, 2019

huard commented May 21, 2019

tlogan2000 commented May 21, 2019 •

edited

Loading

tomLandry commented May 21, 2019

[WIP] Development #33

[WIP] Development #33

Conversation

davidcaron commented May 16, 2019

Overview

davidcaron commented May 16, 2019

davidcaron commented May 16, 2019

huard commented May 16, 2019

davidcaron commented May 16, 2019 • edited Loading

tlogan2000 commented May 16, 2019

tomLandry commented May 16, 2019

huard commented May 16, 2019

tlogan2000 commented May 17, 2019 • edited Loading

tlogan2000 commented May 17, 2019 • edited Loading

davidcaron commented May 17, 2019

huard commented May 17, 2019

tlogan2000 commented May 17, 2019

tomLandry commented May 17, 2019

tlogan2000 commented May 21, 2019 • edited Loading

davidcaron commented May 21, 2019

tlogan2000 commented May 21, 2019

davidcaron commented May 21, 2019

tomLandry commented May 21, 2019

tlogan2000 commented May 21, 2019 • edited Loading

tlogan2000 commented May 21, 2019 • edited Loading

Zeitsperre commented May 21, 2019

huard commented May 21, 2019

tlogan2000 commented May 21, 2019 • edited Loading

tomLandry commented May 21, 2019

davidcaron commented May 16, 2019 •

edited

Loading

tlogan2000 commented May 17, 2019 •

edited

Loading

tlogan2000 commented May 17, 2019 •

edited

Loading

tlogan2000 commented May 21, 2019 •

edited

Loading

tlogan2000 commented May 21, 2019 •

edited

Loading

tlogan2000 commented May 21, 2019 •

edited

Loading

tlogan2000 commented May 21, 2019 •

edited

Loading