Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Development #33

Merged
merged 42 commits into from
May 21, 2019
Merged

[WIP] Development #33

merged 42 commits into from
May 21, 2019

Conversation

davidcaron
Copy link
Collaborator

Overview

This PR is mostly focused on adding a subset + heat wave frequency process.

Along the way, I tried to split operations in inherited methods to avoid some code duplication. I think it's not the best way to do it yet, but it will help if we want to extend the functionality to subsetting with bounding boxes and polygons with other indices.

This still needs more testing, as I don't think that the mock tests are enough.

Changes:

  • Added BCCAQv2_heat_wave_frequency_gridpoint process

Todo:
[ ] More testing, as I don't think that the mock tests are enough.
[ ] The BCCAQv2_heat_wave_frequency_gridpoint process is taking much longer than I thought to process for the 150 years. More than 10 hours... I think this needs to be improved, I'm opened to suggestions.

@davidcaron
Copy link
Collaborator Author

When subsetting, the ram and CPU usage is very low, so that's a good point in favor of multithreading.

It seems to take around 5 minutes to subset a single netcdf file for the 150 years for a single grid cell. And I think there was recently a bunch more files added to the bccaqv2 dataset (there are around 260 files now). To compute the heat wave frequency, we need tasmin and tasmax, so that’s 174 files. Only for subsetting, the estimated time would be 15 hours. I could be surprised again, but I think the heat wave computation will be pretty much insignificant compared to this.

@davidcaron
Copy link
Collaborator Author

@huard, @tlogan2000 I think the subset could be downloading way more data than it needs. The processing time for one file (~5 min) is close to the time it takes me to download the whole file in my browser (~10 min).

I'll keep looking into the xclim subset function for a gridpoint, but I would be very interested in knowing what you think.

@huard
Copy link
Collaborator

huard commented May 16, 2019

Nothing obvious comes to mind. You could try replacing the subset_bbox call by a simple selection by indices: out = ds.isel(lat=0, lon=0) and see if there is a noticeable speed difference.

@davidcaron
Copy link
Collaborator Author

davidcaron commented May 16, 2019

I'm starting to think this could be related to Ouranosinc/pavics-sdi#106

I'm having a hard time executing this simple script:

from pathlib import Path
import time

from siphon.catalog import TDSCatalog
import xarray as xr
from xclim.subset import subset_gridpoint


catalog = TDSCatalog("https://boreas.ouranos.ca/thredds/catalog/birdhouse/pcic/BCCAQv2/catalog.xml")

print("starting")


for n, dataset in enumerate(catalog.datasets.values()):
    opendap_url = dataset.access_urls["OPENDAP"]
    t = time.perf_counter()
    ds = xr.open_dataset(opendap_url)

    out = ds.isel(lat=0, lon=0)

    out_fn = Path(__file__).parent / "tests" / "tmp" / f"temp{n}.nc"
    out_fn.parent.mkdir(parents=True, exist_ok=True)

    out.to_netcdf(out_fn)

    print(out_fn, f"{time.perf_counter() - t:.2f} seconds")

I load all the BCCAQv2 datasets, select the first cell element and save it to disk. When all works fine, it takes 5 seconds per dataset. But when it doesn't, it takes a minute or more. When I to load the main catalog in my browser https://boreas.ouranos.ca/thredds while I'm processing, I get 504 errors...

Here is what the output of the script looks like when it's working:

starting
/c/projects/finch/tests/tmp/temp0.nc 4.12 seconds
/c/projects/finch/tests/tmp/temp1.nc 4.16 seconds

And when it's not:

starting
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
/c/projects/finch/tests/tmp/temp0.nc 93.56 seconds
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ html><html><head><title>Error</title><style> body { width: 35em; margin: 0 auto; font-family: Tahoma, Verdana, Arial, sans-serif; }</style></head><body><h1>An error occurred.</h1><p>Sorry, the page you are looking for is currently unavailable.<br/>Please try again later.</p><p>If you are the system administrator of this resource then you should checkthe <a href="http://nginx.org/r/error_log">error log</a> for details.</p><p><em>Faithfully yours, nginx.</em></p></body></html>
/c/projects/finch/tests/tmp/temp1.nc 310.05 seconds

@tlogan2000
Copy link
Collaborator

I'm getting variable times running locally as well. It sometimes seems to take a very long time just to create the dataset. This seems to be consistently the same files though.... Will have to keep digging

ds = xr.open_dataset(ncfile)

@tomLandry
Copy link

Mmm... then us getting all the data locally here might not help at all if we keep this method of access to the dataset.

@huard
Copy link
Collaborator

huard commented May 16, 2019

Are the files taking longer to load those with CF calendars ?

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 17, 2019

The following files seem to have issues with either missing days in the time-series or timestamp that seem wrong (only 1 file). This could possibly cause slower dataset object creation? I'm not sure there is much to be done other than in case 1 where I think we can correct the time signature for the day that is clearly coded wrong

Case 1 : 2036-10-28 time step coded as 1850-01-01

tasmax_day_BCCAQv2+ANUSPLIN300_CESM1-CAM5_historical+rcp85_r1i1p1_19500101-21001231.nc

Case 2 : 2099-12-31 timestep missing

tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1_historical+rcp85_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1_historical+rcp85_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1_historical+rcp85_r1i1p1_19500101-21001231.nc

Case 3 HadGEM{} models: Missing month Dec 2005 and/or occasionally Dec 2100; and/or missing entire 2100

tasmax_day_BCCAQv2+ANUSPLIN300_HadGEM2-CC_historical+rcp45_r1i1p1_19500101-21001231.nc
tasmax_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp26_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-CC_historical+rcp45_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp26_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_HadGEM2-CC_historical+rcp45_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp26_r1i1p1_19500101-21001231.nc
tasmax_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc
tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc
pr_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 17, 2019

I seem to be getting more consistent extraction times (reading off the local disk still ) by using a distributed client and dask.array memory chunking when creating the dataset objects

It can still occasionally take >30 sec but for the most part i am looking at between 5 and 15 seconds for most simulations

I have also set decode_times=False when reading the files to avoid any time lost in conversion

Code

from pathlib import Path
import time
import glob
from siphon.catalog import TDSCatalog
import xarray as xr
from xclim.subset import subset_gridpoint
from distributed import Client
client=Client()
client

datasets = sorted(glob.glob(glob.os.path.join(path_to_bccaqv2,'*.nc')))
print("starting")

for n,dataset in enumerate(datasets): 
    t = time.perf_counter()
    ds = xr.open_dataset(dataset,decode_times=False, chunks=dict(time=366))
    #out = ds.sel(lat=-70,lon=45, method='nearest')   
    out = subset_gridpoint(ds, lon=-70,lat=45)
    out_fn = glob.os.path.join(outpath , "temp{n}.nc".format(n=n))
    out.to_netcdf(out_fn)
    print(n, dataset, f"{time.perf_counter() - t:.2f} seconds")

@davidcaron
Copy link
Collaborator Author

I think we need to be careful with disk caching when benchmarking. I run subset_gridpoint for a local file once, it takes 10 minutes, and I run it again and it takes less than a second. Then when I change the location of the gridpoint (lat/lon) it takes 10 minutes again for the same file.

Also, subsetting along lat/lon for a simgle time step instead of extracting a timeseries for a grid point is vary fast.

I'm experimenting with chunking the datasets:

https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters

I think these datasets are not chunked, do you know if that's the case?

@huard
Copy link
Collaborator

huard commented May 17, 2019

@tlogan2000
Copy link
Collaborator

Yes, I just ran the same test selecting a single time-step and extracting is extremely fast.

running ncdump -hs I don't see any chunking / storage info so as far as I can tell files are in the 'index-order' described in the link making spatial access fast.

I think that the fact that PCIC used the 'scale offset packing' option makes chunking the data not possible. If we convert to a chunked format we will probably immediately result in a double precision encoding and likely greatly increase file size.

@tomLandry
Copy link

So last question would be, worst case scenario for subsetting grid cell if we can't rechunk: ~10 minutes x ~30 models x 3 RCP = 900 minutes = 15 hours. So let's say 8 to 15 hours before receiving an email. Would be worth to send a email to the user so he's invited to look again... much later.

Next question would be, how can we cache stuff a bit. Like pre-fetching everything close to big cities, instead of relying on random point in middle of nowhere.

Other question is to see how our estimate degrades with multiple users accessing files. If 10 users query the stuff and it drops to 30 minutes per, bad...

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 21, 2019

Hello all, I produced a rechunked daily bccaqv2 file here:
https://pavics.ouranos.ca/thredds/catalog/birdhouse/testdata/Chunking_Netcdf/bccaqv2_daily/catalog.html

@davidcaron could you try extracting via thredds to test performance?

Notes:
In order to future proof an eventual need for subset_bbox and polygon extractions I have chunked the file as something that should give reasonable access speed in both time and space:
time:365, lat:50, lon:56

I have also forced the temperature variable to single precision to reduce file size (@huard do you see any problem using single vs double precision with these data?)

Also added low-level zlib compression (level 2) but I actually end up with a file size less than the original! ~40GB so I could probably get rid of the compression to improve access speed if there is a need

Conversion took ~40min using three worker distributed client
Code for conversion

ChunkSizes = [365, 50, 56]
comp = dict(zlib=True, complevel=2, chunksizes=ChunkSizes, dtype='single')

for n,dataset in enumerate([datasets[0]]): 
    t = time.perf_counter()
    ds = xr.open_dataset(dataset,decode_times=False, chunks=dict(time=365))
    encoding = {var: comp for var in ds.data_vars}
    encoding['time'] = dict(dtype='single')
    out_fn = glob.os.path.join(outpath , "ReChunk_Test_{n}.nc".format(n=glob.os.path.basename(dataset)))
    ds.to_netcdf(out_fn,encoding=encoding)
    print(n, dataset, f"{time.perf_counter() - t:.2f} seconds")

@davidcaron
Copy link
Collaborator Author

Thanks Travis! I will test it first thing when I get the time.

I tried to use nccopy to do the chunking, but with no luck... maybe because the data is scaled and offseted like you previously said?

@tlogan2000
Copy link
Collaborator

Ok cool.
nccopy might be trying to load the entire file to memory? Not sure how it works but that could be difficult considering the file sizes.

Xarray seemed to handle it ok as long as you specifiy dask memory chunking when reading/creating the original dataset onject..

@davidcaron
Copy link
Collaborator Author

nccopy might be trying to load the entire file to memory? Not sure how it works but that could be difficult considering the file sizes.

It doesn't by defaut. I made some tests with ncpdq to rearrange dimensions and this one does though.

@davidcaron davidcaron merged commit 1bdb5bb into master May 21, 2019
@tomLandry
Copy link

@tlogan2000 thank you. We now have a THREDDS server at CRIM will the whole BCCAQv2 dataset ready. We also now have a mirror of the cliamte data portal in-house. Big part we're missing is the portal's geoserver. We do have more flexibility now. We're looking into rechunking the data, but this is not a "today" priority for us. Today, if we can gather timings on the process, we can establish when rechunking should/could occur.

@huard Also, before mass-converting anything, we need to adress the fact that MD5 checksums we received doesn't match Ouranos & CRIM data. I was informed a few times by your team that no modifcation was done on data from PCIC. Hopefully that includes metadata. Maybe MD5 checksums was computed by PCIC on the wrong set.

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 21, 2019

@Zeitsperre any insight for checksum discrepancies on the original BCCAQv2 transferred files?

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 21, 2019

I do remember that we actually transferred the data twice. The first version had grid offset issues for certain files

e.g. see below

The second transfer was in november 2019

Compare_grid_tasmax_day_BCCAQv2+ANUSPLIN300_CNRM-CM5_historical+rcp45_r1i1p1_19500101-21001231

@Zeitsperre
Copy link
Collaborator

@tlogan2000 @tomLandry

We did indeed transfer the data twice. The most recent version on the server was added in November after PCIC corrected the grid-shift problem. The old/erroneous data was purged not long after that. The transfer via GLOBUS automatically checks sha sums so I'm surprised to hear that none of the files are seemingly the same. I'm with Tom on this that they may have performed the MD5 sums on the old data (or maybe they've done further modifications to the existing set?). It doesn't help that they don't include revision numbers in their metadata.

Tom has mentioned he'll be contacting PCIC to verify the numbers again. I've asked that he include me in CC. My time is pretty scarce this week but if they have a new revision, I can start another transfer via GLOBUS tomorrow.

@huard
Copy link
Collaborator

huard commented May 21, 2019

@tlogan2000 The decision on single vs double precision is not ours to make. We're distributing PCIC's data, we should not make any user-visible change to it unless we have their explicit consent.

Also, note that with some compression algorithms, the gains on IO more than offset the compute time, so it's actually faster to read a compressed file (see blosc algorithm for example). In other words, don't expect compression necessarily increases load time.

@tlogan2000
Copy link
Collaborator

tlogan2000 commented May 21, 2019

@huard Sorry my question/thought process about the datatype was relatively unclear.

PCIC original data is 'packed' (saved as short ints on disk and using scale and offset when loaded etc) My understanding of netcdf chunking was that there is no way to 'chunk' the data on disk using the packed option and that we would be forced to change the datatype (if the decision was made to re-chunk everything of course). The default datatype using xarray.to_netcdf is double precision making files quite large... hence my question about using single instead.

However, I think that in the end maybe we can chunk to disk and still have 'packed' data....

@tomLandry
Copy link

Thank you both. So yes, keeping 100% integrity to PCIC original files on Ouranos and CRIM servers is something to thrive for. Hence checksumming (still pending). As for creating a new "optimized" copy of the dataset (rechunked), if we can insure that the data is similar (precision-wise), I think CRIM can make that call. I will contact James in due time. CRIM would incur additional storage cost though, so it's a pretty important decision. And... time is not on our side, sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants