-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Development #33
Conversation
When subsetting, the ram and CPU usage is very low, so that's a good point in favor of multithreading. It seems to take around 5 minutes to subset a single netcdf file for the 150 years for a single grid cell. And I think there was recently a bunch more files added to the bccaqv2 dataset (there are around 260 files now). To compute the heat wave frequency, we need tasmin and tasmax, so that’s 174 files. Only for subsetting, the estimated time would be 15 hours. I could be surprised again, but I think the heat wave computation will be pretty much insignificant compared to this. |
@huard, @tlogan2000 I think the subset could be downloading way more data than it needs. The processing time for one file (~5 min) is close to the time it takes me to download the whole file in my browser (~10 min). I'll keep looking into the xclim subset function for a gridpoint, but I would be very interested in knowing what you think. |
Nothing obvious comes to mind. You could try replacing the subset_bbox call by a simple selection by indices: out = ds.isel(lat=0, lon=0) and see if there is a noticeable speed difference. |
I'm starting to think this could be related to Ouranosinc/pavics-sdi#106 I'm having a hard time executing this simple script: from pathlib import Path
import time
from siphon.catalog import TDSCatalog
import xarray as xr
from xclim.subset import subset_gridpoint
catalog = TDSCatalog("https://boreas.ouranos.ca/thredds/catalog/birdhouse/pcic/BCCAQv2/catalog.xml")
print("starting")
for n, dataset in enumerate(catalog.datasets.values()):
opendap_url = dataset.access_urls["OPENDAP"]
t = time.perf_counter()
ds = xr.open_dataset(opendap_url)
out = ds.isel(lat=0, lon=0)
out_fn = Path(__file__).parent / "tests" / "tmp" / f"temp{n}.nc"
out_fn.parent.mkdir(parents=True, exist_ok=True)
out.to_netcdf(out_fn)
print(out_fn, f"{time.perf_counter() - t:.2f} seconds") I load all the BCCAQv2 datasets, select the first cell element and save it to disk. When all works fine, it takes 5 seconds per dataset. But when it doesn't, it takes a minute or more. When I to load the main catalog in my browser https://boreas.ouranos.ca/thredds while I'm processing, I get 504 errors... Here is what the output of the script looks like when it's working:
And when it's not:
|
I'm getting variable times running locally as well. It sometimes seems to take a very long time just to create the dataset. This seems to be consistently the same files though.... Will have to keep digging
|
Mmm... then us getting all the data locally here might not help at all if we keep this method of access to the dataset. |
Are the files taking longer to load those with CF calendars ? |
The following files seem to have issues with either missing days in the time-series or timestamp that seem wrong (only 1 file). This could possibly cause slower dataset object creation? I'm not sure there is much to be done other than in case 1 where I think we can correct the time signature for the day that is clearly coded wrong Case 1 : 2036-10-28 time step coded as 1850-01-01
Case 2 : 2099-12-31 timestep missing
Case 3 HadGEM{} models: Missing month Dec 2005 and/or occasionally Dec 2100; and/or missing entire 2100
|
I seem to be getting more consistent extraction times (reading off the local disk still ) by using a distributed client and dask.array memory chunking when creating the dataset objects It can still occasionally take >30 sec but for the most part i am looking at between 5 and 15 seconds for most simulations I have also set Code
|
I think we need to be careful with disk caching when benchmarking. I run subset_gridpoint for a local file once, it takes 10 minutes, and I run it again and it takes less than a second. Then when I change the location of the gridpoint (lat/lon) it takes 10 minutes again for the same file. Also, subsetting along lat/lon for a simgle time step instead of extracting a timeseries for a grid point is vary fast. I'm experimenting with chunking the datasets: https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters I think these datasets are not chunked, do you know if that's the case? |
ncdump -hs should tell you. |
Yes, I just ran the same test selecting a single time-step and extracting is extremely fast. running ncdump -hs I don't see any chunking / storage info so as far as I can tell files are in the 'index-order' described in the link making spatial access fast. I think that the fact that PCIC used the 'scale offset packing' option makes chunking the data not possible. If we convert to a chunked format we will probably immediately result in a double precision encoding and likely greatly increase file size. |
So last question would be, worst case scenario for subsetting grid cell if we can't rechunk: ~10 minutes x ~30 models x 3 RCP = 900 minutes = 15 hours. So let's say 8 to 15 hours before receiving an email. Would be worth to send a email to the user so he's invited to look again... much later. Next question would be, how can we cache stuff a bit. Like pre-fetching everything close to big cities, instead of relying on random point in middle of nowhere. Other question is to see how our estimate degrades with multiple users accessing files. If 10 users query the stuff and it drops to 30 minutes per, bad... |
Hello all, I produced a rechunked daily bccaqv2 file here: @davidcaron could you try extracting via thredds to test performance? Notes: I have also forced the temperature variable to single precision to reduce file size (@huard do you see any problem using single vs double precision with these data?) Also added low-level zlib compression (level 2) but I actually end up with a file size less than the original! ~40GB so I could probably get rid of the compression to improve access speed if there is a need Conversion took ~40min using three worker distributed client
|
Thanks Travis! I will test it first thing when I get the time. I tried to use |
Ok cool. Xarray seemed to handle it ok as long as you specifiy dask memory chunking when reading/creating the original dataset onject.. |
It doesn't by defaut. I made some tests with |
@tlogan2000 thank you. We now have a THREDDS server at CRIM will the whole BCCAQv2 dataset ready. We also now have a mirror of the cliamte data portal in-house. Big part we're missing is the portal's geoserver. We do have more flexibility now. We're looking into rechunking the data, but this is not a "today" priority for us. Today, if we can gather timings on the process, we can establish when rechunking should/could occur. @huard Also, before mass-converting anything, we need to adress the fact that MD5 checksums we received doesn't match Ouranos & CRIM data. I was informed a few times by your team that no modifcation was done on data from PCIC. Hopefully that includes metadata. Maybe MD5 checksums was computed by PCIC on the wrong set. |
@Zeitsperre any insight for checksum discrepancies on the original BCCAQv2 transferred files? |
We did indeed transfer the data twice. The most recent version on the server was added in November after PCIC corrected the grid-shift problem. The old/erroneous data was purged not long after that. The transfer via GLOBUS automatically checks sha sums so I'm surprised to hear that none of the files are seemingly the same. I'm with Tom on this that they may have performed the MD5 sums on the old data (or maybe they've done further modifications to the existing set?). It doesn't help that they don't include revision numbers in their metadata. Tom has mentioned he'll be contacting PCIC to verify the numbers again. I've asked that he include me in CC. My time is pretty scarce this week but if they have a new revision, I can start another transfer via GLOBUS tomorrow. |
@tlogan2000 The decision on single vs double precision is not ours to make. We're distributing PCIC's data, we should not make any user-visible change to it unless we have their explicit consent. Also, note that with some compression algorithms, the gains on IO more than offset the compute time, so it's actually faster to read a compressed file (see blosc algorithm for example). In other words, don't expect compression necessarily increases load time. |
@huard Sorry my question/thought process about the datatype was relatively unclear. PCIC original data is 'packed' (saved as short ints on disk and using scale and offset when loaded etc) My understanding of netcdf chunking was that there is no way to 'chunk' the data on disk using the packed option and that we would be forced to change the datatype (if the decision was made to re-chunk everything of course). The default datatype using xarray.to_netcdf is double precision making files quite large... hence my question about using single instead. However, I think that in the end maybe we can chunk to disk and still have 'packed' data.... |
Thank you both. So yes, keeping 100% integrity to PCIC original files on Ouranos and CRIM servers is something to thrive for. Hence checksumming (still pending). As for creating a new "optimized" copy of the dataset (rechunked), if we can insure that the data is similar (precision-wise), I think CRIM can make that call. I will contact James in due time. CRIM would incur additional storage cost though, so it's a pretty important decision. And... time is not on our side, sorry about that. |
Overview
This PR is mostly focused on adding a subset + heat wave frequency process.
Along the way, I tried to split operations in inherited methods to avoid some code duplication. I think it's not the best way to do it yet, but it will help if we want to extend the functionality to subsetting with bounding boxes and polygons with other indices.
This still needs more testing, as I don't think that the mock tests are enough.
Changes:
Todo:
[ ] More testing, as I don't think that the mock tests are enough.
[ ] The BCCAQv2_heat_wave_frequency_gridpoint process is taking much longer than I thought to process for the 150 years. More than 10 hours... I think this needs to be improved, I'm opened to suggestions.