Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capability to do qaqc without recreating -raw.cdf file #87

Open
ssuttles-usgs opened this issue Nov 2, 2022 · 6 comments
Open

Capability to do qaqc without recreating -raw.cdf file #87

ssuttles-usgs opened this issue Nov 2, 2022 · 6 comments

Comments

@ssuttles-usgs
Copy link
Contributor

Presently in the stglib workflow any qaqc actions to data variables are specified in the config.yaml file, which is ingested as an argument at the first processing step where the raw instrument data are read and written to a raw.cdf file. It would be desirable to have the added capability to allow qaqc actions to be specified at later steps in the process, so that the raw,cdf file would not need to be recreated each time. One idea that has been discussed would be to allow a new qaqc.yaml file, containing qaqc actions, as an optional argument at the step(s) where the .nc files for data release are generated (e.g runexocdf2nc.py). This could be implemented in a similar way to the optional atmospheric pressure correction argument (--atmpres ) that is used to correct submerged pressure data for changes in local atmospheric pressure.

@dnowacki-usgs
Copy link
Member

I think we could keep everything in the same yaml file and read it in again, as suggested, using an additional command-line argument. We would want to check when re-reading the file to make sure the new values are not different from what already exists and if so issue a warning (but probably not fail, since there may be some testing of the ideal qaqc cutoff values).

@ssuttles-usgs
Copy link
Contributor Author

That does seem like a potentially good solution within the existing workflow. Do you know if there is a way to just to just work with the global attributes of an existing netCDF file (read & write)? For very large files the re-writing of the entire raw,cdf file can take a very long time.

@dnowacki-usgs
Copy link
Member

dnowacki-usgs commented Jan 3, 2024

An xr.open_dataset() should just open the dataset but not load any of the data values. I think this means that accessing the attrs is quick. For writing, I think(?) it has to rewrite the whole file, but I'm not certain about that.

Edit: that seems to be the case for xarray. https://stackoverflow.com/questions/66231575/xarray-appending-or-rewriting-a-existing-nc-file

We'd be reading in the whole CDF in this scenario anyway though, right?

@ssuttles-usgs
Copy link
Contributor Author

The writing the whole file part is the thing that can be slow for large files. Since we only need to append some attributes to the global attributes that will invoke QA/QC that we want to perform in the cdf2nc step, I am hoping there is a way to do that without re-writing the whole raw CDF file.

ncatted or other things in NCO might work better than xarray, but don't honestly know until trying.

https://nco.sourceforge.net/nco.html#ncatted

https://stackoverflow.com/questions/69043727/how-can-i-add-or-edit-lot-of-global-attributes-with-ncatted

I am happy to save this issue until we are in the QA/QC phase for a very large dataset. I am not there yet!

@dnowacki-usgs
Copy link
Member

dnowacki-usgs commented Jan 3, 2024 via email

@ssuttles-usgs
Copy link
Contributor Author

Ok, now I get it. Yes, re-reading the yaml file with a command line option in the cdf2nc step with the added qaqc calls seems like a great solution! Sorry I did not see that was what you meant before. Addinf a user warning if any pre-existing key word values have changed seems like a good idea. This would allow the option of making a change to instrument metadata in the final step if needed. Alternatively, could only allow non-existing keys/values to be added, but I prefer the approach with the flexibility and user warning as you suggest. As for timing, if you would like me to implement this capability I will probably wait until I start processing datasets I have that contain very large data files. I suspect in about a month. If you want to go ahead and make the change sooner, that would be great too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants