Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore metadata limitations with Parquet format #26

Open
BecCowley opened this issue May 7, 2024 · 5 comments
Open

Explore metadata limitations with Parquet format #26

BecCowley opened this issue May 7, 2024 · 5 comments
Assignees
Labels
CARSv2 branch for CARSv2 project

Comments

@BecCowley
Copy link
Collaborator

Global attributes from netcdf files don't get carried through to parquet format.
What are the implications for the user with loss of metadata with parquet formats?

@BecCowley BecCowley added the CARSv2 branch for CARSv2 project label May 7, 2024
@BecCowley
Copy link
Collaborator Author

Can we transform the metadata to data? I mean, treat metadata as data - same as we treat LATITUDE and LONGITUDE, we also transform things like probe types, instrument types, recorder type, institute etc into data.

@BecCowley
Copy link
Collaborator Author

https://gmd.copernicus.org/preprints/gmd-2021-138/gmd-2021-138.pdf
A paper on how large netcdf files are transferred to parquet format. They handle the global attributes by inclusion of an additional file.

@lbesnard
Copy link
Collaborator

lbesnard commented May 8, 2024

for a bit more explanation on this, the metadata currently written in the parquet sidecar is the metadata of the dataset, not the metadata of the original input NetCDF files. This creates an "issue" (very similar to what we had anyway with the data stored in PostGreSQL) where specific netcdf metadata would be lost. This means the parquet format is "lossy" compared to the orginal NetCDF files.
For example with the Glider data, how to store this kind of information:

    "PLATFORM": {
      "type": "string",
      "trans_system_id": "Irridium",
      "positioning_system": "GPS",
      "platform_type": "Slocum G2",
      "platform_maker": "Teledyne Webb Research",
      "firmware_version_navigation": 7.1,
      "firmware_version_science": 7.1,
      "glider_serial_no": "416",
      "battery_type": "Alkaline",
      "glider_owner": "CSIRO",
      "operating_institution": "ANFOG",
      "long_name": "platform informations"
    },
    "DEPLOYMENT": {
      "type": "string",
      "deployment_start_date": "2015-10-21-T05:00:02Z",
      "deployment_start_latitude": -18.9373,
      "deployment_start_longitude": 146.881,
      "deployment_start_technician": "Gregor, Rob",
      "deployment_end_date": "2015-10-27-T01:56:23Z",
      "deployment_end_latitude": -19.2358,
      "deployment_end_longitude": 147.5188,
      "deployment_end_status": "recovered",
      "deployment_pilot": "pilot, CSIRO",
      "long_name": "deployment informations"
    },
    "SENSOR1": {
      "type": "string",
      "sensor_type": "CTD",
      "sensor_maker": "Seabird",
      "sensor_model": "GPCTD",
      "sensor_serial_no": "9117",
      "sensor_calibration_date": "2013-09-17",
      "sensor_parameters": "TEMP, CNDC, PRES, PSAL",
      "long_name": "sensor1 informations"
    },
    "SENSOR2": {
      "type": "string",
      "sensor_type": "ECO Puck",
      "sensor_maker": "Wetlabs",
      "sensor_model": "FLBBCDSLC",
      "sensor_serial_no": "3345",
      "sensor_calibration_date": "2013-10-07",
      "sensor_parameters": "CPHL, CDOM, VBSC",
      "long_name": "sensor2 informations"
    },
...

@mhidas
Copy link
Collaborator

mhidas commented May 8, 2024

@lbesnard I think what @BecCowley is talking about is adding some global attributes from the original NetCDF files into columns in the Parquet product. I'm pretty sure you talked about this as being possible with your code, you just have to configure it to do it, right?

@mhidas
Copy link
Collaborator

mhidas commented May 8, 2024

https://github.com/aodn/aodn_cloud_optimised/blob/main/README_add_new_dataset.md#global-attributes-as-variables

@BecCowley you just need to specify which global attributes we should be addding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CARSv2 branch for CARSv2 project
Projects
None yet
Development

No branches or pull requests

3 participants