### Example 3 - Accessing metadata

This example shows how to access Argo metadata stored in parquet format. All metadata are stored in one parquet file, and we will load it into memory and filtering by float number.

#### Getting started

We start by importing the necessary modules and setting the path for the metadata parquet file.

In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
import xarray as xr
import pyarrow.parquet as pq
import glob
from pprint import pprint

# Paths on Poseidon cluster
metadata_parquet = '../data/parquet/metadata/ArgoBGC_metadata.parquet'

We first query the parquet files for the variable names, as we might not be familiar with them, or we might want to make sure that indeed the names have not been changed from the Argo convention.  The variable names for the metadata are as defined in [argo_synthetic-profile_index.txt](https://usgodae.org/pub/outgoing/argo/argo_synthetic-profile_index.txt).

In [2]:
dataset = pq.ParquetDataset(metadata_parquet)
schema = dataset.schema
pprint(sorted(schema.names))

['__index_level_0__',
 'cycle',
 'date',
 'date_update',
 'file',
 'filename',
 'filepath',
 'filepath_main',
 'institution',
 'latitude',
 'longitude',
 'ocean',
 'parameter_data_mode',
 'parameters',
 'profiler_type',
 'wmoid']


Filtering syntax works as in the previous examples. For example, we can ask for the metadata of a specific float by providing its WMO ID.

In [3]:
filter_ID = [("wmoid","==",4903798)]

With pandas, we pass the filter to the `filters` variable of the `read_parquet()` command:

In [4]:
df = pd.read_parquet( metadata_parquet , engine='pyarrow', filters = filter_ID )
df

Unnamed: 0,file,date,latitude,longitude,ocean,profiler_type,institution,parameters,parameter_data_mode,date_update,wmoid,filepath_main,filepath,filename,cycle
146170,coriolis/4903798/profiles/SR4903798_001.nc,2024-06-19 11:37:00,56.224,-50.380,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:13:54,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_001.nc,1
146171,coriolis/4903798/profiles/SR4903798_001D.nc,2024-06-18 17:14:00,56.123,-50.531,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:13:45,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_001D.nc,1
146172,coriolis/4903798/profiles/SR4903798_002.nc,2024-06-20 11:39:00,56.165,-50.419,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:12,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_002.nc,2
146173,coriolis/4903798/profiles/SR4903798_002D.nc,2024-06-19 13:33:00,56.224,-50.376,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:03,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_002D.nc,2
146174,coriolis/4903798/profiles/SR4903798_003.nc,2024-06-21 11:41:00,56.318,-50.307,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:31,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_003.nc,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146229,coriolis/4903798/profiles/SR4903798_045.nc,2024-09-01 11:20:00,56.096,-50.472,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-09-01 14:13:31,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_045.nc,45
146230,coriolis/4903798/profiles/SR4903798_046.nc,2024-09-03 11:23:00,56.041,-50.606,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-09-05 15:12:05,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_046.nc,46
146231,coriolis/4903798/profiles/SR4903798_047.nc,2024-09-05 11:21:00,55.969,-50.573,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-09-05 15:12:14,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_047.nc,47
146232,coriolis/4903798/profiles/SR4903798_048.nc,2024-09-07 11:25:00,55.905,-50.478,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-09-07 15:12:14,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_048.nc,48


You can explore the dataframe just by calling it (`df`) as we did above. If you want a list of the variables that are stored, you can use `sorted(df.columns.to_list())` (it will provide the same output as the previous `sorted(schema.names)`):

In [5]:
sorted(df.columns.to_list())

['cycle',
 'date',
 'date_update',
 'file',
 'filename',
 'filepath',
 'filepath_main',
 'institution',
 'latitude',
 'longitude',
 'ocean',
 'parameter_data_mode',
 'parameters',
 'profiler_type',
 'wmoid']

#### Exercise

Try and access some other metadata, for example:

* filtering by geographical coordinates (look at example 1 for inspiration on how to build the filter) or by other parameters;
* performing reads/manipulations that you would need to perform your tasks.

If you encounter any issues, please [reach out](enrico.milanese@whoi.edu)!