### Example 3 - Accessing metadata (Amazon S3 version)

This example will show you how to access Argo metadata stored in parquet format in WHOI's AWS S3 data lake. All metadata are stored in one parquet file, and we will load it into memory and filtering by float number.

##### Note on AWS S3

In this example we will access metadata stored in WHOI's **AWS S3** data lake. Reading data from WHOI's Poseidon cluster is slightly different and we refer you to dedicated examples (manipulating the data once loaded into the memory does not change).

NB: Access to the data lake should be public at this time. If you get some permission error, [reach out](enrico.milanese@whoi.edu).

##### Note on parquet files

The original netCDF Argo files have been converted to parquet format, which provides faster read operations.

There are a couple of way to read parquet files in Python. One is by directly using pandas (make sure you have pyarrow, fastparquet or some other suitable engine installed), the other is with Dask. Generally speaking, you'll want to use Dask if you need a large amount of data at the same time so that you can benefit from its parallelization. You should avoid Dask and just go for pandas whenever the data fits in your RAM.

When reading parquet files with pandas, you can either specificy the file name (if you know which file you want), or the directory containing all the parquet files. In latter case if you apply any filter, pandas and pyarrow will sort through all the files in the folder, reading into memory only the subsets that satisfy your filter.

#### Getting started

We start by importing the necessary modules and setting the path and filenames of the parquet files. For a list of modules that you need to install, you can look at the [README.md file in the repository](https://github.com/boom-lab/nc2parquet).

In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
import xarray as xr
import pyarrow.parquet as pq
import glob
from pprint import pprint

# Importing AWS S3 modules, setting AWS S3 paths and file system
from pyarrow import fs
import boto3
from botocore import UNSIGNED
from botocore.client import Config
client = boto3.client('s3', config=Config(signature_version=UNSIGNED), region_name='us-east-1')

metadata_pqt = 's3://argo-experimental/pqt/metadata/test_metadata.parquet'
fs, _ = fs.FileSystem.from_uri(metadata_pqt)

We first query the parquet files for the variable names, as we might not be familiar with them, or we might want to make sure that indeed the names have not been changed from the Argo convention.  The variable names for the metadata are as defined in [argo_synthetic-profile_index.txt](https://usgodae.org/pub/outgoing/argo/argo_synthetic-profile_index.txt).

In [2]:
dataset = pq.ParquetDataset(metadata_pqt.strip("s3://"), filesystem=fs)
schema = dataset.schema
pprint(sorted(schema.names))

['__index_level_0__',
 'cycle',
 'date',
 'date_update',
 'file',
 'filename',
 'filepath',
 'filepath_main',
 'institution',
 'latitude',
 'longitude',
 'ocean',
 'parameter_data_mode',
 'parameters',
 'profiler_type',
 'wmoid']


Filtering syntax works as in the previous examples. For example, we can ask for the metadata of a specific float by providing its WMO ID.

In [3]:
filter_ID = [("wmoid","==",4903798)]

With pandas, we pass the filter to the `filters` variable of the `read_parquet()` command:

In [4]:
ds = pq.ParquetDataset(metadata_pqt.strip("s3://"), filesystem=fs, filters=filter_ID)
df = ds.read().to_pandas()
df

Unnamed: 0,file,date,latitude,longitude,ocean,profiler_type,institution,parameters,parameter_data_mode,date_update,wmoid,filepath_main,filepath,filename,cycle
142317,coriolis/4903798/profiles/SR4903798_001.nc,2024-06-19 11:37:00,56.224,-50.38,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:13:54,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_001.nc,1
142318,coriolis/4903798/profiles/SR4903798_001D.nc,2024-06-18 17:14:00,56.123,-50.531,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:13:45,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_001D.nc,1
142319,coriolis/4903798/profiles/SR4903798_002.nc,2024-06-20 11:39:00,56.165,-50.419,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:12,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_002.nc,2
142320,coriolis/4903798/profiles/SR4903798_002D.nc,2024-06-19 13:33:00,56.224,-50.376,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:03,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_002D.nc,2
142321,coriolis/4903798/profiles/SR4903798_003.nc,2024-06-21 11:41:00,56.318,-50.307,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:31,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_003.nc,3
142322,coriolis/4903798/profiles/SR4903798_003D.nc,2024-06-20 13:03:00,56.165,-50.425,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:21,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_003D.nc,3
142323,coriolis/4903798/profiles/SR4903798_004.nc,2024-06-22 11:40:00,56.19,-50.287,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:50,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_004.nc,4
142324,coriolis/4903798/profiles/SR4903798_004D.nc,2024-06-21 13:10:00,56.323,-50.296,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:14:40,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_004D.nc,4
142325,coriolis/4903798/profiles/SR4903798_005.nc,2024-06-23 11:39:00,56.407,-50.443,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:15:10,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_005.nc,5
142326,coriolis/4903798/profiles/SR4903798_005D.nc,2024-06-22 13:01:00,56.183,-50.292,A,836,IF,PRES TEMP PSAL DOXY PH_IN_SITU_TOTAL,RRRAR,2024-06-27 10:15:00,4903798,coriolis/4903798/,coriolis/4903798/profiles/,SR4903798_005D.nc,5


You can explore the dataframe just by calling it (`df`) as we did above. If you want a list of the variables that are stored, you can use `sorted(df.columns.to_list())` (it will provide the same output as the previous `sorted(schema.names)`):

In [5]:
sorted(df.columns.to_list())

['cycle',
 'date',
 'date_update',
 'file',
 'filename',
 'filepath',
 'filepath_main',
 'institution',
 'latitude',
 'longitude',
 'ocean',
 'parameter_data_mode',
 'parameters',
 'profiler_type',
 'wmoid']

#### Exercise

Try and access some other metadata, for example:
*filtering by geographical coordinates (look at example 1 for inspiration on how to build the filter) or by other parameters;
*performing reads/manipulations that you would need to perform your tasks.

If you encounter any issues, please [reach out](enrico.milanese@whoi.edu)!