# Basic ERDDAP Obs Matching

This notebook is a demonstration of a basic recommended method to collect SalishSeaCast model field values to match
ocean observations.
It uses `xarray` to access the SalishSeaCast model datasets via the ERDDAP server.
The observations are assumed to be defined by a collection of 4d time-space coordinates in a `pandas.DataFrame`.
(Those could, of course, be read from a CSV file.)
The model field values closest to the observations are collected in an `xarray.Dataset` that is converted to a
`pandas.DataFrame` at the end of the notebook.


It is assumed that the observations are discrete and independent;
i.e. single points in time-space.

Extraction of time series, depth profiles, or other hyperslabs of model fields is best done using
[Reshapr](https://reshapr.readthedocs.io/en/latest/).

Collection of model field values to match observations using the method demonstrated here is
a process that lends itself well to parallelization,
in contrast to the simple loop iteration that is used in this notebook.
Scaling the method demonstrated here to match thousands of observations is the subject of
another notebook in this directory.

The conda environment file used to run this notebook is `environment.yaml` in this directory.

In [3]:
import pandas as pd
import xarray as xr

For reference, the versions of Python and the packages that are important to what we are doing
that were in the environment in which this notebook was last run are:

In [2]:
import sys
import netCDF4
import numpy
import pandas
import xarray

print(f"Python {sys.version=}")
print(f"{numpy.__version__=}")
print(f"{xarray.__version__=}")
print(f"{pandas.__version__=}")
print(f"{netCDF4.__version__=}")

Python sys.version='3.13.2 | packaged by conda-forge | (main, Feb 17 2025, 14:10:22) [GCC 13.3.0]'
numpy.__version__='2.2.3'
xarray.__version__='2025.1.2'
pandas.__version__='2.2.3'
netCDF4.__version__='1.7.2'


## Basic ERDDAP Dataset Access with `xarray`

* The SalishSeaCast ERDDAP server is at https://salishsea.eos.ubc.ca/erddap.
* The list of datasets published by the server is at https://salishsea.eos.ubc.ca/erddap/info/index.html.
  The dataset ids are in the rightmost column of the table on that page.
* You can look at the metadata for a specific dataset by clicking the `M` link in the "FGDC, ISO, Metadata" column of the table.
  That will take you to a page with a URL of the form `https://salishsea.eos.ubc.ca/erddap/info/dataset_id/index.html`,
  for example: https://salishsea.eos.ubc.ca/erddap/info/ubcSSg3DPhysicsFields1hV21-11/index.html
* To access that dataset using `xarray.open_dataset()`,
  use a URL of the form `https://salishsea.eos.ubc.ca/erddap/griddap/dataset_id`

In [5]:
erddap_url = "https://salishsea.eos.ubc.ca/erddap"
dataset_id = "ubcSSg3DPhysicsFields1hV21-11"
dataset_url = f"{erddap_url}/griddap/{dataset_id}"

In [6]:
ds = xr.open_dataset(dataset_url)

ds

`xarray` defers loading the actual values of the fields in the dataset as long as possible.
That's known as lazy-loading.
The information returned by `xarray.open_data()` in the cell above is just the dataset metadata.

The ERDDAP server has 2 important limits that control how much data you can retrieve in a single operation:

1. The maximum size of the data transfer must be less than 2 Gb
2. The maximum processing time on the server to extract the data from the underlying files must be less than 10 minutes

So, to get the maximum amount of useful data within those limits, we need to limit our data requests in every way possible.
The first way to do that is to reduce the set of variables we are requesting from the server to only those we are interested in.
Subtracting the set of variables that we are interested in from the set of all variables in the dataset gives us a set of variables
that we can tell the server to drop from our request:


In [7]:
all_vars = set(ds.data_vars)
keep_vars = {"salinity"}
drop_vars = all_vars - keep_vars
ds = xr.open_dataset(dataset_url, drop_variables=drop_vars)

ds

## Matching a Single Observation Point

Let's consider a single salinity observation for which we want to get the nearest model salinity field value.
If we read that observation from a CSV file into a `pandas.DataFrame` it would look like:

In [49]:
df = pd.DataFrame(
    {
        "time": [pd.to_datetime("2025-03-07 11:45:00")],
        "depth": [0.5],
        "gridY": [350],
        "gridX": [250],
        "salinity": [28.5],
    }
)

df

Unnamed: 0,time,depth,gridY,gridX,salinity
0,2025-03-07 11:45:00,0.5,350,250,28.5


The time and depth values of the observation are unlikely to exactly match those of a model calculation

In [35]:
ds_pt = (ds
    .sel(time=df["time"][0], method="nearest")
    .sel(depth=df["depth"][0], method="nearest")
    .sel(gridY=df["gridY"][0], gridX=df["gridX"][0])
)

ds_pt

In [36]:
ds_pt.salinity.values

array(28.972582, dtype=float32)

In [43]:
def foo(time, depth, gridY, gridX):
    ds_pt = (ds
        .sel(time=time, method="nearest")
        .sel(depth=depth, method="nearest")
        .sel(gridY=gridY, gridX=gridX)
    )
    return ds_pt

In [44]:
[foo(time, depth, gridY, gridX) for time, depth, gridY, gridX in df.to_numpy()]

[<xarray.Dataset> Size: 20B
 Dimensions:   ()
 Coordinates:
     time      datetime64[ns] 8B 2025-03-07T11:30:00
     depth     float32 4B 0.5
     gridY     int16 2B 350
     gridX     int16 2B 250
 Data variables:
     salinity  float32 4B ...
 Attributes: (12/25)
     acknowledgement:           MEOPAR, Ocean Networks Canada (ONC), Digital R...
     cdm_data_type:             Grid
     comment:                   If you use this dataset in your research,\nple...
     Conventions:               CF-1.6, COARDS, ACDD-1.3
     creator_email:             sallen@eoas.ubc.ca
     creator_name:              SalishSeaCast Project Contributors
     ...                        ...
     testOutOfDate:             now-16hours
     time_coverage_end:         2025-03-07T23:30:00Z
     time_coverage_start:       2007-01-01T00:30:00Z
     timeStamp:                 2025-Mar-07 17:08:30 GMT
     title:                     Green, Salish Sea, 3d Physics Fields, Hourly, ...
     uuid:                      

In [41]:
df = pd.DataFrame(
    {
        "time": [pd.to_datetime(timestamp) for timestamp in ["2025-03-07 11:59:00", "2025-03-07 12:30:00"]],
        "depth": [0.5, 0.5],
        "gridY": [350, 350],
        "gridX": [250, 250],
    }
)

df

Unnamed: 0,time,depth,gridY,gridX
0,2025-03-07 11:59:00,0.5,350,250
1,2025-03-07 12:30:00,0.5,350,250


In [45]:
[foo(time, depth, gridY, gridX) for time, depth, gridY, gridX in df.to_numpy()]

[<xarray.Dataset> Size: 20B
 Dimensions:   ()
 Coordinates:
     time      datetime64[ns] 8B 2025-03-07T11:30:00
     depth     float32 4B 0.5
     gridY     int16 2B 350
     gridX     int16 2B 250
 Data variables:
     salinity  float32 4B ...
 Attributes: (12/25)
     acknowledgement:           MEOPAR, Ocean Networks Canada (ONC), Digital R...
     cdm_data_type:             Grid
     comment:                   If you use this dataset in your research,\nple...
     Conventions:               CF-1.6, COARDS, ACDD-1.3
     creator_email:             sallen@eoas.ubc.ca
     creator_name:              SalishSeaCast Project Contributors
     ...                        ...
     testOutOfDate:             now-16hours
     time_coverage_end:         2025-03-07T23:30:00Z
     time_coverage_start:       2007-01-01T00:30:00Z
     timeStamp:                 2025-Mar-07 17:08:30 GMT
     title:                     Green, Salish Sea, 3d Physics Fields, Hourly, ...
     uuid:                      

In [46]:
ds_pts = [foo(time, depth, gridY, gridX) for time, depth, gridY, gridX in df.to_numpy()]

In [48]:
xr.concat(ds_pts, dim="model_time")