# Data Access Methods

This tutorial demostrates several ways data can be accessed remotely and loaded into a Python environment, including

* THREDDS/OPeNDAP
* OGC Web Feature Service (WFS)
* direct access to files on cloud storage (AWS S3)
* cloud-optimised formats Zarr & Parquet
* New OGC APIs

## THREDDS / OPeNDAP  **TODO**

## Web Feature Service (WFS)

* A [standard](http://www.opengeospatial.org/standards/wfs) of the [Open Geospatial Consortium](http://www.opengeospatial.org/) (OGC)
* Allows geographic _features_ (spatial extent + data) to be accessed via the Web.
* Allows filtering based on spatial extent and attributes.

For example, most of the tabular (1-dimensional) data from the Australian Integrated Marine Observing System (IMOS) is available via WFS.

In [1]:
from owslib.wfs import WebFeatureService

wfs = WebFeatureService(url="https://geoserver-123.aodn.org.au/geoserver/wfs",
                        version="1.1.0")
wfs.identification.title

'AODN Web Feature Service (WFS)'

In [2]:
# Each dataset is served as a separate "feature type":
print(f"There are {len(wfs.contents)} fature types, e.g.")
list(wfs.contents)[:10]

There are 397 fature types, e.g.


['imos:anmn_ctd_profiles_data',
 'imos:anmn_ctd_profiles_map',
 'imos:anmn_velocity_timeseries_map',
 'imos:anmn_nrs_rt_meteo_timeseries_data',
 'imos:anmn_nrs_rt_meteo_timeseries_map',
 'imos:anmn_nrs_rt_bio_timeseries_data',
 'imos:anmn_nrs_rt_bio_timeseries_map',
 'imos:anmn_nrs_rt_wave_timeseries_data',
 'imos:anmn_nrs_rt_wave_timeseries_map',
 'imos:anmn_acoustics_map']

For now we'll assume we already know which featuretype we want. It's a dataset containing selected CTD profiles obtained at the National Reference Stations around australia.

In [3]:
typename = 'imos:nrs_depth_binned_ctd_data'
wfs.get_schema(typename)

{'properties': {'Project': 'string',
  'StationName': 'string',
  'TripCode': 'string',
  'CastTimeUTC': 'dateTime',
  'Latitude': 'decimal',
  'Longitude': 'decimal',
  'file_id': 'int',
  'SampleTime_Local': 'string',
  'SampleTime_UTC': 'dateTime',
  'trip_code': 'string',
  'SampleDepth_m': 'float',
  'Salinity_psu': 'float',
  'Salinity_flag': 'string',
  'Temperature_degC': 'float',
  'Temperature_flag': 'string',
  'DissolvedOxygen_umolkg': 'float',
  'DissolvedOxygen_flag': 'string',
  'Chla_mgm3': 'float',
  'Chla_flag': 'string',
  'Turbidity_NTU': 'float',
  'Turbidity_flag': 'string',
  'Conductivity_Sm': 'float',
  'Conductivity_flag': 'string',
  'WaterDensity_kgm3': 'float',
  'WaterDensity_flag': 'string'},
 'required': [],
 'geometry': 'Point',
 'geometry_column': 'geom'}

We can read in a subset of the data by specifying a bounding box (in this case near Rottnest Island, just off Perth, WA).

We'll get the result in CSV format so it's easy to read into a Pandas DataFrame.

In [4]:
import pandas as pd

xmin, xmax = 115.2, 115.7
ymin, ymax = -32.2, -31.8

response = wfs.getfeature(typename=typename, bbox=(xmin, ymin, xmax, ymax), outputFormat='csv')
df = pd.read_csv(response)
response.close()

df.head()

Unnamed: 0,FID,Project,StationName,TripCode,CastTimeUTC,Latitude,Longitude,file_id,SampleTime_Local,SampleTime_UTC,...,DissolvedOxygen_umolkg,DissolvedOxygen_flag,Chla_mgm3,Chla_flag,Turbidity_NTU,Turbidity_flag,Conductivity_Sm,Conductivity_flag,WaterDensity_kgm3,WaterDensity_flag
0,nrs_depth_binned_ctd_data.fid-2a03b077_18971e1...,NRS,Rottnest Island,ROT20100520,2010-05-20T02:06:34,-32,115.4167,2561,2010-05-20 10:05:00,2010-05-20T02:05:00,...,,,0.3787,0.0,0.0751,0.0,5.0086,1,1024.9067,0.0
1,nrs_depth_binned_ctd_data.fid-2a03b077_18971e1...,NRS,Rottnest Island,ROT20100520,2010-05-20T02:06:34,-32,115.4167,2561,2010-05-20 10:05:00,2010-05-20T02:05:00,...,,,0.4106,0.0,0.0765,0.0,5.0087,1,1024.9117,0.0
2,nrs_depth_binned_ctd_data.fid-2a03b077_18971e1...,NRS,Rottnest Island,ROT20100520,2010-05-20T02:06:34,-32,115.4167,2561,2010-05-20 10:05:00,2010-05-20T02:05:00,...,,,0.4201,0.0,0.0799,0.0,5.0086,1,1024.9164,0.0
3,nrs_depth_binned_ctd_data.fid-2a03b077_18971e1...,NRS,Rottnest Island,ROT20100520,2010-05-20T02:06:34,-32,115.4167,2561,2010-05-20 10:05:00,2010-05-20T02:05:00,...,,,0.449,0.0,0.0847,0.0,5.0074,1,1024.92,0.0
4,nrs_depth_binned_ctd_data.fid-2a03b077_18971e1...,NRS,Rottnest Island,ROT20100520,2010-05-20T02:06:34,-32,115.4167,2561,2010-05-20 10:05:00,2010-05-20T02:05:00,...,,,0.5021,0.0,0.079,0.0,5.0065,1,1024.9272,0.0


We can also filter the data based on the values in specified columns (properties) and ask for only a subset of the columns to be returned. The filters need to be provided in XML format, but the `owslib` library allows us to construct them in a more Pythonic way.

In [5]:
from owslib.etree import etree
from owslib.fes import PropertyIsEqualTo, And

filter = And([PropertyIsEqualTo(propertyname="StationName", literal="Rottnest Island"),
              PropertyIsEqualTo(propertyname="Temperature_flag", literal="1"),
              PropertyIsEqualTo(propertyname="Salinity_flag", literal="1")
             ])
filterxml = etree.tostring(filter.toXML(), encoding="unicode")
response = wfs.getfeature(typename=typename, filter=filterxml, outputFormat="csv",
                          propertyname=["CastTimeUTC", "SampleDepth_m", "Temperature_degC", "Salinity_psu", "Chla_mgm3"]
                         )
df = pd.read_csv(response, parse_dates=["CastTimeUTC"])
response.close()

# df.set_index(["CastTimeUTC", "SampleDepth_m"], inplace=True)

# the server adds a feature ID column we don't really need
df.drop(columns='FID', inplace=True)

df

Unnamed: 0,CastTimeUTC,SampleDepth_m,Salinity_psu,Temperature_degC,Chla_mgm3
0,2010-05-20 02:06:34,2,35.6492,21.3280,0.3787
1,2010-05-20 02:06:34,3,35.6495,21.3277,0.4106
2,2010-05-20 02:06:34,4,35.6495,21.3270,0.4201
3,2010-05-20 02:06:34,5,35.6463,21.3190,0.4490
4,2010-05-20 02:06:34,6,35.6468,21.3093,0.5021
...,...,...,...,...,...
6001,2023-06-13 01:36:01,42,35.2102,20.6584,0.0330
6002,2023-06-13 01:36:01,43,35.2102,20.6608,0.0309
6003,2023-06-13 01:36:01,44,35.2102,20.6614,0.0321
6004,2023-06-13 01:36:01,45,35.2101,20.6616,0.0306


In [6]:
import holoviews
import hvplot.pandas

df.hvplot(x="Temperature_degC", y="SampleDepth_m", by="CastTimeUTC", flip_yaxis=True, legend=False, width=1200, height=500)

In [7]:
df.hvplot.scatter(x="Salinity_psu", y="Temperature_degC",
                  xlim=(34, 37), ylim=(15, 25),
                  legend=False, width=700, height=700)

Further examples?
* Plot timeseries of near-surface values
* Complute MLD (or read from `nrs_derived_indices_data`) and plot timeseries
* Calculate average profile per month of year?
* Plot timeseries of various phytoplankton species abundances?

**TODO** Add abstract & metadata link to the example WFS layer

## Reading files on cloud storage

Data files made available to the public on cloud storage such as Amazon S3 can be accessed over the web as if they were stored locally. You just need to find the exact URL for each file.

For example, all the public data files hosted by the Australian Ocean Data Network are stored in an [S3 bucket](https://www.techtarget.com/searchaws/definition/AWS-bucket) called `imos-data`. You can browse the contents of the bucket and download individual files [here](https://imos-data.aodn.org.au). 

To access the bucket using Python, we'll use the `s3fs` library.

In [8]:
import s3fs

s3 = s3fs.S3FileSystem(anon=True)

# Let's take a look at what satellite SST products are available
sst_files = s3.ls("imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023")
sst_files[-20:]

['imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230629120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230630120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230701120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230702120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230703120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230704120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230705120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230706120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/202307071

In [9]:
import xarray as xr
import holoviews as hv
import hvplot.xarray

ds = xr.open_dataset(s3.open(sst_files[-1]))  # Q: Should we use s3fs.S3Map() here instead of s3.open()?
ds

In [11]:
import geoviews as gv
import geoviews.feature as gf
from geoviews import opts
from cartopy import crs

gv.extension('bokeh', 'matplotlib')
gv.output(size=150)

In [13]:
sst_var = 'analysed_sst'
gds = gv.Dataset(ds,
                 kdims=['lon', 'lat'],
                 vdims=[sst_var],
                 crs=crs.PlateCarree(central_longitude=180)  # this is neede to properly handle lat > 180
                )
sst_plot = gds.to(gv.Image)
sst_plot.opts(cmap='coolwarm', colorbar=True, width=600, height=500, title=ds.title)

It's worth understanding a little about how this works. 
The above example only reads the metadata and _a subset_ of the data, the entire file is read from S3 and returned. This is because unlike a local filesytem, the basic read/write operations on this kind of cloud storage (also called "object store") operate on the entire file (object). 
If you only need a small subset of a large file, this can be a very inefficient way to get it.

For example, if we wanted to plot a timeseries of the above satellite SST product at a given point, we would only need a single value out of each file (corresponding to one point in the timeseries), but the entire file would need to be read each time.

Let's try plotting the last 30 days of data for a point East of Tasmania...

In [None]:
%%time
s3_objs = [s3.open(f) for f in sst_files[-30:]]
mds = xr.open_mfdataset(s3_objs)
mds

In [None]:
mds[sst_var].sel(lat=-42, lon=150, method="nearest").hvplot()

### Zarr - a cloud-optimised data format



### Parquet?

## New OGC APIs?

# TODO

- [ ] Add metadata links for datasets used
