# IMOS long time series products
## Hourly aggregated files
*Eduardo Klein. Apr-2021*  



This document presents the structure of the hourly aggregated long time series product, how to discover the file, interact with the data and plot some variables over the time/depth.  
  
  


## File structure

Normally, every instrument recovered from a mooring array represents an individual file in the [IMOS THREDDS server](http://thredds.aodn.org.au/thredds/catalog/IMOS/ANMN/catalog.html). This characteristic that facilitates the individual quality control and metadata handling pose some challenges for the analysis of long time series: 

- Many files for one time series
- Instruments deployed to varying depths
- Instruments sample at different times
- Significant work and expert knowledge required to view and analyse time series
- Different user groups need different products (gridded density, MLD, data combined from multiple sources, plots, etc…)

The hourly aggregated product is a file that aggregates all the variables from one site into one-hour bins. 

The aggregated file is a netCDF 4 file organised in an [Indexed Ragged Array](http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_indexed_ragged_array_representation) structure that follows the Climate-Forecast conventions and IMOS netCDF file conventions (*diagram by M. Hidas*): 

![Indexed Ragged Array](./img/indexedraggedarray.png)

Some characteristics of this structure:

- `TIME` is no longer a dimension. That means that the ordinary selecting and plotting methods for CF netCDF are no longer available. `TIME` is one of the variables in the file.
- the dimensions of the file are `OBSERVATION` and `INSTRUMENT`. As the aggregation process combine instruments that normally have common timestamps, the time variable could have repeated values. Also each deployment (instrument) is identified by and index and the compound instrument name, make, serial. In this way it is easy to filter according to specific instruments.


The aggregation takes the variable values half an hour before the hour and half an hour after the hour and reduce the variable values by calculating the mean or the median. Additional variables resulting from the aggregation process are also available in the file, as ancillary variables.

![hourly aggregation](./img/BurstAveraging.png)



## Discovering the files

One of the characteristics of the IMOS file naming convention is that the creation date is part of the file name. That means that every time the IMOS-AODN automatoc pipeline produces a new aggregated file for a site, the old file will be replaced by a new one with **a different name**. This could represent a challenge to access the file programatically and normally you need to go to the AODN THREDDS server and look for the new file name. 

Here is a function that helps you to discover the name of the most recent aggregated file. It requires the name of the site (`site_code`, like GBRCHH or NRSYON), the type of product (`product`: aggregated, hourly, or gridded), the `parameter` (only for type=aggregated, like `TEMP` or 'velocity'), QC level (only for hourly, only good data `QC` True), and the root of the URL `webURL`: 'S3': Amazon AWS (to download, fastest), 'wget' (AODN THREDDS, to download), 'opendap' (AODN THREDDS to open remotely).

It returns the file name with the appropriate web prefix:

(this function is also available as stand alone function)


In [74]:

import pandas as pd

pd.options.display.max_colwidth = 500

def getLTSPfileName(site, product="gridded", QC=True, param="TEMP", webURL="opendap"):
    '''
    get the url of the LTSP files
    
    require: pandas
    site: the site_code
    product: product type )aggregated, hourly or gridded)
    QC: for the hourly, include only good data (default True)
    param: for aggregated product, parameter code as IMOS standard (e.g. TEMP)
    webURL: web source of the file (S3: Amazon AWS (fastest), wget (AODN THREDDS, to download),
            opendap (AODN THREDDS to open remotely)
    E. Klein. eklein at ocean-analytics dot com dot au
    '''
    
    if webURL == "opendap": 
        WEBROOT = 'http://thredds.aodn.org.au/thredds/dodsC/'
    elif webURL == "wget":
        WEBROOT = 'http://thredds.aodn.org.au/thredds/fileServer/'
    elif webURL == "S3":
        WEBROOT = 'https://s3-ap-southeast-2.amazon.com/imos-data/'
    else:
        print("ERROR: wrong webURL: it must be one of S3, opendap or wget")

  
    urlGeoServer = "http://geoserver-123.aodn.org.au/geoserver/ows?typeName=moorings_all_map&SERVICE=WFS&REQUEST=GetFeature&VERSION=1.0.0&outputFormat=csv&CQL_FILTER=(realtime='FALSE')and(site_code='" + site + "')"
    df = pd.read_csv(urlGeoServer)
    url = df['url']
    
    #fileName = df$url[grepl(paste0(product,"-timeseries"), df$url)]
    fileName = "TEST"
    
    
    if product == "gridded": 
        fileName = url[url.str.contains("gridded")]
    elif product=="velocity-hourly":
        fileName = url[url.str.contains("velocity-hourly")]
    elif product=="hourly":
        if QC:
            fileName = url[url.str.contains("(?<!velocity-)hourly-timeseries(?!-including)", regex=True)]
        else:
            fileName = url[url.str.contains("including-non")]
    elif product=="aggregated":
        fileName = url[url.str.contains(param) & url.str.contains("aggregated")]
    else:
        print("ERROR: invalid combination of arguments or wrong names")

    
    return WEBROOT + fileName.to_string(index=False, header=False).strip()




For example, lets get the file name of the aggregated hourly file for GBRCCH: 


In [79]:
pd.options.display.max_colwidth = 200
fileName = getLTSPfileName(product="hourly", site="GBRCCH")
print(fileName)

http://thredds.aodn.org.au/thredds/dodsC/IMOS/ANMN/QLD/GBRCCH/hourly_timeseries/IMOS_ANMN-QLD_BOSTZ_20070910_GBRCCH_FV02_hourly-timeseries_END-20201215_C-20210218.nc


This file could be open remotely using for example `xarray.open_dataset()`. See below.

## The file content

As it was mentioned, all non-velocity or velocity parameter are aggregated into a single hourly file. Lets explore the content of the dataset. Using the `fileName` discovered above, we can open remotely the file from IMOS-AODN THREDDS server (you need an Internet conection):

In [89]:
import xarray as xr
import numpy as np

## lets use the fileName discovered above
nc = xr.open_dataset(fileName)

## print the file structure
nc

<xarray.Dataset>
Dimensions:           (INSTRUMENT: 178, OBSERVATION: 760988)
Coordinates:
    TIME              (OBSERVATION) datetime64[ns] ...
    LONGITUDE         (INSTRUMENT) float64 ...
    LATITUDE          (INSTRUMENT) float64 ...
    NOMINAL_DEPTH     (INSTRUMENT) float32 ...
Dimensions without coordinates: INSTRUMENT, OBSERVATION
Data variables:
    instrument_index  (OBSERVATION) int32 ...
    instrument_id     (INSTRUMENT) |S64 ...
    source_file       (INSTRUMENT) |S64 ...
    DEPTH             (OBSERVATION) float32 ...
    DEPTH_count       (OBSERVATION) float32 ...
    DEPTH_min         (OBSERVATION) float32 ...
    DEPTH_max         (OBSERVATION) float32 ...
    DEPTH_std         (OBSERVATION) float32 ...
    CHLU              (OBSERVATION) float32 ...
    CHLU_count        (OBSERVATION) float32 ...
    CHLU_max          (OBSERVATION) float32 ...
    CHLU_min          (OBSERVATION) float32 ...
    CHLU_std          (OBSERVATION) float32 ...
    CPHL              (OBSE

You will note that the file contains the IMOS standard global attributes with some additions related to the aggregation process:


In [83]:
nc.attrs.keys()

odict_keys(['Conventions', 'abstract', 'acknowledgement', 'author', 'author_email', 'citation', 'contributor_email', 'contributor_name', 'contributor_role', 'data_centre', 'data_centre_email', 'date_created', 'disclaimer', 'featureType', 'file_version', 'generating_code_version', 'geospatial_lat_max', 'geospatial_lat_min', 'geospatial_lon_max', 'geospatial_lon_min', 'geospatial_vertical_max', 'geospatial_vertical_min', 'history', 'included_values_flagged_as', 'institution_references', 'keywords', 'keywords_vocabulary', 'license', 'lineage', 'naming_authority', 'project', 'references', 'rejected_files', 'site_code', 'source', 'standard_name_vocabulary', 'time_coverage_end', 'time_coverage_start', 'title', 'DODS.strlen', 'DODS.dimName'])

You can for example interrogate the attributes to know the time coverage of the data without actually accessing the data itself (note that the results are strings, not date objects):

In [84]:
print(nc.time_coverage_start)
print(nc.time_coverage_end)

2007-09-10T07:00:00Z
2020-12-15T00:00:00Z


In [113]:
import folium
meanLat = np.mean([nc.geospatial_lat_min, nc.geospatial_lat_max])
meanLon = np.mean([nc.geospatial_lon_min, nc.geospatial_lon_max])
m = folium.Map(location=[meanLat, meanLon], tiles="OpenStreetMap", zoom_start=8)
folium.Marker(
    location=[meanLat, meanLon],
    popup=nc.site_code,
    icon=folium.Icon(color="blue", icon="info-sign")
).add_to(m)

m

## Slicing the data

## Plotting