# Introduction to Argovis's API

Argovis provides an API that indexes and distributes numerous oceanographic datasets with detailed query parameters, enabling you to search and download only and exactly data of interest. In this notebook, we'll tour some of the standard usage patterns enabled by Argovis.

## Note on how to navigate this and other notebooks in the repository

We suggest to:
1. Start from this notebook to have an overview of the standard usage patterns enabled by Argovis.
2. Try some tasks that Argovis supports via the Argovis API, running e.g. **Argovis_explore_ocean_vertical_structure** (and in general current and upcoming notebooks with file name starting in Argovis_explore_*): if the task of interest for you is included in these notebooks, we reccomend starting your code from the one used in these notebooks.
3. Visit the folder **dataset_specific_notebooks** for notebooks with examples that are specific to individual Argovis datasets (some of the code here will be modified and/or moved to the parent folder to allow for usage with multiple datasets).

Notebooks in the folder work in progress are under development and not guaranteed to work as is.

## Setup: Register an API key

In order to allocate Argovis's limited computing resources fairly, users are encouraged to register and request a free API key. This works like a password that identifies your requests to Argovis. To do so:

 - Visit [https://argovis-keygen.colorado.edu/](https://argovis-keygen.colorado.edu/)
 - Fill out the form under _New Account Registration_
 - An API key will be emailed to you shortly.
 
Treat this API key like a password - don't share it or leave it anywhere public. If you ever forget it or accidentally reveal it to a third party, see the same website above to change or deactivate your token.

Put your API key in the quotes in the variable below before moving on:

In [1]:
API_ROOT='https://argovis-api.colorado.edu/'
API_KEY=''

# Argovis data structures

Argovis standard data structures divide measurements into _data_ and _metadata_ documents. Typically, a data document corresponds to measurements or gridded data associated with a discreet temporospatial column - a time, latitude and longitude. A single such document may contain measurements at multiple depths or altitudes, provided they share the same latitude, longitude, and time.

Each of these data documents will refer to at least one corresponding metadata document that captures additional information about the measurement. Argovis divides information between data and metadata documents in order to minimize redundancy in the data you download: many data documents will point to the same metadata document, allowing you to only download that metadata once. Typically, these metadata groupings will refer to some meaningful characteristic of the data; Argo metadata documents correspond to physical floats, while CCHDO metadata documents correspond to cruises, for example.

For more detail and specifications on the data and metadata documents for each collection, see [https://argovis.colorado.edu/docs/documentation/_build/html/database/schema.html](https://argovis.colorado.edu/docs/documentation/_build/html/database/schema.html).

# The standard data routes

## What datasets does Argovis index?

Argovis supports several different data sets with the API and data structures described here. They and their corresponding routes are:

 - Argo profiling float data, `/argo`
 - CCHDO ship-based profile data, `/cchdo`
 - tropical cyclone data from HURDAT and JTWC, `/tc`
 - Global Drifter Program data, `/drifters`
 - Easy Ocean, `/easyocean`
 - several gridded products:
   - Roemmich-Gilson total temperature and salinity, `/grids/rg09`
   - ocean heat content, `/grids/kg21`
   - GLODAP, `/grids/glodap`
 - Argone Argo float position forecast model data, `/argone`
 - Argo trajectory data, `/argotrajectories`
 - several satellite-based timeseries:
   - NOAA sea surface temperature, `/timeseries/noaasst`
   - Copernicus sea surface height, `/timeseries/copernicussla`
   - CCMP wind vector product, `/timeseries/ccmpwind`
   
The examples that follow apply equally to all these routes; they all support similar query options and follow similar behavior patterns.

## Using Swagger and the `argovisHelpers` package to download data

In order to successfully explore Argovis data, there are two important tools to introduce in this section: Swagger, our API documentation engine, and `argovisHelpers`, our Python package of fuctions to help you access and interpret Argovis data.

### Using Swagger docs

Argovis' API documentation is found at [https://argovis-api.colorado.edu/docs/](https://argovis-api.colorado.edu/docs/). These docs are split into several categories; what follows applies to all categories _not_ marked experimental; the experimental categories are under development and may change or be removed at any time.

Categories have three typical routes:
 - The main _data route_, like `/argo`, or `/cchdo`. These routes provide the data documents for the dataset named in the route.
 - The _metadata route_, like `/argo/meta`. These routes provide the metadata documents referred to by data documents.
 - The _vocabulary route_, like `/argo/vocabulary`. These routes provide lists of possible options for search parameters used in the corresponding data and metadata routes.
 
Click on any of the routes, like `/argo` - a list of possible query string parameters are presented, with a short explanation of what they mean.

If you're familiar with REST APIs, this is enough information for you to construct a query string and issue a request in any programming environment that can facilitate an HTTP GET request. If you're working in Python, we provide a helper library, `argovisHelpers`, to manage these requests for you. Let's try it out by making our first request for Argo data, for profiles found within 100 km of a point in the South Atlantic in May 2011 (users of Python's `requests` module will notice a familiar pattern, providing the query string parameters listed in the Swagger docs and associated values as a dictionary):

In [2]:
from argovisHelpers import helpers as avh

In [3]:
argoSearch = {
    'startDate': '2011-05-01T00:00:00Z',
    'endDate': '2011-06-01T00:00:00Z',
    'center': '-22.5,0',
    'radius': 100
}

argoProfiles = avh.query('argo', options=argoSearch, apikey=API_KEY, apiroot=API_ROOT)

HTTPSConnectionPool(host='argovis-api.colorado.edu', port=443): Max retries exceeded with url: /argo?startDate=2011-05-01T00%3A00%3A00Z&endDate=2011-06-01T00%3A00%3A00Z&center=-22.5%2C0&radius=100 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb2c25928b0>: Failed to establish a new connection: [Errno 110] Connection timed out'))


Let's have a look at what we get from the first profile returned:

In [4]:
argoProfiles[0]

urllib3.exceptions.MaxRetryError("HTTPSConnectionPool(host='argovis-api.colorado.edu', port=443): Max retries exceeded with url: /argo?startDate=2011-05-01T00%3A00%3A00Z&endDate=2011-06-01T00%3A00%3A00Z&center=-22.5%2C0&radius=100 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb2c25928b0>: Failed to establish a new connection: [Errno 110] Connection timed out'))")

This is a data document for Argo, matching the specification at [https://argovis.colorado.edu/docs/documentation/_build/html/database/schema.html](https://argovis.colorado.edu/docs/documentation/_build/html/database/schema.html). It contains the `timestamp` and `geolocation` properties that place this profile geospatially, and other parameters that typically change from point to point.

All data documents bear a `metadata` key, which is a pointer to the appropriate metadata document to find out more about this measurement. Let's fetch that document for this first profile by querying the `argo/meta` route for a doument with an `id` that matches this `metadata` pointer:

In [5]:
metaOptions = {
    'id': argoProfiles[0]['metadata'][0]
}

argoMeta = avh.query('argo/meta', options=metaOptions, apikey=API_KEY, apiroot=API_ROOT)
argoMeta

TypeError: 'MaxRetryError' object is not subscriptable

In addition to temporospatial searches, data and metadata routes typically support _category searches_, which are searches for documents that belong to certain categories. Which categories are available to search by changes logically from dataset to dataset; Argo floats can be searched by platform number, for example, while tropical cyclones can be searched by storm name. See the swagger docs for the full set of possibilities for each category; let's now use argo's platform category search to get all profiles collected by the same platform as the first profile above:

In [None]:
platformSearch = {
    'platform': argoMeta[0]['platform']
}

platformProfiles = avh.query('argo', options=platformSearch, apikey=API_KEY, apiroot=API_ROOT)
print(len(platformProfiles))

At the time of writing, 125 profiles are found for this platform in this way.

For all category searches, we may wish to know the full list of all possible values a category can take on; for this, there are the _vocabulary_ routes. All vocabulary routes support a parameter `enum`, to list what other categorical parameters are available to filter this dataset by:

In [None]:
vocab_enum = {
    'parameter': 'enum'
}

avh.query('argo/vocabulary', options=vocab_enum, apikey=API_KEY, apiroot=API_ROOT)

Evidently we can filter Argo data by platform, for example. Let's see what platforms are available:

In [None]:
platformVocabSearch = {
    'parameter': 'platform'
}

platforms = avh.query('argo/vocabulary', options=platformVocabSearch, apikey=API_KEY, apiroot=API_ROOT, verbose=True)
print(platforms[0:10])

Here we just print out the first 10 platform IDs found, but all 17 thousand or so are present.

## Using the `data` query option

The astute reader may have noticed something about the data document shown above: there's no actual measurements included in it! By default, only the non-measurement data is returned, in order to minimize bandwidth consumed; in order to get back actual measurements and their QC flags, we must query and filter including the `data` parameter, the behavior of which we'll see in this section.

### Basic data request

Let's start by asking for one particular profile by ID, and ask for some temperature data to go with:

In [None]:
dataQuery = {
    'id': '4901283_003',
    'data': 'temperature'
}

profile = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
print(profile[0]['data'])

The `data` key contains a list of lists of measurements. To interpret them, we use the `data_info` key:

In [None]:
print(profile[0]['data_info'])

`data_info` is always a list that contains exactly three items:

 - `data_info[0]` is the list of measurements returned in our `data` object, in the same order as `data`. So in the example above, `data[0]` are pressure measurements, while `data[1]` are temperature measurements. Note we got back pressures even though we only asked for temperatures; pressures are always provided where available as they are needed to meaningfully interpret all other data variables.
 - `data_info[1]` is a list of per-measurement variables. In the example above, pressure and temperature both have a `units` and a `data_keys_mode` associated with them.
 - `data_info[2]` is a rank 2 matrix with rows labeled by `data_info[0]` and columns by `data_info[1]`. So for the example above, this matrix indicates pressure has units 'decibar', and temperature has `data_keys_mode` 'D'.
 
With this information, we now understand how to interpret the `data` key above: the first list is a list of pressures measured in decibar, and the second list are corresponding temperature measurements measured in degrees C. Note that the ith elements in the data lists all correspond to the same level - in other words, `data[0][i],  data[1][i], data[2][1], ....` are all measurements corresponding to the ith level of this object.
 
> **Data and metadata precedence**: sometimes, you might see a given key on *both* a data document and its corresponding metadata document; when this happens, the value on the data document always takes precedence. `data_info` is a common example of this, which we'll see again below.

### Data inflation

If you find this format difficult to consume, another option is to use the `data_inflate` function from the argovis helpers package. This function will turn your data array into a list of dictionaries, one dictionary per level, with keys corresponding to the data values:

In [None]:
inflated_data = avh.data_inflate(profile[0])
inflated_data[0:10]

This format is inefficient to download, but easy to read and work with. Long-time users of previous versions of Argovis may recognize this as similar to the legacy format of some of our data.

### Getting absolutely everything

What we've seen above allows us to be very targeted in the data we download; rather than being forced to spend time and bandwidth downloading data we aren't interested in, we can focus on just what we need. On the other hand, somtimes we really do want everything, and for that there's `data=all`:

In [None]:
dataQuery = {
    'id': '4901283_003',
    'data': 'all'
}

profile = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
avh.data_inflate(profile[0])[0:10]

> **Downloading only what you need:**
Some objects, like Argo BGC probes, measure many values. Your downoads will often be dramatically faster if you specify your variables of interest, rather than using `data=all` unnecessarily. Recall that the `data` parameter can also accept a comma-separated list of variable names, if there are a few that you'd like.

### Filtering behavior of data requests

Note that adding a specific data filter is a _firm requirement_ that all returned profiles have some meaningful data for _all_ variables listed. Try demanding chlorophyl-a in addition to temperature for our current profile of interest:

In [None]:
dataQuery = {
    'id': '4901283_003',
    'data': 'temperature,chla'
}

profile = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
print(profile)

We get nothing in our array of profiles; even though we asked for profile id '4901283_003' and we know it exists, `data=temperature,chla` filters our query down to _only_ profiles that have both temperature and chla reported; since the profile requested doesn't have any chla measurements, it is dropped from the returns in this case. This is useful if you only want to download profiles that definitely have data of interest; for example, try the same thing on our regional search from above:

In [None]:
argoSearch = {
    'startDate': '2011-05-01T00:00:00Z',
    'endDate': '2011-06-01T00:00:00Z',
    'center': '-22.5,0',
    'radius': 100,
    'data': 'temperature,chla'
}

argoProfiles = avh.query('argo', options=argoSearch, apikey=API_KEY, apiroot=API_ROOT)
print(len(argoProfiles))

Evidently Argo made no chlorophyl-a measurements in May 2011 within 100 km of our point of interest - a fact which we found using the data api without having to download or reduce any data at all. One final point on data filtering in this manner: it's not enough for a profile to nominally have a variable defined for it; it must have at least one non-null value reported for that variable somewhere in the search results. For example, when we did `data=all` for our profile of interest above, we saw dissolved oxygen, `doxy`, was defined for it. But:

In [None]:
dataQuery = {
    'id': '4901283_003',
    'data': 'doxy'
}

profile = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
print(profile)

Again our search is filtered down to nothing, since every level in that profile reported `None` for `doxy`.

### Search negation

Let's find some profiles that do actually have dissolved oxygen in them, this time with a slightly different geography search: let's look for everything in August 2017 within a polygon region, defined as a list of `[longitude, latitude]` points: 

In [None]:
dataQuery = {
    'startDate': '2017-08-01T00:00:00Z',
    'endDate': '2017-09-01T00:00:00Z',
    'polygon': [[-150,-30],[-155,-30],[-155,-35],[-150,-35],[-150,-30]],
    'data': 'doxy'
}

profiles = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)

We find one profile with meaningful dissolved oxygen data in the region of interest.

The `data` key also accepts _tilde negation_, meaning 'filter for profiles that _don't_ contain this data', for example:

In [None]:
dataQuery = {
    'startDate': '2017-08-01T00:00:00Z',
    'endDate': '2017-09-01T00:00:00Z',
    'polygon': [[-150,-30],[-155,-30],[-155,-35],[-150,-35],[-150,-30]],
    'data': 'temperature,~doxy'
}

profiles = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
print(len(profiles))

We get a collection of profiles that appear in the region of interest, and have temperature but _not_ dissolved oxygen. In this way, we can split up our downloads into groups of related and interesting profiles without re-downloading the same profiles over and over.

### QC filtering

In addition to querying and filtering by what data is available, we can also make demands on the quality of that data by performing QC filtering. Let's start by looking at some particulate backscattering data:

In [None]:
bbpQuery = {
    'id': '2902857_001',
    'data': 'bbp700,bbp700_argoqc'
}

bbp = avh.query('argo', options=bbpQuery, apikey=API_KEY, apiroot=API_ROOT)
bbpindex = bbp[0]['data_info'][0].index('bbp700')
bbpQCindex = bbp[0]['data_info'][0].index('bbp700_argoqc')
print(bbp[0]['data'][bbpindex][0:10])
print(bbp[0]['data'][bbpQCindex][0:10])

We request both the measurement and its corresponding QC flags, for reference. Recall that for Argo:

 - QC 1 == definitely good data
 - QC 2 == probably good data
 - QC 3 == probably bad data
 - QC 4 == definitely bad data
 
If we didn't look at the QC flags for our particulate backscatter data, we could easily have missed that some of the measurements shown above (and many more in the profile not printed) have been marked as bad data by the upstream data distributor, and therefore might not be appropriate for your purposes. We can suppress measurements based on a list of allowed QC values by modifying what we pass to the `data` query parameter:

In [None]:
bbpQCfilteredQuery = {
    'id': '2902857_001',
    'data': 'bbp700,1,bbp700_argoqc'
}

bbpFiltered = avh.query('argo', options=bbpQCfilteredQuery, apikey=API_KEY, apiroot=API_ROOT)
bbpindex = bbpFiltered[0]['data_info'][0].index('bbp700')
bbpQCindex = bbpFiltered[0]['data_info'][0].index('bbp700_argoqc')
print(bbpFiltered[0]['data'][bbpindex][0:10])
print(bbpFiltered[0]['data'][bbpQCindex][0:10])

In our `data` query parameter, we listed which QC flags we find tolerable for each measurement parameter; in this case `bbp700,1` indicates we only want `bbp700` data if it has a corresponding QC flag of 1. Some things implied by this example that are worth highlighting:

 - QC flags listed after a variable name only apply to that variable name. Try printing the `pressure` record for the profile found above, and you'll see none of its levels were suppressed.
 - The list of QC flags is an explicit-allow list and can contain as many flags as you want. For example, you might change the above data query to `bbp700,1,2` to get both 1- and 2-flagged `bbp700` measurements back.
 - We include the explicit QC flag in this example for illustrative purposes, but it's not required when doing QC filtering in this way. Try the above query while omitting `bbp700_argoqc`, and you'll get the same non-`None` values for `bbp700`.
 - Note however, as with all data requests, if all explicitly requested data variables are `None` for a level, that level is dropped. In the case where you omitted `bbp700_argoqc` and only requested `bbp700`, the levels where the QC filtration set the `bbp700` value to `None` are dropped.
 - Similarly, if *all* levels of a requested variable are set to `None` by QC filtration, the entire profile will be dropped from the returns, on the grounds that it doesn't contain any of the data you requested at a level of quality you marked as acceptable.
 
Note that QC flags are currently only available for the argo and cchdo datasets, and furthermore that these datasets assign different meanings to their QC flags. Be sure to check the docs for each project to make sure you understand how to interpret that project's QC flags.

### Minimal data responses

Sometimes, we might want to use the `data` filter as we've seen to confine our attention to only profiles that have data of interest, but we're only interested in general or metadata about those measurements, and don't want to download the actual measurements; for this, we can add the `except-data-values` token:

In [None]:
dataQuery = {
    'startDate': '2017-08-01T00:00:00Z',
    'endDate': '2017-09-01T00:00:00Z',
    'polygon': [[-150,-30],[-155,-30],[-155,-35],[-150,-35],[-150,-30]],
    'data': 'doxy,except-data-values'
}

profiles = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
print(profiles)

Note that specifying only `'data': 'except-data-values'` is the same as just leaving the `data` query key off completely; the purpose of this option is to allow you to filter by data, but then only get back the lightweight non-measurement values. 

If we want an even more minimal response, we can use the `compression=minimal` option:

In [None]:
dataQuery = {
    'startDate': '2017-08-01T00:00:00Z',
    'endDate': '2017-09-01T00:00:00Z',
    'polygon': [[-150,-30],[-155,-30],[-155,-35],[-150,-35],[-150,-30]],
    'data': 'doxy',
    'compression': 'minimal'
}

profiles = avh.query('argo', options=dataQuery, apikey=API_KEY, apiroot=API_ROOT)
print(profiles)

With `compression: minimal`, for each data document we get only a minimal amount of information describing it; each data product has a slightly different minimal representation tailored to suit.

### Temporospatial request details

You have seen in examples above that requests can be temporally limited by `startDate` and `endDate`, and confined to a geographic region with `polygon`. There are a few more features and facts about temporospatial requests in Argovis that are worth exploring.

#### Box regions

The `polygon` region definitions you've seen so far define regions on the globe by connecting vertexes with geodesic edges. If instead we want a region bounded by lines of constant latitude and longitude, there is the `box` query string parameter. Compare two similar but different searches, first with `polygon`, similar to the above, tracing geodesics between four corners of a region:

In [None]:
qs = {
    'startDate': '2017-08-01T00:00:00Z',
    'endDate': '2017-09-01T00:00:00Z',
    'polygon': [[-20,70],[20,70],[20,72],[-20,72],[-20,70]],
}

profiles = avh.query('argo', options=qs, apikey=API_KEY, apiroot=API_ROOT)
latitudes = [x['geolocation']['coordinates'][1] for x in profiles]
print(min(latitudes))
print(max(latitudes))

Now try something similar, but with a `box` region defined instead across the four corners:

In [None]:
qs = {
    'startDate': '2017-08-01T00:00:00Z',
    'endDate': '2017-09-01T00:00:00Z',
    'box': [[-20,70],[20,72]]
}

profiles = avh.query('argo', options=qs, apikey=API_KEY, apiroot=API_ROOT)
latitudes = [x['geolocation']['coordinates'][1] for x in profiles]
print(min(latitudes))
print(max(latitudes))

Notice that while both regions share the same corners, the polygon search actually returns profiles with latitudes higher than the region's northermost corners since geodesics between two points sharing a latitude deflect north in this far-north search region. Meanwhile, the latitudes of profiles in the box region are confined between the lines of constant latitude connecting the vertexes and defining the top and bottom of the box.

> **Box mode notation**: note that box mode expects exactly two vertexes: the most southern and western corner first, followed by the most northern and eastern corner.