# COSIMA Training: Finding COSIMA Data

This is a modified version of the COSIMA Recipes notebook [Exploring The COSIMA Cookbook](https://cosima-recipes.readthedocs.io/en/latest/Tutorials/Using_Explorer_tools.html)

## COSIMA Cookbook Database

The COSIMA Cookbook provides a database of some of the data available at NCI.

The Cookbook also provides an API to query the database and retrieve data by experiment and variable name.

In [None]:
import cosima_cookbook as cc
import cf_xarray

To access the database you must first create a session, which is a connection you then pass to querying functions

In [None]:
session = cc.database.create_session()

If you know the name of the experiment, and the variable in that experiment you can load that variable directly using `getvar`, which returns an [xarray DataArray](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html). e.g. the `u` ocean velocity variable from the `01deg_jra55v140_iaf` experiment, and specify just the first three files (`n=3`) for speed

In [None]:
experiment_name = '01deg_jra55v140_iaf'
variable_name   = 'u'

In [None]:
cc.querying.getvar(expt=experiment_name, variable=variable_name, session=session, n=3)

Inside `getvar` there is a database lookup to find all the files that contain the variable `u` in the experiment `01deg_jra55v140_iaf`, and then it does the equivalent of [`open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html), which opens multiple netCDF files, reads the metadata which describes the data, and joins (concatenates) the metadata along the `time` dimension. `xarray` presents the data as if it was a single dataset, and takes care of reading data from the correct files when an operation occurs that requires reading the data.

The question then becomes, how do I find out what experiment to use, and what variables are available? The API provides `get_experiments` which returns a list of experiments:

In [None]:
cc.querying.get_experiments(session, all=True)

And `get_variables` which returns a list of variables for a given experiment

In [None]:
variables = cc.querying.get_variables(session, experiment=experiment_name)
variables

But there are sometimes duplicate variables with different frequency:

In [None]:
variable_name = 'surface_salt'

In [None]:
variables[variables.name == variable_name]

If you just try and load this data you will get an error because you will be trying to load data from different files with different temporal frequency

In [None]:
cc.querying.getvar(expt=experiment_name, variable=variable_name, session=session)

You can get around this error by passing a `frequency` argument to `getvar` as suggested in the error message above, but you would still have to find the name of the variable you want to load by querying a pandas table. It isn't awfully user friendly, especially to those who are not python experts, or domain experts who know what to search for.

## Exploring the Cookbook Database

The COSIMA Cookbook `explore` submodule seeks to solve the issue of how to find relevant experiments and variables within a Cookbook database and simplify the process of loading this data.

It does this by providing GUI elements that users can embed in their jupyter notebooks that can be used to filter and query the database, and then load the data you want.

When you load data it prints out the command used, which can be copied and used in other contexts.

In [None]:
from cosima_cookbook import explore

### Database Explorer

The first component is `DatabaseExplorer`, which uses filtering by keyword and variable to find relevant experiments. 

When an experiment is selected and the "Load Experiment" button pushed, it opens an `ExperimentExplorer` below the Database Explorer. A detailed explanation of the `ExperimentExplorer` is in the next section.

The full description of the explorer and how it works is available in COSIMA Recipes

https://cosima-recipes.readthedocs.io/en/latest/Tutorials/Using_Explorer_tools.html

The first step is to import the `explore` submodule

In [None]:
from cosima_cookbook import explore

Then create a `DatabaseExplorer` object, passing it the already open connection to the datatbase (`session`). This can take a minute or more, so be patient ....

In [None]:
%%time
dbx = explore.DatabaseExplorer(session=session)

And lastly execute the returned object, which displays the explorer GUI in the jupyter notebook.

Try clicking on 'Variable' and type in a search term in the 'Search: start typing' box. This does a live search of *all* variables in the database, and searches on variable name, the long name, and standard name. It can be a great way to see what sorts of variables are available. Select a variable and it will show the long name and units underneath.

Try adding some variables to the the filter variables box and pushing filter, to see what experiments have those combinations of variables.

In [None]:
dbx

#### Exercise 1. 

If you wanted to recreate the surface mass water transformations from [this cosima recipe](https://cosima-recipes.readthedocs.io/en/latest/DocumentedExamples/Surface_Water_Mass_Transformation.html) then you would need the following variables:

`surface_temp`, `surface_salt`, `pme_river`, `sfc_salt_flux_restore`, `sfc_hflux_from_runoff`, `sfc_hflux_coupler`, `sfc_hflux_pme` and `frazil_3d_int_z`

Try adding all these variables to the variable filter and see which experiments might match the requirements!

<details>
  <summary>Click for answer</summary>
Should have 24 experiments, `1deg_jra55_SAMextr_*`, `01deg_jra55v13_ryf9091*`, `01deg_jra55v140_iaf*`, `1deg_jra55_iaf_v2.0.0rc3*` and `basal_melt_outputs`
</details>

#### Exercise 2

Remove all the variables from the filter and then add back `sea_level`. Then in the keyword filter select `access-om2-1`, then filter the experiments.

One of the experiments should be `1deg_jra55_iaf_omip2_cycle6`. Select it and then press "Load Experiment".

Select the `sea_level` variable. 

1. What is the long name? What are the units? 

2. Check the "Frequency" menu. What frequencies are available? What date ranges? 

3. Choose a 10 year date range and push the "Load" button. The `ExperimentExplorer` will load, and display, an `xarray.DataArray` object. What is the command used to load the data?

You can copy the command and use it in this, or another notebook. Modify it to suit your requirements.

<details>
  <summary>Click for answer</summary>

1. Variable long name is "effective sea level (eta_t + patm/(rho0*g)) on T cells" and units are meter
    
2. Frequencies should be `1 daily` and `1 monthly`.    
Date range is 1957/12/30-2018/12/30 (1 daily) and 1957/12/31-2018/11/30 (1 monthly)

3. Example command to load the data:
```python
cc.querying.getvar(expt='1deg_jra55_iaf_omip2_cycle6', variable='sea_level', 
                          session=session, frequency='1 monthly',
                          attrs={'cell_methods': 'time: mean'},
                          start_time='1957-12-31 00:00:00', 
                          end_time='1967-11-30 00:00:00')
```                
</details>

#### Exercise 3

Now open a new cell and type
```python
dbx.ee.data
```

What do you see?

<details>
  <summary>Click for answer</summary>
Should be exactly the same data as you just loaded using the explorer. The database explorer stores the experiment explorer internally as an attribute named `ee`. In turn the experiment explorer stores the data it loads in an attribute named `data`.
</details>

### Experiment Explorer

The `ExperimentExplorer` can be used independently of the `DatabaseExplorer` if you already know the experiment you wish to load. 

When a variable is selected the long name is displayed below the box as before, but it also populates the frequency drop down and date range slider to the right. Identical variables can be present in a data set with different temporal frequencies. It is necessary to choose a frequency in this case as those variables cannot be loaded into the same `xarray.DataArray`. When a frequency is selected the date range slider may change the range of available dates if they differ between the two frequencies.

It is advisable to reduce the date range you load if you know you only need the data for a limited time range, as it is much quicker to load the metadata as fewer files need to be opened and their metadata checked.

Once you have selected a variable, confirmed the frequency and date range are correct, push the "Load" button and the data will be loaded into an `xarray.DataArray` object. When this is done the metadata from the loaded data will be displayed at the end of the cell output.

The relevant command used to load the data is displayed, so that it can be copied, reused, and/or modified.

The loaded data is available as the `.data` attribute of the `ExperimentExplorer` object. At any time a different variable from the same or a different experiment can be loaded, and the `.data` attribute will be updated to reflect the new data.

In [None]:
experiment_name = "1deg_jra55_iaf_omip2_cycle6"

In [None]:
ee = explore.ExperimentExplorer(session=session, experiment=experiment_name)
ee

#### Exercise 4

1. From the experiment explorer what is the variable name for `snow-ice formation (cm/day)`? (this is a variable from the ice model)

2. Load the variable with daily frequency for the period 1958/01/01-1959/07/01.

3. Select the first year of data and find the maximum value. What is it?

4. Try making a plot of the spatial distribution of the maximum of this variable for the first year (should be a global plot)

<details>
  <summary>Click for answer</summary>

1. Variable name is `snoice`


2.
```python
ee.data.sel(time=slice('1958')).max().values
```

3. Answer: `5.7867136`

4.
```python
ee.data.sel(time=slice('1958')).max('time').plot()
```
</details>