# Cal-Adapt Analytics Engine: Threshold Tools Basics

A notebook on how to use the `climakitae` package and its `threshold_tools` to calculate values of interest related to extreme weather events including:
- __return values__ (e.g., the value of a high temperature that will be reached once every 10 years, i.e., the value of a high temperature event with a 10-year return period)
- __return probabilities__ (e.g., the probability of temperature exceeding 300 Kelvin)
- __return periods__ (e.g., how often, on average, a 300 Kelvin monthly average temperature event will occur; how often, on average, a 150 mm daily precipitation event will occur)

The notebook allows you to choose the data you want to use for the calculations. [Data available on the Analytics Engine](https://analytics.cal-adapt.org/data/) include future climate projections and historial climate data of multiple variables at different spatial resolutions, geographic locations, and temporal resolutions. You may also visualize the resulting values of interest on maps and export all results.

In this notebook, return values, probabilities, and periods can be inferred from extreme weather events identified as the events with the maximum value in a year (e.g. the hottest hour in a year). To examine the changing frequency of events above or below a specific threshold value (e.g. critical value for infrastructure), please see the *threshold_tools_exceedance.ipynb* notebook. Development to use return value, probability, and return period calculation tools with threshold values instead of maximum values is currently ongoing.

This notebook provides code to perform example analyses. Information on how to customize the code for your own analysis is also available.

To execute a code 'cell' of this notebook, place the cursor in the cell and press the 'play' icon, or simply press shift+enter together. Some cells will take longer to run, and you will see a [∗] to the left of the cell while Analytics Engine is still working.


The techniques in this notebook come from applications of extreme value theory to climate data. For further reading on this topic, see [Cooley 2009](https://link.springer.com/article/10.1007/s10584-009-9627-x).

<a id='setup'></a>
## Step 0: Setup

### Import neccessary packages before running analysis

In [None]:
import panel as pn
pn.extension()

In [None]:
import xarray as xr

import climakitae as ck
from climakitae import threshold_tools

### Load a new `climakitae` application

In [None]:
app = ck.Application()

<a id='sel'></a>
## Step 1: Select and retrieve data of interest

### Select data

In the code cell below, the `app.select()` function of the `climakitae` app displays an interface for data selection. The selected data will be used to calculate return values, probabilities, and periods.

To perform the example analyses provided later in the notebook, use the area subset option dropdown option (displayed in `app.select()` below the inset map) to subset the data to the state of California (CA). Otherwise, you can keep the default options displayed by the panel the same.

To learn more about the data available on the Analytics Engine, see our [data catalog](https://analytics.cal-adapt.org/data/). The *getting_started.ipynb* notebook contains additional explanations of the data.

<span style="color:red"><b>Warning:</b></span> Ensure that you __don't__ compute an area average across grid cells within your selected region. You can achieve this by setting the area average option to **No** in the second tab on the interface.

__Note:__
- This version only offers the [dynamically-downscaled](https://dept.atmos.ucla.edu/alexhall/downscaling-cmip6) data.
- If you select 'daily' for 'Timescale', it will result in a daily aggregation of the hourly data. If you select 'monthly' for 'Timescale', it will result in a monthly aggregation of the daily data. The aggregation can be average, maximum, or sum as appropriate for the data variable.

__Tip:__ When performing your own analysis with __Future Model Data__, select just one scenario. It helps to streamline the analysis.

In [None]:
app.select()

### Retrieve data

Run `app.retrieve()` to load the data selected above.

In [None]:
generated_data = app.retrieve()

__Note:__ The code cell above may take several minutes. When it finishes running, you can preview the loaded data below. The data will be stored as a `DataArray` object called `generated_data`. Learn more about `DataArray` [here](http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/xarray.html#Xarray-data-structures).

__Tip:__ In Step 2, you will subset the data by one scenario and and one simulation. Available scenarios and simulations can also be found in the data preview.

In [None]:
generated_data

<a id='transform'></a>
## Step 2: Transform data to prepare for calculations

### Subset data by scenario and simulation to prepare it for `threshold_tools` functions

Currently, `threshold_tools` functions that perform the calculations requires an input `DataArray` where there is only one scenario and one simulation selected. Replace `scenario=` and `simulation=` with the particular selections that are present in your data. 

__Tip:__ Available selections can be found in the preview of `generated_data` in Step 1.

The example code selects 'historical' as the scenario and 'cnrm-esm2-1' as the simulation. The resulting `subsetted_data` is a `DataArray` that contains one value for each grid cell at each timestamp.

In [None]:
subsetted_data = generated_data.sel(scenario='historical').sel(simulation='WRF_CNRM-ESM2-1_r1i1p1f2')
subsetted_data

__Note:__ All the following calculations are performed on individual grid cells.

### Pull Annual Maximum Series (AMS) for all grid cells

This is the first step of extreme value analysis -- identifying what conditions are extreme. In this example, we default to considering each annual maximum value as a sample of an extreme event. Here, extreme events are evaluated using the annual block maxima approach, which determines the maximum value within a given block period (year). Because this approach considers only the maximum, it is limited when multiple extremes occur in a single year, since some of the extremes (that may be more extreme than the maxima in subsequent years) are excluded. This limitation makes the tools in this notebook not ideal for California in cases such as atmospheric river events or evaluating extreme wet and dry years. 

After pulling the AMS, run `app.load` to bring the data down to an appropriate size for later computations.

__Note:__ Running `app.load` may take several minutes.

In [None]:
ams = threshold_tools.get_ams(subsetted_data, extremes_type='max')
ams = app.load(ams) 
ams

<a id='calc'></a>
## Step 3: Calculate values of interest

__Note:__ All calculated values will be stored in `Dataset` objects. Learn more about them [here](http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/xarray.html#Xarray-data-structures).

### 3a) Find a distribution to use for calculation

Calculating return values, probabilities, and periods requires [fitting a probability distribution](https://en.wikipedia.org/wiki/Probability_distribution_fitting) to the annual maximum data values computed in Step 2. Step 3a) contains tools for finding a distribution that fits the data well. You can select among a list of distributions and evalute how well a selected distribution fits the annual maximum data values. The evaluation is conducted through a goodness of fit statistical test. You can also visualize the test results on a map.

#### Test goodness of fit of selected distribution

The `get_ks_stat` function of `threshold_tools` performs the [KS goodness of fit test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). The test can be used to evaluate the fit between data and a reference probability distribution. Specifically, the function tests the null hypothesis that the input data are distributed according to the selected distribution. The alternative hypothesis is that the data are _not_ distributed according to the distribution. The function outputs p-values from the tests. At the confidence level of 95%, the null hypothesis should be rejected in favor of the alternative if the p-value is less than the critical value of 0.05, suggesting that the selected distribution _doesn't_ fit the data well.

The example code performs the KS test on the AMS data with the generalized extreme value (GEV) distribution as the reference distribution.

Below is a full list of reference distributions that can be specified in the `distr=` part the `get_ks_stat` function, along with information on the situations in which each distribution is often used.

- __gev__ - Generalized extreme value (GEV) distribution - allows for a continuous range of different shapes, and will reduce to the Gumbel, Weibull, and Generalized Pareto distributions under different conditions. The GEV distribution may generally provide a better fit than the three individual distributions, and is a common distribution used in hydrological applications.
- __gumbel__ - Gumbel distribution - Range of interest is unlimited
- __weibull__ - Weibull distribution - Range of interest has an upper limit
- __pearson3__ - Pearson Type III distribution - Range of interest has a lower limit
- __genpareto__ - Generalized Pareto distribution - This distribution is often used in application for river flood events and suggested to be of a good general fit for precipitation in the United States.


In [None]:
goodness_of_fit = threshold_tools.get_ks_stat(ams, distr='gev', multiple_points=True)
goodness_of_fit

Evaluate the p-values in the `goodness_of_fit` `Dataset` to ensure the selected distribution fits the data well. Once you have identified a distribution with satisfactory goodness of fit, please proceed to Step 3b) to calculate return values, probabilities, and/or periods.

__Tip:__ You may also map the p-values below. 

#### Visualize goodness of fit test results

Observe a geospatial map of p-values from the KS test to ensure the selected distribution fits the data well. The p-values should be above the critical value associated with your desired level of confidence.

If you have been following the example code, notice the p-values are all above 0.05, so at the 95% confidence level, the GEV distribution fits the AMS well.

In [None]:
threshold_tools.get_geospatial_plot(goodness_of_fit, data_variable='p_value')

### 3b) Calculate values of interest using a distribution that fits the data well

#### Calculate return value for a selected return period

The `get_return_value` function in `threshold_tools` calculates the return value for a certain return period (i.e., 1-in-X-year event). Confidence intervals of the return values can also be calculated.

The example code calculates the return value for a 1-in-10-year extreme high monthly average temperature event. The return values are inferred from GEV distributions fitted to the AMS. A hundred bootstrap samples are also used to calculate 95% confidence intervals.

To perform your own analysis, specify `distr=` as the distribution you found in Step 3a), and `return_period=` as the return period of your interest (in years). You may also specify a different number of bootstrap samples used to calculate confidence intervals, as well as different lower and upper bounds of the confidence intervals. 

__Tip:__ `bootstrap_runs`, `conf_int_lower_bound`, and `conf_int_upper_bound` are set to the default values in the example. If you want to perform the analysis with these default values, you don't need to specify them explicitly. For instance, the example code is equivalent to `threshold_tools.get_return_value(ams, return_period=10, distr='gev', multiple_points=True)`


In [None]:
return_value = threshold_tools.get_return_value(
    ams, return_period=10, distr='gev',
    bootstrap_runs=100,
    conf_int_lower_bound=2.5,
    conf_int_upper_bound=97.5,
    multiple_points=True
)
return_value

Examine the results in the `return_value` `Dataset`, proceed to calculate the next value of your interest, or visualize `return_value` [here](#vis_return_value) in Step 4.

#### Calculate return probability of exceedance of selected threshold

The `get_return_prob` function in `threshold_tools` calculates the probability of a variable exceeding a certain threshold. Confidence intervals of the return probabilities can also be calculated.

The example code calculates the probability of monthly average temperature exceeding 300 Kelvin. The return probabilities are inferred from Pearson Type III distributions fitted to the AMS. By default, a hundred bootstrap samples are also used to calculate 95% confidence intervals.

__Note:__ The number of bootstrap samples and the confidence level of confidence intervals are not specified explicitly in the example code, and are therefore set to the default values: `bootstrap_runs=100, conf_int_lower_bound=2.5, conf_int_upper_bound=97.5`.

To perform your own analysis, specify `distr=` as the distribution you found in Step 3a), and `threshold=` as the threshold of your interest. The unit of the threshold is assumed to be the same as that of the data variable in the AMS. You may also specify the number of bootstrap samples (using `bootstrap_runs=`), as well as the lower and upper bounds of the confidence intervals (using `conf_int_lower_bound=` and `conf_int_upper_bound=`).

In [None]:
return_prob = threshold_tools.get_return_prob(ams, threshold=300, distr='pearson3', multiple_points=True)
return_prob

Examine the results in the `return_prob` `Dataset`, proceed to calculate the next value of your interest, or visualize `return_prob` [here](#vis_return_prob) in the Step 4.

#### Calculate return period for a selected return value

The `get_return_period` function in `threshold_tools` calculates the return period (i.e., 1-in-X-year) for a certain return value. Confidence intervals of the return periods can also be calculated.

The example code calculates the return period of 300 Kelvin events. The return periods are inferred from Weibull distributions fitted to the AMS. By default, a hundred bootstrap samples are also used to calculate 95% confidence intervals.

__Note:__ The number of bootstrap samples and the confidence level of confidence intervals are not specified explicitly in the example code, and are therefore set to the default values: `bootstrap_runs=100, conf_int_lower_bound=2.5, conf_int_upper_bound=97.5`.

To perform your own analysis, specify `distr=` as the distribution you found in Step 3a), and `return_value=` as the threshold of your interest. The unit of the return value is assumed to be the same as that of the data variable in the AMS. You may also specify the number of bootstrap samples (using `bootstrap_runs=`), as well as the lower and upper bounds of the confidence intervals (using `conf_int_lower_bound=` and `conf_int_upper_bound=`).

In [None]:
return_period = threshold_tools.get_return_period(ams, return_value=300, distr='weibull', multiple_points=True)
return_period

Examine the results in the `return_period` `Dataset`, or visualize `return_period` [here](#vis_return_period) in the Step 4.

<a id='vis'></a>
## Step 4: Visualize values of interest

This step allows you to visualize the `Dataset` outputs from Step 3b).

<a id='vis_return_value'></a>
### Visualize return value

Observe a geospatial map of return values for selected return period.

In [None]:
threshold_tools.get_geospatial_plot(return_value, data_variable='return_value')

<a id='vis_return_prob'></a>
### Visualize return probability

Observe a geospatial map of return probabilities of exceedance of selected threshold.

In [None]:
threshold_tools.get_geospatial_plot(return_prob, data_variable='return_prob')

<a id='vis_return_period'></a>
### Visualize return period

Observe a geospatial map of return periods for selected return value.

In [None]:
threshold_tools.get_geospatial_plot(return_period, data_variable='return_period', bar_max=1000)

When you are done with Step 4, you may export results in [Step 5](#export), or clean up the notebook in [Step 6](#end).


<a id='export'></a>
## Step 5: Export results

To export any `DataArray` or `Dataset` object in the notebook (the AMS, the return values, etc.), first execute the following code cell and pick a file format.

__Tip:__ We recommend the NetCDF file format, which will work with any number of dimensions in your `DataArray` or `Dataset`.

In [None]:
app.export_as()

Next, specify the `DataArray` or `Dataset` object you wish to export and your desired file name (in single or double quotation marks).

In [None]:
app.export_dataset(return_period, 'my_filename_1')

If you would like to save data as a CSV or GeoTIFF file, __please note:__

- CSV and GeoTIFF can only be used for `DataArray`
- CSV works the best for up to 2-dimensional data (e.g., lon x lat), and will be compressed and exported with a separate metadata file
- GeoTIFF can accept 3 dimensions in total:
    - x and y dimensions are required
    - The third dimension is flexible and will be a "band" in the file: time, simulation, or scenario could go here
    - Metadata will be accessible as "tags" in the .tif file
    


To export a `Dataset` as a CSV or GeoTIFF file, please subset it with your desired variable first, then select either CSV or GeoTIFF as your format (NetCDF will also work). For example:

In [None]:
variable = 'return_period'
return_period_variable = return_period[variable]

In [None]:
app.export_as()

In [None]:
app.export_dataset(return_period_variable, 'my_filename_2')

<a id='end'></a>
## Step 6: Clean up

Lastly, when you are done, close your cluster resources to free them up for the next time you work. 

In [None]:
cluster.close()