# Extreme temperature identification

This notebook contains working notes and code to identify T extremes.

My goal is to develop methods that will identify extremes, and then be able to compute various facets of extreme events, such as duration and magnitude.

## Quantile computation

The Xarray quantile method currently does not support dask arrays. That means that the calculation of quantiles has to be done on an in-memory DataArray/NumpyArray. I suspect that the implementation of quantile uses numpy's nanpercentile for the calculation. This is known to be slow because it does a 1d calculation looped over all points [https://krstn.eu/np.nanpercentile()-there-has-to-be-a-faster-way/]. 

Instead of implementing something complicated, we can just do the computation and wait it out, and save the result in a file for later use.



In [2]:
from distributed import Client

In [3]:
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:39246  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 9  Cores: 72  Memory: 405.33 GB


In [5]:
dir(client)

['__aenter__',
 '__aexit__',
 '__await__',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_asynchronous',
 '_cancel',
 '_close',
 '_connecting_to_scheduler',
 '_dec_ref',
 '_deserializers',
 '_ensure_connected',
 '_expand_key',
 '_expand_resources',
 '_expand_retries',
 '_gather',
 '_gather_future',
 '_gather_keys',
 '_gather_remote',
 '_gather_semaphore',
 '_get_dataset',
 '_get_futures_error',
 '_get_task_stream',
 '_graph_to_futures',
 '_handle_cancelled_key',
 '_handle_error',
 '_handle_key_in_memory',
 '_handle_lost_data',
 '_handle_report',
 '_handle_restart',
 '_handle_retried_key',
 '_handle_scheduler_coroutine',
 '_hand

In [6]:
import numpy as np 
import xarray as xr

In [7]:
%matplotlib inline

In [8]:
%%time
ds = xr.open_mfdataset("/project/amp/jcaron/CPC_Tminmax/tmax.*.nc")
tmax = ds['tmax']
print(tmax)

<xarray.DataArray 'tmax' (time: 14736, lat: 360, lon: 720)>
dask.array<shape=(14736, 360, 720), dtype=float32, chunksize=(365, 360, 720)>
Coordinates:
  * lat      (lat) float32 89.75 89.25 88.75 88.25 ... -88.75 -89.25 -89.75
  * lon      (lon) float32 0.25 0.75 1.25 1.75 ... 358.25 358.75 359.25 359.75
  * time     (time) datetime64[ns] 1979-01-01 1979-01-02 ... 2019-05-06
Attributes:
    level_desc:    Surface
    statistic:     Mean
    parent_stat:   Other
    long_name:     Daily Maximum Temperature
    cell_methods:  time: mean
    valid_range:   [-90.  50.]
    avg_period:    0000-00-01 00:00:00
    dataset:       CPC Global Temperature
    comments:      GTS data and is gridded using the Shepard Algorithm
    max_period:    6z to 6z
    units:         degC
    var_desc:      Maximum Temperature
CPU times: user 980 ms, sys: 170 ms, total: 1.15 s
Wall time: 2.89 s


In [None]:
%%time
# what is the median value for each day
# tmax_median_by_day = xr.ufuncs.median(tmax.groupby('time.dayofyear'), dim='time')
# tmax_median_by_day.sel(lat=55, lon=83, method='nearest').plot()

In [11]:
%%time
tmax.load()

CPU times: user 13.2 s, sys: 25 s, total: 38.3 s
Wall time: 44.7 s


<xarray.DataArray 'tmax' (time: 14736, lat: 360, lon: 720)>
array([[[nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan],
        ...,
        [nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan]],

       [[nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan],
        ...,
        [nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan]],

       ...,

       [[nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan],
        ...,
        [nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan]],

       [[nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan],
        ...,
        [nan, nan, ..., nan, nan],
        [nan, nan, ..., nan, nan]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 89.75 89.25 88.75 88.25 ... -88.75 -89.25 -89.75
  * lon      (lon) float32 0.25 0.75 1.25 1.75 ... 358.25 358.75 359.25 359.75
  * time     (time) datetime64[ns] 1979-01-01 1979-01-02 ... 2019-05-06
Attributes:
    level_desc:    Surface
    statistic

In [9]:
%%time
grp = tmax.groupby('time.dayofyear')

CPU times: user 23.2 ms, sys: 947 µs, total: 24.1 ms
Wall time: 20.6 ms


In [14]:
%%time
# get the 90th percentile
tmax_90_by_day = grp.apply(xr.DataArray.quantile, args=(.9,), **{'dim':'time', 'interpolation':'linear', 'keep_attrs':True})

  overwrite_input, interpolation)


CPU times: user 1h 1min 23s, sys: 7min 43s, total: 1h 9min 7s
Wall time: 1h 21s


In [16]:
tmax_90_by_day.name = "tmax90pct"

In [18]:
tmax_90_by_day.to_netcdf("/project/amp/brianpm/TemperatureExtremes/Derived/CPC_tmax_90pct_dayofyear.nc")

## thoughts on this approach

I like this Xarray approach because it works in one line. The obvious downside is that it is unreasonably slow.

In our next step, we probably want to gather days on either side of each day to increase sample size. In Xarray, we can definitely do that by using a rolling operation, but that would need to be separate from the groupby operation, I think. 

Probably the easiest option is to deal with the time sampling in a loop. Then there's a question of speed. Is it worth breaking the quantile calculation into a loop and spreading it out using multiprocessing, or just wait it out without having to deal with setting up a pool.

In [38]:
tmax_pt = tmax[:,100,100]
tmax_pt

<xarray.DataArray 'tmax' (time: 14736)>
array([14.950293, 16.44022 , 19.941858, ..., 20.553688, 20.05624 , 22.033697],
      dtype=float32)
Coordinates:
    lat      float32 39.75
    lon      float32 50.25
  * time     (time) datetime64[ns] 1979-01-01 1979-01-02 ... 2019-05-06
Attributes:
    level_desc:    Surface
    statistic:     Mean
    parent_stat:   Other
    long_name:     Daily Maximum Temperature
    cell_methods:  time: mean
    valid_range:   [-90.  50.]
    avg_period:    0000-00-01 00:00:00
    dataset:       CPC Global Temperature
    comments:      GTS data and is gridded using the Shepard Algorithm
    max_period:    6z to 6z
    units:         degC
    var_desc:      Maximum Temperature

In [44]:
g = dict(tmax_pt.groupby("time.dayofyear"))  # this is kind of a kludge, but gives keys 1-366, with values that are dataArrays

In [50]:
tmax_by_day = dict(tmax.groupby("time.dayofyear"))  # this is kind of a kludge, but gives keys 1-366, with values that are dataArrays

In [54]:
tmax_90th_by_day = {k:np.nanquantile(tmax_by_day[k], 0.9, axis=0) for k in tmax_by_day}

In [10]:
%%time
# this is probably the same as above, but now we'll get a better data structure out of it.
# Slow when applied to the whole dataset.
tmax_90_by_day_2 = tmax.groupby('time.dayofyear').apply(xr.DataArray.quantile, args=(.9,), **{'dim':'time', 'interpolation':'linear', 'keep_attrs':True})

TypeError: quantile does not work for arrays stored as dask arrays. Load the data via .compute() or .load() prior to calling this method.

In [13]:
# how to put a window around each day to increase the sample size and smooth the quantile time series



ValueError: applied function returned data with unexpected number of dimensions: 0 vs 2, for dimensions ('lat', 'lon')

In [12]:
xr.apply_ufunc?

[0;31mSignature:[0m [0mxr[0m[0;34m.[0m[0mapply_ufunc[0m[0;34m([0m[0mfunc[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
apply_ufunc(func : Callable,
               *args : Any,
               input_core_dims : Optional[Sequence[Sequence]] = None,
               output_core_dims : Optional[Sequence[Sequence]] = ((),),
               exclude_dims : Collection = frozenset(),
               vectorize : bool = False,
               join : str = 'exact',
               dataset_join : str = 'exact',
               dataset_fill_value : Any = _NO_FILL_VALUE,
               keep_attrs : bool = False,
               kwargs : Mapping = None,
               dask : str = 'forbidden',
               output_dtypes : Optional[Sequence] = None,
               output_sizes : Optional[Mapping[Any, int]] = None)

Apply a vectorized function for unlabeled arrays on xarray objects.

The function will be m