In [1]:
import xarray as xr
import numpy as np
import dask
import datetime

In [2]:
# you need this if running original otherwise it really slows down
import warnings
warnings.filterwarnings('ignore')

**Example calculation**<br>
----
Using NOAA OISST timeseries, I am selecting a smaller region to demo xmhw code.

In [3]:
# using NOAA oisst as timeseries
ds =xr.open_mfdataset(#'/g/data/ua8/NOAA_OISST/AVHRR/v2-1_modified/timeseries/oisst_timeseries_*.nc'),
                       ['/g/data/ua8/NOAA_OISST/AVHRR/v2-1_modified/timeseries/oisst_timeseries_2003.nc',
                       '/g/data/ua8/NOAA_OISST/AVHRR/v2-1_modified/timeseries/oisst_timeseries_2004.nc'],
                        concat_dim='time', combine='nested')
# removing zlev dimension
sst =ds['sst'].squeeze()
sst = sst.drop('zlev')
# for the moment getting small region to test
sst_limited = sst.sel(lat=slice(-43,-42),lon=slice(149,150))
# only land example
#sst_limited = sst.sel(lat=slice(50,60),lon=slice(40,60))
tos = sst_limited.squeeze()

**Calculate threshold separately and save it to file**<br>
We separated the calculation of the the climatologies from the identification of marine heat waves (mhw). In this way we have two separate functions and you can save a re-use the threshold while experimenting with different settings for the detection part.

In [4]:
from xmhw.xmhw import threshold, detect


The *threshold* function will calculate the climatologies, ie.e seasonal average and threhsold, then use to detect marine heat waves (mhw) along the timeseries.<br>This function mimic the original code behaviour including returning a dictionary. We are looking at changing this so it will return a dataset instead.<br> As for the original several parameters can be set:
````
threshold(temp, tdim='time', climatologyPeriod=[None,None], pctile=90, windowHalfWidth=5,
          smoothPercentile=True, smoothPercentileWidth=31, maxPadLength=False, coldSpells=False, Ly=False)
````
Where *temp* is the temperature timeseries, this is the only input needed, if you're happy with the default settings and if you're time dimension is called 'time'.<br><br>
In the following example we're using all default settings for threshold.

In [8]:
#%%time
clim = threshold(tos)
# to actually get the calculation going use .compute /.load or .values
clim['thresh'].load()
clim['seas'].load()

It is important to notice that differently from the original function which takes a numpy 1D array, because we are using xarray we can pass a 3D array (in fact we could pass any n-dim array) and the code will deal with it.<br>
You can see that the we selected a 4X4 lat-lon region and all the sleected grid cells are ocean. <br>
The first thing the code does is to stack all the dimensions but "time" in the new *cell* dimension. At the same time it removes all the land points, these are assumed to have np.nan values along the temporal axis.<br>
Before saving the results to netcdf the data is *unstacked* so the result has the same dimensions as the original timeseries. The only differenc eis that the climatologies are saved not along the entire timeseries but only along the new *doy* dimension. Given that xarray keeps the coordinates witht he arrays there is no need to repeat the climatologies along the time axis.

In [None]:
# save threshold and seasonal average to netcdf
climds = xr.merge([clim['thresh'], clim['seas']])
climds.to_netcdf('climatology.nc')

**Filter MHW passing calculated climatologies to detect**<br>
The *detect* function indetifies all the mhw events and their characteristics. Corresponds to the second part of the original detect function and again mimic the logic of the original code.

````
    detect(temp, thresh, seas, minDuration=5, joinAcrossGaps=True, maxGap=2,
           maxPadLength=None, coldSpells=False, tdim='time')
````
This time you have to pass the timeseries, the threshold and the seasonal average. The others parameters are optional.<br> The results are stored differently form the original function:
````
   Original structure: 
       - mhw is a dictionary
       - each characteristic is a key with a list of values, each value represent an event
       - Ex.  mhw['intensity_max'][ev]
````
First of all, the new function returns an xarray dataset not a dictionary. Most importantly, there's one variable for each calculated field. The events are stored all together not as separate arrays.<br> Let's see an example, we are using all default settings for MHW filter.

In [6]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=4, n_workers=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:43149  Dashboard: http://127.0.0.1:35409/status,Cluster  Workers: 1  Cores: 4  Memory: 33.56 GB


In [9]:
%%time
mhw = detect(tos, clim['thresh'], clim['seas']) 
mhw.compute()











KeyboardInterrupt: 

The resulting dataset has two kind of variables:
````events (time, lat, lon)
    relSeas (time, lat, lon)
    relThresh (time, lat, lon)
    end_idx (event, lat, lon)
    start_idx (event, lat, lon)
    intensity_cumulative (event, lat, lon)

````
Some are defined on along time and they will have np.nan everywhere but where an event is defined. "events' is one of them it will look like:<br>
  nan, nan, nan, 3, 3, 3, 3, 3, nan ... <br>
Where 3 is the index of the first timestep for an event.
The *events* variable can be used as a coordinate for the other variables defined along the time axis.
The other group defines the mhw characteristics and they are defined along the *event* dimension.
The *event* dimension size is determined by the number of separate events individuated. Separate events have different startung times. This menas that if two different cells have events starting at timestep=50, these event will have the same index along the dimension 'event' regardless on their duration.<br>
Clearly this is an approximation because if an event starts even a timestep later is classified as separate.
This is because as for the orgiinal code, each event is individuated cell by cell. 

In [None]:
mhw 

In [None]:
mhw.events

In [None]:
mhw.intensity_cumulative

In [None]:
# save mhw to yearly netcdf files (to split size if you have a really long timeseries)
#years, datasets = zip(*mhwds.groupby("time.year"))
#paths = ["mhw_%s.nc" % y for y in years]
#xr.save_mfdataset(datasets, paths)
# you can use this if only doing a subset

mhwds.to_netcdf('mhw.nc')

**Find MHW using original code**<br><br>

In [None]:
#%%time
#from datetime import date
#from marineHeatWaves import detect as orig_detect

# create necessary time numpy array
t = np.arange(date(2003,1,1).toordinal(),date(2004,12,31).toordinal()+1)
sst = tos[:,0,0].squeeze().values
# call function with default settings
orig_mhw, orig_clim = orig_detect(t, sst)