# Examples of applying filters and categorising tracks

Import essential libraries

In [1]:
from pathlib import Path

from octant.core import TrackRun

Define the common data directory

In [2]:
sample_dir = Path(".") / "sample_data"

Data are usually organised in hierarchical directory structure. Here, the relevant parameters are defined.

In [3]:
dataset = "era5"
period = "test"
run_id = 0

Construct the full path

In [4]:
track_res_dir = sample_dir / dataset / f"run{run_id:03d}" / period

Now load the cyclone tracks themselves

In [5]:
tr = TrackRun(track_res_dir)
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Number of tracks,671,671,671,671,671,671
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


## Classify the tracks

Now, to label each of the tracks within `tr` according to a set of filters or criteria, `classify()` method should be used.

It has an alias: `categorise()`.

Below are two examples: a simple one and a more advanced using a function with multiple arguments.

### Simple functions as filters

As its argument, `classify()` takes a list of tuples in the form of
```
[
(<labelA>, [<func1>, <func2>, ..., <funcN>]),
(<labelB>, [<func1>, <func2>, ..., <funcN>]),
...
(<labelZ>, [<func1>, <func2>, ..., <funcN>]),
],
```

where `labelA` is assigned to a track if the track satisfies **all** the conditions given by `[<func1>, <func2>, ..., <funcN>]`, which is a list of 1 or more functions.
These functions expect 1 and only 1 argument - `OctantTrack`.

For example, it is possible to classify tracks by their lifetime, maximum vorticity, and distance travelled:

In [6]:
conditions = [
    ("long_lived", [lambda ot: ot.lifetime_h >= 6]),
    (
        "far_travelled_and_very_long_lived",
        [lambda ot: ot.lifetime_h >= 36, lambda ot: ot.gen_lys_dist_km > 300.0],
    ),
    ("strong", [lambda x: x.max_vort > 1e-3]),
]

In [7]:
tr.classify(conditions)

In [8]:
tr.is_categorised, tr.is_cat_inclusive

(True, False)

In [9]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,247,247,247,247,long_lived
Categories,of which,18,18,18,18,far_travelled_and_very_long_lived
Categories,of which,5,5,5,5,strong
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


**NB** By default, the categories are **NOT** "inclusive", so all categories are independent.

In this case, "long_lived" do not include the 18 tracks, of which 15 are "far_travelled_and_very_long_lived" plus 5 are "strong".

This is how the numbers change if the categorisation is inclusive, so the "long_lived" subset includes tracks are "far_travelled_and_very_long_lived", and they both include the "strong" subset.

In [10]:
tr.classify(conditions, inclusive=True)

In [11]:
tr.is_categorised, tr.is_cat_inclusive

(True, True)

In [12]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,247,247,247,247,long_lived
Categories,of which,18,18,18,18,far_travelled_and_very_long_lived|long_lived
Categories,of which,3,3,3,3,strong|far_travelled_and_very_long_lived|long_lived
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


Note that it automatically appends `|<source_category>` to the category labels.

### Select one or multiple categories

Selection of tracks within a category can be done as following.

In [13]:
tr.classify(conditions)

In [14]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,247,247,247,247,long_lived
Categories,of which,18,18,18,18,far_travelled_and_very_long_lived
Categories,of which,5,5,5,5,strong
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


* one category

In [15]:
tr["strong"]

Unnamed: 0_level_0,Unnamed: 1_level_0,lon,lat,vo,time,area,vortex_type
track_idx,row_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
93,0,1.5,75.0,0.000418,2011-01-04 02:00:00,55228.76953,1
93,1,3.6,75.3,0.000403,2011-01-04 03:00:00,59789.96875,1
93,2,6.6,75.9,0.000403,2011-01-04 04:00:00,58341.80859,1
93,3,7.5,75.9,0.000421,2011-01-04 05:00:00,64806.83984,3
93,4,4.8,74.7,0.000431,2011-01-04 06:00:00,60390.07812,1
93,5,5.7,74.7,0.000449,2011-01-04 07:00:00,56248.26172,1
93,6,4.2,74.1,0.000444,2011-01-04 08:00:00,54530.76562,1
93,7,3.9,73.8,0.000436,2011-01-04 09:00:00,51079.36719,1
93,8,13.5,75.9,0.000470,2011-01-04 10:00:00,63191.57031,1
93,9,14.7,76.5,0.000484,2011-01-04 11:00:00,64762.19922,1


* several categories (AND operator)

In [16]:
tr[["strong", "long_lived"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,lon,lat,vo,time,area,vortex_type
track_idx,row_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
93,0,1.5,75.0,0.000418,2011-01-04 02:00:00,55228.76953,1
93,1,3.6,75.3,0.000403,2011-01-04 03:00:00,59789.96875,1
93,2,6.6,75.9,0.000403,2011-01-04 04:00:00,58341.80859,1
93,3,7.5,75.9,0.000421,2011-01-04 05:00:00,64806.83984,3
93,4,4.8,74.7,0.000431,2011-01-04 06:00:00,60390.07812,1
93,5,5.7,74.7,0.000449,2011-01-04 07:00:00,56248.26172,1
93,6,4.2,74.1,0.000444,2011-01-04 08:00:00,54530.76562,1
93,7,3.9,73.8,0.000436,2011-01-04 09:00:00,51079.36719,1
93,8,13.5,75.9,0.000470,2011-01-04 10:00:00,63191.57031,1
93,9,14.7,76.5,0.000484,2011-01-04 11:00:00,64762.19922,1


In the same fashion, the size of each subset can be checked.

In [17]:
tr.size("strong")

5

In [18]:
tr.size(["long_lived", "strong"])

4

Group-by operation can also be used to iterate over tracks within a subset.

In [19]:
tr["strong"].gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f347c2916d8>

### Where is the categorisation data stored?

After categorisation is applied to `TrackRun`, the attribute `TrackRun.cats` is assigned to a `pandas.DataFrame` with boolean values containing the True/False flags for each track and category.

In [20]:
tr.cats.head(10)

Unnamed: 0_level_0,long_lived,far_travelled_and_very_long_lived,strong
track_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,False,False,False
1,True,False,False
2,False,False,False
3,False,False,False
4,True,False,False
5,False,False,False
6,True,False,False
7,True,False,False
8,False,False,False
9,False,False,False


**Note** by default, `classify()` clears previous categories. To preserve them, use `clear=False` keyword.

A shortcut to view the available categories is

In [21]:
tr.cat_labels

['long_lived', 'far_travelled_and_very_long_lived', 'strong']

### More complex functions as filters

It is possible to categorise tracks by their proximity to the coast (land) or other masked points in an array with geographical coordinates.
For convenience, `octant.misc` module contains `check_by_mask()` function that checks if a cyclone track stays close to land points or domain boundaries for a long enough time. This function is essentially a wrapper around `octant.utils.mask_tracks()` function.

In [22]:
import xarray as xr

from octant.misc import check_by_mask

First, reload the `TrackRun` just in case.

In [23]:
tr = TrackRun(track_res_dir)

Load land-sea mask array from ERA5 dataset:

In [24]:
lsm = xr.open_dataarray(sample_dir / dataset / "lsm.nc")
lsm = lsm.squeeze()  # remove singular time dimension

Importantly, the `classify()` method expects functions that only take 1 argument of type `OctantTrack`, so to use the function above, we need to construct a partial function using `functools` from the standard library.

In [25]:
from functools import partial

In [26]:
land_mask_fun = partial(check_by_mask, trackrun=tr, lsm=lsm, rad=75.)  # and leave other parameters default

This new function has been supplied with all the additional arguments, and can take only `OctantTrack`, which is exactly what `classify()` needs.
It is then passed as a second filtering function to the list of conditions:

In [27]:
new_conditions = [
    ("good_candidates", [lambda ot: ot.lifetime_h >= 6, land_mask_fun]),
    (
        "pmc",
        [
            lambda ot: ((ot.vortex_type != 0).sum() / ot.shape[0] < 0.2)
            and (ot.gen_lys_dist_km > 300.0)
        ],
    ),
]

In [28]:
%%time
tr.classify(new_conditions, inclusive=True)

CPU times: user 21.5 s, sys: 16 ms, total: 21.5 s
Wall time: 21.5 s


In [29]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,101,101,101,101,good_candidates
Categories,of which,36,36,36,36,pmc|good_candidates
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


## Categorise by percentile

`TrackRun` also has a method to select a subset of tracks depending on a statistic.

For example, to select tracks with maximum vorticity in the top 20% (greater than) you can do:

In [30]:
tr.categorise_by_percentile(subset="pmc|good_candidates", perc=80, by="max_vort", oper="gt")

In [31]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,101,101,101,101,good_candidates
Categories,of which,36,36,36,36,pmc|good_candidates
Categories,of which,7,7,7,7,max_vort__gt__80pc|pmc|good_candidates
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


... or to find the weakest 5% of "good candidates":

In [32]:
tr.categorise_by_percentile(subset="good_candidates", perc=5, by="max_vort", oper="le")

In [33]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,101,101,101,101,good_candidates
Categories,of which,36,36,36,36,pmc|good_candidates
Categories,of which,7,7,7,7,max_vort__gt__80pc|pmc|good_candidates
Categories,of which,6,6,6,6,max_vort__le__5pc|good_candidates
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


## Clear categories

Categories can be removed one by one or all together. It is also possible to "overwrite" the inclusivity within this function.

In [34]:
tr.clear_categories(subset="good_candidates", inclusive=False)

In [35]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Categories,,,,,,
Categories,,671,671,671,671,in total
Categories,of which,36,36,36,36,pmc|good_candidates
Categories,of which,7,7,7,7,max_vort__gt__80pc|pmc|good_candidates
Categories,of which,6,6,6,6,max_vort__le__5pc|good_candidates
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test


In [36]:
tr.clear_categories()

In [37]:
tr

Cyclone tracking results,Cyclone tracking results.1,Cyclone tracking results.2,Cyclone tracking results.3,Cyclone tracking results.4,Cyclone tracking results.5,Cyclone tracking results.6
Number of tracks,671,671,671,671,671,671
Data columns,"lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type","lon, lat, vo, time, area, vortex_type"
Sources,,,,,,
Sources,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test,sample_data/era5/run000/test
