An example notebook that split data to train-test and and uses in DNN trainings.

## Import libraries

In [15]:
import xarray as xr
import motrainer
import dask_ml.model_selection as dcv
from motrainer.jackknife import JackknifeGPI

data saved from the notebook [1_example_one_pickle_file_with_nested_fields](./1_example_one_pickle_file_with_nested_fields.ipynb)

## Read data and split to train and test datasets

In [16]:
# Read the data
zarr_file_path = "./example1_data.zarr"
ds = xr.open_zarr(zarr_file_path)

In [17]:
def to_dataframe(ds):
    return ds.to_dask_dataframe()

def chunk(ds, chunks):
    return ds.chunk(chunks)
    
bags = motrainer.dataset_split(ds, "space")
bags = bags.map(chunk, {"space": 100}).map(to_dataframe)

test_size = 0.33
f_shuffle = True
train_test_bags = bags.map(
    dcv.train_test_split, test_size=test_size, shuffle=f_shuffle, random_state=1
)  
train_bags = train_test_bags.pluck(0)
test_bags = train_test_bags.pluck(1)

## Define training parameters

In [18]:
# JackKnife parameters
JackKnife = {
    'val_split_year': 2017,
    'output_list': ['sig', 'slop', 'curv'],
    'input_list': ['TG1', 'TG2', 'TG3', 'WG1', 'WG2', 'WG3', 'BIOMA1', 'BIOMA2'],
    'out_path': './dnn_examples/results'
}

# Training parameters
searching_space = {
    'num_dense_layers': [1, 10],
    'num_input_nodes': [1, 6],
    'num_dense_nodes': [1, 128],
    'learning_rate': [5e-4, 1e-2],
    'activation': ['relu']
}

# Here, I reduce parameters to be able to run on my own machine
optimize_space = {
    'best_loss': 2, # 1
    'n_calls': 11, # 15
    'epochs': 5, # 300
    'noise': 0.1, 
    'kappa': 5,
    'validation_split': 0.2,
    'x0': [1e-3, 1, 4, 13, 'relu', 64]
} # For weightling loss: 'loss_weights': [1, 1, 0.5], 

## Run the training

In this example, we will demonstrate how to run the training parralel per grid (partition) with a dask cluster.

In [19]:
# a function for training
def training_func(gpi_num, df, JackKnife, searching_space, optimize_space):
    
    # remove NA data
    gpi_data = df.compute()
    gpi_data.dropna(inplace=True)

    # add time to index
    gpi_data.set_index("time", inplace=True, drop=True)

    gpi = JackknifeGPI(gpi_data,
                       JackKnife['val_split_year'],
                       JackKnife['input_list'],
                       JackKnife['output_list'],
                       outpath=f"{JackKnife['out_path']}/gpi{gpi_num+1}")

    gpi.train(searching_space=searching_space,
              optimize_space=optimize_space,
              normalize_method='standard',
              training_method='dnn',
              performance_method='rmse',
              verbose=2)

    gpi.export_best()

    return gpi.apr_perf, gpi.post_perf

To add a Dask cluster to this notebook, you can use the Dask JupyterLab extension (look for the Dask logo on the left tab of the JupyterLab interface):

Click on the Dask logo; click the Scale button, set up the number of workers to the number of available cores; then click <> to add a code block. Then a code cell will be added to this notebook. Please drop this cell below. By executing it, a Dask SLURMCluster will be created.

This is an example code of DASK cluster

```python
from dask.distributed import Client

client = Client("tcp://127.0.0.1:39087")
client
```

In [20]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:39007")
client

0,1
Connection method: Direct,
Dashboard: http://127.0.0.1:8787/status,

0,1
Comm: tcp://127.0.0.1:39007,Workers: 5
Dashboard: http://127.0.0.1:8787/status,Total threads: 10
Started: 1 minute ago,Total memory: 19.11 GiB

0,1
Comm: tcp://127.0.0.1:45823,Total threads: 2
Dashboard: http://127.0.0.1:43255/status,Memory: 3.82 GiB
Nanny: tcp://127.0.0.1:41415,
Local directory: /tmp/dask-scratch-space/worker-y19tu4tj,Local directory: /tmp/dask-scratch-space/worker-y19tu4tj
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 139.88 MiB,Spilled bytes: 0 B
Read bytes: 8.96 kiB,Write bytes: 8.82 kiB

0,1
Comm: tcp://127.0.0.1:44953,Total threads: 2
Dashboard: http://127.0.0.1:33439/status,Memory: 3.82 GiB
Nanny: tcp://127.0.0.1:44863,
Local directory: /tmp/dask-scratch-space/worker-8udn0mwa,Local directory: /tmp/dask-scratch-space/worker-8udn0mwa
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 137.86 MiB,Spilled bytes: 0 B
Read bytes: 8.97 kiB,Write bytes: 8.82 kiB

0,1
Comm: tcp://127.0.0.1:39031,Total threads: 2
Dashboard: http://127.0.0.1:44167/status,Memory: 3.82 GiB
Nanny: tcp://127.0.0.1:41687,
Local directory: /tmp/dask-scratch-space/worker-cp9l2br3,Local directory: /tmp/dask-scratch-space/worker-cp9l2br3
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 136.60 MiB,Spilled bytes: 0 B
Read bytes: 9.06 kiB,Write bytes: 8.91 kiB

0,1
Comm: tcp://127.0.0.1:41335,Total threads: 2
Dashboard: http://127.0.0.1:41259/status,Memory: 3.82 GiB
Nanny: tcp://127.0.0.1:42087,
Local directory: /tmp/dask-scratch-space/worker-u6n9g9_o,Local directory: /tmp/dask-scratch-space/worker-u6n9g9_o
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 138.97 MiB,Spilled bytes: 0 B
Read bytes: 8.97 kiB,Write bytes: 8.82 kiB

0,1
Comm: tcp://127.0.0.1:41043,Total threads: 2
Dashboard: http://127.0.0.1:35649/status,Memory: 3.82 GiB
Nanny: tcp://127.0.0.1:46833,
Local directory: /tmp/dask-scratch-space/worker-3e_hehyn,Local directory: /tmp/dask-scratch-space/worker-3e_hehyn
Tasks executing:,Tasks in memory:
Tasks ready:,Tasks in flight:
CPU usage: 2.0%,Last seen: Just now
Memory usage: 137.44 MiB,Spilled bytes: 0 B
Read bytes: 9.05 kiB,Write bytes: 8.91 kiB


In [21]:
from dask.distributed import wait

In [23]:
# Use client to parallelize the loop across workers
futures = [
    client.submit(training_func, gpi_num, df, JackKnife, searching_space, optimize_space) for  gpi_num, df in enumerate(train_bags)
]

# Wait for all computations to finish
wait(futures)

# Get the results
results = client.gather(futures)

In [32]:
# Close the Dask client
client.close()

In [31]:
# print the results
for gpi_num, performance in enumerate(results):
    print(f"GPI {(gpi_num + 1)}")
    print(" aprior performance(RMSE):")
    print(performance[0])
    print("post performance(RMSE):")
    print(performance[1])
    print("=========================================")

GPI 1
 aprior performance(RMSE):
[[0.26438]
 [0.03295]
 [0.14362]]
post performance(RMSE):
[[0.34483]
 [0.00631]
 [0.00686]]
GPI 2
 aprior performance(RMSE):
[[0.37801]
 [0.02598]
 [0.26245]]
post performance(RMSE):
[[0.70249]
 [0.22075]
 [0.24423]]
GPI 3
 aprior performance(RMSE):
[[0.31875]
 [0.24323]
 [0.05353]]
post performance(RMSE):
[[0.03498]
 [0.19958]
 [0.24324]]
GPI 4
 aprior performance(RMSE):
[[0.19431]
 [0.10026]
 [0.16398]]
post performance(RMSE):
[[0.20526]
 [0.02813]
 [0.21003]]
GPI 5
 aprior performance(RMSE):
[[0.23724]
 [0.1104 ]
 [0.28052]]
post performance(RMSE):
[[0.10751]
 [0.08874]
 [0.26091]]


Shutdown the client to free up the resources click on SHUTDOWN in the Dask JupyterLab extension.