This notebooks demonstrate how to split data to train-test execute parallel DNN trainings.

The example dataset `./example1_data.zarr/` can be generated using this [Jupyter Notebook](https://vegewaterdynamics.github.io/motrainer/notebooks/example_daskml/).

## Import libraries

In [15]:
import xarray as xr
import motrainer
import dask_ml.model_selection as dcv
from motrainer.jackknife import JackknifeGPI

## Read data and split to train and test datasets

In [16]:
# Read the data
zarr_file_path = "./example1_data.zarr"
ds = xr.open_zarr(zarr_file_path)

In [17]:
def to_dataframe(ds):
    return ds.to_dask_dataframe()

def chunk(ds, chunks):
    return ds.chunk(chunks)
    
bags = motrainer.dataset_split(ds, "space")
bags = bags.map(chunk, {"space": 100}).map(to_dataframe)

test_size = 0.33
f_shuffle = True
train_test_bags = bags.map(
    dcv.train_test_split, test_size=test_size, shuffle=f_shuffle, random_state=1
)  
train_bags = train_test_bags.pluck(0)
test_bags = train_test_bags.pluck(1)

## Define training parameters

In [18]:
# JackKnife parameters
JackKnife = {
    'val_split_year': 2017,
    'output_list': ['sig', 'slop', 'curv'],
    'input_list': ['TG1', 'TG2', 'TG3', 'WG1', 'WG2', 'WG3', 'BIOMA1', 'BIOMA2'],
    'out_path': './dnn_examples/results'
}

# Training parameters
searching_space = {
    'num_dense_layers': [1, 10],
    'num_input_nodes': [1, 6],
    'num_dense_nodes': [1, 128],
    'learning_rate': [5e-4, 1e-2],
    'activation': ['relu']
}

# Here, I reduce parameters to be able to run on my own machine
optimize_space = {
    'best_loss': 2, # 1
    'n_calls': 11, # 15
    'epochs': 5, # 300
    'noise': 0.1, 
    'kappa': 5,
    'validation_split': 0.2,
    'x0': [1e-3, 1, 4, 13, 'relu', 64]
} # For weightling loss: 'loss_weights': [1, 1, 0.5], 

## Run the training

In this example, we will demonstrate how to run the training parralel per grid (partition) with a dask cluster.

In [19]:
# a function for training
def training_func(gpi_num, df, JackKnife, searching_space, optimize_space):
    
    # remove NA data
    gpi_data = df.compute()
    gpi_data.dropna(inplace=True)

    # add time to index
    gpi_data.set_index("time", inplace=True, drop=True)

    gpi = JackknifeGPI(gpi_data,
                       JackKnife['val_split_year'],
                       JackKnife['input_list'],
                       JackKnife['output_list'],
                       outpath=f"{JackKnife['out_path']}/gpi{gpi_num+1}")

    gpi.train(searching_space=searching_space,
              optimize_space=optimize_space,
              normalize_method='standard',
              training_method='dnn',
              performance_method='rmse',
              verbose=2)

    gpi.export_best()

    return gpi.apr_perf, gpi.post_perf

By default, Dask uses a local threaded scheduler to parallelize the tasks. Alternatively, other types of clusters can be set up if the training job is running on other infrastructures. The usage of different clusters will not influence the syntax of data split and training jobs. For more information on different Dask clusters, please check the [Dask Documentation](https://docs.dask.org/en/stable/deploying.html).

In [20]:
from dask.distributed import Client

client = Client()

In [21]:
from dask.distributed import wait

In [23]:
# Use client to parallelize the loop across workers
futures = [
    client.submit(training_func, gpi_num, df, JackKnife, searching_space, optimize_space) for  gpi_num, df in enumerate(train_bags)
]

# Wait for all computations to finish
wait(futures)

# Get the results
results = client.gather(futures)

In [32]:
# Close the Dask client
client.close()

In [31]:
# print the results
for gpi_num, performance in enumerate(results):
    print(f"GPI {(gpi_num + 1)}")
    print(" aprior performance(RMSE):")
    print(performance[0])
    print("post performance(RMSE):")
    print(performance[1])
    print("=========================================")

GPI 1
 aprior performance(RMSE):
[[0.26438]
 [0.03295]
 [0.14362]]
post performance(RMSE):
[[0.34483]
 [0.00631]
 [0.00686]]
GPI 2
 aprior performance(RMSE):
[[0.37801]
 [0.02598]
 [0.26245]]
post performance(RMSE):
[[0.70249]
 [0.22075]
 [0.24423]]
GPI 3
 aprior performance(RMSE):
[[0.31875]
 [0.24323]
 [0.05353]]
post performance(RMSE):
[[0.03498]
 [0.19958]
 [0.24324]]
GPI 4
 aprior performance(RMSE):
[[0.19431]
 [0.10026]
 [0.16398]]
post performance(RMSE):
[[0.20526]
 [0.02813]
 [0.21003]]
GPI 5
 aprior performance(RMSE):
[[0.23724]
 [0.1104 ]
 [0.28052]]
post performance(RMSE):
[[0.10751]
 [0.08874]
 [0.26091]]


Shutdown the client to free up the resources click on SHUTDOWN in the Dask JupyterLab extension.

## Inspect best model file

In [13]:
import h5py
import tensorflow as tf

In [14]:
best_model = "./dnn_examples/results/gpi1/best_optimized_model_2015.h5"
model = tf.keras.models.load_model(best_model)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 5)                 45        
                                                                 
 layer_dense_1 (Dense)       (None, 39)                234       
                                                                 
 layer_dense_2 (Dense)       (None, 39)                1560      
                                                                 
 layer_dense_3 (Dense)       (None, 39)                1560      
                                                                 
 layer_dense_4 (Dense)       (None, 39)                1560      
                                                                 
 layer_dense_5 (Dense)       (None, 39)                1560      
                                                                 
 dense_1 (Dense)             (None, 3)                 1

In [16]:
# Add more info to the model file e.g. the path to the data
with h5py.File(best_model, 'a') as f:
    f.attrs['input_file_path'] = "./example1_data.zarr"

In [17]:
# Inspect the hyperparameters and input_list 
with h5py.File(best_model, 'r') as f:
    hyperparameters = f.attrs['hyperparameters']
    input_list = f.attrs['input_list']
    input_file_path = f.attrs['input_file_path']

print(eval(hyperparameters))

[(2.6945188437821344e-05, [0.0102933544822095, 2, 6, 6, 'relu', 220]), (0.0009668049169704318, [0.014053549021147586, 1, 5, 5, 'relu', 213]), (0.0011984826996922493, [0.01, 2, 5, 5, 'relu', 64]), (0.002951781963929534, [0.01034267518524073, 1, 6, 5, 'relu', 325]), (0.005780111066997051, [0.01, 1, 5, 6, 'relu', 189]), (0.0062195612117648125, [0.01238696509850767, 1, 5, 6, 'relu', 89]), (0.007209620904177427, [0.012213029952058047, 1, 6, 5, 'relu', 41]), (0.008430225774645805, [0.016252143020123296, 1, 5, 6, 'relu', 153]), (0.01519404910504818, [0.012097401107848887, 1, 5, 6, 'relu', 161]), (0.03946790099143982, [0.013412574345977837, 2, 5, 5, 'relu', 242]), (0.04046712443232536, [0.010949417026328035, 2, 5, 5, 'relu', 276]), (0.09059736132621765, [0.013515310361619494, 1, 6, 6, 'relu', 147]), (0.11357539147138596, [0.013742318889904532, 2, 5, 6, 'relu', 46]), (0.11388207972049713, [0.02, 1, 5, 6, 'relu', 123]), (0.7128570675849915, [0.012739986706827862, 1, 6, 6, 'relu', 302])]


In [18]:
print(input_list)

['TG1' 'TG2' 'TG3' 'WG1' 'WG2' 'WG3' 'BIOMA1' 'BIOMA2']


In [19]:
print(input_file_path)

./example1_data.zarr
