# Fed-BioMed Researcher Listing Datasets and Selecting Particular Nodes

Use for developing (autoreloads changes made across packages)

In [None]:
%load_ext autoreload
%autoreload 2

In this tutorial, you will learn how to list datasets deployed in nodes and select them to perform an experiement. To be able to follow this example, you need to lauch more than 2 nodes that have MNIST dataset.

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the client up
It is necessary to previously configure multiple node:
1. `./scripts/fedbiomed_run node config config-n1.ini add`
  * Select option 2 (default) to add MNIST to the client
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  * Start node with `./scripts/fedbiomed_run node config config-n1.ini start`  
  
  
2. Add data to second node: 
    * Open new terminal create new node by indicating the MNIST dataset that you already dowloaded
    `./scripts/fedbiomed_run node config config-n2.ini --add-mnist path/to/your/mnist/data`
    * Start node: `./scripts/fedbiomed_run node config config-n2.ini start`
3. Add a third node by following the same instructions of step 2.  

## List Datasets Available in Nodes

You can easly list dataset located in online nodes using `list()` method of `Request` class. 

**Arguments**  

 * `verbose` : Prints list of datasets in table format 
 * `nodes`  : Array includes nodes ids. Gets list of dataset only given nodes ids  
 
 

In [None]:
from fedbiomed.researcher.requests import Requests

req = Requests()
datasets = req.list(verbose=True)


You can also access these information from result of the `list()` method. 

In [None]:
print('Datasets:')
print(datasets)
print('\nNode ids:')
print(list(datasets.keys()))


You can select and list only datasets from a subset of the previously listed nodes:

In [None]:
nodes = list(datasets.keys())
if nodes:
    # keep only first node
    nodes = nodes[0:1]
else:
    # in this case, datasets from all nodes are listed
    nodes = []

Alternatively, you can create a list that contains nodes ids that you want run your experiment:

In [None]:
# Set directly the `nodes` variable when you know their ids
nodes = ['node_b1f4374a-09e2-436a-b21e-9d2493586c47', 'node_eac43a7c-4dc6-4833-851a-a87e007e72c8']

In [None]:
print('Selected nodes:')
print(nodes)

req = Requests()
datasets = req.list(nodes, verbose=True)

print('Datasets:')
print(datasets)

After specifying nodes in the `nodes` list, you can start creating your model and experiment.

## Search datasets from tags

You can also search datasets from a list of tags.

If all the specified tags are included in the dataset's tag list, then the dataset is selected

In [None]:
# exact tag match for MNIST dataset: dataset is selected
#tags =  ['#MNIST', '#dataset']

# loose tag match for MNIST dataset (listing a subset of the tags): dataset is selected
tags =  ['#dataset']

# not all tags matching MNIST dataset: dataset is NOT selected
#tags = ['#dataset', 'other']

In [None]:
from fedbiomed.researcher.requests import Requests

req = Requests()
datasets = req.search(tags)

In [None]:
print(datasets)

## Create a Model and an Experiment

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [None]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms


# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    
    # Defines and return model 
    def init_model(self, model_args):
        return self.Net(model_args = model_args)
    
    # Defines and return optimizer
    def init_optimizer(self, optimizer_args):
        return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"])
    
    # Declares and return dependencies
    def init_dependencies(self):
        deps = ["from torchvision import datasets, transforms"]
        return deps
    
    class Net(nn.Module):
        def __init__(self, model_args):
            super().__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout(0.25)
            self.dropout2 = nn.Dropout(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)

        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)


            output = F.log_softmax(x, dim=1)
            return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.model().forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the client side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the client side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [None]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'optimizer_args': {
        "lr" : 1e-3
    },
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}

Define an experiment
- search nodes serving data for these `tags`, optionally filter on a list of client ID with `clients`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [None]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

#tags =  ['#MNIST', '#dataset']
tags =  ['#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 # in this case you may want to use only nodes selected
                 # during the previous steps
                 #nodes=nodes,
                 model_args=model_args,
                 training_plan_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

Check tags, nodes and datasets used by the experiment:

In [None]:
print('Tags:')
print(exp.tags())
print('\nNodes:')
print(exp.nodes())
print('\nDatasets:')
print(exp.training_data().data())

## Optional: filter used datasets with minimum number of samples

As an advanced example, we may want to keep only the datasets that contain at least `min_samples` samples.

For this example you may want to share a dataset of each supported type (MNIST, CSV, medical folder dataset, etc.) with the `#dataset` tag, before creating the `Experiment()`.
Each of these datasets should contain less than `min_samples` samples to be filtered out.

Then filter found datasets:

In [None]:
datasets = exp.training_data().data()

# Minimal number of samples in dataset
#   eg: 59000 keeps MNIST dataset but should filter out most (smaller) datasets
min_samples = 59000

# Filter out all datasets from nodes
# Handle case where there may be multiple datasets per node matching the tags
datasets_filtered = {}
for node, ds in datasets.items():
    #df = [ d for d in ds if d['shape'][0] > min_samples ]
    df = []
    for d in ds:
        # most datasets have 1 data modality, so shape is a list
        if isinstance(d['shape'], list):
            if d['shape'][0] > min_samples:
                df.append(d)
        # medical folder dataset have multiples data modalities, shape is a dict of lists
        elif isinstance(d['shape'], dict):
            # we want at least the min number of samples for each of the modalities
            # (nota: this doesn't handle the case of un-complete subjects ...)
            if all([ v[0] >= min_samples for k,v in d['shape'].items() if k != 'num_modalities']):
                df.append(d)
        else:
            print("Bad dataset shape. Aborting.")
            break
    if df:
        datasets_filtered[node] = df
        
print(datasets_filtered)

Now set updated datasets for the experiment

In [None]:
exp.set_training_data(training_data = datasets_filtered)
exp.set_strategy(node_selection_strategy=None)
exp.set_job()

print('\nDatasets:')
print(exp.training_data().data())

## Run experiment

Let's start the experiment.

By default, this function doesn't stop until all the `round_limit` rounds are done for all the clients

In [None]:
exp.run()

Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1].data()
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies()[rounds - 1].dataframe()

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


Feel free to try your own models :D