# Fedbiomed Researcher base example

Use for developing (autoreloads changes made across packages)

In [1]:
%load_ext autoreload
%autoreload 2

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Define an experiment model and parameters"

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [2]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"]
        
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [3]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}

## Declare and run the experiment

- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [4]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

2022-03-17 15:25:04,102 fedbiomed INFO - Component environment:
2022-03-17 15:25:04,103 fedbiomed INFO - type = ComponentType.RESEARCHER
2022-03-17 15:25:04,301 fedbiomed INFO - Messaging researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7fa8931be550>
2022-03-17 15:25:04,333 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-03-17 15:25:04,335 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Message received: {'researcher_id': 'researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
2022-03-17 15:25:14,372 fedbiomed INFO - Node selected for training -> node_19ef0050-617d-4624-bbce-207469edf883
2022-03-17 15:25:14,412 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0115/my_model_178fefaf-d79e-4dc

Let's start the experiment.

By default, this function doesn't stop until all the `round_limit` rounds are done for all the nodes

In [5]:
exp.run_once(increase=True)

2022-03-17 15:25:14,668 fedbiomed INFO - Sampled nodes in round 0 ['node_19ef0050-617d-4624-bbce-207469edf883']
2022-03-17 15:25:14,669 fedbiomed INFO - Send message to node node_19ef0050-617d-4624-bbce-207469edf883 - {'researcher_id': 'researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2', 'job_id': 'fa44b909-b49f-488d-b171-7eb8d5cf2b95', 'training_args': {'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/17/my_model_178fefaf-d79e-4dcf-96de-aaf0a6dece7a.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/17/aggregated_params_init_ab74f0c4-dc6c-4121-b5b2-afd4bd2252d8.pt', 'model_class': 'MyTrainingPlan', 'training_data': {'node_19ef0050-617d-4624-bbce-207469edf883': ['dataset_ba55374f-ddc3-4f5d-8bb6-deac79c459ee']}}
2022-03-17 15:25:14,670 fedbiomed DEBUG - researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2
2022-03-17 15:25:14,673 fedbiomed INFO - log fr

2022-03-17 15:25:34,742 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_ba9b4c54-3a64-4549-a8ac-d1170f35fb54.pt successful, with status code 200
2022-03-17 15:25:34,754 fedbiomed INFO - Nodes that successfully reply in round 0 ['node_19ef0050-617d-4624-bbce-207469edf883']
2022-03-17 15:25:34,936 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0115/aggregated_params_a18a2137-4123-4139-9143-d007cff93852.pt successful, with status code 201
2022-03-17 15:25:34,938 fedbiomed INFO - Saved aggregated params for round 0 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0115/aggregated_params_a18a2137-4123-4139-9143-d007cff93852.pt


1

In [6]:
exp.run(rounds=8, increase=True)

2022-03-17 15:25:43,292 fedbiomed DEBUG - Auto increasing total rounds for experiment from 2 to 9
2022-03-17 15:25:43,293 fedbiomed INFO - Sampled nodes in round 1 ['node_19ef0050-617d-4624-bbce-207469edf883']
2022-03-17 15:25:43,293 fedbiomed INFO - Send message to node node_19ef0050-617d-4624-bbce-207469edf883 - {'researcher_id': 'researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2', 'job_id': 'fa44b909-b49f-488d-b171-7eb8d5cf2b95', 'training_args': {'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/17/my_model_178fefaf-d79e-4dcf-96de-aaf0a6dece7a.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/17/aggregated_params_a18a2137-4123-4139-9143-d007cff93852.pt', 'model_class': 'MyTrainingPlan', 'training_data': {'node_19ef0050-617d-4624-bbce-207469edf883': ['dataset_ba55374f-ddc3-4f5d-8bb6-deac79c459ee']}}
2022-03-17 15:25:43,294 fedbiomed DEBUG - re

2022-03-17 15:26:03,318 fedbiomed INFO - Downloading model params after training on node_19ef0050-617d-4624-bbce-207469edf883 - from http://localhost:8844/media/uploads/2022/03/17/node_params_28b709a9-3e35-4f9a-975a-9a23b493c3ca.pt
2022-03-17 15:26:03,348 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_74d3ef5e-4b1c-486b-bb25-399f53530da4.pt successful, with status code 200
2022-03-17 15:26:03,385 fedbiomed INFO - Nodes that successfully reply in round 1 ['node_19ef0050-617d-4624-bbce-207469edf883']
2022-03-17 15:26:03,576 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0115/aggregated_params_b61c9b5a-fca8-4522-bd54-e1de30508b49.pt successful, with status code 201
2022-03-17 15:26:03,580 fedbiomed INFO - Saved aggregated params for round 1 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0115/aggregated_params_b61c9b5a-fca8-4522-bd54-e1de30508b49.pt
2022-03-1

2022-03-17 15:26:13,988 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-03-17 15:26:13,990 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - running model.postprocess() method
2022-03-17 15:26:13,991 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - model.postprocess() method not provided
2022-03-17 15:26:14,201 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/tmp/node_params_f6ae1b38-5114-4797-a01a-186eeb1117fa.pt successful, with status code 201
2022-03-17 15:26:14,204 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - results uploaded successfully 
2022-03-17 15:26:23,612 fedbiomed INFO - Downloading model params after training on node_19ef0050-617d-4624-bbce-207469edf883 - from http://loca

2022-03-17 15:26:32,946 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 3360/48000 (7%) 
 					 Loss: [1m0.138728[0m 
					 ---------
2022-03-17 15:26:33,434 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 3840/48000 (8%) 
 					 Loss: [1m0.134329[0m 
					 ---------
2022-03-17 15:26:33,916 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 4320/48000 (9%) 
 					 Loss: [1m0.173769[0m 
					 ---------
2022-03-17 15:26:34,411 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-03-17 15:26:34,413 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - running model.postprocess() method
2022-03-17 15:26:34,416 fedbiomed INFO - log from: node_19ef0050-617d-4624

2022-03-17 15:26:51,757 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 2400/48000 (5%) 
 					 Loss: [1m0.253747[0m 
					 ---------
2022-03-17 15:26:52,255 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 2880/48000 (6%) 
 					 Loss: [1m0.492268[0m 
					 ---------
2022-03-17 15:26:52,747 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 3360/48000 (7%) 
 					 Loss: [1m0.164425[0m 
					 ---------
2022-03-17 15:26:53,245 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 3840/48000 (8%) 
 					 Loss: [1m0.118568[0m 
					 ---------
2022-03-17 15:26:54,076 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 4320/48000 (9%) 
 	

2022-03-17 15:27:14,116 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 960/48000 (2%) 
 					 Loss: [1m0.094482[0m 
					 ---------
2022-03-17 15:27:14,995 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 1440/48000 (3%) 
 					 Loss: [1m0.145009[0m 
					 ---------
2022-03-17 15:27:15,629 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 1920/48000 (4%) 
 					 Loss: [1m0.041120[0m 
					 ---------
2022-03-17 15:27:16,510 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 2400/48000 (5%) 
 					 Loss: [1m0.031313[0m 
					 ---------
2022-03-17 15:27:17,423 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 2880/48000 (6%) 
 		

2022-03-17 15:27:36,184 fedbiomed INFO - [1mTESTING BEFORE TRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Completed: 12000/12000 (100%) 
 					 RECALL: [1m0.967167[0m 
					 ---------
2022-03-17 15:27:36,186 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Using device cpu for training (cuda_available=False, gpu=False, gpu_only=False, use_gpu=False, gpu_num=None)
2022-03-17 15:27:36,911 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 480/48000 (1%) 
 					 Loss: [1m0.034250[0m 
					 ---------
2022-03-17 15:27:37,488 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 960/48000 (2%) 
 					 Loss: [1m0.211607[0m 
					 ---------
2022-03-17 15:27:38,079 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_19ef0050-617d-4624-bbce-207469edf883 
					 Epoch: 1 | Completed: 1440/4800

2022-03-17 15:27:50,034 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - upload (HTTP GET request) of file my_model_f931d30e-9ac1-4417-985a-aee5cd3be635.pt successful, with status code 200
2022-03-17 15:27:50,068 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Dataset path has been set as../data
2022-03-17 15:27:50,120 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - training with arguments {'history_monitor': <fedbiomed.node.history_monitor.HistoryMonitor object at 0x7f663603abb0>, 'node_args': {'gpu': False, 'gpu_num': None, 'gpu_only': False}, 'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}
2022-03-17 15:27:56,612 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - Actual/True values (y_true) has more than two levels, using multiclass `weighted` calculation for the metric RECALL
2022-03-17 15:27:56,626 fedbiomed INFO - [1mTESTING BEF

2022-03-17 15:28:10,249 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - upload (HTTP GET request) of file my_model_d3309b868fb741938fea961640a0e724.py successful, with status code 200
2022-03-17 15:28:10,276 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - upload (HTTP GET request) of file my_model_900059c6-460e-4f69-9c3a-1f1574688b67.pt successful, with status code 200
2022-03-17 15:28:10,309 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Dataset path has been set as../data
2022-03-17 15:28:10,351 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - training with arguments {'history_monitor': <fedbiomed.node.history_monitor.HistoryMonitor object at 0x7f66cf734250>, 'node_args': {'gpu': False, 'gpu_num': None, 'gpu_only': False}, 'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}
2022-03-17 15:28:17,271 fedbiomed INFO - log from: node_19e

8

Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1].data()
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies()[rounds - 1].dataframe()

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


Feel free to run other sample notebooks or try your own models :D