# Fed-BioMed to train a federated SGD regressor model

## Data 


This tutorial shows how to deploy in Fed-BioMed to solve a federated regression problem with scikit-learn.

In this tutorial we are using the wrapper of Fed-BioMed for the SGD regressor (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html).
The goal of the notebook is to train a model on a realistic dataset of (synthetic) medical information mimicking the ADNI dataset (http://adni.loni.usc.edu/). 

## Creating nodes

To proceed with the tutorial, we create 3 clients with corresponding dataframes of clinical information in .csv format. Each client has 300 data points composed by several features corresponding to clinical and medical imaging informations. **The data is entirely synthetic and randomly sampled to mimick the variability of the real ADNI dataset**. The training partitions are availables at the following link:

https://drive.google.com/file/d/1R39Ir60oQi8ZnmHoPz5CoGCrVIglcO9l/view?usp=sharing

The federated task we aim at solve is to predict a clinical variable (the mini-mental state examination, MMSE) from a combination of demographic and imaging features. The regressors variables are the following features:

['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']

and the target variable is:

['MMSE.bl']
    

To create the federated dataset, we follow the standard procedure for node creation/population of Fed-BioMed. 
After activating the fedbiomed network with the commands

`source ./scripts/fedbiomed_environment network`

and 

`./scripts/fedbiomed_run network`

we create a first node by using the commands

`source ./scripts/fedbiomed_environment node`

`./scripts/fedbiomed_run node start`

We then poulate the node with the data of first client:

`./scripts/fedbiomed_run node config conf.ini add`

Thn, we select option 1 (csv dataset) to add the .csv partition of client 1, by just picking the .csv of client 1. We use `adni` as tag to save the selected dataset. We can further check that the data has been added by executing `./scripts/fedbiomed_run node list`

Following the same procedure, we create the other two nodes with the datasets of client 2 and client 3 respectively. To do so, we add and launch a `Node`using others configuration files

## Fed-BioMed Researcher

We are now ready to start the reseracher enviroment with the command `source ./scripts/fedbiomed_environment researcher`, and open the Jupyter notebook with `./scripts/fedbiomed_run researcher start`. 

We can first query the network for the adni dataset. In this case, the nodes are sharing the respective partitions unsing the same tag `adni`:

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from fedbiomed.researcher.requests import Requests
req = Requests()
req.list(verbose=True)

The code for network and data loader of the sklearn SGDRegressor can now be deployed in Fed-BioMed.
We first import the necessary module `SGDSkLearnModel` from `fedbiomed`:

**__init__** : we add here the needed sklearn libraries
       
**training_data** : you must return here a tuple (data,targets) that must be of the same type of 
your method partial_fit parameters. 

We note that this model performs a common standardization across federated datasets by **centering with respect to the same parameters**.

In [None]:
from fedbiomed.common.training_plans import FedSGDRegressor
from fedbiomed.common.data import DataManager

from declearn.optimizer import Optimizer
from declearn.optimizer.modules import AdamModule
from declearn.optimizer.regularizers import FedProxRegularizer


class SGDRegressorTrainingPlan(FedSGDRegressor):
    # Declares and return dependencies
    def init_dependencies(self):
        deps = ["from torchvision import datasets, transforms",
                "from declearn.optimizer import Optimizer",
                "from declearn.optimizer.modules import AdamModule",
                "from declearn.optimizer.regularizers import FedProxRegularizer"]
        return deps

    def training_data(self):
        dataset = pd.read_csv(self.dataset_path, delimiter=',')
        regressors_col = ['AGE', 'WholeBrain.bl',
                          'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
        target_col = ['MMSE.bl']
        
        # mean and standard deviation for normalizing dataset
        # it has been computed over the whole dataset
        scaling_mean = np.array([72.3, 0.7, 0.0, 0.0, 0.0, 0.0])
        scaling_sd = np.array([7.3e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])
        
        X = (dataset[regressors_col].values-scaling_mean)/scaling_sd
        y = dataset[target_col]
        return DataManager(dataset=X, target=y.values.ravel(),  shuffle=True)

    # Defines and return a declearn optimizer
    def init_optimizer(self, optimizer_args):
        return Optimizer(lrate=.1 ,modules=[AdamModule()], regularizers=[FedProxRegularizer()])

**model_args** is a dictionary containing your model arguments, in case of SGDRegressor this will be max_iter and tol. n_features is provided to correctly initialize the SGDRegressor coef_ array.

**training_args** is a dictionary with parameters related to Federated Learning. 

In [None]:
from fedbiomed.common.metrics import MetricTypes
RANDOM_SEED = 1234


model_args = {
    'max_iter':2000,
    'tol': 1e-5,
    'eta0':0.05,
    'n_features': 6,
    'random_state': RANDOM_SEED
}

training_args = {
    'epochs': 5,
    'loader_args': { 'batch_size': 32, },
    'test_ratio':.3,
    'test_metric': MetricTypes.MEAN_SQUARE_ERROR,
    'test_on_local_updates': True,
    'test_on_global_updates': True
}

The experiment can be now defined, by providing the `adni` tag, and running the local training on nodes with model defined in `model_path`, standard `aggregator` (FedAvg) and `client_selection_strategy` (all nodes used). Federated learning is going to be perfomed through 10 optimization rounds.

In [None]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['adni']

# Add more rounds for results with better accuracy
#
#rounds = 40
rounds = 5

# select nodes participating to this experiment
exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=SGDRegressorTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

In [None]:
# start federated training
exp.run()

# Declearn Optimizers with Scikit learn Perceptron Classifier

In [1]:
from fedbiomed.common.training_plans import FedPerceptron
from fedbiomed.common.data import DataManager
import numpy as np

from fedbiomed.common.optimizers import Optimizer
from declearn.optimizer.modules import AdamModule
from declearn.optimizer.regularizers import FedProxRegularizer

class SkLearnClassifierTrainingPlan(FedPerceptron):
    def init_dependencies(self):
        """Define additional dependencies.
        
        In this case, we rely on torchvision functions for preprocessing the images.
        """
        return ["from torchvision import datasets, transforms",
                "from fedbiomed.common.optimizers import Optimizer",
                "from declearn.optimizer.modules import AdamModule",
                "from declearn.optimizer.regularizers import FedProxRegularizer",]

    def training_data(self):
        """Prepare data for training.
        
        This function loads a MNIST dataset from the node's filesystem, applies some
        preprocessing and converts the full dataset to a numpy array. 
        Finally, it returns a DataManager created with these numpy arrays.
        """
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        
        X_train = dataset.data.numpy()
        X_train = X_train.reshape(-1, 28*28)
        Y_train = dataset.targets.numpy()
        return DataManager(dataset=X_train, target=Y_train,  shuffle=False)
    
    # Defines and return a declearn optimizer
    def init_optimizer(self, optimizer_args):
        return Optimizer(lr=.1 ,modules=[AdamModule()], regularizers=[FedProxRegularizer()])

In [2]:
model_args = {'n_features': 28*28,
              'n_classes' : 10,
              'eta0':1e-6,
              'random_state':1234,
              'alpha':0.1 }

training_args = {
    'epochs': 3, 
    'batch_maxnum': 20,  # can be used to debugging to limit the number of batches per epoch
    'optimizer_args': {
        "lr" : 1e-3
    },
#    'log_interval': 1,  # output a logging message every log_interval batches
    'batch_size': 4
}

In [3]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 3

# select nodes participating in this experiment
exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=SkLearnClassifierTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)


2023-05-16 16:38:03,122 fedbiomed INFO - Messaging researcher_6b641dc8-8bf0-4237-aaf7-eb6d72f8d3d8 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7f5df876dcd0>
2023-05-16 16:38:03,155 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2023-05-16 16:38:13,175 fedbiomed INFO - Node selected for training -> node_f0ea0045-a45a-4c29-9367-fa28128a59e9
2023-05-16 16:38:13,178 fedbiomed INFO - Node selected for training -> node_c2390ffb-45a3-4acc-a926-f564d3954696
2023-05-16 16:38:13,183 fedbiomed INFO - Checking data quality of federated datasets...
2023-05-16 16:38:13,188 fedbiomed DEBUG - Using declearn optimizer
2023-05-16 16:38:13,205 fedbiomed DEBUG - Model file has been saved: /home/ybouilla/fedbiomed_2/fedbiomed/var/experiments/Experiment_0012/my_model_522bd712-f13b-44e3-adfb-621407cb2f7c.py
2023-05-16 16:38:13,256 fedbiomed DEBUG - HTTP POST request of file /home/ybouilla/fedbiomed_2/fedbiom

In [4]:
exp.run(increase=True)

2023-05-16 16:38:13,310 fedbiomed INFO - Sampled nodes in round 0 ['node_f0ea0045-a45a-4c29-9367-fa28128a59e9', 'node_c2390ffb-45a3-4acc-a926-f564d3954696']
2023-05-16 16:38:13,311 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					[1m Request: [0m: Perform training with the arguments: {'researcher_id': 'researcher_6b641dc8-8bf0-4237-aaf7-eb6d72f8d3d8', 'job_id': '6bc623af-1fc1-4715-bf4f-b4ec702dfc44', 'training_args': {'epochs': 3, 'batch_maxnum': 20, 'optimizer_args': {'lr': 0.001}, 'batch_size': 4, 'num_updates': None, 'dry_run': False, 'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'log_interval': 10, 'fedprox_mu': None, 'use_gpu': False, 'dp_args': None, 'share_persistent_buffers': True}, 'training': True, 'model_args': {'n_features': 784, 'n_classes': 10, 'eta0': 1e-06, 'random_state': 1234, 'alpha': 0.1, 'loss': 'perceptron', 'verbose': 1},

2023-05-16 16:38:14,277 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_c2390ffb-45a3-4acc-a926-f564d3954696 
					 Round 1 Epoch: 2 | Iteration: 10/20 (50%) | Samples: 40/80
 					 Loss perceptron: [1m76.487863[0m 
					 ---------
2023-05-16 16:38:14,313 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					 Round 1 Epoch: 2 | Iteration: 10/20 (50%) | Samples: 40/80
 					 Loss perceptron: [1m76.487863[0m 
					 ---------
2023-05-16 16:38:14,533 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					 Round 1 Epoch: 2 | Iteration: 20/20 (100%) | Samples: 80/80
 					 Loss perceptron: [1m377.399983[0m 
					 ---------
2023-05-16 16:38:14,537 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_c2390ffb-45a3-4acc-a926-f564d3954696 
					 Round 1 Epoch: 2 | Iteration: 20/20 (100%) | Samples: 80/80
 					 Loss perceptron: [1m377.399983[0m 
					 ---------
2023-05-16 16:38:14,570 fedbiome

2023-05-16 16:38:23,548 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_f0ea0045-a45a-4c29-9367-fa28128a59e9
					[1m MESSAGE:[0m NPDataLoader expanding 1-dimensional target to become 2-dimensional.[0m
-----------------------------------------------------------------
2023-05-16 16:38:23,550 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_f0ea0045-a45a-4c29-9367-fa28128a59e9
					[1m MESSAGE:[0m NPDataLoader expanding 1-dimensional dataset to become 2-dimensional.[0m
-----------------------------------------------------------------
2023-05-16 16:38:23,551 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_f0ea0045-a45a-4c29-9367-fa28128a59e9
					[1m MESSAGE:[0m NPDataLoader expanding 1-dimensional target to become 2-dimensional.[0m
-----------------------------------------------------------------
					[1m NODE[0m node_f0ea0045-a45a-4c29-9367-fa28128a59e9
					[1m MESSAGE:[0m The following non-default model parameters were overridden due to the disabling of t

2023-05-16 16:38:33,562 fedbiomed INFO - Saved aggregated params for round 1 in /home/ybouilla/fedbiomed_2/fedbiomed/var/experiments/Experiment_0012/aggregated_params_3a087263-848c-4b4b-b697-a8dd72bcecc1.mpk
2023-05-16 16:38:33,563 fedbiomed INFO - Sampled nodes in round 2 ['node_f0ea0045-a45a-4c29-9367-fa28128a59e9', 'node_c2390ffb-45a3-4acc-a926-f564d3954696']
2023-05-16 16:38:33,564 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					[1m Request: [0m: Perform training with the arguments: {'researcher_id': 'researcher_6b641dc8-8bf0-4237-aaf7-eb6d72f8d3d8', 'job_id': '6bc623af-1fc1-4715-bf4f-b4ec702dfc44', 'training_args': {'epochs': 3, 'batch_maxnum': 20, 'optimizer_args': {'lr': 0.001}, 'batch_size': 4, 'num_updates': None, 'dry_run': False, 'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'log_interval': 10, 'fedprox_mu': None, 'use_gpu': False, 

2023-05-16 16:38:34,409 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					 Round 3 Epoch: 2 | Iteration: 10/20 (50%) | Samples: 40/80
 					 Loss perceptron: [1m0.000000[0m 
					 ---------
2023-05-16 16:38:34,443 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_c2390ffb-45a3-4acc-a926-f564d3954696 
					 Round 3 Epoch: 2 | Iteration: 10/20 (50%) | Samples: 40/80
 					 Loss perceptron: [1m0.000000[0m 
					 ---------
2023-05-16 16:38:34,596 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					 Round 3 Epoch: 2 | Iteration: 20/20 (100%) | Samples: 80/80
 					 Loss perceptron: [1m0.000000[0m 
					 ---------
2023-05-16 16:38:34,617 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_f0ea0045-a45a-4c29-9367-fa28128a59e9 
					 Round 3 Epoch: 3 | Iteration: 1/20 (5%) | Samples: 4/80
 					 Loss perceptron: [1m0.000000[0m 
					 ---------
2023-05-16 16:38:34,620 fedbiomed INFO - 

3

##  Testing

Once the federated model is obtained, it is possible to test it locally on an independent testing partition.
The test dataset is available at this link:

https://drive.google.com/file/d/1zNUGp6TMn6WSKYVC8FQiQ9lJAUdasxk1/

In [None]:
!pip install matplotlib
!pip install gdown

Download the testing dataset on the local temporary folder.

In [None]:
import os
import gdown
import tempfile
import zipfile
import pandas as pd
import numpy as np

from fedbiomed.common.constants import ComponentType
from fedbiomed.researcher.environ import environ


resource = "https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7"

tmpdir = tempfile.TemporaryDirectory(dir=environ['TMP_DIR'])
base_dir = tmpdir.name

test_file = os.path.join(base_dir, "test_data.zip")
gdown.download(resource, test_file, quiet=False)

zf = zipfile.ZipFile(test_file)

for file in zf.infolist():
    zf.extract(file, base_dir)

# loading testing dataset
test_data = pd.read_csv(os.path.join(base_dir,'adni_validation.csv'))

In [None]:
from sklearn.linear_model import SGDRegressor
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

Here we extract the relevant regressors and target from the testing data 

In [None]:
regressors_col = ['AGE', 'WholeBrain.bl', 'Ventricles.bl',
                  'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
target_col = ['MMSE.bl']
X_test = test_data[regressors_col].values
y_test = test_data[target_col].values

To inspect the model evolution across FL rounds, we export `exp.aggregated_params()` containing models parameters collected at the end of each round. The MSE (Mean Squarred Error) should be decreasing at each iteration with the federated parameters obtained at each round. 

In [None]:
scaling_mean = np.array([72.3, 0.7, 0.0, 0.0, 0.0, 0.0])
scaling_sd = np.array([7.3e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])

testing_error = []


# we create here several instances of SGDRegressor using same sklearn arguments
# we have used for Federated Learning training
fed_model = exp.training_plan().model()
regressor_args = {key: model_args[key] for key in model_args.keys() if key in fed_model.get_params().keys()}

for i in range(rounds):
    fed_model.coef_ = exp.aggregated_params()[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp.aggregated_params()[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict((X_test-scaling_mean)/scaling_sd) - y_test)**2)
    testing_error.append(mse)

plt.plot(testing_error)
plt.title('FL testing loss')
plt.xlabel('FL round')
plt.ylabel('testing loss (MSE)')

We finally inspect the predictions of the final federated model on the testing data.

In [None]:
y_predicted = fed_model.predict((X_test-scaling_mean)/scaling_sd)
plt.scatter(y_predicted, y_test, label='model prediction')
plt.xlabel('predicted')
plt.ylabel('target')
plt.title('Federated model testing prediction')

first_diag = np.arange(np.min(y_test.flatten()),
                       np.max(y_test.flatten()+1))
plt.scatter(first_diag, first_diag, label='correct Target')
plt.legend()

In [None]:
a = X_test / scaling_sd
a.shape

In [None]:
X_test.shape

In [None]:
X_test[:,1] / scaling_sd[1] - a[:,1]