# Fedbiomed to train a federated SGD regressor model

## Data 


This tutorial shows how to deploy in Fed-BioMed to solve a federated regression problem with scikit-learn.

In this tutorial we are using the wrapper of Fed-BioMed for the SGD regressor (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html).
The goal of the notebook is to train a model on a realistic dataset of (synthetic) medical information mimicking the ADNI dataset (http://adni.loni.usc.edu/). 

## Creating nodes

To proceed with the tutorial, we create 3 clients with corresponding dataframes of clinical information in .csv format. Each client has 300 data points composed by several features corresponding to clinical and medical imaging informations. **The data is entirely synthetic and randomly sampled to mimick the variability of the real ADNI dataset**. The training partitions are availables at the following link:

https://drive.google.com/file/d/1R39Ir60oQi8ZnmHoPz5CoGCrVIglcO9l/view?usp=sharing

The federated task we aim at solve is to predict a clinical variable (the mini-mental state examination, MMSE) from a combination of demographic and imaging features. The regressors variables are the following features:

['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']

and the target variable is:

['MMSE.bl']
    

To create the federated dataset, we follow the standard procedure for node creation/population of Fed-BioMed. 
After activating the fedbiomed network with the commands

`source ./scripts/fedbiomed_environment network`

and 

`./scripts/fedbiomed_run network`

we create a first node by using the commands

`source ./scripts/fedbiomed_environment node`

`./scripts/fedbiomed_run node start`

We then poulate the node with the data of first client:

`./scripts/fedbiomed_run node config conf.ini add`

Thn, we select option 1 (csv dataset) to add the .csv partition of client 1, by just picking the .csv of client 1. We use `adni` as tag to save the selected dataset. We can further check that the data has been added by executing `./scripts/fedbiomed_run node list`

Following the same procedure, we create the other two nodes with the datasets of client 2 and client 3 respectively. To do so, we add and launch a `Node`using others configuration files

## Fed-BioMed Researcher

We are now ready to start the reseracher enviroment with the command `source ./scripts/fedbiomed_environment researcher`, and open the Jupyter notebook with `./scripts/fedbiomed_run researcher`. 

We can first query the network for the adni dataset. In this case, the nodes are sharing the respective partitions unsing the same tag `adni`:

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from fedbiomed.researcher.requests import Requests
req = Requests()
req.list(verbose=True)

The code for network and data loader of the sklearn SGDRegressor can now be deployed in Fed-BioMed.
We first import the necessary module `SGDSkLearnModel` from `fedbiomed`:

**__init__** : we add here the needed sklearn libraries
       
**training_data** : you must return here a tuple (data,targets) that must be of the same type of 
your method partial_fit parameters. 

We note that this model performs a common standardization across federated datasets by **centering with respect to the same parameters**.

In [8]:
from fedbiomed.common.training_plans import SGDSkLearnModel
from fedbiomed.common.data import DataManager

class SGDRegressorTrainingPlan(SGDSkLearnModel):
    def __init__(self, model_args: dict = {}):
        super(SGDRegressorTrainingPlan, self).__init__(model_args)
        self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
    
    def training_data(self):
        dataset = pd.read_csv(self.dataset_path,delimiter=',')
        regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl',
                          'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
        target_col = ['MMSE.bl']
        
        # mean and standard deviation for normalizing dataset
        # it has been computed over the whole dataset
        scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0])
        scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])
        
        X = (dataset[regressors_col].values-scaling_mean)/scaling_sd
        y = dataset[target_col]
        return DataManager(dataset=X, target=y.values.ravel())
    

**model_args** is a dictionary containing your model arguments, in case of SGDRegressor this will be max_iter and tol. n_features is provided to correctly initialize the SGDRegressor coef_ array.

**training_args** is a dictionary with parameters related to Federated Learning. 

In [12]:
from fedbiomed.common.metrics import MetricTypes
RANDOM_SEED = 1234


model_args = {
    'max_iter':2000,
    'tol': 1e-5,
    'eta0':0.05,
    'model': 'SGDRegressor',
    'n_features': 8,
    'random_state': RANDOM_SEED
}

training_args = {
    'epochs': 5,
    'test_ratio':.3,
    'test_metric': MetricTypes.ACCURACY,
    'test_on_local_updates': True,
    'test_on_global_updates': True
}

The experiment can be now defined, by providing the `adni` tag, and running the local training on nodes with model defined in `model_path`, standard `aggregator` (FedAvg) and `client_selection_strategy` (all nodes used). Federated learning is going to be perfomed through 10 optimization rounds.

In [15]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['adni']

# Add more rounds for results with better accuracy
#
#rounds = 40
rounds = 5

# select nodes participating to this experiment
exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=SGDRegressorTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

2022-03-30 16:02:15,868 fedbiomed INFO - Searching dataset with data tags: ['adni'] for all nodes
2022-03-30 16:02:25,881 fedbiomed INFO - Node selected for training -> node_ad006bab-e62d-4745-948c-604a37b7f170
2022-03-30 16:02:25,887 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0017/my_model_db01f25c-90cc-41b4-a8e4-8aabef9edeff.py
2022-03-30 16:02:25,949 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0017/my_model_db01f25c-90cc-41b4-a8e4-8aabef9edeff.py successful, with status code 201
2022-03-30 16:02:25,972 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0017/aggregated_params_init_2b15f0bd-4096-43d2-8425-a36ed3837484.pt successful, with status code 201


In [16]:
# start federated training
exp.run()

2022-03-30 16:02:25,978 fedbiomed INFO - Sampled nodes in round 0 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 16:02:25,980 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_ad006bab-e62d-4745-948c-604a37b7f170 
					[1m Reqeust: [0m: Perform training with the arguments: {'researcher_id': 'researcher_ad3c024c-fb12-4ca1-9204-0f6b9220bed8', 'job_id': '5ed9d553-c1bb-4cb8-b3f0-ede1a7a3108f', 'training_args': {'test_ratio': 0.3, 'test_on_local_updates': True, 'test_on_global_updates': True, 'test_metric': <MetricTypes.ACCURACY: (0, <_MetricCategory.CLASSIFICATION_LABELS: 0>)>, 'test_metric_args': {}, 'epochs': 5}, 'training': True, 'model_args': {'max_iter': 2000, 'tol': 1e-05, 'eta0': 0.05, 'model': 'SGDRegressor', 'n_features': 8, 'random_state': 1234, 'verbose': 1}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/30/my_model_db01f25c-90cc-41b4-a8e4-8aabef9edeff.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/30/ag

2022-03-30 16:02:46,116 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_83a59266-44bf-47d3-8127-56815221962a.pt successful, with status code 200
2022-03-30 16:02:46,126 fedbiomed INFO - Nodes that successfully reply in round 1 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 16:02:46,176 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0017/aggregated_params_23f3eb9d-d455-44c4-baab-6b26d0a0483e.pt successful, with status code 201
2022-03-30 16:02:46,178 fedbiomed INFO - Saved aggregated params for round 1 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0017/aggregated_params_23f3eb9d-d455-44c4-baab-6b26d0a0483e.pt
2022-03-30 16:02:46,179 fedbiomed INFO - Sampled nodes in round 2 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 16:02:46,179 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_ad006bab-e62d-4745-948c-604a37b7f170 
		

2022-03-30 16:02:56,316 fedbiomed INFO - [1mTESTING ON LOCAL UPDATES[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Completed: 90/90 (100%) 
 					 ACCURACY: [1m0.000000[0m 
					 ---------
2022-03-30 16:02:56,336 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m results uploaded successfully [0m
-----------------------------------------------------------------
2022-03-30 16:03:06,295 fedbiomed INFO - Downloading model params after training on node_ad006bab-e62d-4745-948c-604a37b7f170 - from http://localhost:8844/media/uploads/2022/03/30/node_params_a689bae8-db12-487e-a2f5-b6cd3faaf132.pt
2022-03-30 16:03:06,300 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_6cc25a5f-9660-483f-8662-35d775fbc4ce.pt successful, with status code 200
2022-03-30 16:03:06,302 fedbiomed INFO - Nodes that successfully reply in round 3 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 16:03:06,322 f

5

##  Testing

Once the federated model is obtained, it is possible to test it locally on an independent testing partition.
The test dataset is available at this link:

https://drive.google.com/file/d/1zNUGp6TMn6WSKYVC8FQiQ9lJAUdasxk1/

In [None]:
!pip install matplotlib
!pip install gdown

Download the testing dataset on the local temporary folder.

In [None]:
import os
import gdown
import tempfile
import zipfile

resource = "https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7"
base_dir = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']).name

test_file = os.path.join(base_dir, "test_data.zip")
gdown.download(resource, test_file, quiet=False)

zf = zipfile.ZipFile(test_file)

for file in zf.infolist():
    zf.extract(file, base_dir)


In [None]:
import pandas as pd
import numpy as np


# loading testing dataset
test_data = pd.read_csv(os.path.join(base_dir,'adni_validation.csv'))

In [None]:
from sklearn.linear_model import SGDRegressor
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

Here we extract the relevant regressors and target from the testing data 

In [None]:
regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl',
                  'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
target_col = ['MMSE.bl']
X_test = test_data[regressors_col].values
y_test = test_data[target_col].values

To inspect the model evolution across FL rounds, we export `exp.aggregated_params()` containing models parameters collected at the end of each round. The MSE (Mean Squarred Error) should be decreasing at each iteration with the federated parameters obtained at each round. 

In [None]:
scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0])
scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])

testing_error = []


# we create here several instances of SGDRegressor using same sklearn arguments
# we have used for Federated Learning training
fed_model = SGDRegressor()
regressor_args = {key: model_args[key] for key in model_args.keys() if key in fed_model.get_params().keys()}

for i in range(rounds):
    fed_model = SGDRegressor()
    fed_model.set_params(**regressor_args)
    fed_model.coef_ = exp.aggregated_params()[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp.aggregated_params()[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict((X_test-scaling_mean)/scaling_sd) - y_test)**2)
    testing_error.append(mse)

plt.plot(testing_error)
plt.title('FL testing loss')
plt.xlabel('FL round')
plt.ylabel('testing loss (MSE)')

We finally inspect the predictions of the final federated model on the testing data.

In [None]:
y_predicted = fed_model.predict((X_test-scaling_mean)/scaling_sd)
plt.scatter(y_predicted, y_test, label='model prediction')
plt.xlabel('predicted')
plt.ylabel('target')
plt.title('Federated model testing prediction')

first_diag = np.arange(np.min(y_test.flatten()),
                       np.max(y_test.flatten()+1))
plt.scatter(first_diag, first_diag, label='correct Target')
plt.legend()

In [None]:
a = X_test / scaling_sd
a.shape

In [None]:
X_test.shape

In [None]:
X_test[:,1] / scaling_sd[1] - a[:,1]