# Fedbiomed Researcher to train a federated scikit learn model.

## Perceptron
Binary Classification
### Purpose of the exercise :
Three datasets `c1.csv` , `c2.csv` and `c3.csv` has been generated with a target column of 2 different classes.
We will fit a Perceptron (classifier) using Federated Learning.

### Get the data 

We use the make_classification dataset from sklearn datasets

In [1]:
from sklearn import datasets
import numpy as np

In [2]:
X,y = datasets.make_classification(n_samples=300, n_features=20,n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0,shuffle=True, random_state=123)
X.shape,y.shape

((300, 20), (300,))

In [3]:
np.unique(y)

array([0, 1])

#### Creating unbalanced dataset, with different amount of data per centers

In [4]:
C1 = X[:150,:]
C2 = X[150:250,:]
C3 = X[250:300,:]

y1 = y[:150].reshape([150,1])
y2 = y[150:250].reshape([100,1])
y3 = y[250:300].reshape([50,1])

C1.shape ,C2.shape , C3.shape , y1.shape, y2.shape, y3.shape

((150, 20), (100, 20), (50, 20), (150, 1), (100, 1), (50, 1))

In [5]:
n1 = np.concatenate((C1, y1), axis=1)
np.savetxt('./data/c1.csv',n1,delimiter=',')

n2 = np.concatenate((C2, y2), axis=1)
np.savetxt('./data/c2.csv',n2,delimiter=',')

n3 = np.concatenate((C3, y3), axis=1)
np.savetxt('./data/c3.csv',n3,delimiter=',')

### Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

### Setting the node up
Before running this notebook you need to configure 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the node
  * Choose the name, tags and description of the dataset (you can write 'perp' always and it will be good)
  * Pick the c1.csv file in your machine (in `notebooks/data/c1.csv`)
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the node
  * Choose the name, tags and description of the dataset (you can write 'perp' always and it will be good)
  * Pick the c2.csv file in your machine (in `notebooks/data/c2.csv`)
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list `
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
 

 Wait until you get `Starting task manager`. it means node is online.


In [None]:
%load_ext autoreload
%autoreload 2

**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [7]:
n_features = 20
n_classes = 2

model_args = {'max_iter':1000, 'tol': 1e-3 , 
               'n_features' : n_features, 'n_classes' : n_classes}

training_args = {   
    'epochs': 5,
    'loader_args': { 'batch_size': 1 }
}

Hereafter the template of the class you should provide to Fedbiomed :
    
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. 

In [9]:
from fedbiomed.common.training_plans import FedPerceptron
from fedbiomed.common.data import DataManager


class PerceptronTraining(FedPerceptron):
    def training_data(self):
        NUMBER_COLS = 20
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:NUMBER_COLS].values
        y = dataset.iloc[:,NUMBER_COLS]       
        return DataManager(dataset=X,target=y.values, shuffle=True)

In [10]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['perp']
rounds = 2

# search for corresponding datasets across nodes datasets
exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=PerceptronTraining,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)


2023-08-08 10:29:48,247 fedbiomed INFO - Messaging researcher_a21d2c82-e89c-461b-b6f1-a57155e551de successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x12c8b2350>
2023-08-08 10:29:48,345 fedbiomed INFO - Searching dataset with data tags: ['perp'] for all nodes
2023-08-08 10:29:58,369 fedbiomed INFO - Node selected for training -> node_44bcb5b9-b589-45f4-b0de-31303a225194
2023-08-08 10:29:58,375 fedbiomed DEBUG - Using native Sklearn Optimizer
2023-08-08 10:29:58,379 fedbiomed DEBUG - Model file has been saved: /Users/fcremone/dev/fedbiomed/var/experiments/Experiment_0039/my_model_579a391a-cbbf-4a8b-93ca-db77c30fd66a.py
2023-08-08 10:29:58,504 fedbiomed DEBUG - HTTP POST request of file /Users/fcremone/dev/fedbiomed/var/experiments/Experiment_0039/my_model_579a391a-cbbf-4a8b-93ca-db77c30fd66a.py successful, with status code 201
2023-08-08 10:29:58,603 fedbiomed DEBUG - HTTP POST request of file /Users/fcremone/dev/fedbiomed/var/experim

In [11]:
exp.run()

2023-08-08 10:30:00,439 fedbiomed INFO - Sampled nodes in round 0 ['node_44bcb5b9-b589-45f4-b0de-31303a225194']
2023-08-08 10:30:00,441 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_44bcb5b9-b589-45f4-b0de-31303a225194 
					[1m Request: [0m: Perform training with the arguments: {'researcher_id': 'researcher_a21d2c82-e89c-461b-b6f1-a57155e551de', 'job_id': 'c4e8e248-b8ee-47ec-9d03-ddfd49ab3f8f', 'training_args': {'epochs': 5, 'loader_args': {'batch_size': 1}, 'optimizer_args': {}, 'num_updates': None, 'dry_run': False, 'batch_maxnum': None, 'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'log_interval': 10, 'fedprox_mu': None, 'use_gpu': False, 'dp_args': None, 'share_persistent_buffers': True, 'random_seed': None}, 'training': True, 'model_args': {'max_iter': 1000, 'tol': 0.001, 'n_features': 20, 'n_classes': 2, 'loss': 'perceptron', 'verbose': 1}, 'round': 0, 'secagg_servkey_id': Non

2

## Lets validate the trained model with the test dataset c3.csv.

In [12]:
import pandas as pd

In [13]:
data = pd.read_csv('./data/c3.csv')

In [14]:
X_test = data.iloc[:,:n_features]
y_test = data.iloc[:,n_features]

F1 score computed with federated algorithm :

For that, we are exporting `exp.aggregated_params()` containing models parameters collected at the end of each round

In [15]:
from sklearn.metrics import f1_score
loss_metric = f1_score
    
testing_error = []

for i in range(rounds):
    fed_model = exp.training_plan().model()
    fed_model.coef_ = exp.aggregated_params()[i]['params']['coef_']
    fed_model.intercept_ = exp.aggregated_params()[i]['params']['intercept_']
    metric = loss_metric(fed_model.predict(X_test),y_test.ravel())
    print('F1 score metric: ', metric, )
    testing_error.append(metric)

F1 score metric:  0.830188679245283
F1 score metric:  0.8333333333333334


X has feature names, but SGDClassifier was fitted without feature names
X has feature names, but SGDClassifier was fitted without feature names


##  SGD regressor

### Data 


This tutorial shows how to deploy in Fed-BioMed to solve a federated regression problem with scikit-learn.

In this tutorial we are using the wrapper of Fed-BioMed for the [SGD regressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html).
The goal of the notebook is to train a model on a realistic dataset of (synthetic) medical information mimicking the ADNI dataset (http://adni.loni.usc.edu/). 

### Creating nodes

To proceed with the tutorial, we create 3 clients with corresponding dataframes of clinical information in .csv format. Each client has 300 data points composed by several features corresponding to clinical and medical imaging informations. **The data is entirely synthetic and randomly sampled to mimick the variability of the real ADNI dataset**. The training partitions are availables at the following link:

https://drive.google.com/file/d/1R39Ir60oQi8ZnmHoPz5CoGCrVIglcO9l/view?usp=sharing
or can be found under `notebooks/data/CSV/pseudo_adni_mod.csv`

The federated task we aim at solve is to predict a clinical variable (the mini-mental state examination, MMSE) from a combination of demographic and imaging features. The regressors variables are the following features:

['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']

and the target variable is:

['MMSE.bl']
    

To create the federated dataset, we follow the standard procedure for node creation/population of Fed-BioMed. 
After activating the fedbiomed network with the commands

`source ./scripts/fedbiomed_environment network`

and 

`./scripts/fedbiomed_run network`

we create a first node by using the commands

`source ./scripts/fedbiomed_environment node`

`./scripts/fedbiomed_run node start`

We then poulate the node with the data of first client:

`./scripts/fedbiomed_run node config conf.ini add`

Thn, we select option 1 (csv dataset) to add the .csv partition of client 1, by just picking the .csv of client 1. We use `adni` as tag to save the selected dataset. We can further check that the data has been added by executing `./scripts/fedbiomed_run node list`

Following the same procedure, we create the other two nodes with the datasets of client 2 and client 3 respectively. To do so, we add and launch a `Node`using others configuration files

### Fed-BioMed Researcher

We are now ready to start the reseracher enviroment with the command `source ./scripts/fedbiomed_environment researcher`, and open the Jupyter notebook with `./scripts/fedbiomed_run researcher start`. 

We can first query the network for the adni dataset. In this case, the nodes are sharing the respective partitions unsing the same tag `adni`:

In [None]:
from fedbiomed.researcher.requests import Requests
req = Requests()
req.list(verbose=True)

The code for network and data loader of the sklearn SGDRegressor can now be deployed in Fed-BioMed.
We first import the necessary module `SGDSkLearnModel` from `fedbiomed`:

       
**training_data** : you must return here a tuple (data,targets) that must be of the same type of 
your method partial_fit parameters. 

We note that this model performs a common standardization across federated datasets by **centering with respect to the same parameters**.

In [None]:
from fedbiomed.common.training_plans import FedSGDRegressor
from fedbiomed.common.data import DataManager

class SGDRegressorTrainingPlan(FedSGDRegressor):
    def training_data(self):
        dataset = pd.read_csv(self.dataset_path,delimiter=';')
        regressors_col = ['AGE', 'WholeBrain.bl',
                          'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
        target_col = ['MMSE.bl']
        
        # mean and standard deviation for normalizing dataset
        # it has been computed over the whole dataset
        scaling_mean = np.array([72.3, 0.7, 0.0, 0.0, 0.0, 0.0])
        scaling_sd = np.array([7.3e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])
        
        X = (dataset[regressors_col].values-scaling_mean)/scaling_sd
        y = dataset[target_col]
        return DataManager(dataset=X, target=y.values.ravel(), shuffle=True)
    

**model_args** is a dictionary containing your model arguments, in case of SGDRegressor this will be max_iter and tol. n_features is provided to correctly initialize the SGDRegressor coef_ array.

**training_args** is a dictionary with parameters related to Federated Learning. 

In [None]:
from fedbiomed.common.metrics import MetricTypes
RANDOM_SEED = 1234


model_args = {
    'max_iter':2000,
    'tol': 1e-5,
    'eta0':0.05,
    'n_features': 6,
    'random_state': RANDOM_SEED
}

training_args = {
    'epochs': 5,
    'loader_args': { 'batch_size': 10, },
    'test_ratio':.3,
    'test_metric': MetricTypes.MEAN_SQUARE_ERROR,
    'test_on_local_updates': True,
    'test_on_global_updates': True
}

The experiment can be now defined, by providing the `adni` tag, and running the local training on nodes with model defined in `model_path`, standard `aggregator` (FedAvg) and `client_selection_strategy` (all nodes used). Federated learning is going to be perfomed through 10 optimization rounds.

In [None]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['adni']

# Add more rounds for results with better accuracy
#
#rounds = 40
rounds = 2

# select nodes participating to this experiment
exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=SGDRegressorTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

In [None]:
# start federated training
exp.run()

In [None]:
exp.aggregated_params()

In [None]:
fed_model = exp.training_plan().model()
fed_model.intercept_ = exp.aggregated_params()[rounds-1]['params']['intercept_']
fed_model.coef_ = exp.aggregated_params()[rounds-1]['params']['coef_']

## SGDClassifier
### Purpose of the exercise :

Three datasets `c1_3class.csv` , `c2_3class.csv` and `c3_3class.csv` has been generated with a target column of 3 different classes.
We will fit a SGCClassifier (classifier) using Federated Learning.

### Get the data 

We use the make_classification dataset from sklearn datasets

In [None]:
from sklearn import datasets
import numpy as np

In [None]:
X,y = datasets.make_classification(n_samples=300, n_features=20,n_informative = 3, n_classes=3,n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0,shuffle=True, random_state=123)
X.shape,y.shape

In [None]:
C1 = X[:150,:]
C2 = X[150:250,:]
C3 = X[250:300,:]

y1 = y[:150].reshape([150,1])
y2 = y[150:250].reshape([100,1])
y3 = y[250:300].reshape([50,1])

C1.shape ,C2.shape , C3.shape , y1.shape, y2.shape, y3.shape

In [None]:
n1 = np.concatenate((C1, y1), axis=1)
np.savetxt('./data/c1_3class.csv',n1,delimiter=',')

n2 = np.concatenate((C2, y2), axis=1)
np.savetxt('./data/c2_3class.csv',n2,delimiter=',')

n3 = np.concatenate((C3, y3), axis=1)
np.savetxt('./data/c3_3class.csv',n3,delimiter=',')

### Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

### Setting the node up
Before running this notebook you need to configure 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the node
  * Choose the name, tags and description of the dataset (you can write **'perp1'** always and it will be good)
  * Pick the c1_3class.csv file in your machine  (in `notebooks/data/c1_3class.csv`)
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the node
  * Choose the name, tags and description of the dataset (you can write **'perp1'** always and it will be good)
  * Pick the c2_3class.csv file in your machine (in `notebooks/data/c2_3class.csv`)
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list `
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
 

 Wait until you get `Starting task manager`. it means node is online.


In [None]:
%load_ext autoreload
%autoreload 2

**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [16]:
n_features = 20
n_classes = 3

model_args = {'max_iter':1000, 'tol': 1e-3 , 
               'n_features' : n_features, 'n_classes' : n_classes}

training_args = {   
    'epochs': 5,
    'loader_args': { 'batch_size': 1, },
}

Hereafter the template of the class you should provide to Fedbiomed :
    
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. 

In [18]:
from fedbiomed.common.training_plans import FedSGDClassifier
from fedbiomed.common.data import DataManager


class SGDClassifierTrainingPlan(FedSGDClassifier):
    def training_data(self):
        NUMBER_COLS = 20
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:NUMBER_COLS].values
        y = dataset.iloc[:,NUMBER_COLS]       
        return DataManager(dataset=X,target=y.values, shuffle=True)

In [21]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['perp']
rounds = 2

# search for corresponding datasets across nodes datasets
exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=SGDClassifierTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)


2023-08-08 10:31:42,761 fedbiomed INFO - Searching dataset with data tags: ['perp'] for all nodes
2023-08-08 10:31:52,780 fedbiomed INFO - Node selected for training -> node_44bcb5b9-b589-45f4-b0de-31303a225194
2023-08-08 10:31:52,784 fedbiomed DEBUG - Using native Sklearn Optimizer
2023-08-08 10:31:52,787 fedbiomed DEBUG - Model file has been saved: /Users/fcremone/dev/fedbiomed/var/experiments/Experiment_0041/my_model_a5d89520-633b-42c3-9489-7224172d856d.py
2023-08-08 10:31:52,839 fedbiomed DEBUG - HTTP POST request of file /Users/fcremone/dev/fedbiomed/var/experiments/Experiment_0041/my_model_a5d89520-633b-42c3-9489-7224172d856d.py successful, with status code 201
2023-08-08 10:31:52,899 fedbiomed DEBUG - HTTP POST request of file /Users/fcremone/dev/fedbiomed/var/experiments/Experiment_0041/aggregated_params_88c27c0b-d967-4941-ad1e-fbe436850764.mpk successful, with status code 201


In [22]:
exp.run()

2023-08-08 10:31:52,908 fedbiomed INFO - Sampled nodes in round 0 ['node_44bcb5b9-b589-45f4-b0de-31303a225194']
2023-08-08 10:31:52,910 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_44bcb5b9-b589-45f4-b0de-31303a225194 
					[1m Request: [0m: Perform training with the arguments: {'researcher_id': 'researcher_a21d2c82-e89c-461b-b6f1-a57155e551de', 'job_id': 'f961d4b2-3b1e-4416-ab88-722567a1b569', 'training_args': {'epochs': 5, 'loader_args': {'batch_size': 1}, 'optimizer_args': {}, 'num_updates': None, 'dry_run': False, 'batch_maxnum': None, 'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'log_interval': 10, 'fedprox_mu': None, 'use_gpu': False, 'dp_args': None, 'share_persistent_buffers': True, 'random_seed': None}, 'training': True, 'model_args': {'max_iter': 1000, 'tol': 0.001, 'n_features': 20, 'n_classes': 3, 'verbose': 1}, 'round': 0, 'secagg_servkey_id': None, 'secagg_biprime_id'

2

In [None]:
import pandas as pd
data = pd.read_csv('./data/c3_3class.csv')

In [None]:
X_test = data.iloc[:,:n_features]
y_test = data.iloc[:,n_features]

F1 score computed with federated algorithm :

For that, we are exporting `exp.aggregated_params()` containing models parameters collected at the end of each round

In [None]:
from sklearn.metrics import classification_report, f1_score
loss_metric = f1_score
    
testing_error = []

for i in range(rounds):
    fed_model = exp.training_plan().model()
    fed_model.coef_ = exp.aggregated_params()[i]['params']['coef_']
    fed_model.intercept_ = exp.aggregated_params()[i]['params']['intercept_']
    print(f'Model trained in round {i}')
    print('-------------------------')
    print(classification_report(y_test, fed_model.predict(X_test), digits=3))