# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets n1.csv , n2.csv and n3.csv has been generated randomly using a linear transformation A = [ 5 8 9 5 0 ].
We will fit a Stochastic Gradient Regressor to approximate this transformation using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed
2. Download n1.csv, n2.csv and n3.csv to some place in your computer from https://gitlab.inria.fr/fedbiomed/fedbiomed/-/tree/develop/notebooks/data
3. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n1.csv .
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n2.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n2.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n2.ini`.
  
* **Node 3 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n3.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n3.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n3.ini`.

 Wait until you get `Connected with result code 0`. it means you are online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'

Hereafter the template of the class you should provide to Fedbiomed :

**after_training_params** : a dictionnary containing the model parameters. 
In SGDRegressor case we will have coef and intercept. For kmeans that will be cluster_center and labels.
       
**training_step** : the most part of the time, it will be the method partial_fit, 
of a scikit incremental learning model. You can uncomment the prints in order to check the evolution of training.
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. To simplify we dont use batch_size here, but the code should work if you want to train on a specific batch of the dataset. 

You can uncomment the prints in order to check the evolution of training.

In [3]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
from sklearn.linear_model import SGDRegressor
import numpy as np

class SGDRegressorTrainingPlan(SGDSkLearnModel):
    def __init__(self, kwargs):
        super(SGDRegressorTrainingPlan,self).__init__(kwargs)
        self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
        self.set_model(SGDRegressor())
        self.set_init_params({'coef_': np.zeros(5), 'intercept_' : [0.]})
    
    def training_data(self,batch_size=None):
        NUMBER_COLS = 5
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        if batch_size == None:
            X = dataset.iloc[:,0:NUMBER_COLS].values
            y = dataset.iloc[:,NUMBER_COLS]
        else:
            X = dataset.iloc[0:batch_size,0:NUMBER_COLS].values
            y = dataset.iloc[0:batch_size,NUMBER_COLS]
        #print('X type ', type(X), ' shape ', X.shape)       
        #print('Y type ', type(y.values), ' shape ', len(y.values))
        return (X,y.values)
    

Writing /Users/mlorenzi/works/temp/fedbiomed/var/tmp/tmpn664ztp1/fedbiosklearn.py


**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [4]:
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'number_columns': 5 }

training_args = {
    'batch_size': None, 
    'lr': 1e-3, 
    'epochs': 5, 
    'dry_run': False,  
    'batch_maxnum': 0
}

In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['sk']
rounds = 5

exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SGDRegressorTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

Messaging researcher_04cf58a2-3dea-43b8-9db4-070592b6781b connected with result code 0
Searching for clients with data tags: ['sk'] ...
2021-08-25 11:45:51.720110 [ RESEARCHER ] message received. {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [19, 5], 'dataset_id': 'dataset_4177d54f-6e37-41b7-a6f1-ac99714ff8e4'}], 'count': 1, 'client_id': 'client_2b339fff-4966-4ed5-9ffe-ae3ab922f87b', 'command': 'search'}
2021-08-25 11:45:51.721876 [ RESEARCHER ] message received. {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [49, 5], 'dataset_id': 'dataset_7ef9e368-8819-4c39-bb48-37d63a51af50'}], 'count': 1, 'client_id': 'client_f00c3aac-129f-4e97-a55a-c2c41088cb6d', 'command': 'search'}
2021-08-25 11:45:51.723664 [ RESEARCHER ] message 

In [6]:
exp.run()

Sampled clients in round  0   ['client_2b339fff-4966-4ed5-9ffe-ae3ab922f87b', 'client_f00c3aac-129f-4e97-a55a-c2c41088cb6d', 'client_a5a17b70-69d5-4a1c-9af1-d67ca4f27768']
[ RESEARCHER ] Send message to client  client_2b339fff-4966-4ed5-9ffe-ae3ab922f87b {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'job_id': 'efc67ca6-b3cc-4d07-9285-2cc0d8921c5f', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'max_iter': 1000, 'tol': 0.001, 'number_columns': 5}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/08/25/my_model_2401ab7f-4f96-401a-91ce-73be2a594b14.py', 'params_url': 'http://localhost:8844/media/uploads/2021/08/25/my_model_1b7e1c5a-4453-42c5-bf81-cffb92435fb2.pt', 'model_class': 'SGDRegressorTrainingPlan', 'training_data': {'client_2b339fff-4966-4ed5-9ffe-ae3ab922f87b': ['dataset_4177d54f-6e37-41b7-a6f1-ac99714ff8e4']}}
researcher_04cf58a2-3dea-43b8-9db4-070592b6781b
[ R

2021-08-25 11:46:12.889441 [ RESEARCHER ] message received. {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'job_id': 'efc67ca6-b3cc-4d07-9285-2cc0d8921c5f', 'success': True, 'client_id': 'client_f00c3aac-129f-4e97-a55a-c2c41088cb6d', 'dataset_id': 'dataset_7ef9e368-8819-4c39-bb48-37d63a51af50', 'params_url': 'http://localhost:8844/media/uploads/2021/08/25/node_params_cd03e05b-cc85-42a5-ba77-d2dd21a23d6e.pt', 'timing': {'rtime_training': 0.0061381529999948725, 'ptime_training': 0.006008000000000013}, 'msg': '', 'command': 'train'}
2021-08-25 11:46:12.929191 [ RESEARCHER ] message received. {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'job_id': 'efc67ca6-b3cc-4d07-9285-2cc0d8921c5f', 'success': True, 'client_id': 'client_a5a17b70-69d5-4a1c-9af1-d67ca4f27768', 'dataset_id': 'dataset_67ad82e4-f3bd-4e12-9fd8-622cba47e968', 'params_url': 'http://localhost:8844/media/uploads/2021/08/25/node_params_120eb912-05a0-40e0-99e0-998b6815abaf.pt', 'timing'

2021-08-25 11:46:33.062806 [ RESEARCHER ] message received. {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'job_id': 'efc67ca6-b3cc-4d07-9285-2cc0d8921c5f', 'success': True, 'client_id': 'client_2b339fff-4966-4ed5-9ffe-ae3ab922f87b', 'dataset_id': 'dataset_4177d54f-6e37-41b7-a6f1-ac99714ff8e4', 'params_url': 'http://localhost:8844/media/uploads/2021/08/25/node_params_37931ddd-3bea-4429-b842-87f57df3d209.pt', 'timing': {'rtime_training': 0.005188239999995403, 'ptime_training': 0.005087000000000064}, 'msg': '', 'command': 'train'}
2021-08-25 11:46:33.104703 [ RESEARCHER ] message received. {'researcher_id': 'researcher_04cf58a2-3dea-43b8-9db4-070592b6781b', 'job_id': 'efc67ca6-b3cc-4d07-9285-2cc0d8921c5f', 'success': True, 'client_id': 'client_f00c3aac-129f-4e97-a55a-c2c41088cb6d', 'dataset_id': 'dataset_7ef9e368-8819-4c39-bb48-37d63a51af50', 'params_url': 'http://localhost:8844/media/uploads/2021/08/25/node_params_336b64ac-079c-40d8-8c24-1f321f436d3d.pt', 'timing':

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file datasets.

In [7]:
n_features = 5
testing_samples = 40
rng = np.random.RandomState(1)
A = np.array([[5],
       [8],
       [9],
       [5],
       [0]])

def test_data():
    X_test = rng.randn(testing_samples, n_features).reshape([testing_samples, n_features])
    y_test = X_test.dot(A) + rng.randn(testing_samples).reshape([testing_samples,1])
    return X_test, y_test

In [8]:
from sklearn.linear_model import SGDRegressor

In [9]:
X_test, y_test = test_data()

The MSE should be decreasing at each iteration with the federated parameters.

In [10]:
testing_error = []

for i in range(rounds):
    fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
    fed_model.coef_ = exp._aggregated_params[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp._aggregated_params[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict(X_test).ravel() - y_test.ravel())**2)
    print('MSE ', mse)
    testing_error.append(mse)

MSE  58.436906082050925
MSE  24.27651060704205
MSE  16.378478348116158
MSE  8.499826085953165
MSE  6.9347972623872405
