# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets n1.csv , n2.csv and n3.csv has been generated randomly using a linear transformation A = [ 5 8 9 5 0 ].
We will fit a Stochastic Gradient Regressor to approximate this transformation using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed
2. Download n1.csv, n2.csv and n3.csv to some place in your computer from https://gitlab.inria.fr/fedbiomed/fedbiomed/-/tree/develop/notebooks/data
3. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n1.csv .
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n2.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n2.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n2.ini`.
  
* **Node 3 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n3.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n3.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n3.ini`.

 Wait until you get `Connected with result code 0`. it means you are online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import environ
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'

Hereafter the template of the class you should provide to Fedbiomed :

**after_training_params** : a dictionnary containing the model parameters. 
In SGDRegressor case we will have coef and intercept. For kmeans that will be cluster_center and labels.
       
**training_step** : the most part of the time, it will be the method partial_fit, 
of a scikit incremental learning model. You can uncomment the prints in order to check the evolution of training.
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. To simplify we dont use batch_size here, but the code should work if you want to train on a specific batch of the dataset. 

You can uncomment the prints in order to check the evolution of training.

In [3]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import SGDClassifier
import numpy as np

class SKlearnTrainingPlan(SGDSkLearnModel):
    def __init__(self, kwargs):
        super(SKlearnTrainingPlan,self).__init__(kwargs)
        self.set_model('SGDRegressor')
        self.set_init_params({'coef_': np.zeros(4), 'intercept_' : [0.]})
    
    def training_data(self,batch_size=None):
        NUMBER_COLS = 4
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        if batch_size == None:
            X = dataset.iloc[:,0:NUMBER_COLS].values
            y = dataset.iloc[:,NUMBER_COLS]
        else:
            X = dataset.iloc[0:batch_size,0:NUMBER_COLS].values
            y = dataset.iloc[0:batch_size,NUMBER_COLS]
        return (X,y.values)
    

Writing /Users/mlorenzi/works/temp/fedbiomed/var/tmp/tmp_0f777hp/fedbiosklearn.py


**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [4]:
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'number_columns': 5 }

training_args = {
    'batch_size': None, 
    'lr': 1e-3, 
    'epochs': 5, 
    'dry_run': False,  
    'batch_maxnum': 0
}

In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['sk']
rounds = 5

exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SKlearnTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

Messaging researcher_7ff5cf8a-41a4-4366-a48d-8ece2d9ab87a connected with result code 0
Searching for clients with data tags: ['sk'] ...
2021-08-25 14:26:19.368479 [ RESEARCHER ] message received. {'researcher_id': 'researcher_7ff5cf8a-41a4-4366-a48d-8ece2d9ab87a', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [49, 4], 'dataset_id': 'dataset_27db49ad-7f53-4235-bc16-c7bcd9090b9a'}], 'count': 1, 'node_id': 'client_947499f1-4f72-4dfa-b578-aeedf9cbf843', 'command': 'search'}
2021-08-25 14:26:19.369748 [ RESEARCHER ] message received. {'researcher_id': 'researcher_7ff5cf8a-41a4-4366-a48d-8ece2d9ab87a', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [49, 4], 'dataset_id': 'dataset_24109b6b-3840-402d-819a-438ecb71c064'}], 'count': 1, 'node_id': 'client_8f83bb09-bd26-4d33-9621-6a4a17ec12da', 'command': 'search'}


In [6]:
exp.run()

Sampled clients in round  0   ['client_947499f1-4f72-4dfa-b578-aeedf9cbf843', 'client_8f83bb09-bd26-4d33-9621-6a4a17ec12da']
[ RESEARCHER ] Send message to client  client_947499f1-4f72-4dfa-b578-aeedf9cbf843 {'researcher_id': 'researcher_7ff5cf8a-41a4-4366-a48d-8ece2d9ab87a', 'job_id': '80cc0f32-4643-4bfd-89db-c57112ddb7b5', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'max_iter': 1000, 'tol': 0.001, 'number_columns': 5}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/08/25/my_model_7115c271-0892-47a8-877c-5dd01e86ba64.py', 'params_url': 'http://localhost:8844/media/uploads/2021/08/25/my_model_59d659a7-e0ce-4519-8772-2ec5c46cdef8.pt', 'model_class': 'SKlearnTrainingPlan', 'training_data': {'client_947499f1-4f72-4dfa-b578-aeedf9cbf843': ['dataset_27db49ad-7f53-4235-bc16-c7bcd9090b9a']}}
researcher_7ff5cf8a-41a4-4366-a48d-8ece2d9ab87a
[ RESEARCHER ] Send message to client  client_8f83bb09-

Downloading model params after training on  client_8f83bb09-bd26-4d33-9621-6a4a17ec12da 
	- from http://localhost:8844/media/uploads/2021/08/25/node_params_278a2b50-0a59-4709-bc8c-4c9c16a9133e.pt
Downloading model params after training on  client_947499f1-4f72-4dfa-b578-aeedf9cbf843 
	- from http://localhost:8844/media/uploads/2021/08/25/node_params_82da8c97-b8ea-4f90-9455-8f5e9445b0c9.pt
Clients that successfully reply in round  2   ['client_8f83bb09-bd26-4d33-9621-6a4a17ec12da', 'client_947499f1-4f72-4dfa-b578-aeedf9cbf843']
Sampled clients in round  3   ['client_947499f1-4f72-4dfa-b578-aeedf9cbf843', 'client_8f83bb09-bd26-4d33-9621-6a4a17ec12da']
[ RESEARCHER ] Send message to client  client_947499f1-4f72-4dfa-b578-aeedf9cbf843 {'researcher_id': 'researcher_7ff5cf8a-41a4-4366-a48d-8ece2d9ab87a', 'job_id': '80cc0f32-4643-4bfd-89db-c57112ddb7b5', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'max_iter': 1000, 'tol'

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file datasets.

In [None]:
n_features = 5
testing_samples = 40
rng = np.random.RandomState(1)
A = np.array([[5],
       [8],
       [9],
       [5],
       [0]])

def test_data():
    X_test = rng.randn(testing_samples, n_features).reshape([testing_samples, n_features])
    y_test = X_test.dot(A) + rng.randn(testing_samples).reshape([testing_samples,1])
    return X_test, y_test

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
X_test, y_test = test_data()

The MSE should be decreasing at each iteration with the federated parameters.

In [None]:
testing_error = []

for i in range(rounds):
    fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
    fed_model.coef_ = exp._aggregated_params[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp._aggregated_params[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict(X_test).ravel() - y_test.ravel())**2)
    print('MSE ', mse)
    testing_error.append(mse)