# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets n1.csv , n2.csv and n3.csv has been generated randomly using a linear transformation A = [ 5 8 9 5 0 ].
We will fit a Stochastic Gradient Regressor to approximate this transformation using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed
2. Download n1.csv, n2.csv and n3.csv to some place in your computer from https://gitlab.inria.fr/fedbiomed/fedbiomed/-/tree/develop/notebooks/data
3. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n1.csv .
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n2.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n2.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n2.ini`.
  
* **Node 3 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n3.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n3.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n3.ini`.

 Wait until you get `Connected with result code 0`. it means you are online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'

Hereafter the template of the class you should provide to Fedbiomed :

**after_training_params** : a dictionnary containing the model parameters. 
In SGDRegressor case we will have coef and intercept. For kmeans that will be cluster_center and labels.
       
**training_step** : the most part of the time, it will be the method partial_fit, 
of a scikit incremental learning model. You can uncomment the prints in order to check the evolution of training.
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. To simplify we dont use batch_size here, but the code should work if you want to train on a specific batch of the dataset. 

You can uncomment the prints in order to check the evolution of training.

In [18]:
from sklearn.linear_model import SGDRegressor

xxx = SGDRegressor()
name = 'coef_'

xxx.coef_=np.zeros(3)
getattr(xxx, "coef_")

array([0., 0., 0.])

In [3]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SkLearnModel
from sklearn.linear_model import SGDRegressor


class SGDRegressorTrainingPlan(SkLearnModel):
    def __init__(self, kwargs):
        super(SGDRegressorTrainingPlan, self).__init__(kwargs)
        self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
    
    def after_training_params(self):
        return {'coef_':  self.reg.coef_  , 'intercept_': self.reg.intercept_}
    
    def training_data(self,batch_size=None):
        NUMBER_COLS = 5
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        if batch_size == None:
            X = dataset.iloc[:,0:NUMBER_COLS].values
            y = dataset.iloc[:,NUMBER_COLS]
        else:
            X = dataset.iloc[0:batch_size,0:NUMBER_COLS].values
            y = dataset.iloc[0:batch_size,NUMBER_COLS]
        #print('X type ', type(X), ' shape ', X.shape)       
        #print('Y type ', type(y.values), ' shape ', len(y.values))
        return (X,y.values)
    

Writing /Users/mlorenzi/works/temp/fedbiomed/var/tmp/tmpiysl6wlh/fedbiosklearn.py


**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [4]:
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'number_columns': 5 }

training_args = {
    'batch_size': None, 
    'lr': 1e-3, 
    'epochs': 5, 
    'dry_run': False,  
    'batch_maxnum': 0
}

In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['sk']
rounds = 5

exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SGDRegressorTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

Messaging researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b connected with result code 0
Searching for clients with data tags: ['sk'] ...
2021-08-24 19:03:33.328612 [ RESEARCHER ] message received. {'researcher_id': 'researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [49, 5], 'dataset_id': 'dataset_3b65fdec-7cb2-4847-8fc5-03ddcd00591b'}], 'count': 1, 'client_id': 'client_b3d83ca6-61ca-453e-afef-c2b58e95f052', 'command': 'search'}
2021-08-24 19:03:33.330501 [ RESEARCHER ] message received. {'researcher_id': 'researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [9, 5], 'dataset_id': 'dataset_68225286-0895-48a3-83a6-e2cc98f93e43'}], 'count': 1, 'client_id': 'client_144cbbef-d114-4ff4-882d-74530fd08c7a', 'command': 'search'}
2021-08-24 19:03:33.331607 [ RESEARCHER ] message r

In [6]:
exp.run()

Sampled clients in round  0   ['client_b3d83ca6-61ca-453e-afef-c2b58e95f052', 'client_144cbbef-d114-4ff4-882d-74530fd08c7a', 'client_a31be5d4-4e24-43f8-a3ad-1a99418b4020']
[ RESEARCHER ] Send message to client  client_b3d83ca6-61ca-453e-afef-c2b58e95f052 {'researcher_id': 'researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b', 'job_id': 'b5f6b0fd-31f0-4e06-b793-c1b184b98332', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'max_iter': 1000, 'tol': 0.001, 'number_columns': 5}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/08/24/my_model_6f559a2a-523e-445d-b492-bb51ae7d93cd.py', 'params_url': 'http://localhost:8844/media/uploads/2021/08/24/my_model_0d9bb45f-3c9c-472c-941e-8be357e43fcc.pt', 'model_class': 'SGDRegressorTrainingPlan', 'training_data': {'client_b3d83ca6-61ca-453e-afef-c2b58e95f052': ['dataset_3b65fdec-7cb2-4847-8fc5-03ddcd00591b']}}
researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b
[ R

Downloading model params after training on  client_144cbbef-d114-4ff4-882d-74530fd08c7a 
	- from http://localhost:8844/media/uploads/2021/08/24/node_params_0727e43b-deea-429d-a1d0-7b44565caae8.pt
Downloading model params after training on  client_b3d83ca6-61ca-453e-afef-c2b58e95f052 
	- from http://localhost:8844/media/uploads/2021/08/24/node_params_571c9878-c1e4-48c2-9ec6-69574f45e212.pt
Downloading model params after training on  client_a31be5d4-4e24-43f8-a3ad-1a99418b4020 
	- from http://localhost:8844/media/uploads/2021/08/24/node_params_4d79c6c0-8bd9-49e7-9326-e5ed2e97ee51.pt
Clients that successfully reply in round  1   ['client_144cbbef-d114-4ff4-882d-74530fd08c7a', 'client_b3d83ca6-61ca-453e-afef-c2b58e95f052', 'client_a31be5d4-4e24-43f8-a3ad-1a99418b4020']
Sampled clients in round  2   ['client_b3d83ca6-61ca-453e-afef-c2b58e95f052', 'client_144cbbef-d114-4ff4-882d-74530fd08c7a', 'client_a31be5d4-4e24-43f8-a3ad-1a99418b4020']
[ RESEARCHER ] Send message to client  client_b3d83c

2021-08-24 19:04:56.400946 [ RESEARCHER ] message received. {'researcher_id': 'researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b', 'job_id': 'b5f6b0fd-31f0-4e06-b793-c1b184b98332', 'success': True, 'client_id': 'client_144cbbef-d114-4ff4-882d-74530fd08c7a', 'dataset_id': 'dataset_68225286-0895-48a3-83a6-e2cc98f93e43', 'params_url': 'http://localhost:8844/media/uploads/2021/08/24/node_params_47b2614a-00a6-46c9-8005-0aa53bd89277.pt', 'timing': {'rtime_training': 0.0034226679999846965, 'ptime_training': 0.003422999999999732}, 'msg': '', 'command': 'train'}
2021-08-24 19:04:56.435884 [ RESEARCHER ] message received. {'researcher_id': 'researcher_2311ff23-b8f9-4fb5-bdc8-dce4ecb4127b', 'job_id': 'b5f6b0fd-31f0-4e06-b793-c1b184b98332', 'success': True, 'client_id': 'client_a31be5d4-4e24-43f8-a3ad-1a99418b4020', 'dataset_id': 'dataset_7fc8fc03-e45d-4f0e-a354-b3d5574cb708', 'params_url': 'http://localhost:8844/media/uploads/2021/08/24/node_params_c4026688-c4d0-4ed1-893b-fc0661552b7b.pt', 'timing'

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file datasets.

In [7]:
n_features = 5
testing_samples = 40
rng = np.random.RandomState(1)
A = np.array([[5],
       [8],
       [9],
       [5],
       [0]])

def test_data():
    X_test = rng.randn(testing_samples, n_features).reshape([testing_samples, n_features])
    y_test = X_test.dot(A) + rng.randn(testing_samples).reshape([testing_samples,1])
    return X_test, y_test

In [8]:
from sklearn.linear_model import SGDRegressor

In [9]:
X_test, y_test = test_data()

The MSE should be decreasing at each iteration with the federated parameters.

In [10]:
testing_error = []

for i in range(rounds):
    fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
    fed_model.coef_ = exp._aggregated_params[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp._aggregated_params[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict(X_test).ravel() - y_test.ravel())**2)
    print('MSE ', mse)
    testing_error.append(mse)

MSE  54.06597495519943
MSE  32.41711852046201
MSE  15.00079364679299
MSE  7.928485551844941
MSE  4.627536226496672
