# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets n1.csv , n2.csv and n3.csv has been generated randomly using a linear transformation A = [ 5 8 9 5 0 ].
We will fit a Stochastic Gradient Regressor to approximate this transformation using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed
2. Download n1.csv, n2.csv and n3.csv to some place in your computer from https://gitlab.inria.fr/fedbiomed/fedbiomed/-/tree/develop/notebooks/data
3. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n1.csv .
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n2.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n2.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n2.ini`.
  
* **Node 3 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n3.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n3.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n3.ini`.

 Wait until you get `Connected with result code 0`. it means you are online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'


**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [3]:
from sklearn.naive_bayes import BernoulliNB

xx=BernoulliNB()

print(xx.get_params())

from sklearn.linear_model import SGDRegressor, SGDClassifier, Perceptron

xx=Perceptron()

print(xx.get_params())

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'fit_prior': True}
{'alpha': 0.0001, 'class_weight': None, 'early_stopping': False, 'eta0': 1.0, 'fit_intercept': True, 'l1_ratio': 0.15, 'max_iter': 1000, 'n_iter_no_change': 5, 'n_jobs': None, 'penalty': None, 'random_state': 0, 'shuffle': True, 'tol': 0.001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


In [4]:
# input_sklearn_model = 'BernoulliNB'

# n_features = 20
# n_classes = 2

# theta_ = np.array([0.1] * (n_features*n_classes)).reshape(n_classes,n_features)
# feature_count_ = np.array([0] * (n_features*n_classes)).reshape(n_classes,n_features)
# class_count_ = np.array([0] * (n_classes))

# model_args = { 'model': input_sklearn_model, 'max_iter':1000, 'tol': 1e-3 , 
#               'init_params' : {'theta_': theta_, 'feature_count_' : feature_count_, 'class_count_' : class_count_}}

# training_args = {
#     'batch_size': None, 
#     'lr': 1e-3, 
#     'epochs': 5, 
#     'dry_run': False,  
#     'batch_maxnum': 0
# }

In [5]:
input_sklearn_model = 'SGDClassifier'

n_features = 20
n_classes = 2

model_args = { 'model': input_sklearn_model, 'max_iter':1000, 'tol': 1e-3 , 
               'n_features' : n_features, 'n_classes' : n_classes}

training_args = {
    'batch_size': None, 
    'lr': 1e-3, 
    'epochs': 5, 
    'dry_run': False,  
    'batch_maxnum': 0
}

Hereafter the template of the class you should provide to Fedbiomed :

**after_training_params** : a dictionnary containing the model parameters. 
In SGDRegressor case we will have coef and intercept. For kmeans that will be cluster_center and labels.
       
**training_step** : the most part of the time, it will be the method partial_fit, 
of a scikit incremental learning model. You can uncomment the prints in order to check the evolution of training.
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. To simplify we dont use batch_size here, but the code should work if you want to train on a specific batch of the dataset. 

You can uncomment the prints in order to check the evolution of training.

In [6]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
import numpy as np

class SkLearnTrainingPlan(SGDSkLearnModel):
    def __init__(self, model_args):
        super(SkLearnTrainingPlan,self).__init__(model_args)
    
    def training_data(self,batch_size=None):
        NUMBER_COLS = 20
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        if batch_size == None:
            X = dataset.iloc[:,0:NUMBER_COLS].values
            y = dataset.iloc[:,NUMBER_COLS]
        else:
            X = dataset.iloc[0:batch_size,0:NUMBER_COLS].values
            y = dataset.iloc[0:batch_size,NUMBER_COLS]
        return (X,y.values)
    

Writing /Users/mlorenzi/works/temp/fedbiomed/var/tmp/tmp2ulwbor8/fedbiosklearn.py


In [7]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['sk']
rounds = 8

exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SkLearnTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

Messaging researcher_dbd672a5-cbde-46e2-87e5-c736aeb83832 connected with result code 0
Searching for clients with data tags: ['sk'] ...
2021-08-27 14:19:29.653983 [ RESEARCHER ] message received. {'researcher_id': 'researcher_dbd672a5-cbde-46e2-87e5-c736aeb83832', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [149, 20], 'dataset_id': 'dataset_85feab12-29fd-413e-9bb8-4f67c34b9a70'}], 'count': 1, 'client_id': 'client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6', 'command': 'search'}
2021-08-27 14:19:29.655621 [ RESEARCHER ] message received. {'researcher_id': 'researcher_dbd672a5-cbde-46e2-87e5-c736aeb83832', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [99, 20], 'dataset_id': 'dataset_6239cb1b-a5e4-4bb2-9227-64b2f81aa1d6'}], 'count': 1, 'client_id': 'client_8044438f-3ada-4a0a-8550-b348897d2b0e', 'command': 'search'}


In [8]:
exp.run()

Sampled clients in round  0   ['client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6', 'client_8044438f-3ada-4a0a-8550-b348897d2b0e']
[ RESEARCHER ] Send message to client  client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6 {'researcher_id': 'researcher_dbd672a5-cbde-46e2-87e5-c736aeb83832', 'job_id': '03892875-7fda-495e-a579-fc1a2b296af1', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'model': 'SGDClassifier', 'max_iter': 1000, 'tol': 0.001, 'n_features': 20, 'n_classes': 2}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/08/27/my_model_c4d4c4fd-3c63-4ab5-9ffe-7c113eff07ae.py', 'params_url': 'http://localhost:8844/media/uploads/2021/08/27/my_model_814f2c9f-8627-432b-889e-9281cf8202a3.pt', 'model_class': 'SkLearnTrainingPlan', 'training_data': {'client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6': ['dataset_85feab12-29fd-413e-9bb8-4f67c34b9a70']}}
researcher_dbd672a5-cbde-46e2-87e5-c736aeb83832
[ RESEARCHER ] S

Downloading model params after training on  client_8044438f-3ada-4a0a-8550-b348897d2b0e 
	- from http://localhost:8844/media/uploads/2021/08/27/node_params_b88f7b92-2c5b-48b0-9d1e-1d1dc23482e4.pt
Downloading model params after training on  client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6 
	- from http://localhost:8844/media/uploads/2021/08/27/node_params_fb23c81f-1f70-4d47-8d7f-9ee59b6c936c.pt
Clients that successfully reply in round  2   ['client_8044438f-3ada-4a0a-8550-b348897d2b0e', 'client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6']
before for  [{'intercept_': array([21.66463689]), 'coef_': array([[ -2.89567681,  23.64039098, -24.82194325,  -5.39333775,
          4.5245667 ,  11.48424367,  -1.41742636,  22.91332383,
         49.68920684,  -6.56522133,  -7.5898722 ,  31.92637853,
         25.52559175,  -3.40697161,  84.41359266,  -3.55464555,
         -9.3617562 ,   0.75377756,   0.39708603,   5.41821487]])}, {'intercept_': array([1.917024]), 'coef_': array([[  2.10798022,  -9.78934752, -11.2

Downloading model params after training on  client_8044438f-3ada-4a0a-8550-b348897d2b0e 
	- from http://localhost:8844/media/uploads/2021/08/27/node_params_bda25c12-b543-4b5b-a5ee-9da5ad25c195.pt
Downloading model params after training on  client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6 
	- from http://localhost:8844/media/uploads/2021/08/27/node_params_d579da5f-7d10-4800-ba81-5987584f8526.pt
Clients that successfully reply in round  4   ['client_8044438f-3ada-4a0a-8550-b348897d2b0e', 'client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6']
before for  [{'intercept_': array([13.30873162]), 'coef_': array([[  4.94793798,   8.80237015, -14.90283239,  -7.76233559,
         -4.13888728,  22.66718809,  -1.93147271,  13.94853594,
         38.80117897,   7.64502433,  -1.99297094,  15.67660055,
         16.3816328 ,  -5.98098122,  52.76503328,   6.47767066,
          1.48805807,  -0.352431  ,   3.93020366,   8.2991169 ]])}, {'intercept_': array([-4.06522571]), 'coef_': array([[ -6.7124674 ,  -1.84388741, -1

Downloading model params after training on  client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6 
	- from http://localhost:8844/media/uploads/2021/08/27/node_params_a871c2bb-f133-4717-9052-ff4a52f16f34.pt
Downloading model params after training on  client_8044438f-3ada-4a0a-8550-b348897d2b0e 
	- from http://localhost:8844/media/uploads/2021/08/27/node_params_7293788c-c033-46fc-abc4-1c5574757761.pt
Clients that successfully reply in round  6   ['client_e01bd65c-755d-4bf3-9034-2d70ef70dbd6', 'client_8044438f-3ada-4a0a-8550-b348897d2b0e']
before for  [{'intercept_': array([0.91734461]), 'coef_': array([[ -3.69728137,  -0.82856839, -17.40886925,   5.58155825,
          3.52852048,  12.1046895 , -14.65674667,  -8.53213522,
         18.68118084, -17.38473811,  -7.46815094,  15.35162126,
         -1.77586086,  -1.87538997,  36.49510918,   6.06516642,
          4.27724584,   0.63724254,  -0.29134972,   7.26916885]])}, {'intercept_': array([13.87548199]), 'coef_': array([[-1.4444612 ,  6.25457212, -7.87

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file datasets.

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('/Users/mlorenzi/Downloads/c3.csv')

# this dataset corresponds to the last 50 samples of the data created with this instance:
# X,y = make_classification(n_samples=300, n_features=20,n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, 
#                           hypercube=True, shift=0.0, scale=1.0,shuffle=True, random_state=123)
#
# The first 250 samples are used to create the training clients (datasets c1 and c2)
#

In [None]:
from sklearn.linear_model import SGDClassifier

X_test = data.iloc[:,:n_features]
y_test = data.iloc[:,n_features]

The MSE should be decreasing at each iteration with the federated parameters.

In [None]:
if input_sklearn_model in ['SGDClassifier', 'Perceptron']:
    from sklearn.metrics import f1_score
    loss_metric = f1_score
if input_sklearn_model=='SGDRegressor':
    from sklearn.metrics import mean_squared_error
    loss_metric = mean_squared_error
    
testing_error = []

for i in range(rounds):
    fed_model = exp.model_instance.get_model()
    fed_model.coef_ = exp.aggregated_params[i]['params']['coef_']
    fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_']
    metric = loss_metric(fed_model.predict(X_test),y_test.ravel())
    print('Accuracy metric: ', metric, )
    testing_error.append(metric)

In [None]:
from sklearn.linear_model import SGDRegressor, SGDClassifier, Perceptron
from sklearn.naive_bayes import BernoulliNB

xx=SGDClassifier()

theta_ = np.array([0.1] * (n_features*n_classes)).reshape(n_classes,n_features)
feature_count_ = np.array([0] * (n_features*n_classes)).reshape(n_classes,n_features)
class_count_ = np.array([0] * (n_classes))

xx.theta_ = theta_
xx.feature_count_ = feature_count_
xx.class_count_ = class_count_

xx.partial_fit(X_test,y_test, classes = np.unique(y_test))


In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error

print(f1_score(xx.predict(X_test),y_test.ravel()))
print(mean_squared_error(xx.predict(X_test),y_test.ravel()))