# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets c1.csv , c2.csv and c3.csv has been generated with a target column of 3 different classes.
We will fit a Perceptron (classifier) using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Get the data 

We use the make_classification dataset from sklearn datasets

In [1]:
from sklearn import datasets
import numpy as np

In [2]:
X,y = datasets.make_classification(n_samples=300, n_features=20,n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0,shuffle=True, random_state=123)

In [3]:
X.shape

(300, 20)

In [4]:
y.shape

(300,)

In [5]:
C1 = X[:150,:]
C2 = X[150:250,:]
C3 = X[250:300,:]

In [6]:
y1 = y[:150].reshape([150,1])
y2 = y[150:250].reshape([100,1])
y3 = y[250:300].reshape([50,1])

In [7]:
C1.shape ,C2.shape , C3.shape , y1.shape, y2.shape, y3.shape

((150, 20), (100, 20), (50, 20), (150, 1), (100, 1), (50, 1))

In [8]:
C2.shape

(100, 20)

In [None]:
n1 = np.concatenate((C1, y1), axis=1)
np.savetxt('== local path to c1.csv',n1,delimiter=',')

In [None]:
n2 = np.concatenate((C2, y2), axis=1)
np.savetxt('== local path to c2.csv',n2,delimiter=',')

In [None]:
n3 = np.concatenate((C3, y3), axis=1)
np.savetxt('== local path to c3.csv',n3,delimiter=',')

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed
2. You need to configure 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'perp' always and it will be good)
  * Pick the c1.csv file in your machine.
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'perp' always and it will be good)
  * Pick the c2.csv file in your machine.
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list `
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
 

 Wait until you get `Connected with result code 0`. it means you are online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'


**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [3]:
input_sklearn_model = 'Perceptron'

n_features = 20
n_classes = 2

model_args = { 'model': input_sklearn_model, 'max_iter':1000, 'tol': 1e-3 , 
               'n_features' : n_features, 'n_classes' : n_classes}

training_args = {
    'batch_size': None, 
    'lr': 1e-3, 
    'epochs': 5, 
    'dry_run': False,  
    'batch_maxnum': 0
}

Hereafter the template of the class you should provide to Fedbiomed :

**after_training_params** : a dictionnary containing the model parameters. 
In SGDRegressor case we will have coef and intercept. For kmeans that will be cluster_center and labels.
       
**training_step** : the most part of the time, it will be the method partial_fit, 
of a scikit incremental learning model. You can uncomment the prints in order to check the evolution of training.
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. To simplify we dont use batch_size here, but the code should work if you want to train on a specific batch of the dataset. 

You can uncomment the prints in order to check the evolution of training.

In [4]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
import numpy as np

class SkLearnTrainingPlan(SGDSkLearnModel):
    def __init__(self, model_args):
        super(SkLearnTrainingPlan,self).__init__(model_args)
    
    def training_data(self):
        NUMBER_COLS = 20
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:NUMBER_COLS].values
        y = dataset.iloc[:,NUMBER_COLS]       
        return (X,y.values)

Writing /user/jsaray/home/INRIA-PROJECTS/reviews/yannickRev/fedbiomed/var/tmp/tmp2sc12_y8/fedbiosklearn.py


In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['perp']
rounds = 8

exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SkLearnTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

Messaging 17b9a1a1-e93c-4ca1-bfa9-840188d453fb connected with result code 0
Searching for clients with data tags: ['sk'] ...
2021-09-07 15:06:57.316276 [ RESEARCHER ] message received. {'researcher_id': 'researcher_16d7a0f5-1751-4504-9c0e-30b5ee99cc13', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [149, 20], 'dataset_id': 'dataset_81c0addf-5fa0-4d08-a43f-6abe17d1d3ce'}], 'count': 1, 'client_id': 'client_31176b56-c503-4dcd-9c46-90f585a631ef', 'command': 'search'}
2021-09-07 15:06:57.316854 [ RESEARCHER ] message received. {'researcher_id': 'researcher_16d7a0f5-1751-4504-9c0e-30b5ee99cc13', 'success': True, 'databases': [{'name': 'sk', 'data_type': 'csv', 'tags': ['sk'], 'description': 'sk', 'shape': [99, 20], 'dataset_id': 'dataset_2399b259-ee94-4ec1-9c3e-3600f104bd81'}], 'count': 1, 'client_id': 'client_c59e50c0-47cf-4e10-b592-06cac360a01c', 'command': 'search'}


In [6]:
exp.run()

Sampled clients in round  0   ['client_31176b56-c503-4dcd-9c46-90f585a631ef', 'client_c59e50c0-47cf-4e10-b592-06cac360a01c']
[ RESEARCHER ] Send message to client  client_31176b56-c503-4dcd-9c46-90f585a631ef {'researcher_id': 'researcher_16d7a0f5-1751-4504-9c0e-30b5ee99cc13', 'job_id': '8b6b4a01-eae6-4c56-a254-3a788f5d2867', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'model': 'Perceptron', 'max_iter': 1000, 'tol': 0.001, 'n_features': 20, 'n_classes': 2}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/09/07/my_model_a504950e-3b43-423c-89db-e2affa702964.py', 'params_url': 'http://localhost:8844/media/uploads/2021/09/07/my_model_82531fdc-7168-45c7-80ac-fb6e9b1b0fdf.pt', 'model_class': 'SkLearnTrainingPlan', 'training_data': {'client_31176b56-c503-4dcd-9c46-90f585a631ef': ['dataset_81c0addf-5fa0-4d08-a43f-6abe17d1d3ce']}}
researcher_16d7a0f5-1751-4504-9c0e-30b5ee99cc13
[ RESEARCHER ] Send

Downloading model params after training on  client_c59e50c0-47cf-4e10-b592-06cac360a01c 
	- from http://localhost:8844/media/uploads/2021/09/07/node_params_21b45e32-100e-49b3-b4c1-64f22e58258d.pt
Downloading model params after training on  client_31176b56-c503-4dcd-9c46-90f585a631ef 
	- from http://localhost:8844/media/uploads/2021/09/07/node_params_77a6ddbf-f16c-4165-9d9a-f4fffbe8619f.pt
Clients that successfully reply in round  2   ['client_c59e50c0-47cf-4e10-b592-06cac360a01c', 'client_31176b56-c503-4dcd-9c46-90f585a631ef']
Sampled clients in round  3   ['client_31176b56-c503-4dcd-9c46-90f585a631ef', 'client_c59e50c0-47cf-4e10-b592-06cac360a01c']
[ RESEARCHER ] Send message to client  client_31176b56-c503-4dcd-9c46-90f585a631ef {'researcher_id': 'researcher_16d7a0f5-1751-4504-9c0e-30b5ee99cc13', 'job_id': '8b6b4a01-eae6-4c56-a254-3a788f5d2867', 'training_args': {'batch_size': None, 'lr': 0.001, 'epochs': 5, 'dry_run': False, 'batch_maxnum': 0}, 'model_args': {'model': 'Perceptron', 

2021-09-07 15:08:00.240584 [ RESEARCHER ] message received. {'researcher_id': 'researcher_16d7a0f5-1751-4504-9c0e-30b5ee99cc13', 'job_id': '8b6b4a01-eae6-4c56-a254-3a788f5d2867', 'success': True, 'client_id': 'client_c59e50c0-47cf-4e10-b592-06cac360a01c', 'dataset_id': 'dataset_2399b259-ee94-4ec1-9c3e-3600f104bd81', 'params_url': 'http://localhost:8844/media/uploads/2021/09/07/node_params_efe81df7-fcf0-49b4-a818-b3699a8f63a5.pt', 'timing': {'rtime_training': 0.007740759999251168, 'ptime_training': 0.007674280000000033}, 'msg': '', 'command': 'train'}
Downloading model params after training on  client_31176b56-c503-4dcd-9c46-90f585a631ef 
	- from http://localhost:8844/media/uploads/2021/09/07/node_params_99e8505f-a2a9-43fe-aec8-67a194012379.pt
Downloading model params after training on  client_c59e50c0-47cf-4e10-b592-06cac360a01c 
	- from http://localhost:8844/media/uploads/2021/09/07/node_params_efe81df7-fcf0-49b4-a818-b3699a8f63a5.pt
Clients that successfully reply in round  5   ['cli

## Lets validate the trained model with the test dataset c3.csv.

In [7]:
import pandas as pd

In [8]:
data = pd.read_csv('== Path to c3.csv')

In [9]:
X_test = data.iloc[:,:n_features]
y_test = data.iloc[:,n_features]

Accuracy computing with federated algorithm :

In [10]:
from sklearn.metrics import f1_score
loss_metric = f1_score
    
testing_error = []

for i in range(rounds):
    fed_model = exp.model_instance.get_model()
    fed_model.coef_ = exp.aggregated_params[i]['params']['coef_']
    fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_']
    metric = loss_metric(fed_model.predict(X_test),y_test.ravel())
    print('Accuracy metric: ', metric, )
    testing_error.append(metric)

Accuracy metric:  0.8727272727272727
Accuracy metric:  0.830188679245283
Accuracy metric:  0.8461538461538461
Accuracy metric:  0.830188679245283
Accuracy metric:  0.8679245283018867
Accuracy metric:  0.8679245283018867
Accuracy metric:  0.8679245283018867
Accuracy metric:  0.8846153846153847
