# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets `c1.csv` , `c2.csv` and `c3.csv` has been generated with a target column of 3 different classes.
We will fit a Perceptron (classifier) using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Get the data 

We use the make_classification dataset from sklearn datasets

In [None]:
from sklearn import datasets
import numpy as np

In [None]:
X,y = datasets.make_classification(n_samples=300, n_features=20,n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0,shuffle=True, random_state=123)

In [None]:
X.shape

In [None]:
y.shape

In [None]:
C1 = X[:150,:]
C2 = X[150:250,:]
C3 = X[250:300,:]

In [None]:
y1 = y[:150].reshape([150,1])
y2 = y[150:250].reshape([100,1])
y3 = y[250:300].reshape([50,1])

In [None]:
C1.shape ,C2.shape , C3.shape , y1.shape, y2.shape, y3.shape

In [None]:
C2.shape

In [None]:
n1 = np.concatenate((C1, y1), axis=1)
np.savetxt('== local path to c1.csv',n1,delimiter=',')

In [None]:
n2 = np.concatenate((C2, y2), axis=1)
np.savetxt('== local path to c2.csv',n2,delimiter=',')

In [None]:
n3 = np.concatenate((C3, y3), axis=1)
np.savetxt('== local path to c3.csv',n3,delimiter=',')

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
Before running this notebook you need to configure 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the node
  * Choose the name, tags and description of the dataset (you can write 'perp' always and it will be good)
  * Pick the c1.csv file in your machine.
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the node
  * Choose the name, tags and description of the dataset (you can write 'perp' always and it will be good)
  * Pick the c2.csv file in your machine.
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list `
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
 

 Wait until you get `Starting task manager`. it means node is online.


In [None]:
%load_ext autoreload
%autoreload 2

**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [1]:
input_sklearn_model = 'Perceptron'

n_features = 20
n_classes = 2

model_args = { 'model': input_sklearn_model, 'max_iter':1000, 'tol': 1e-3 , 
               'n_features' : n_features, 'n_classes' : n_classes}

training_args = {   
    'epochs': 5, 
}

Hereafter the template of the class you should provide to Fedbiomed :
    
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. 

In [2]:
from fedbiomed.common.training_plans import SGDSkLearnModel
from fedbiomed.common.data import DataManager
class SkLearnTrainingPlan(SGDSkLearnModel):
    def __init__(self, model_args: dict = {}):
        super(SkLearnTrainingPlan,self).__init__(model_args)
        self.add_dependency(["from sklearn.linear_model import Perceptron"])
    
    def training_data(self):
        NUMBER_COLS = 20
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:NUMBER_COLS].values
        y = dataset.iloc[:,NUMBER_COLS]       
        return DataManager(dataset=X,target=y.values)

In [3]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['perp']
rounds = 8

# search for corresponding datasets across nodes datasets
exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=SkLearnTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

2022-03-22 09:12:44,013 fedbiomed INFO - Component environment:
2022-03-22 09:12:44,014 fedbiomed INFO - type = ComponentType.RESEARCHER
2022-03-22 09:12:44,226 fedbiomed INFO - Messaging researcher_1fecd236-3507-4a58-9921-8d364492a6d1 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7f350a7fc5e0>
2022-03-22 09:12:44,267 fedbiomed INFO - Searching dataset with data tags: ['perp'] for all nodes
2022-03-22 09:12:54,305 fedbiomed INFO - Node selected for training -> node_fa6a1655-e676-42a8-a6a5-fe2630057d46
2022-03-22 09:12:54,334 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0015/my_model_3782e2a4-1385-4653-a3a3-c3e80c1f9560.py
2022-03-22 09:12:54,368 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0015/my_model_3782e2a4-1385-4653-a3a3-c3e80c1f9560.py successful, with status co

In [4]:
# train experiments
exp.run()

2022-03-22 09:12:54,410 fedbiomed INFO - Sampled nodes in round 0 ['node_fa6a1655-e676-42a8-a6a5-fe2630057d46']
2022-03-22 09:12:54,413 fedbiomed INFO - Send message to node node_fa6a1655-e676-42a8-a6a5-fe2630057d46 - {'researcher_id': 'researcher_1fecd236-3507-4a58-9921-8d364492a6d1', 'job_id': 'a1ba5071-4500-4fcb-bff8-6b97ca345aed', 'training_args': {'epochs': 5}, 'model_args': {'model': 'Perceptron', 'max_iter': 1000, 'tol': 0.001, 'n_features': 20, 'n_classes': 2, 'verbose': 1}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/22/my_model_3782e2a4-1385-4653-a3a3-c3e80c1f9560.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/22/aggregated_params_init_402bc571-08e0-41fe-a161-02b0de2d03df.pt', 'model_class': 'SkLearnTrainingPlan', 'training_data': {'node_fa6a1655-e676-42a8-a6a5-fe2630057d46': ['dataset_579dde20-22b3-443b-8efd-a73a799d2ea6']}}
2022-03-22 09:12:54,415 fedbiomed DEBUG - researcher_1fecd236-3507-4a58-9921-8d364492a6d1
2022-03-22 

2022-03-22 09:13:14,502 fedbiomed INFO - Downloading model params after training on node_fa6a1655-e676-42a8-a6a5-fe2630057d46 - from http://localhost:8844/media/uploads/2022/03/22/node_params_95823fed-8277-4559-876b-3a780f556865.pt
2022-03-22 09:13:14,506 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_cab995ee-8d19-4263-8169-46ba2393aed1.pt successful, with status code 200
2022-03-22 09:13:14,509 fedbiomed INFO - Nodes that successfully reply in round 1 ['node_fa6a1655-e676-42a8-a6a5-fe2630057d46']
2022-03-22 09:13:14,554 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0015/aggregated_params_2aade478-0c68-4c85-8e61-f985df0d5b41.pt successful, with status code 201
2022-03-22 09:13:14,555 fedbiomed INFO - Saved aggregated params for round 1 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0015/aggregated_params_2aade478-0c68-4c85-8e61-f985df0d5b41.pt
2022-03-2

2022-03-22 09:13:24,753 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Epoch: 3 | Completed: 120/120 (100%) 
 					 Loss: [1m1.352828[0m 
					 ---------
2022-03-22 09:13:24,757 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - Train Epoch: 4 [Batch All Samples]	Loss: 1.372400
2022-03-22 09:13:24,759 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Epoch: 4 | Completed: 120/120 (100%) 
 					 Loss: [1m1.372400[0m 
					 ---------
2022-03-22 09:13:24,762 fedbiomed INFO - [1mTESTING AFTER TRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Completed: 30/30 (3%) 
 					 ACCURACY: [1m1.000000[0m 
					 ---------
2022-03-22 09:13:24,816 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - results uploaded successfully 
2022-03-22 09:13:34,674 fedbiomed INFO - Downloading model params after training on node

2022-03-22 09:13:44,993 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - Train Epoch: 1 [Batch All Samples]	Loss: 1.109058
2022-03-22 09:13:44,999 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Epoch: 1 | Completed: 120/120 (100%) 
 					 Loss: [1m1.109058[0m 
					 ---------
2022-03-22 09:13:45,003 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - Train Epoch: 2 [Batch All Samples]	Loss: 1.201760
2022-03-22 09:13:45,006 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Epoch: 2 | Completed: 120/120 (100%) 
 					 Loss: [1m1.201760[0m 
					 ---------
2022-03-22 09:13:45,008 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - Train Epoch: 3 [Batch All Samples]	Loss: 0.693517
2022-03-22 09:13:45,010 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Epoch

2022-03-22 09:14:05,066 fedbiomed DEBUG - researcher_1fecd236-3507-4a58-9921-8d364492a6d1
2022-03-22 09:14:05,091 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - training with arguments {'history_monitor': <fedbiomed.node.history_monitor.HistoryMonitor object at 0x7fde8d41fd60>, 'node_args': {'gpu': False, 'gpu_num': None, 'gpu_only': False}, 'epochs': 5}
2022-03-22 09:14:05,100 fedbiomed INFO - [1mTESTING BEFORE TRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Completed: 30/30 (3%) 
 					 RECALL: [1m1.000000[0m 
					 ---------
2022-03-22 09:14:05,104 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / INFO - Train Epoch: 0 [Batch All Samples]	Loss: 2.052002
2022-03-22 09:14:05,109 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 
					 Epoch: 0 | Completed: 120/120 (100%) 
 					 Loss: [1m2.052002[0m 
					 ---------
2022-03-22 09:14:05,111 fedbiomed INFO - l

8

2022-03-22 09:14:32,053 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / CRITICAL - Node stopped in signal_handler, probably by user decision (Ctrl C)
2022-03-22 09:14:32,205 fedbiomed INFO - log from: node_fa6a1655-e676-42a8-a6a5-fe2630057d46 / CRITICAL - Node stopped in signal_handler, probably by user decision (Ctrl C)


## Lets validate the trained model with the test dataset c3.csv.

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('== local path to c3.csv')

In [None]:
X_test = data.iloc[:,:n_features]
y_test = data.iloc[:,n_features]

F1 score computed with federated algorithm :

For that, we are exporting `exp.aggregated_params()` containing models parameters collected at the end of each round

In [None]:
from sklearn.metrics import f1_score
loss_metric = f1_score
    
testing_error = []

for i in range(rounds):
    fed_model = exp.model_instance().get_model()
    fed_model.coef_ = exp.aggregated_params()[i]['params']['coef_']
    fed_model.intercept_ = exp.aggregated_params()[i]['params']['intercept_']
    metric = loss_metric(fed_model.predict(X_test),y_test.ravel())
    print('F1 score metric: ', metric, )
    testing_error.append(metric)