# Fedbiomed Researcher to train a federated scikit learn model.

## Purpose of the exercise :

Three datasets n1.csv , n2.csv and n3.csv has been generated randomly using a linear transformation A = [ 5 8 9 5 0 ].
We will fit a Stochastic Gradient Regressor to approximate this transformation using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed
2. Download n1.csv, n2.csv and n3.csv to some place in your computer from https://gitlab.inria.fr/fedbiomed/fedbiomed/-/tree/develop/notebooks/data
3. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n1.csv .
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n2.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n2.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n2.ini`.
  
* **Node 3 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file n3.csv .
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node list config n3.ini`
  * Run the node using `./scripts/fedbiomed_run node start config n3.ini`.

 Wait until you get `Connected with result code 0`. it means you are online.


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'


**model_args** is a dictionnary containing your model arguments, in case of SGDRegressor this will be max_iter and tol.

**training_args** is a dictionnary with parameters , related to Federated Learning. 

In [None]:
# input_sklearn_model = 'BernoulliNB'

# n_features = 20
# n_classes = 2

# theta_ = np.array([0.1] * (n_features*n_classes)).reshape(n_classes,n_features)
# feature_count_ = np.array([0] * (n_features*n_classes)).reshape(n_classes,n_features)
# class_count_ = np.array([0] * (n_classes))

# model_args = { 'model': input_sklearn_model, 'max_iter':1000, 'tol': 1e-3 , 
#               'init_params' : {'theta_': theta_, 'feature_count_' : feature_count_, 'class_count_' : class_count_}}

# training_args = {
#     'batch_size': None, 
#     'lr': 1e-3, 
#     'epochs': 5, 
#     'dry_run': False,  
#     'batch_maxnum': 0
# }

In [None]:
input_sklearn_model = 'Perceptron'

n_features = 20
n_classes = 2

model_args = { 'model': input_sklearn_model, 'max_iter':1000, 'tol': 1e-3 , 
               'n_features' : n_features, 'n_classes' : n_classes}

training_args = {
    'batch_size': None, 
    'lr': 1e-3, 
    'epochs': 5, 
    'dry_run': False,  
    'batch_maxnum': 0
}

Hereafter the template of the class you should provide to Fedbiomed :

**after_training_params** : a dictionnary containing the model parameters. 
In SGDRegressor case we will have coef and intercept. For kmeans that will be cluster_center and labels.
       
**training_step** : the most part of the time, it will be the method partial_fit, 
of a scikit incremental learning model. You can uncomment the prints in order to check the evolution of training.
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. To simplify we dont use batch_size here, but the code should work if you want to train on a specific batch of the dataset. 

You can uncomment the prints in order to check the evolution of training.

In [None]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
import numpy as np

class SkLearnTrainingPlan(SGDSkLearnModel):
    def __init__(self, model_args):
        super(SkLearnTrainingPlan,self).__init__(model_args)
    
    def training_data(self,batch_size=None):
        NUMBER_COLS = 20
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        if batch_size == None:
            X = dataset.iloc[:,0:NUMBER_COLS].values
            y = dataset.iloc[:,NUMBER_COLS]
        else:
            X = dataset.iloc[0:batch_size,0:NUMBER_COLS].values
            y = dataset.iloc[0:batch_size,NUMBER_COLS]
        return (X,y.values)
    

In [None]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['sk']
rounds = 8

exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SkLearnTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

In [None]:
exp.run()

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file datasets.

In [None]:
import pandas as pd

from sklearn.linear_model import SGDClassifier

xx = SGDClassifier()
print(xx.get_params())

In [None]:
data = pd.read_csv('/Users/mlorenzi/Downloads/c3.csv')

# this dataset corresponds to the last 50 samples of the data created with this instance:
# X,y = make_classification(n_samples=300, n_features=20,n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, 
#                           hypercube=True, shift=0.0, scale=1.0,shuffle=True, random_state=123)
#
# The first 250 samples are used to create the training clients (datasets c1 and c2)
#

In [None]:
from sklearn.linear_model import SGDClassifier

X_test = data.iloc[:,:n_features]
y_test = data.iloc[:,n_features]

The MSE should be decreasing at each iteration with the federated parameters.

In [None]:
if input_sklearn_model in ['SGDClassifier', 'Perceptron']:
    from sklearn.metrics import f1_score
    loss_metric = f1_score
if input_sklearn_model=='SGDRegressor':
    from sklearn.metrics import mean_squared_error
    loss_metric = mean_squared_error
    
testing_error = []

for i in range(rounds):
    fed_model = exp.model_instance.get_model()
    fed_model.coef_ = exp.aggregated_params[i]['params']['coef_']
    fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_']
    metric = loss_metric(fed_model.predict(X_test),y_test.ravel())
    print('Accuracy metric: ', metric, )
    testing_error.append(metric)

In [None]:
from sklearn.linear_model import Perceptron

perc = Perceptron()
perc.fit(X_test, y_test)
print)