# Fedbiomed Researcher to train a federated scikit learn model.

## Description of the exercise :

Three datasets `n1.csv` , `n2.csv` and `n3.csv` will be generated randomly using a linear transformation A = [ 5 8 9 5 0 ].After this we will fit a Stochastic Gradient Regressor to approximate this transformation using Federated Learning.

## Extending this notebook to any incremental learning scikit model:

The same federated learning scheme below applies to any sklearn model supporting the method partial_fit():

A family of models could be naturally imported in Fed-BioMed, following the same approach. For example: 
- Naive Bayes.  
- Logistic regression,
- SVM/SVC (linear and non-linear), 
- perceptron, 
- KMeans, 
- incremental PCA, 
- mini batch dictionary learning, 
- latent Dirichlet annotation, 

## Data Generation

We will generate three datasets using the linear transformation A = [ 5 8 9 5 0 ]
Then save them in a path of your choice on your machine.

In [None]:
import numpy as np

In [None]:
n_centers = 3
n_samples, n_features = [10,20,50], 5
testing_samples = 40

### Creating a random dataset
rng = np.random.RandomState(1)

y = []
X = []

### Creating a random linear transformation
A = np.array([5,8,9,5,0]).reshape([n_features,1])

### For every center we create random X and y dataset with same generative rule (the matrix A)
for i in range(n_centers):
    X.append(rng.randn(n_samples[i], n_features).reshape([n_samples[i], n_features]))
    y.append(X[i].dot(A) + rng.randn(n_samples[i]).reshape([n_samples[i],1]))

### We use the same rule to generate an independent testing dataset
X_test = rng.randn(testing_samples, n_features).reshape([testing_samples, n_features])
y_test = X_test.dot(A) + rng.randn(testing_samples).reshape([testing_samples,1])

In [None]:
A

In [None]:
n1 = np.concatenate((X[0], y[0]), axis=1)
np.savetxt('== Local path to node1.csv',n1,delimiter=',')

In [None]:
n2 = np.concatenate((X[1], y[1]), axis=1)
np.savetxt('== Local path to node2.csv',n2,delimiter=',')

In [None]:
n3 = np.concatenate((X[2], y[2]), axis=1)
np.savetxt('== Local path to node3.csv',n3,delimiter=',')

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed

2. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[0],y[0].
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[1],y[1].
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list`
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
  
* **Node 3 :** Open a third terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[2],y[2].
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n3.ini list`
  * Run the node using `./scripts/fedbiomed_run node config n3.ini start `.

 Wait until you get `Connected with result code 0`. it means node is online.


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'

Hereafter the template of the class you should provide to Fedbiomed :

**__init__** : we add here the needed sklearn libraries
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. 

In [None]:
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
from sklearn.linear_model import SGDRegressor


class SGDRegressorTrainingPlan(SGDSkLearnModel):
    def __init__(self, kwargs):
        super(SGDRegressorTrainingPlan, self).__init__(kwargs)
        self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
    
    def training_data(self):
        NUMBER_COLS = 5
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:NUMBER_COLS].values
        y = dataset.iloc[:,NUMBER_COLS]
        return (X,y.values)
    

**model_args** is a dictionary containing your model arguments, in case of SGDRegressor this will be max_iter and tol. n_features is provided to correctly initialize the SGDRegressor coef_ array.

**training_args** is a dictionary with parameters related to Federated Learning. 

In [None]:
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'model': 'SGDRegressor' , 'n_features': 5}

training_args = {
    'epochs': 5, 
}

In [None]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['sk']
rounds = 5

# select nodes participing to this experiment
exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SGDRegressorTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 client_selection_strategy=None)

In [None]:
# start federated training
exp.run()

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file training datasets.

In [None]:
n_features = 5
testing_samples = 40
rng = np.random.RandomState(1)


def test_data():
    X_test = rng.randn(testing_samples, n_features).reshape([testing_samples, n_features])
    y_test = X_test.dot(A) + rng.randn(testing_samples).reshape([testing_samples,1])
    return X_test, y_test

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
X_test, y_test = test_data()

The MSE should be decreasing at each iteration with the federated parameters.

For that, we are exporting `exp.aggregated_params` containing models parameters collected at the end of each round

In [None]:
testing_error = []

for i in range(rounds):
    fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
    fed_model.coef_ = exp.aggregated_params[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict(X_test).ravel() - y_test.ravel())**2)
    print('MSE ', mse)
    testing_error.append(mse)