# Fedbiomed Researcher to train a federated ppca model.

## Description of the exercise :

Three datasets `n1.csv` , `n2.csv` and `n3.csv` will be generated randomly using 3-views PPCA from a 4-dimensional latent space, with views dimensions [15,8,10] and 2 groups. Henceforth, we will distribute the 3 dataset to 3 distinct nodes and use Fed-mv-PPCA. In each center we check the evolution of expected LL during training.

## Data Generation

We will generate three datasets using mv-PPCA.
Then save them in a path of your choice on your machine.

In [None]:
import numpy as np
import pandas as pd

def sample_x_n(N, q, random_state=None):
    return np.random.RandomState(random_state).randn(N,q)

def generate_data(N_g, W, a_g, mu, sigma2, x_n, random_state=None):
    rnd=np.random.RandomState(random_state)

    N=N_g.sum()
    d, q = W.shape
    sigma=np.sqrt(sigma2)

    return compute_mean_likelihood(N_g, W, a_g, mu, x_n) + sigma*rnd.randn(N,d)

def compute_mean_likelihood(N_g, W, a_g, mu, x_n):
    G=len(N_g)
    N=N_g.sum()
    d, q = W.shape

    g_ind=np.concatenate((np.zeros(1, dtype=np.int64), np.cumsum(N_g)))


    y_n=np.empty((N, d))

    for g in range(G):
        y_n[g_ind[g]:g_ind[g+1]]= np.einsum("dq,nq->nd", W, x_n[g_ind[g]:g_ind[g+1]]+a_g[g]) + mu
    y_n = pd.DataFrame(data=y_n,
                     columns=[f'var_{i + 1}' for i in range(d)])

    return y_n

In [None]:
np.random.seed(100)

D_i = [15, 8, 10]
G = 2
n_centers = 3
testing_samples = 40

q_gen, sigma2_gen1, sigma2_gen2, sigma2_gen3 = 4, 2, 1, 3

W_gen1 = np.random.uniform(-10, 10, (D_i[0], q_gen))
W_gen2 = np.random.uniform(-5, 5, (D_i[1], q_gen))
W_gen3 = np.random.uniform(-15, 15, (D_i[2], q_gen))
a_g_gen = np.concatenate((np.zeros((1, q_gen)), np.random.uniform(-10, 10, (G - 1, q_gen))))
mu_gen1 = np.random.uniform(-10, 10, D_i[0])
mu_gen2 = np.random.uniform(-5, 5, D_i[1])
mu_gen3 = np.random.uniform(-15, 15, D_i[2])

for i in range(n_centers):
    N_g = np.array([np.random.randint(25,300),np.random.randint(25,300)])
    g_ind = np.concatenate((np.zeros(1, dtype=np.int64), np.cumsum(N_g)))
    N = N_g.sum()
    x_n_gen = sample_x_n(N, q_gen, random_state=150)
    y_t1 = generate_data(N_g, W_gen1, a_g_gen, mu_gen1, sigma2_gen1, x_n_gen, random_state=250)
    y_t2 = generate_data(N_g, W_gen2, a_g_gen, mu_gen2, sigma2_gen2, x_n_gen, random_state=250)
    y_t3 = generate_data(N_g, W_gen3, a_g_gen, mu_gen3, sigma2_gen3, x_n_gen, random_state=250)

    gr = [0 for _ in range(N_g[0])]+[1 for _ in range(N_g[1])]
    gr = pd.Series(gr)

    t_i = pd.concat((y_t1, y_t2, y_t3, gr), axis=1)
    np.savetxt('== Local path to node' + str(i+1) + '.csv',t_i,delimiter=',')
               
N_test = np.array([testing_samples//2,testing_samples//2])
g_ind = np.concatenate((np.zeros(1, dtype=np.int64), np.cumsum(N_test)))
N = N_test.sum()
x_n_gen = sample_x_n(N, q_gen, random_state=150)
y_t1 = generate_data(N_test, W_gen1, a_g_gen, mu_gen1, sigma2_gen1, x_n_gen, random_state=250)
y_t2 = generate_data(N_test, W_gen2, a_g_gen, mu_gen2, sigma2_gen2, x_n_gen, random_state=250)
y_t3 = generate_data(N_test, W_gen3, a_g_gen, mu_gen3, sigma2_gen3, x_n_gen, random_state=250)

gr = [0 for _ in range(N_test[0])]+[1 for _ in range(N_test[1])]
gr = pd.Series(gr)

t_test = pd.concat((y_t1, y_t2, y_t3, gr), axis=1)

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed

2. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[0],y[0].
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[1],y[1].
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list`
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
  
* **Node 3 :** Open a third terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[2],y[2].
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n3.ini list`
  * Run the node using `./scripts/fedbiomed_run node config n3.ini start `.

 Wait until you get `Connected with result code 0`. it means node is online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/ppca_id.py'

Hereafter the template of the class you should provide to Fedbiomed :

**__init__** : we add here the needed sklearn libraries
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. 

In [3]:
%%writefile "$model_file"

from fedbiomed.common.ppca import PpcaPlan
import numpy as np


class IID_MV_PPCA(PpcaPlan):
    def __init__(self, kwargs):
        super(IID_MV_PPCA, self).__init__(kwargs)
        #self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
    
    def training_data(self):
        """
            Perform in this method all data reading and data transformations you need.
            At the end you should provide a couple (X,y,ViewsX), where X is the training dataset, 
            y the corresponding labels, ViewsX a list, with len(ViewsX)=K, containing 1 at position i 
            if the center dispose of data for the i-th view 0 otherwise.
            :raise NotImplementedError if developer do not implement this method.
        """
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,:-1].values
        y = dataset[dataset.columns[-1]]
        return (X,y.values,[1,1,1])
    

Writing /Users/balelli/ownCloud/INRIA_EPIONE/FedBioMed/fedbiomed/var/tmp/tmp97bticuo/ppca_id.py


**model_args** is a dictionary containing the mv-ppca model arguments: the total number of views across all datasets, the dimension of each view and the latent space size.

**training_args** contains here the number of local iterations for EM. 

In [4]:
model_args = {'tot_views':3, 'dim_views': [15, 8, 10] , 'n_components': 4}

training_args = {'n_iterations': 15}

In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.mlaggregator import MLaggregator

tags =  ['ppca_data']
rounds = 5

# select nodes participing to this experiment
exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='IID_MV_PPCA',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=MLaggregator(),
                 client_selection_strategy=None)

2021-10-05 16:50:41,041 fedbiomed INFO - Messaging researcher_30571cfd-7c42-4eb6-a9bc-207d0727a7e4 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x15299b5b0>
2021-10-05 16:50:41,097 fedbiomed INFO - Searching for clients with data tags: ['ppca_data']
2021-10-05 16:50:41,118 fedbiomed INFO - message received:{'researcher_id': 'researcher_30571cfd-7c42-4eb6-a9bc-207d0727a7e4', 'success': True, 'databases': [{'name': 'ppca_data', 'data_type': 'csv', 'tags': ['ppca_data'], 'description': 'ppca_data', 'shape': [214, 33], 'dataset_id': 'dataset_808bb2d1-ada4-4b91-bdec-29e515b4960d'}], 'count': 1, 'client_id': 'client_72ed9e37-49e2-4eff-8db6-30257bd5f5e9', 'command': 'search'}


In [None]:
# start federated training
exp.run()

2021-10-05 16:50:51,752 fedbiomed INFO - Sampled clients in round 0 ['client_72ed9e37-49e2-4eff-8db6-30257bd5f5e9']
2021-10-05 16:50:51,753 fedbiomed INFO - Send message to client client_72ed9e37-49e2-4eff-8db6-30257bd5f5e9 - {'researcher_id': 'researcher_30571cfd-7c42-4eb6-a9bc-207d0727a7e4', 'job_id': '28b8ad51-e069-4fe9-9e93-b63eb746721b', 'training_args': {'n_iterations': 15}, 'model_args': {'tot_views': 3, 'dim_views': [15, 8, 10], 'n_components': 4}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/10/05/my_model_1b802b6a-abab-4b99-b926-643d57d99e79.py', 'params_url': 'http://localhost:8844/media/uploads/2021/10/05/my_model_b1dff4c4-c8f2-49df-b6f1-1a9f17263ea2.pt', 'model_class': 'IID_MV_PPCA', 'training_data': {'client_72ed9e37-49e2-4eff-8db6-30257bd5f5e9': ['dataset_808bb2d1-ada4-4b91-bdec-29e515b4960d']}}
2021-10-05 16:50:51,754 fedbiomed DEBUG - researcher_30571cfd-7c42-4eb6-a9bc-207d0727a7e4
2021-10-05 16:50:52,056 fedbiomed INFO - message received:

## Lets build now a dataset test, **A** is the linear transformation that has been used to build the csv file training datasets.

In [None]:
n_features = 5
testing_samples = 40
rng = np.random.RandomState(1)


def test_data():
    X_test = rng.randn(testing_samples, n_features).reshape([testing_samples, n_features])
    y_test = X_test.dot(A) + rng.randn(testing_samples).reshape([testing_samples,1])
    return X_test, y_test

In [None]:
from sklearn.linear_model import SGDRegressor

In [None]:
X_test, y_test = test_data()

The MSE should be decreasing at each iteration with the federated parameters.

For that, we are exporting `exp.aggregated_params` containing models parameters collected at the end of each round

In [None]:
testing_error = []

for i in range(rounds):
    fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
    fed_model.coef_ = exp.aggregated_params[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict(X_test).ravel() - y_test.ravel())**2)
    print('MSE ', mse)
    testing_error.append(mse)