# Fedbiomed Researcher to train a federated ppca model.

## Description of the exercise :

Three datasets `n1.csv` , `n2.csv` and `n3.csv` will be generated randomly using 3-views PPCA from a 4-dimensional latent space, with views dimensions [15,8,10] and 2 groups. Henceforth, we will distribute the 3 dataset to 3 distinct nodes and use Fed-mv-PPCA. In each center we check the evolution of expected LL during training.

## Data Generation

We will generate three datasets using mv-PPCA.
Then save them in a path of your choice on your machine.

In [None]:
import numpy as np
import pandas as pd

def sample_x_n(N, q, random_state=None):
    return np.random.RandomState(random_state).randn(N,q)

def generate_data(N_g, W, a_g, mu, sigma2, x_n, view, random_state=None):
    rnd=np.random.RandomState(random_state)

    N=N_g.sum()
    d, q = W.shape
    sigma=np.sqrt(sigma2)

    return compute_mean_likelihood(N_g, W, a_g, mu, x_n, view) + sigma*rnd.randn(N,d)

def compute_mean_likelihood(N_g, W, a_g, mu, x_n, view):
    G=len(N_g)
    N=N_g.sum()
    d, q = W.shape

    g_ind=np.concatenate((np.zeros(1, dtype=np.int64), np.cumsum(N_g)))


    y_n=np.empty((N, d))

    for g in range(G):
        y_n[g_ind[g]:g_ind[g+1]]= np.einsum("dq,nq->nd", W, x_n[g_ind[g]:g_ind[g+1]]+a_g[g]) + mu
    y_n = pd.DataFrame(data=y_n,
                     columns=[f'var_{view},{i + 1}' for i in range(d)])

    return y_n

In [None]:
np.random.seed(100)

D_i = [15, 8, 10]
G = 2
n_centers = 3
testing_samples = 40

q_gen, sigma2_gen1, sigma2_gen2, sigma2_gen3 = 4, 2, 1, 3

W_gen1 = np.random.uniform(-10, 10, (D_i[0], q_gen))
W_gen2 = np.random.uniform(-5, 5, (D_i[1], q_gen))
W_gen3 = np.random.uniform(-15, 15, (D_i[2], q_gen))
a_g_gen = np.concatenate((np.zeros((1, q_gen)), np.random.uniform(-10, 10, (G - 1, q_gen))))
mu_gen1 = np.random.uniform(-10, 10, D_i[0])
mu_gen2 = np.random.uniform(-5, 5, D_i[1])
mu_gen3 = np.random.uniform(-15, 15, D_i[2])

for i in range(n_centers):
    N_g = np.array([np.random.randint(25,300),np.random.randint(25,300)])
    g_ind = np.concatenate((np.zeros(1, dtype=np.int64), np.cumsum(N_g)))
    N = N_g.sum()
    x_n_gen = sample_x_n(N, q_gen, random_state=150)
    y_t1 = generate_data(N_g, W_gen1, a_g_gen, mu_gen1, sigma2_gen1, x_n_gen, view = 1, random_state=250)
    y_t2 = generate_data(N_g, W_gen2, a_g_gen, mu_gen2, sigma2_gen2, x_n_gen, view = 2, random_state=250)
    y_t3 = generate_data(N_g, W_gen3, a_g_gen, mu_gen3, sigma2_gen3, x_n_gen, view = 3, random_state=250)

    gr = [int(0) for _ in range(N_g[0])]+[int(1) for _ in range(N_g[1])]
    gr = pd.Series(gr)

    t_i = pd.concat((y_t1, y_t2, y_t3, gr), axis=1)
    t_i.columns.values[-1] = 'Label'
    t_i.to_csv('== Local path to node' + str(i+1) + '.csv',sep=',')
    #np.savetxt('== Local path to node' + str(i+1) + '.csv',t_i,delimiter=',')
               
N_test = np.array([testing_samples//2,testing_samples//2])
g_ind = np.concatenate((np.zeros(1, dtype=np.int64), np.cumsum(N_test)))
N = N_test.sum()
x_n_gen = sample_x_n(N, q_gen, random_state=150)
y_t1 = generate_data(N_test, W_gen1, a_g_gen, mu_gen1, sigma2_gen1, x_n_gen, view = 1, random_state=250)
y_t2 = generate_data(N_test, W_gen2, a_g_gen, mu_gen2, sigma2_gen2, x_n_gen, view = 2, random_state=250)
y_t3 = generate_data(N_test, W_gen3, a_g_gen, mu_gen3, sigma2_gen3, x_n_gen, view = 3, random_state=250)

gr = [0 for _ in range(N_test[0])]+[1 for _ in range(N_test[1])]
gr = pd.Series(gr)

t_test = pd.concat((y_t1, y_t2, y_t3, gr), axis=1)

## Start the network and setting the client up
Before running this notebook:
1. You should start the network from fedbiomed-network, as detailed in :
https://gitlab.inria.fr/fedbiomed/fedbiomed

2. You need to configure at least 2 nodes: <br/>
* **Node 1 :** `./scripts/fedbiomed_run node add`
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[0],y[0].
  * Check that your data has been added in node 1 by executing `./scripts/fedbiomed_run node list`
  * Run the node using `./scripts/fedbiomed_run node start`. <br/>

* **Node 2 :** Open a second terminal and run ./scripts/fedbiomed_run node add config n2.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[1],y[1].
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n2.ini list`
  * Run the node using `./scripts/fedbiomed_run node config n2.ini start`.
  
* **Node 3 :** Open a third terminal and run ./scripts/fedbiomed_run node add config n3.ini
  * Select option 1 to add a csv file to the client
  * Choose the name, tags and description of the dataset (you can write 'sk' always and it will be good)
  * Pick the .csv file you stored the couple X[2],y[2].
  * Check that your data has been added in node 2 by executing `./scripts/fedbiomed_run node config n3.ini list`
  * Run the node using `./scripts/fedbiomed_run node config n3.ini start `.

 Wait until you get `Connected with result code 0`. it means node is online.


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
from fedbiomed.researcher.environ import TMP_DIR
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=TMP_DIR+'/')
model_file = tmp_dir_model.name + '/ppca_id.py'

Hereafter the template of the class you should provide to Fedbiomed :

**__init__** : we add here the needed sklearn libraries
       
**training_data** : you must return here the (X,y) that must be of the same type of 
your method partial_fit parameters. 

In [3]:
%%writefile "$model_file"

from fedbiomed.common.ppca import PpcaPlan
import numpy as np


class IID_MV_PPCA(PpcaPlan):
    def __init__(self, kwargs):
        super(IID_MV_PPCA, self).__init__(kwargs)
        #self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
    
    def training_data(self):
        """
            Perform in this method all data reading and data transformations you need.
            At the end you should provide a couple (X,y,ViewsX), where X is the training dataset, 
            y the corresponding labels, ViewsX a list, with len(ViewsX)=K, containing 1 at position i 
            if the center dispose of data for the i-th view 0 otherwise.
            :raise NotImplementedError if developer do not implement this method.
        """
        dataset = pd.read_csv(self.dataset_path,delimiter=',', index_col=0)
        X = dataset.iloc[:,:-1]
        y = dataset[dataset.columns[-1]]
        return (X,y,[1,1,1])
    

Writing /Users/balelli/ownCloud/INRIA_EPIONE/FedBioMed/fedbiomed/var/tmp/tmpjo_dv67c/ppca_id.py


**model_args** is a dictionary containing the mv-ppca model arguments: the total number of views across all datasets, the dimension of each view and the latent space size.

**training_args** contains here the number of local iterations for EM. 

In [4]:
model_args = {'tot_views':3, 'dim_views': [15, 8, 10] , 'n_components': 4}

training_args = {'n_iterations': 15}

In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.mlaggregator import MLaggregator

tags =  ['ppca_data']
rounds = 20

# select nodes participing to this experiment
exp = Experiment(tags=tags,
                 #clients=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='IID_MV_PPCA',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=MLaggregator(),
                 client_selection_strategy=None)

2021-10-08 16:33:44,822 fedbiomed INFO - Messaging researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x155e005b0>
2021-10-08 16:33:44,904 fedbiomed INFO - Searching for clients with data tags: ['ppca_data']
2021-10-08 16:33:44,933 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'success': True, 'databases': [{'name': 'ppca_data', 'data_type': 'csv', 'tags': ['ppca_data'], 'description': 'ppca_data', 'shape': [351, 34], 'dataset_id': 'dataset_6408d072-5f60-4592-ad83-e095f92aef13'}], 'count': 1, 'client_id': 'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'command': 'search'}
2021-10-08 16:33:44,940 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'success': True, 'databases': [{'name': 'ppca_data', 'data_type': 'csv', 'tags': ['ppca_data'], 'description': 'ppca_data', 'shape': [215

In [6]:
# start federated training
exp.run()

2021-10-08 16:33:57,611 fedbiomed INFO - Sampled clients in round 0 ['client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'client_305a6614-69ea-42ca-9850-43de0246b72f']
2021-10-08 16:33:57,613 fedbiomed INFO - Send message to client client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - {'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'training_args': {'n_iterations': 15}, 'model_args': {'tot_views': 3, 'dim_views': [15, 8, 10], 'n_components': 4}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/10/08/my_model_c291cf4f-8e30-4075-98d0-21a89628d23f.py', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/my_model_2eb56e7b-edc5-4f0a-b7f9-11dd925cffc8.pt', 'model_class': 'IID_MV_PPCA', 'training_data': {'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4': ['dataset_6408d072-5f60-4592-ad83-e095f92aef13']}}
2021-10-08 16:33:57,615 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 

2021-10-08 16:34:28,124 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 16:34:35,212 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'success': True, 'client_id': 'client_305a6614-69ea-42ca-9850-43de0246b72f', 'dataset_id': 'dataset_e009a9dc-a088-4603-83f8-bb9bf82dc849', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/node_params_cb83af4a-8d2e-4f6c-b0fb-a2cb800f7120.pt', 'timing': {'rtime_training': 6.806645809000003, 'ptime_training': 3.8936030000000006}, 'msg': '', 'command': 'train'}
2021-10-08 16:34:37,344 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'success': True, 'client_id': 'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'dataset_id': 'dataset_6408d072-5f60-4592-ad83-e095f92aef13', 'params_url': 'http://localhost:8844/media/u

2021-10-08 16:35:23,626 fedbiomed INFO - Downloading model params after training on client_305a6614-69ea-42ca-9850-43de0246b72f - from http://localhost:8844/media/uploads/2021/10/08/node_params_f85d369f-9a32-4c49-a2ad-da8a730a5f0d.pt
2021-10-08 16:35:23,660 fedbiomed INFO - Downloading model params after training on client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - from http://localhost:8844/media/uploads/2021/10/08/node_params_f369866b-5fbf-41d1-bd47-db97992783e5.pt
2021-10-08 16:35:23,714 fedbiomed INFO - Clients that successfully reply in round 4 ['client_305a6614-69ea-42ca-9850-43de0246b72f', 'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4']
2021-10-08 16:35:23,821 fedbiomed INFO - Sampled clients in round 5 ['client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'client_305a6614-69ea-42ca-9850-43de0246b72f']
2021-10-08 16:35:23,825 fedbiomed INFO - Send message to client client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - {'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '

2021-10-08 16:36:09,330 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 16:36:09,331 fedbiomed INFO - Send message to client client_305a6614-69ea-42ca-9850-43de0246b72f - {'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'training_args': {'n_iterations': 15}, 'model_args': {'tot_views': 3, 'dim_views': [15, 8, 10], 'n_components': 4}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/10/08/my_model_c291cf4f-8e30-4075-98d0-21a89628d23f.py', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/researcher_params_9adde7e1-d3cb-4078-8fda-2d8828f05762.pt', 'model_class': 'IID_MV_PPCA', 'training_data': {'client_305a6614-69ea-42ca-9850-43de0246b72f': ['dataset_e009a9dc-a088-4603-83f8-bb9bf82dc849']}}
2021-10-08 16:36:09,339 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 16:36:16,274 fedbiomed INFO - message received:{'researcher_id':

2021-10-08 16:36:53,119 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'success': True, 'client_id': 'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'dataset_id': 'dataset_6408d072-5f60-4592-ad83-e095f92aef13', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/node_params_99a4ac56-7d60-425f-a56b-2e6dbbc65ba7.pt', 'timing': {'rtime_training': 8.148407865999985, 'ptime_training': 10.139786999999998}, 'msg': '', 'command': 'train'}
2021-10-08 16:36:59,785 fedbiomed INFO - Downloading model params after training on client_305a6614-69ea-42ca-9850-43de0246b72f - from http://localhost:8844/media/uploads/2021/10/08/node_params_93ece63f-9d20-4547-9e87-e4bfd6511e9f.pt
2021-10-08 16:36:59,826 fedbiomed INFO - Downloading model params after training on client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - from http://localhost:8844/media/uploads/2021/10/08/node_params_99a4ac56-7d60-425f-a56

2021-10-08 16:37:30,454 fedbiomed INFO - Sampled clients in round 12 ['client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'client_305a6614-69ea-42ca-9850-43de0246b72f']
2021-10-08 16:37:30,456 fedbiomed INFO - Send message to client client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - {'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'training_args': {'n_iterations': 15}, 'model_args': {'tot_views': 3, 'dim_views': [15, 8, 10], 'n_components': 4}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/10/08/my_model_c291cf4f-8e30-4075-98d0-21a89628d23f.py', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/researcher_params_6e6d36bd-194b-4522-a019-97fe6a5c4b33.pt', 'model_class': 'IID_MV_PPCA', 'training_data': {'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4': ['dataset_6408d072-5f60-4592-ad83-e095f92aef13']}}
2021-10-08 16:37:30,459 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2

2021-10-08 16:38:00,898 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 16:38:06,441 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'success': True, 'client_id': 'client_305a6614-69ea-42ca-9850-43de0246b72f', 'dataset_id': 'dataset_e009a9dc-a088-4603-83f8-bb9bf82dc849', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/node_params_16085300-effe-44e2-8ae1-1309ef5fd85e.pt', 'timing': {'rtime_training': 5.262860255000021, 'ptime_training': 5.0171140000000065}, 'msg': '', 'command': 'train'}
2021-10-08 16:38:08,438 fedbiomed INFO - message received:{'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'success': True, 'client_id': 'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'dataset_id': 'dataset_6408d072-5f60-4592-ad83-e095f92aef13', 'params_url': 'http://localhost:8844/media/u

2021-10-08 16:38:46,314 fedbiomed INFO - Downloading model params after training on client_305a6614-69ea-42ca-9850-43de0246b72f - from http://localhost:8844/media/uploads/2021/10/08/node_params_2bff9170-8924-413f-9829-242c8871e424.pt
2021-10-08 16:38:46,350 fedbiomed INFO - Downloading model params after training on client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - from http://localhost:8844/media/uploads/2021/10/08/node_params_a59c34bc-9b1c-497a-93ea-df505be62f07.pt
2021-10-08 16:38:46,378 fedbiomed INFO - Clients that successfully reply in round 16 ['client_305a6614-69ea-42ca-9850-43de0246b72f', 'client_d5ae6538-a16d-4d84-a002-e2260fcc10f4']
2021-10-08 16:38:46,480 fedbiomed INFO - Sampled clients in round 17 ['client_d5ae6538-a16d-4d84-a002-e2260fcc10f4', 'client_305a6614-69ea-42ca-9850-43de0246b72f']
2021-10-08 16:38:46,481 fedbiomed INFO - Send message to client client_d5ae6538-a16d-4d84-a002-e2260fcc10f4 - {'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id':

2021-10-08 16:39:16,930 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 16:39:16,935 fedbiomed INFO - Send message to client client_305a6614-69ea-42ca-9850-43de0246b72f - {'researcher_id': 'researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838', 'job_id': '2457bda6-10c1-4be7-82cb-caa87f72086f', 'training_args': {'n_iterations': 15}, 'model_args': {'tot_views': 3, 'dim_views': [15, 8, 10], 'n_components': 4}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2021/10/08/my_model_c291cf4f-8e30-4075-98d0-21a89628d23f.py', 'params_url': 'http://localhost:8844/media/uploads/2021/10/08/researcher_params_ebf501bb-a43a-4c5a-9781-948fd7ffa853.pt', 'model_class': 'IID_MV_PPCA', 'training_data': {'client_305a6614-69ea-42ca-9850-43de0246b72f': ['dataset_e009a9dc-a088-4603-83f8-bb9bf82dc849']}}
2021-10-08 16:39:16,940 fedbiomed DEBUG - researcher_c2567539-90f1-4ceb-8d3d-9db61bed9838
2021-10-08 16:39:23,098 fedbiomed INFO - message received:{'researcher_id':

In [7]:
print("\nList the training rounds : ", exp.aggregated_params.keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params[rounds - 1]['params'].keys())


List the training rounds :  dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Access the federated params for the last training round :
	- params_path:  /Users/balelli/ownCloud/INRIA_EPIONE/FedBioMed/fedbiomed/var/tmp/researcher_params_07e93e56-573a-4223-b225-53bebf262e68.pt
	- parameter data:  dict_keys(['tilde_muk', 'tilde_Wk', 'tilde_Sigma2k', 'Alpha', 'Beta', 'sigma_til_muk', 'sigma_til_Wk', 'sigma_til_sigma2k'])


## Test