In [1]:
from meps_model import WeightedLitDNN

from nvflare.app_common.workflows.fedavg import FedAvg
from nvflare.app_opt.pt.job_config.base_fed_job import BaseFedJob
from nvflare.job_config.script_runner import ScriptRunner

In [2]:
job = BaseFedJob(
    name="meps_lightning_fedavg",
    initial_model=WeightedLitDNN(),
    key_metric="val_loss"
)

In [3]:
n_clients = 3

controller = FedAvg(
    num_clients=n_clients,
    num_rounds=5,
)
job.to(controller, "server")

In [4]:
for i in range(n_clients):
    runner = ScriptRunner(
        script="meps_nvflare.py", script_args=f"--batch_size 256 --local_epochs 100 --weights --dataset_path /data/shared/analysis/bias_fairness/for_fl/site_{i+1}"
    )
    job.to(runner, f"site_{i+1}")

In [5]:
job.simulator_run("/data/shared/analysis/bias_fairness/nvflare_weighted_wkdir")

2025-07-13 10:18:19,690 - SimulatorRunner - INFO - Create the Simulator Server.
2025-07-13 10:18:19,695 - CoreCell - INFO - server: creating listener on tcp://0:48345
2025-07-13 10:18:19,761 - CoreCell - INFO - server: created backbone external listener for tcp://0:48345
2025-07-13 10:18:19,761 - ConnectorManager - INFO - 594320: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2025-07-13 10:18:19,765 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:30075] is starting
2025-07-13 10:18:20,267 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:30075
2025-07-13 10:18:20,267 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:48345] is starting
2025-07-13 10:18:20,270 - SimulatorServer - INFO - max_reg_duration=60.0
2025-07-13 10:18:20,559 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 51323
2025-07-13 10:18:20,559 - SimulatorRunner - INFO - 

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Epoch 0:  47%|████▋     | 8/17 [00:00<00:00, 38.40it/s, v_num=0]2025-07-13 10:18:34,481 - pytorch_lightning.utilities.rank_zero - INFO - You are using the plain ModelCheckpoint callback. Consider using LitModelCheckpoint which with seamless uploading to Model registry.
Epoch 0:  59%|█████▉    | 10/17 [00:00<00:00, 35.99it/s, v_num=0]2025-07-13 10:18:34,498 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: False, used: False
2025-07-13 10:18:34,499 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2025-07-13 10:18:34,499 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2025-07-13 10:18:34,501 - nvflare.app_common.executors.task_script_runner - INFO - 
[Current Round=0, Site = site_3]

2025-07-13 10:18:34,501 - nvflare.app_common.executors.task_script_runner - INFO - --- validate global model ---
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 40.09it/s]
2025-07-13 10:18:34,689 - nvflare

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Epoch 0:  12%|█▏        | 2/17 [00:00<00:00, 45.66it/s, v_num=0]
Validation DataLoader 0:  67%|██████▋   | 2/3 [00:00<00:00, 44.27it/s] [A
Validation DataLoader 0:  67%|██████▋   | 2/3 [00:00<00:00, 51.56it/s] [A
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 57.36it/s][A
Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 27.41it/s, v_num=0]     [A
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 61.66it/s][A
Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 23.52it/s, v_num=0]     [A
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/3 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s][A
Epoch 1:  71%|███████   | 12/17 [00:00<00:00, 17.52it/s, v_num=0]3it/s][A
Validation DataLoader 0:  67%|██████▋   | 2/3 [00:00<00:00, 42.95it/s] [A
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 59.17it/s][A
Epoch 1: 100%|██████████| 17/17 [00:01<00:00, 15.44it/s, v_num=0]     [A
Epoch 1:  47%|████▋    

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


2025-07-13 10:18:56,675 - lightning.pytorch.callbacks.model_summary - INFO - 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
2025-07-13 10:18:56,677 - RestoreState - INFO - optimizer states restored.
Epoch 13:  12%|█▏        | 2/17 [00:00<00:00, 18.57it/s, v_num=0]2025-07-13 10:18:56,804 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site_3, peer_run=simulate_job, task_name=train, task_id=895c661e-f94c-49c1-ad11-ce0654c8417d]: assigned task to client site_3: name=train, id=895c661e-f94c-49c1-ad11-ce0654c8417d
2025-07-13 10:18:56,807 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site_3, peer_run=simulate_

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 42.01it/s]
2025-07-13 10:18:57,327 - nvflare.app_common.executors.task_script_runner - INFO - ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Validate metric           DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        val_loss            0.8071267008781433
──────────────────────────────────���─────────────────────────────────────────────────────────────────────────────────────
2025-07-13 10:18:57,333 - nvflare.app_common.executors.task_script_runner - INFO - --- train new model ---
Epoch 13:  82%|████████▏ | 14/17 [00:00<00:00, 17.82it/s, v_num=0]2025-07-13 10:18:57,474 - lightning.pytorch.callbacks.model_summary - INFO - 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-------------------------

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Validation: |          | 0/? [00:00<?, ?it/s]2025-07-13 10:19:08,402 - nvflare.app_common.executors.task_script_runner - INFO - 
[Current Round=2, Site = site_1]

2025-07-13 10:19:08,403 - nvflare.app_common.executors.task_script_runner - INFO - --- validate global model ---
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 35.86it/s] 
2025-07-13 10:19:08,503 - nvflare.app_common.executors.task_script_runner - INFO - ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Validate metric           DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        val_loss            0.8249231576919556
──────────────────────────────────���─────────────────────────────────────────────────────────────────────────────────────
Validation DataLoader 0:  33%|███▎      | 1/3 [00:00<00:00, 38.70it/s]2025-07-13 10:19:08,509 - nvflare.app_commo

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


2025-07-13 10:19:08,618 - lightning.pytorch.callbacks.model_summary - INFO - 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
2025-07-13 10:19:08,620 - RestoreState - INFO - optimizer states restored.
Epoch 11:   0%|          | 0/17 [00:00<?, ?it/s]2025-07-13 10:19:08,687 - lightning.pytorch.callbacks.model_summary - INFO - 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
2025-07-13 10:19:08,689

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode



2025-07-13 10:19:16,687 - nvflare.app_common.executors.task_script_runner - INFO - ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Validate metric           DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        val_loss             0.825820803642273
──────────────────────────────────���─────────────────────────────────────────────────────────────────────────────────────
Epoch 22:  12%|█▏        | 2/17 [00:00<00:01, 12.83it/s, v_num=0]2025-07-13 10:19:16,701 - nvflare.app_common.executors.task_script_runner - INFO - --- train new model ---
Epoch 21:  12%|█▏        | 2/17 [00:00<00:01, 13.05it/s, v_num=0]2025-07-13 10:19:16,806 - lightning.pytorch.callbacks.model_summary - INFO - 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-------------------------------

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Epoch 22: 100%|██████████| 17/17 [00:01<00:00, 14.33it/s, v_num=0]
Epoch 21:  82%|████████▏ | 14/17 [00:01<00:00, 12.32it/s, v_num=0]
Validation:   0%|          | 0/3 [00:00<?, ?it/s][A
Epoch 15:  71%|███████   | 12/17 [00:00<00:00, 12.71it/s, v_num=0]
Epoch 21:  88%|████████▊ | 15/17 [00:01<00:00, 12.56it/s, v_num=0]t/s][A
Epoch 15:  76%|███████▋  | 13/17 [00:01<00:00, 12.93it/s, v_num=0]t/s][A
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 42.46it/s][A
Epoch 22: 100%|██████████| 17/17 [00:01<00:00, 12.85it/s, v_num=0]    [A
2025-07-13 10:19:17,870 - InProcessClientAPI - INFO - Try to send local model back to peer 
Epoch 21: 100%|██████████| 17/17 [00:01<00:00, 13.17it/s, v_num=0]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/3 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s][A2025-07-13 10:19:17,942 - ClientRunner - INFO - [identity=site_1, run=simulate_job, peer=simulator_server, peer_run=simulate

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 45.84it/s]
2025-07-13 10:19:20,793 - nvflare.app_common.executors.task_script_runner - INFO - ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     Validate metric           DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        val_loss            0.8241653442382812
──────────────────────────────────���─────────────────────────────────────────────────────────────────────────────────────
2025-07-13 10:19:20,797 - nvflare.app_common.executors.task_script_runner - INFO - --- train new model ---
Epoch 23:  29%|██▉       | 5/17 [00:00<00:00, 17.95it/s, v_num=0]2025-07-13 10:19:20,897 - lightning.pytorch.callbacks.model_summary - INFO - 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
--------------------------

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Epoch 16:   0%|          | 0/17 [00:00<?, ?it/s]2025-07-13 10:19:20,919 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site_2, peer_run=simulate_job, task_name=train, task_id=a79c480e-861c-4089-b2f2-491c7806a172]: assigned task to client site_2: name=train, id=a79c480e-861c-4089-b2f2-491c7806a172
2025-07-13 10:19:20,920 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=controller, peer=site_2, peer_run=simulate_job, task_name=train, task_id=a79c480e-861c-4089-b2f2-491c7806a172]: sent task assignment to client. client_name:site_2 task_id:a79c480e-861c-4089-b2f2-491c7806a172
2025-07-13 10:19:20,921 - GetTaskCommand - INFO - return task to client.  client_name: site_2  task_name: train   task_id: a79c480e-861c-4089-b2f2-491c7806a172  sharable_header_task_id: a79c480e-861c-4089-b2f2-491c7806a172
2025-07-13 10:19:20,931 - Communicator - INFO - Received from simulator_server server. getTask: train size: 36KB (36018 Bytes) time

INFO: 
  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | model | DeepNN | 7.6 K  | train
-----------------------------------------
7.6 K     Trainable params
0         Non-trainable params
7.6 K     Total params
0.030     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode


Epoch 23: 100%|██████████| 17/17 [00:01<00:00, 15.01it/s, v_num=0]
Epoch 22:  29%|██▉       | 5/17 [00:00<00:00, 12.53it/s, v_num=0]]
Validation:   0%|          | 0/3 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s][A
Epoch 22:  35%|███▌      | 6/17 [00:00<00:00, 12.73it/s, v_num=0]]t/s][A
Validation DataLoader 0:  67%|██████▋   | 2/3 [00:00<00:00, 24.94it/s][A
Validation DataLoader 0: 100%|██████████| 3/3 [00:00<00:00, 32.64it/s][A
Epoch 23: 100%|██████████| 17/17 [00:01<00:00, 13.20it/s, v_num=0]    [A
2025-07-13 10:19:21,896 - InProcessClientAPI - INFO - Try to send local model back to peer 
Epoch 16: 100%|██████████| 17/17 [00:01<00:00, 13.80it/s, v_num=0]
Epoch 22:  65%|██████▍   | 11/17 [00:00<00:00, 14.22it/s, v_num=0]
Validation:   0%|          | 0/3 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/3 [00:00<?, ?it/s][A
Epoch 22:  71%|███████   | 12/17 [00:00<00:00, 14.56it/s, v_num=0]it/s][A
Validation DataLoader 0:  67%|█