In [None]:
import numpy as np
import wandb
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
import shutil
import os                              # Import the 'os' module for changing directories
os.chdir('/content/drive/MyDrive/FL')  # Change the directory

Mounted at /content/drive


In [None]:
import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
from torchvision import transforms
from torchvision.datasets import CIFAR100
from torch.utils.data import Subset, DataLoader, random_split

from FederatedLearningProject.data.cifar100_loader import get_cifar100
import FederatedLearningProject.checkpoints.checkpointing as checkpointing
from FederatedLearningProject.training.FL_training import train_server
from FederatedLearningProject.experiments import models

In [None]:
wandb.login() # Ask for your APIw key for logging in to the wandb library.

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdepetrofabio[0m ([33mdepetrofabio-politecnico-di-torino[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:

valid_split_perc = 0.2  # of the 50000 training data
train_set, val_set, test_set = get_cifar100(valid_split_perc=valid_split_perc)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2)
val_loader = DataLoader(val_set, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(test_set, batch_size=128, shuffle=False, num_workers=2)

Number of images in Training Set:   40000
Number of images in Validation Set: 10000
Number of images in Test Set:       10000
✅ Datasets loaded successfully


In [None]:
model = models.LinearFlexibleDino(num_layers_to_freeze=12) # num_layers_to_freeze

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


In [None]:
model.debug()


--- Debugging Model ---
Model is primarily on device: cpu
Model overall mode: Train

Parameter Details (Name | Device | Requires Grad? | Inferred Block | Module Mode):
- backbone.cls_token                                 | cpu        | False           | N/A             | Train
- backbone.pos_embed                                 | cpu        | False           | N/A             | Train
- backbone.patch_embed.proj.weight                   | cpu        | False           | N/A             | Train
- backbone.patch_embed.proj.bias                     | cpu        | False           | N/A             | Train
- backbone.blocks.0.norm1.weight                     | cpu        | False           | Block 0         | Eval
- backbone.blocks.0.norm1.bias                       | cpu        | False           | Block 0         | Eval
- backbone.blocks.0.attn.qkv.weight                  | cpu        | False           | Block 0         | Eval
- backbone.blocks.0.attn.qkv.bias                    | cpu      

In [None]:
model.to_cuda()

moving model to cuda


## FedAvg Hyperparameters


The Federated Averaging (FedAvg) algorithm involves several important hyperparameters that influence the performance and efficiency of training. Below is a detailed description of each:

- **`num_clients (K)`**  
  Total number of clients (or devices) participating in the federated learning system.  
  Example: `num_clients = 100`

- **`fraction (C)`**  
  The fraction of clients selected to participate in each communication round. Must be a float between 0 and 1.  
  Example: `fraction = 0.1` means 10% of clients are selected in each round.

- **`local_steps (J)`**  
  Number of local training steps each selected client performs before sending updates back to the server.  
  Example: `local_steps = 5`.  

  - **`num_rounds`**  
  Total number of communication rounds (or global iterations) the server runs to aggregate updates and refine the global model.  
  📌 *Example:* `num_rounds = 100`. This is up to us to define based on convergence and time/compute budget.

- **Additional Notes:**  
  These hyperparameters directly affect convergence speed, communication cost, and model performance.  
  - A smaller `C` reduces communication overhead but may slow convergence.  
  - A larger `E` can improve local model performance but may lead to model divergence if clients’ data distributions are highly non-IID.


The first FL baseline
Implement the algorithm described in [10], fix K=100, C=0.1, adopt an iid sharding of the training set and fix J=4 the number of local steps. Run FedAvg on CIFAR-100 for a proper number of rounds (up to you to define, based on convergence and time/compute budget).


# Federated Learning Baseline on CIFAR-100 using FedAvg

In this experiment, we aim to implement and evaluate the Federated Averaging (FedAvg) algorithm as described in McMahan et al. [10], using a controlled setup on the CIFAR-100 dataset. This serves as a **baseline FL experiment** with standard hyperparameters for further comparative studies.

## 📌 Objectives

- Implement the **FedAvg** algorithm.
- Use **IID sharding** of CIFAR-100 to simulate a federated setting.
- Fix key FL hyperparameters:
  - Number of clients (**K**) = 100
  - Fraction of participating clients per round (**C**) = 0.1
  - Number of local update steps (**J**) = 4
- Evaluate performance over a suitable number of communication rounds.

## ⚙️ Experiment Configuration

| Parameter                  | Value        |
|---------------------------|--------------|
| Dataset                   | CIFAR-100    |
| Model                     | DINO (TBD)   |
| Total Clients (K)         | 100          |
| Participation Fraction (C)| 0.1 (10 clients/round) |
| Local Epochs (J)          | 4            |
| Sharding Type             | IID          |
| Rounds                    | *TBD based on convergence (e.g., 100–500)* |
| Optimizer                 | SGD     |
| Learning Rate             | 0.01 |
| Momentum                  | 0.9 |
| Weight Decay              | 0.0001 |

## 📊 Notes

- **IID sharding** means the training data will be equally and randomly split among clients to avoid any data heterogeneity.
- The number of rounds will be chosen based on **observed convergence behavior** and practical time/compute budget constraints.
- Performance will be tracked using **validation/test accuracy** and **loss** over communication rounds.

## 🧠 Why this setup?

This configuration serves as a **standard benchmark** for future comparison with more advanced techniques (e.g., personalization, non-IID setups, compression, or asynchronous training). By fixing `K`, `C`, and `J`, and using IID data, we create a controlled environment to evaluate the basic performance of FedAvg.

## 📚 Reference

[10] McMahan, Brendan, et al. *Communication-Efficient Learning of Deep Networks from Decentralized Data.* AISTATS 2017.


In [None]:
# --- OPTIMIZER AND LOSS FUNCTION ---


learning_rate = 0.01  # best hyperparameter of the centralized
momentum = 0.9
weight_decay = 0.0001 # best hyperparameter of the centralized

num_clients = 100

# Default hyperparameters for FedAvg
num_local_steps = 4 #
# num_local_steps = 8 #
# num_local_steps = 16 #
fraction = 0.1

# optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum, weight_decay=weight_decay)
# Example optimizer instantiation:
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=learning_rate, # Example LR
    weight_decay=weight_decay,
    momentum=momentum
)
criterion = nn.CrossEntropyLoss()

In [None]:

# wandb.init() prepares the tracking of hyperparameters/metrics for later recording performance using wandb.log()


# INITIALIZE W&B
wandb.init(
    project=project_name,
    name=run_name,
    config={
        "model": model_name,
        "num_rounds": num_rounds,
        "batch_size": train_loader.batch_size,
        "learning_rate": optimizer.param_groups[0]['lr'],
        "architecture": model.__class__.__name__,
})

# Copy your config
config = wandb.config


NameError: name 'num_rounds' is not defined

In [None]:
#  PERCORSO CHECKPOINT

checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/TestFinaliSingleModel"
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_path = os.path.join(checkpoint_dir, f"{model_name}_checkpoint.pth")    # we predefine the name of the file inside the specified folder (dir)


In [None]:
# RECOVER CHECKPOINT
start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=model_name)


try:
  print()
  print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
  model.load_state_dict(model_data["model_state_dict"])
  optimizer.load_state_dict(model_data["optimizer_state_dict"])
except: None

 Nessun checkpoint trovato, inizio da round 1.



In [None]:
from FederatedLearningProject.data.cifar100_loader import create_iid_splits
client_dataset = create_iid_splits(train_set, num_clients = num_clients)

Dataset has 40000 samples across 100 classes.
Creating 100 IID splits with 100 classes each.


Each of the 100 classes split into 100 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), 

In [None]:
### --- LOOP FOR FINDING THE OPTIMAL NUM_ROUNDS --- ###
device = "cuda"
model_name = "dino_vits16_J4"
project_name = "FederatedProject"
num_rounds_iterations = [200,300,400,500,600]
# Loop through different num_rounds values
for num_rounds_val in num_rounds_iterations:
    # Re-initialize the model for each run to ensure fresh weights (or load a pre-trained one)
    # If you want to start from the exact same initial state for each run,
    # you might want to save the initial state_dict and load it here.
    model = models.LinearFlexibleDino(num_layers_to_freeze=12) # Initialize your actual model here
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration
    run_name = f"{model_name}_rounds_{num_rounds_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val, # Use the current num_rounds_val
            "batch_size": test_loader.batch_size, # Using test_loader's batch_size as a placeholder
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps,
            "fraction_clients": fraction,
        },
        reinit=True # Important: Allows re-initialization of wandb in a loop
    )

    # Copy your config
    config = wandb.config

    # PERCORSO CHECKPOINT
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    # Make checkpoint path unique to the run if you want to store separate checkpoints
    checkpoint_path = os.path.join(checkpoint_dir, f"{model_name}_rounds_{num_rounds_val}_checkpoint.pth")

    # RECOVER CHECKPOINT (This part remains, but remember it will look for a checkpoint
    # specific to the current `model_name` and `num_rounds_val` if you've made the path unique.)
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=f"{model_name}_rounds_{num_rounds_val}")

    try:
      print()
      print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
      model.load_state_dict(model_data["model_state_dict"])
      optimizer.load_state_dict(model_data["optimizer_state_dict"])
    except:
        print("Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.")

    print(f"\n--- Starting training for num_rounds = {num_rounds_val} ---")
    train_server(model=model,
                 num_clients=num_clients,
                 num_client_steps=num_local_steps,
                 num_rounds=num_rounds_val, # Pass the current num_rounds_val
                 client_dataset=client_dataset,
                 frac=fraction,
                 optimizer=optimizer,
                 batch_size=64,
                 n_rounds_log=10,
                 val_loader=val_loader,
                 criterion=criterion,
                 checkpoint_path=checkpoint_path,
                 model_name=f"{model_name}_rounds_{num_rounds_val}") # Ensure model_name is unique for checkpointing

    # End the current wandb run before starting the next one
    wandb.finish()

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


0,1
client_avg_accuracy,▁
client_avg_loss,▁
round,▁
server_val_accuracy,▁
server_val_loss,▁

0,1
client_avg_accuracy,27.00583
client_avg_loss,3.1616
round,9.0
server_val_accuracy,29.12
server_val_loss,3.13042


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 200 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_200_checkpoint.pth

Round 10/200
Selected Clients: [92 33 75 21 41 36  0 46 88 43]
Avg Client Loss: 4.2218 | Avg Client Accuracy: 13.20%
Evaluation Loss:4.2730 | Val Accuracy: 13.23%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_200_checkpoint.pth

Round 20/200
Selected Clients: [84 24 22 67 94 16 76 49 50 43]
Avg Client Loss: 3.5273 | Avg Client Accuracy: 20.00%
Evaluation Loss:3.4841 | Val Accuracy: 22.70%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▅▆▆▆▇▇▇▇▇▇▇▇█▇██
client_avg_loss,█▆▄▃▃▃▃▂▂▂▂▂▁▂▂▂▁▂▁▁
round,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
server_val_accuracy,▁▃▄▅▆▆▆▇▇▇▇▇▇▇██████
server_val_loss,█▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,41.60156
client_avg_loss,2.32964
round,199.0
server_val_accuracy,43.68
server_val_loss,2.26143


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 300 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_300_checkpoint.pth

Round 10/300
Selected Clients: [95 43  9 11 26 76 53 14 89 47]
Avg Client Loss: 4.2283 | Avg Client Accuracy: 12.42%
Evaluation Loss:4.2434 | Val Accuracy: 13.98%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_300_checkpoint.pth

Round 20/300
Selected Clients: [36 82 93 41 81 30 55 31 21 11]
Avg Client Loss: 3.4035 | Avg Client Accuracy: 21.29%
Evaluation Loss:3.4683 | Val Accuracy: 23.40%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█▇███████
client_avg_loss,█▅▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▃▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇█████████████
server_val_loss,█▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,41.875
client_avg_loss,2.21978
round,299.0
server_val_accuracy,44.82
server_val_loss,2.18801


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 400 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_400_checkpoint.pth

Round 10/400
Selected Clients: [74 97 19 23 25  4 41 57  6 15]
Avg Client Loss: 4.2501 | Avg Client Accuracy: 12.19%
Evaluation Loss:4.2911 | Val Accuracy: 14.38%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_400_checkpoint.pth

Round 20/400
Selected Clients: [54 87 61 92 10 26 53  7 75 73]
Avg Client Loss: 3.5031 | Avg Client Accuracy: 20.27%
Evaluation Loss:3.5074 | Val Accuracy: 22.68%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇███▇█▇█████
client_avg_loss,█▆▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▂▁▁▁▁▁▁▁
round,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
server_val_accuracy,▁▃▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇████████████████████
server_val_loss,█▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,43.94531
client_avg_loss,2.13777
round,399.0
server_val_accuracy,45.44
server_val_loss,2.17429


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 500 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_500_checkpoint.pth

Round 10/500
Selected Clients: [68 72 46 58 23  1 47 97 64 41]
Avg Client Loss: 4.2638 | Avg Client Accuracy: 11.60%
Evaluation Loss:4.2506 | Val Accuracy: 14.67%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_500_checkpoint.pth

Round 20/500
Selected Clients: [29 69 11  5 44 25 41 93  2 13]
Avg Client Loss: 3.5425 | Avg Client Accuracy: 20.27%
Evaluation Loss:3.4281 | Val Accuracy: 23.81%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▄▅▆▆▆▆▇▆▇▇▇▇▇▇▇▇██▇█▇████████████████
client_avg_loss,█▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
server_val_accuracy,▁▃▄▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇███████████████████
server_val_loss,█▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,44.72656
client_avg_loss,2.10548
round,499.0
server_val_accuracy,46.3
server_val_loss,2.14952


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 600 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_600_checkpoint.pth

Round 10/600
Selected Clients: [66 46 88 10 65 64 29 48 94 21]
Avg Client Loss: 4.2258 | Avg Client Accuracy: 11.68%
Evaluation Loss:4.2160 | Val Accuracy: 14.51%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_600_checkpoint.pth

Round 20/600
Selected Clients: [44 80 68 89 75 83  5 47 45 57]
Avg Client Loss: 3.5220 | Avg Client Accuracy: 19.53%
Evaluation Loss:3.4553 | Val Accuracy: 23.88%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▆▇▆▇▇▇▇▇▇▇▇▇▇▇█▇█▇██▇▇█████▇████████
client_avg_loss,█▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇████
server_val_accuracy,▁▃▄▅▆▆▇▇▇▇▇▇▇▇▇▇████████████████████████
server_val_loss,█▇▆▅▄▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,43.71094
client_avg_loss,2.13101
round,599.0
server_val_accuracy,46.42
server_val_loss,2.15125


In [None]:
num_optimal_rounds = 300

In [None]:
### IID SETTING, TESTING WITH 4,8,16 num_local_steps


### LOOP FOR IID ACCURACIES AND PLOTS

for num_local_steps_val in num_local_steps_iterations:





4

In [None]:
### --- LOOP FOR IID ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProjectIID_300rounds"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_iid_local_steps_{num_local_steps_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "IID_bs128", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=64,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()


--- Starting training for num_local_steps = 4 | num_rounds = 300 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_4_checkpoint.pth

Round 5/300
Selected Clients: [96 78 13 48 37 11 41 14 60 56]
Avg Client Loss: 5.1769 | Avg Client Accuracy: 4.14%
Evaluation Loss:5.1611 | Val Accuracy: 6.51%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_4_checkpoint.pth

Round 10/300
Selected Clients: [68 31 21 76 65 72 56 57 59 16]
Avg Client Loss: 4.2606 | Avg Client Accuracy: 11.29%
Evaluation Loss:4.2603 | Val Accuracy: 13.62%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_4_checkpoint.pth

Round 15/300
Selected Clients: [16 55

0,1
client_avg_accuracy,▁▂▃▄▄▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█▇█▇▇▇█▇▇███████████
client_avg_loss,█▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
server_val_accuracy,▁▂▃▄▄▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█▇█▇█████████████
server_val_loss,█▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,43.24219
client_avg_loss,2.19599
round,299.0
server_val_accuracy,44.6
server_val_loss,2.20149



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_8_checkpoint.pth

Round 5/150
Selected Clients: [ 1 34 63 39 43  5 59 25 31 96]
Avg Client Loss: 3.8086 | Avg Client Accuracy: 19.02%
Evaluation Loss:3.7723 | Val Accuracy: 19.81%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_8_checkpoint.pth

Round 10/150
Selected Clients: [89 85 21  3  9 40 92 44 34 11]
Avg Client Loss: 3.1892 | Avg Client Accuracy: 26.38%
Evaluation Loss:3.1725 | Val Accuracy: 28.15%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_8_checkpoint.pth

Round 15/150
Selected Clients: [43 

0,1
client_avg_accuracy,▁▃▅▅▆▅▆▆▇▆▇▇▇▇▇▇▇▇▇▇████▇▇▇███
client_avg_loss,█▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▂▂▁▁▁▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▃▅▅▆▆▆▇▇▇▇▇▇▇▇███████████████
server_val_loss,█▅▄▃▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,45.08163
client_avg_loss,2.12081
round,149.0
server_val_accuracy,44.67
server_val_loss,2.27007



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_16_checkpoint.pth

Round 5/75
Selected Clients: [73 89 64 28 19 18 76 40  0  1]
Avg Client Loss: 2.2061 | Avg Client Accuracy: 48.36%
Evaluation Loss:3.1319 | Val Accuracy: 31.12%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_16_checkpoint.pth

Round 10/75
Selected Clients: [72  9 26 78 43 86 31 68 13 54]
Avg Client Loss: 1.9612 | Avg Client Accuracy: 53.13%
Evaluation Loss:2.8412 | Val Accuracy: 36.47%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round/dino_vits_16_iid_local_steps_16_checkpoint.pth

Round 15/75
Selected Clients: [37 

0,1
client_avg_accuracy,▁▄▄▅▅▆▆▆█▇█▇███
client_avg_loss,█▅▅▃▄▃▃▃▁▂▁▂▁▁▁
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▄▅▆▇▇▇▇▇██▇██▇
server_val_loss,█▅▃▂▂▂▂▂▁▁▁▁▁▁▁

0,1
client_avg_accuracy,60.27194
client_avg_loss,1.58808
round,74.0
server_val_accuracy,43.14
server_val_loss,2.48329


Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 1 classes each.


Each of the 100 classes split into 1 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0)}
Total: 1
Client 1 has samples from classes: {np.int64(1)}
Total: 1
Client 2 has samples from classes: {np.int64(2)}
Total: 1
Client 3 has samples from classes: {np.int64(3)}
Total: 1
Client 4 has samples from classes: {np.int64(4)}
Total: 1
Client 5 has samples from classes: {np.int64(5)}
Total: 1
Client 6 has samples from classes: {np.int64(6)}
Total: 1
Client 7 has samples from classes: {np.int64(7)}
Total: 1
Client 8 has samples from classes: {np.int64(8)}
Total: 1
Client 9 has samples from classes: {np.int64(9)}
Total: 1
Client 10 has samples from classes: {np.int64(10)}
Total: 1
Client 11 has samples from classes: {np.int64(11)}
Total: 1
Client 12 has samples from classes: {np.int64(12)}
Total: 1
Client 13 has samples from classes: {np.int64(13)}

In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_1 = create_non_iid_splits(train_set, classes_per_client=1)
### --- LOOP FOR non_IID(1) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(1)"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(1)_local_steps_{num_local_steps_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(1)", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_1,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=64,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()


--- Starting training for num_local_steps = 4 | num_rounds = 300 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_4_checkpoint.pth

Round 5/300
Selected Clients: [43 17 63 44 65 58  2 73 87 38]
Avg Client Loss: 3.1543 | Avg Client Accuracy: 77.27%
Evaluation Loss:11.3716 | Val Accuracy: 2.06%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_4_checkpoint.pth

Round 10/300
Selected Clients: [20 80 71 79 10 11 52 51 74 18]
Avg Client Loss: 5.0134 | Avg Client Accuracy: 75.00%
Evaluation Loss:12.4698 | Val Accuracy: 4.00%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_4_checkpoint.pth

Round 15/300
Selecte

0,1
client_avg_accuracy,▄▃▁▃▂▃▁▃▂▄▅▃▄▅▄▅▅▄▄▅▄▄▅▅▆▆▅▅▆█▆▄▅▅▆▃▆▅▃▆
client_avg_loss,▃▅█▄▄▅▅▅▃▃▃▃▂▂▄▃▂▃▂▂▂▃▂▁▂▃▂▂▁▁▂▁▂▂▁▂▁▁▂▁
round,▁▁▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███
server_val_accuracy,▁▁▁▂▂▃▃▄▄▄▅▄▄▅▆▆▆▆▅▇▆▆▇▆▇▇▇▇▇█▇▇▇█▇█▇█▇█
server_val_loss,▇████▇▅▅▅▅▄▄▃▂▃▃▃▃▂▃▂▂▃▂▂▂▂▁▂▂▁▁▂▂▂▁▂▁▁▁

0,1
client_avg_accuracy,81.48438
client_avg_loss,1.24686
round,299.0
server_val_accuracy,32.76
server_val_loss,4.4985



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_8_checkpoint.pth

Round 5/150
Selected Clients: [52 60 51 95 97 74 13 59 75 35]
Avg Client Loss: 3.9338 | Avg Client Accuracy: 85.04%
Evaluation Loss:22.8081 | Val Accuracy: 2.83%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_8_checkpoint.pth

Round 10/150
Selected Clients: [68 66 27 28 88 30 17 60 24 35]
Avg Client Loss: 3.4381 | Avg Client Accuracy: 85.88%
Evaluation Loss:23.2504 | Val Accuracy: 4.18%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_8_checkpoint.pth

Round 15/150
Selecte

0,1
client_avg_accuracy,▆▆▅▃▁▄▇▄█▆▆▄▆▆▆▄▆▇▅▄▇███▇▆▇█▇▇
client_avg_loss,▄▃▅▇█▅▃▄▁▂▃▆▃▂▃▅▂▃▃▃▂▁▁▁▂▃▁▁▂▂
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▁▁▂▂▃▃▃▄▄▅▅▅▅▆▆▆▆▆▇▆▆▇▇▇▇▇▇▇█
server_val_loss,▇▇██▆▆▅▅▄▅▄▃▃▄▃▃▂▃▃▂▂▂▂▁▁▁▁▂▁▁

0,1
client_avg_accuracy,87.06994
client_avg_loss,1.83598
round,149.0
server_val_accuracy,29.56
server_val_loss,6.56271



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_16_checkpoint.pth

Round 5/75
Selected Clients: [27 65 44 30  7 81 99 33 82 49]
Avg Client Loss: 3.9050 | Avg Client Accuracy: 89.26%
Evaluation Loss:27.4590 | Val Accuracy: 3.41%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_16_checkpoint.pth

Round 10/75
Selected Clients: [79 86 22 24  7 34  1 40 83 33]
Avg Client Loss: 5.6649 | Avg Client Accuracy: 86.57%
Evaluation Loss:32.1602 | Val Accuracy: 6.24%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)/dino_vits_16_non_iid(1)_local_steps_16_checkpoint.pth

Round 15/75
Selecte

0,1
client_avg_accuracy,▄▁▂▅▂▅▅▅▆▇█▆▆▇▅
client_avg_loss,▅▇█▄▆▃▃▃▃▁▁▂▂▂▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▂▂▂▃▄▅▅▆▅▇▇▇██
server_val_loss,▆█▇▆▅▃▃▃▂▂▁▂▁▁▁

0,1
client_avg_accuracy,90.54482
client_avg_loss,2.14994
round,74.0
server_val_accuracy,20.24
server_val_loss,13.57529


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_5 = create_non_iid_splits(train_set, classes_per_client=5)

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 5 classes each.


Each of the 100 classes split into 5 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4)}
Total: 5
Client 1 has samples from classes: {np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)}
Total: 5
Client 2 has samples from classes: {np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14)}
Total: 5
Client 3 has samples from classes: {np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19)}
Total: 5
Client 4 has samples from classes: {np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24)}
Total: 5
Client 5 has samples from classes: {np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29)}
Total: 5
Client 6 has samples from classes: {np.int64(32), np.int64(33), np.int64(34), np.int64(30), np.int64(31)}
Total: 5
Client 7 has sa

In [None]:
### --- LOOP FOR non_IID(5) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(5)"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(5)_local_steps_{num_local_steps_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(5)", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_5,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=64,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()


--- Starting training for num_local_steps = 4 | num_rounds = 300 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_4_checkpoint.pth

Round 5/300
Selected Clients: [87 31 36 74 54 39 24 15  1  9]
Avg Client Loss: 3.1337 | Avg Client Accuracy: 37.81%
Evaluation Loss:6.3770 | Val Accuracy: 4.31%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_4_checkpoint.pth

Round 10/300
Selected Clients: [70 42 10 35 64  3 80 83 89 50]
Avg Client Loss: 3.3741 | Avg Client Accuracy: 43.28%
Evaluation Loss:6.3988 | Val Accuracy: 6.95%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_4_checkpoint.pth

Round 15/300
Selected 

0,1
client_avg_accuracy,▁▄▄▅▄▅▆▅▅▆▇▇▆▆▆▇▆▇▆▇▇▇▆█▇▇▇▇▇▇██▇▇▇▆▇▆▇▇
client_avg_loss,█▆▆▅▇▇▅▆▆▅▅▃▂▃▄▄▅▂▂▃▃▂▂▃▂▃▁▃▂▂▁▂▂▃▂▃▂▃▃▃
round,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
server_val_accuracy,▁▂▃▄▄▅▅▆▆▆▆▆▆▆▆▇▇▇▆▇▇▇█▇▇██▇█▇██████▇▇██
server_val_loss,██▆▄▅▄▅▃▃▃▂▂▂▂▂▂▂▃▂▂▂▂▂▂▁▁▂▁▁▁▂▁▁▁▁▂▁▂▁▁

0,1
client_avg_accuracy,65.35156
client_avg_loss,1.42847
round,299.0
server_val_accuracy,38.25
server_val_loss,2.75259



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_8_checkpoint.pth

Round 5/150
Selected Clients: [58 37  7 10 14 65 19  2  5 78]
Avg Client Loss: 3.7090 | Avg Client Accuracy: 49.60%
Evaluation Loss:9.6584 | Val Accuracy: 7.08%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_8_checkpoint.pth

Round 10/150
Selected Clients: [21 45 85 25 69 82 59 80 97  5]
Avg Client Loss: 7.9480 | Avg Client Accuracy: 43.61%
Evaluation Loss:11.6999 | Val Accuracy: 5.87%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_8_checkpoint.pth

Round 15/150
Selected

0,1
client_avg_accuracy,▂▁▄▄▄▄▅▆▅▆▆▅▆▆▆▇▅▇▇▇█▇▇▇▇█▇▇▇█
client_avg_loss,▄█▄▄▃▃▃▃▄▂▂▃▃▂▂▁▃▂▂▂▁▁▁▁▂▁▂▂▁▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▁▃▃▃▃▅▃▅▅▆▅▇▆▆▇▇▇▆▇▇▇▇▇▇▇▇▇▇█
server_val_loss,▆█▅▄▄▆▃▅▃▃▃▄▂▂▃▂▂▂▃▂▁▁▂▂▂▂▂▁▁▁

0,1
client_avg_accuracy,77.35088
client_avg_loss,0.89832
round,149.0
server_val_accuracy,38.17
server_val_loss,3.40348



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_16_checkpoint.pth

Round 5/75
Selected Clients: [99 26 20 63 78 23 85 13 68 61]
Avg Client Loss: 5.8495 | Avg Client Accuracy: 60.04%
Evaluation Loss:16.5327 | Val Accuracy: 5.67%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_16_checkpoint.pth

Round 10/75
Selected Clients: [30 79 54 19 65 13 53 22 59 99]
Avg Client Loss: 5.1169 | Avg Client Accuracy: 66.70%
Evaluation Loss:26.5045 | Val Accuracy: 5.10%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)/dino_vits_16_non_iid(5)_local_steps_16_checkpoint.pth

Round 15/75
Selecte

0,1
client_avg_accuracy,▁▃▃▄▅▆▆▇▆█▇▇▆█▆
client_avg_loss,▇▆█▅▃▂▂▁▂▁▁▂▂▁▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▁▅▅▅▆▅▇▇████▇█
server_val_loss,▅█▃▃▃▂▂▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,74.57719
client_avg_loss,2.44128
round,74.0
server_val_accuracy,30.57
server_val_loss,6.26572


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_10 = create_non_iid_splits(train_set, classes_per_client=10)

### --- LOOP FOR non_IID(10) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(10)"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(10)_local_steps_{num_local_steps_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(10)", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_10,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=64,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 10 classes each.


Each of the 100 classes split into 10 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)}
Total: 10
Client 1 has samples from classes: {np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19)}
Total: 10
Client 2 has samples from classes: {np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29)}
Total: 10
Client 3 has samples from classes: {np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(30), np.int64(31)}
Total: 10
Client 4 has samples from classes: {np.int64(40), np.int64(41), np.int64(4

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_4_checkpoint.pth

Round 5/300
Selected Clients: [91 89 93 22 86 72 64 69 10 50]
Avg Client Loss: 3.9930 | Avg Client Accuracy: 22.66%
Evaluation Loss:6.0810 | Val Accuracy: 5.28%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_4_checkpoint.pth

Round 10/300
Selected Clients: [76 68 92  3 48 94 25 84 62 34]
Avg Client Loss: 3.0784 | Avg Client Accuracy: 31.91%
Evaluation Loss:5.1325 | Val Accuracy: 10.17%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_4_checkpoint.pth

Round 15/300
Se

0,1
client_avg_accuracy,▁▃▄▄▅▅▆▆▅▆▆▆▅▆▆▇▇▇▇▆▇▇▇▇█▆▇▇▇▇█▇█▇▇▇█▇▇▇
client_avg_loss,█▆▅▄▄▃▃▂▂▄▃▂▃▃▂▂▂▂▂▂▂▂▃▁▂▂▂▁▁▁▁▁▂▁▁▂▁▁▂▁
round,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
server_val_accuracy,▁▂▃▄▄▅▆▆▆▅▇▆▆▇▆▇▇▇▇▇██▇▇▇██▇▇██▇████████
server_val_loss,█▆▄▃▃▃▂▂▃▂▂▂▂▂▂▂▂▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁

0,1
client_avg_accuracy,61.79688
client_avg_loss,1.39573
round,299.0
server_val_accuracy,40.02
server_val_loss,2.46739



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_8_checkpoint.pth

Round 5/150
Selected Clients: [92 57 45 82 93 72 62 55 79 80]
Avg Client Loss: 4.9776 | Avg Client Accuracy: 35.19%
Evaluation Loss:9.6926 | Val Accuracy: 8.70%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_8_checkpoint.pth

Round 10/150
Selected Clients: [32 75 83 84 20 46  8 13 78 58]
Avg Client Loss: 3.7961 | Avg Client Accuracy: 43.81%
Evaluation Loss:6.6133 | Val Accuracy: 13.99%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_8_checkpoint.pth

Round 15/150
Se

0,1
client_avg_accuracy,▁▃▅▄▄▅▄▅▆▇▅▅▆▇▆▆▆▅▇▆▆▇█▇▇▇█▇▇▇
client_avg_loss,█▆▅▄▄▄▅▄▂▂▅▃▂▂▂▂▃▄▂▃▃▂▁▂▂▁▁▂▂▂
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▂▄▄▄▄▅▅▆▆▇▆▇▆▆▇▇▆█▇▇█▇▇▇█▇█▇▇
server_val_loss,█▅▄▃▃▃▃▂▂▂▂▂▁▂▂▂▂▂▁▂▂▁▂▁▂▁▂▁▁▁

0,1
client_avg_accuracy,63.50135
client_avg_loss,1.68545
round,149.0
server_val_accuracy,37.57
server_val_loss,3.17783



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_16_checkpoint.pth

Round 5/75
Selected Clients: [48  6 74 31 69 97 58 32 41 16]
Avg Client Loss: 3.2161 | Avg Client Accuracy: 53.82%
Evaluation Loss:7.7622 | Val Accuracy: 12.69%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_16_checkpoint.pth

Round 10/75
Selected Clients: [18 76 64 69 16 90 82 33 39 75]
Avg Client Loss: 3.7985 | Avg Client Accuracy: 58.70%
Evaluation Loss:7.4170 | Val Accuracy: 19.88%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)/dino_vits_16_non_iid(10)_local_steps_16_checkpoint.pth

Round 15/75
S

0,1
client_avg_accuracy,▁▃▂▃▄▆▆▆▆▇▆▆███
client_avg_loss,▅▆██▅▃▄▃▃▂▃▃▁▁▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▃▃▅▆▅▇▅▇▆▅███▇
server_val_loss,█▇▇▇▆█▄▅▂▄▇▃▁▃▂

0,1
client_avg_accuracy,70.02116
client_avg_loss,1.67034
round,74.0
server_val_accuracy,31.38
server_val_loss,4.64849


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_50 = create_non_iid_splits(train_set, classes_per_client=50)

### --- LOOP FOR non_IID(50) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(50)"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(50)_local_steps_{num_local_steps_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(50)", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_50,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=64,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 50 classes each.


Each of the 100 classes split into 50 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49)}
Total: 50
Client 1 has samples from classes: {np.int64(50), np.int64(51), np.int64

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_4_checkpoint.pth

Round 5/300
Selected Clients: [49 27 46  4 83 82 18 53 75  1]
Avg Client Loss: 4.7801 | Avg Client Accuracy: 7.11%
Evaluation Loss:5.2008 | Val Accuracy: 7.20%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_4_checkpoint.pth

Round 10/300
Selected Clients: [ 2 92 75 90 45 55 39 84 86 50]
Avg Client Loss: 4.0594 | Avg Client Accuracy: 14.02%
Evaluation Loss:4.3411 | Val Accuracy: 13.53%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_4_checkpoint.pth

Round 15/300
Sel

0,1
client_avg_accuracy,▁▂▄▄▄▅▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▆███▇▇▆████▇██▇██
client_avg_loss,█▆▅▅▄▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
server_val_accuracy,▁▂▃▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█▇▇████▇▇██████████
server_val_loss,█▆▅▄▄▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,45.66406
client_avg_loss,2.07244
round,299.0
server_val_accuracy,44.51
server_val_loss,2.22814



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_8_checkpoint.pth

Round 5/150
Selected Clients: [56 19 37 34 60 17  5 93 18 58]
Avg Client Loss: 3.4704 | Avg Client Accuracy: 21.97%
Evaluation Loss:3.9161 | Val Accuracy: 18.45%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_8_checkpoint.pth

Round 10/150
Selected Clients: [19 85 37  1 99 59 33 27 28 86]
Avg Client Loss: 3.6885 | Avg Client Accuracy: 25.05%
Evaluation Loss:5.4334 | Val Accuracy: 21.15%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_8_checkpoint.pth

Round 15/150
S

0,1
client_avg_accuracy,▁▂▅▅▅▅▆▆▆▆▆▅▇▇▆▇▇██▆█▇▇█▇▇▇▆█▇
client_avg_loss,▇█▄▃▃▄▃▂▃▂▄▅▂▂▄▂▂▁▂▄▁▂▂▁▂▂▂▃▁▂
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▂▅▄▅▆▆▆▇▆▇▆▆▇▅▇█▆▆▅█▆█▇██▇▇█▆
server_val_loss,▅█▃▃▂▂▂▂▂▂▂▃▂▂▄▁▁▂▂▃▁▂▁▁▁▁▂▂▁▃

0,1
client_avg_accuracy,46.46929
client_avg_loss,2.12938
round,149.0
server_val_accuracy,35.85
server_val_loss,3.28026



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_16_checkpoint.pth

Round 5/75
Selected Clients: [92 47 99 40 70 19 49 46  5 82]
Avg Client Loss: 2.7295 | Avg Client Accuracy: 43.35%
Evaluation Loss:3.8584 | Val Accuracy: 26.16%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_16_checkpoint.pth

Round 10/75
Selected Clients: [59  6 41 19 96 67 72 54 31 10]
Avg Client Loss: 2.1414 | Avg Client Accuracy: 50.95%
Evaluation Loss:3.5234 | Val Accuracy: 30.12%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)/dino_vits_16_non_iid(50)_local_steps_16_checkpoint.pth

Round 15/75
S

0,1
client_avg_accuracy,▁▄▅▆▅▆▆▇▇█▇▆▇▇▇
client_avg_loss,█▅▄▃▇▂▂▂▂▁▂▅▂▁▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▃▃▆▃▂▄▆▅▇▆▃█▅▆
server_val_loss,▅▄▄▂▆█▅▂▃▁▃▇▁▄▃

0,1
client_avg_accuracy,59.19687
client_avg_loss,1.72568
round,74.0
server_val_accuracy,37.77
server_val_loss,3.36463


In [None]:
### RUNS WITH BATCH SIZE = 128 ###

### --- LOOP FOR IID ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProjectIID_300rounds_bs128"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_iid_local_steps_{num_local_steps_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "IID_bs128", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=128,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()


--- Starting training for num_local_steps = 4 | num_rounds = 300 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_4_checkpoint.pth

Round 5/300
Selected Clients: [69 65  3 11 52 87 37 75 93  4]
Avg Client Loss: 5.0257 | Avg Client Accuracy: 4.77%
Evaluation Loss:5.1143 | Val Accuracy: 6.74%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_4_checkpoint.pth

Round 10/300
Selected Clients: [61 59 82 91 63 38 36 90 52 71]
Avg Client Loss: 4.1111 | Avg Client Accuracy: 14.10%
Evaluation Loss:4.2278 | Val Accuracy: 13.90%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_4_checkpoint.pth

Round 15/300
Select

0,1
client_avg_accuracy,▁▃▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇████████████▇████
client_avg_loss,█▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▃▄▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█▇▇▇███████████████
server_val_loss,█▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,45.35031
client_avg_loss,2.05609
round,299.0
server_val_accuracy,44.64
server_val_loss,2.19052



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_8_checkpoint.pth

Round 5/150
Selected Clients: [73 17 21 98 42 92 11 27 16 75]
Avg Client Loss: 3.3947 | Avg Client Accuracy: 22.05%
Evaluation Loss:3.9002 | Val Accuracy: 18.64%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_8_checkpoint.pth

Round 10/150
Selected Clients: [42 27 62  1 63 48 89 23 86 70]
Avg Client Loss: 2.7551 | Avg Client Accuracy: 32.52%
Evaluation Loss:3.2285 | Val Accuracy: 27.16%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_8_checkpoint.pth

Round 15/150
Sele

0,1
client_avg_accuracy,▁▃▄▅▅▅▆▆▆▇▆▇▆▇▇▇▇▇▇█▇▇▇██▇████
client_avg_loss,█▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▃▄▅▆▆▆▇▇▇▇▇▇▇████████████████
server_val_loss,█▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,50.02757
client_avg_loss,1.88521
round,149.0
server_val_accuracy,44.26
server_val_loss,2.25723



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_16_checkpoint.pth

Round 5/75
Selected Clients: [33 24  5 85 18 77 42 30 70 25]
Avg Client Loss: 1.9513 | Avg Client Accuracy: 54.94%
Evaluation Loss:3.1686 | Val Accuracy: 28.75%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_16_checkpoint.pth

Round 10/75
Selected Clients: [11 72 66 59 17 75 92  0 67 33]
Avg Client Loss: 1.6001 | Avg Client Accuracy: 61.36%
Evaluation Loss:2.7951 | Val Accuracy: 34.55%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_IID_300round_bs128/dino_vits_16_iid_local_steps_16_checkpoint.pth

Round 15/75
Sele

0,1
client_avg_accuracy,▁▄▅▇▇▆▆▇▇▇█▇███
client_avg_loss,█▄▄▂▂▃▃▂▁▂▁▂▁▁▁
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▄▅▆▆▇▇▇▇██████
server_val_loss,█▅▃▃▂▂▂▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,68.50741
client_avg_loss,1.271
round,74.0
server_val_accuracy,43.84
server_val_loss,2.39631


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_1 = create_non_iid_splits(train_set, classes_per_client=1)
### --- LOOP FOR non_IID(1) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(1)_bs128"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(1)_local_steps_{num_local_steps_val}_bs128"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(1)_bs128", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_1,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=128,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 1 classes each.


Each of the 100 classes split into 1 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0)}
Total: 1
Client 1 has samples from classes: {np.int64(1)}
Total: 1
Client 2 has samples from classes: {np.int64(2)}
Total: 1
Client 3 has samples from classes: {np.int64(3)}
Total: 1
Client 4 has samples from classes: {np.int64(4)}
Total: 1
Client 5 has samples from classes: {np.int64(5)}
Total: 1
Client 6 has samples from classes: {np.int64(6)}
Total: 1
Client 7 has samples from classes: {np.int64(7)}
Total: 1
Client 8 has samples from classes: {np.int64(8)}
Total: 1
Client 9 has samples from classes: {np.int64(9)}
Total: 1
Client 10 has samples from classes: {np.int64(10)}
Total: 1
Client 11 has samples from classes: {np.int64(11)}
Total: 1
Client 12 has samples from classes: {np.int64(12)}
Total: 1
Client 13 has samples from classes: {np.int64(13)}

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_4_bs128_checkpoint.pth

Round 5/300
Selected Clients: [97 87 75 74  0 91 60 27 56  2]
Avg Client Loss: 4.5311 | Avg Client Accuracy: 68.84%
Evaluation Loss:14.4331 | Val Accuracy: 3.13%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_4_bs128_checkpoint.pth

Round 10/300
Selected Clients: [69 47 33 93 59 91 99 85  7 42]
Avg Client Loss: 7.0809 | Avg Client Accuracy: 67.46%
Evaluation Loss:12.4308 | Val Accuracy: 3.65%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_4_bs128_

0,1
client_avg_accuracy,▃▃▄▁▇▃▃▆▃▃▆▄▃▄▄▆▇▅▅▆▆▆▆▆▇▆▇▇▅▆▅█▅▆█▇▅█▅▅
client_avg_loss,▄▅▆▄█▃▆▅▄▂▅▅▃▂▃▃▂▂▃▂▂▂▁▁▂▁▂▂▂▁▂▁▂▂▃▃▁▂▁▂
round,▁▁▁▁▁▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▁▂▂▂▃▃▃▃▃▄▄▄▅▅▆▆▅▆▆▆▆▆▇▆▇▇▇▇▇█▇▇▇▇█████
server_val_loss,█▇▇▇▆▆▅▅▅▅▃▄▃▃▃▃▃▂▂▂▁▂▁▁▂▁▁▁▁▂▁▂▁▂▁▁▁▁▁▁

0,1
client_avg_accuracy,71.54341
client_avg_loss,2.20946
round,299.0
server_val_accuracy,32.38
server_val_loss,4.73316



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_8_bs128_checkpoint.pth

Round 5/150
Selected Clients: [91 28 22 39 77 41 12 14  6 13]
Avg Client Loss: 3.5490 | Avg Client Accuracy: 85.93%
Evaluation Loss:17.0085 | Val Accuracy: 2.51%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_8_bs128_checkpoint.pth

Round 10/150
Selected Clients: [13 83 74 94 68 78 31 27 86 91]
Avg Client Loss: 8.4259 | Avg Client Accuracy: 80.23%
Evaluation Loss:19.7701 | Val Accuracy: 3.45%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_8_bs128_

0,1
client_avg_accuracy,▆▃▃▂▃▃▄▁▁▃▄▃▂▇▄▆▄▅▆▇▅▅▆█▆▇▇▇▇▇
client_avg_loss,▃█▇▆▆▄▆▆▆▆▄▅▆▂▄▂▃▃▁▁▂▃▂▁▂▁▁▁▁▂
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▁▂▂▃▃▃▄▄▄▄▆▆▅▆▆▇▆▆▆█▇██▇█████
server_val_loss,▇███▆▇▆▅▅▅▅▄▃▄▃▄▃▃▃▃▂▂▂▂▂▁▁▂▂▁

0,1
client_avg_accuracy,87.2015
client_avg_loss,1.79191
round,149.0
server_val_accuracy,27.73
server_val_loss,6.53202



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_16_bs128_checkpoint.pth

Round 5/75
Selected Clients: [72 20  5 41 93 88 43 97 11 90]
Avg Client Loss: 3.0239 | Avg Client Accuracy: 88.50%
Evaluation Loss:30.2607 | Val Accuracy: 2.92%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_16_bs128_checkpoint.pth

Round 10/75
Selected Clients: [92 66 87 20 63 36 69 62 52 56]
Avg Client Loss: 4.0033 | Avg Client Accuracy: 87.90%
Evaluation Loss:33.5845 | Val Accuracy: 6.80%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(1)_bs128/dino_vits_16_non_iid(1)_local_steps_16_bs128

0,1
client_avg_accuracy,▄▃▅▁▁▂▂▄▆█▅█▇▆▆
client_avg_loss,▄▅▆▇█▆▆▄▃▁▃▁▃▁▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▂▂▄▃▄▅▅▆▆▆▇▇██
server_val_loss,▇█▇▅▅▄▃▃▂▂▂▁▁▁▁

0,1
client_avg_accuracy,89.9154
client_avg_loss,2.07193
round,74.0
server_val_accuracy,22.63
server_val_loss,11.40999


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_5 = create_non_iid_splits(train_set, classes_per_client=5)
### --- LOOP FOR non_IID(5) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(5)_bs128"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(5)_local_steps_{num_local_steps_val}_bs128"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(5)_bs128", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_5,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=128,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 5 classes each.


Each of the 100 classes split into 5 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4)}
Total: 5
Client 1 has samples from classes: {np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)}
Total: 5
Client 2 has samples from classes: {np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14)}
Total: 5
Client 3 has samples from classes: {np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19)}
Total: 5
Client 4 has samples from classes: {np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24)}
Total: 5
Client 5 has samples from classes: {np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29)}
Total: 5
Client 6 has samples from classes: {np.int64(32), np.int64(33), np.int64(34), np.int64(30), np.int64(31)}
Total: 5
Client 7 has sa

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_4_bs128_checkpoint.pth

Round 5/300
Selected Clients: [49 91 57 92 98 53 20 19 15 67]
Avg Client Loss: 3.7713 | Avg Client Accuracy: 34.37%
Evaluation Loss:6.2658 | Val Accuracy: 4.56%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_4_bs128_checkpoint.pth

Round 10/300
Selected Clients: [47  6  7 27  0 48 91 70 23 18]
Avg Client Loss: 3.5421 | Avg Client Accuracy: 35.35%
Evaluation Loss:6.2420 | Val Accuracy: 7.82%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_4_bs128_ch

0,1
client_avg_accuracy,▁▁▂▄▄▄▆▆▆▆▆▅▇▆▇▇▆▇▇▇▇▇▇█▇█▇▇▇▇▇▇▆███▆▇█▇
client_avg_loss,█▇▆▄▅▅▂▃▂▃▂▄▄▁▃▃▂▂▂▂▂▂▂▁▂▂▁▂▂▂▁▂▂▁▁▁▃▂▁▂
round,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇████
server_val_accuracy,▁▂▃▃▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇████▇▇▇█▇██▇▇██
server_val_loss,█▆▆▆▄▃▃▃▃▃▂▃▂▂▂▂▂▁▂▂▁▂▁▂▁▁▂▂▂▁▂▁▁▂▂▁▁▁▁▁

0,1
client_avg_accuracy,64.39135
client_avg_loss,1.48581
round,299.0
server_val_accuracy,38.8
server_val_loss,2.61817



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_8_bs128_checkpoint.pth

Round 5/150
Selected Clients: [32 19 15 70 92  6 89 34 14 91]
Avg Client Loss: 4.1556 | Avg Client Accuracy: 51.84%
Evaluation Loss:9.9471 | Val Accuracy: 6.44%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_8_bs128_checkpoint.pth

Round 10/150
Selected Clients: [80 68 21 23 29 57 37 69 33  5]
Avg Client Loss: 4.6251 | Avg Client Accuracy: 51.05%
Evaluation Loss:11.0668 | Val Accuracy: 6.81%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_8_bs128_c

0,1
client_avg_accuracy,▁▁▃▁▄▆▄▅▅▆▆▆▆▇▅▇▆▆▇▇▇▆▇▆▆█▇███
client_avg_loss,▆▇▅█▃▃▆▄▄▃▂▂▂▂▅▂▄▃▂▁▂▃▂▃▃▁▂▁▁▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▁▃▄▃▅▅▅▆▅▆▇▆▆▅▇▇▆▇▆█▇▇▇▇▆██▇▆
server_val_loss,▇█▄▃▅▄▃▃▃▃▂▁▂▂▃▁▂▂▁▂▁▁▁▂▁▂▁▁▁▂

0,1
client_avg_accuracy,74.25418
client_avg_loss,1.18244
round,149.0
server_val_accuracy,29.62
server_val_loss,4.34128



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_16_bs128_checkpoint.pth

Round 5/75
Selected Clients: [20 87 35 17 32  2 54 51 12 48]
Avg Client Loss: 4.8762 | Avg Client Accuracy: 63.07%
Evaluation Loss:13.0580 | Val Accuracy: 8.35%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_16_bs128_checkpoint.pth

Round 10/75
Selected Clients: [61 98 79 17 24 32 60 28 41 68]
Avg Client Loss: 3.9511 | Avg Client Accuracy: 64.26%
Evaluation Loss:15.1118 | Val Accuracy: 10.72%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(5)_bs128/dino_vits_16_non_iid(5)_local_steps_16_bs12

0,1
client_avg_accuracy,▁▂▂▁▅▄▇█▅▇▅▆▆▅▇
client_avg_loss,▅▄▅█▂▃▂▁▃▂▃▂▂▂▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▂▃▃▅▄▇▆▇█▇▇██▇
server_val_loss,▇█▅▇▄▄▁▂▂▂▂▂▁▁▃

0,1
client_avg_accuracy,75.41088
client_avg_loss,1.96666
round,74.0
server_val_accuracy,27.39
server_val_loss,7.65721


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_10 = create_non_iid_splits(train_set, classes_per_client=10)
### --- LOOP FOR non_IID(5) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(10)_bs128"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(10)_local_steps_{num_local_steps_val}_bs128"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(10)_bs128", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_10,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=128,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 10 classes each.


Each of the 100 classes split into 10 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)}
Total: 10
Client 1 has samples from classes: {np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19)}
Total: 10
Client 2 has samples from classes: {np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29)}
Total: 10
Client 3 has samples from classes: {np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(30), np.int64(31)}
Total: 10
Client 4 has samples from classes: {np.int64(40), np.int64(41), np.int64(4

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_4_bs128_checkpoint.pth

Round 5/300
Selected Clients: [67 78 17 62  1 52 80 10  0 97]
Avg Client Loss: 3.7470 | Avg Client Accuracy: 19.56%
Evaluation Loss:6.2256 | Val Accuracy: 5.33%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_4_bs128_checkpoint.pth

Round 10/300
Selected Clients: [81 41 32 52 57 47 44 64 98 95]
Avg Client Loss: 3.6259 | Avg Client Accuracy: 26.70%
Evaluation Loss:5.0933 | Val Accuracy: 10.11%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_4_b

0,1
client_avg_accuracy,▁▂▄▄▅▅▆▆▆▆▇▇▆▆▇▇▇▇▇█▇▇▇▆█▇▇▇█▇█▇██▆▇██▆█
client_avg_loss,█▆▃▄▄▃▃▃▂▃▃▃▄▂▃▂▂▃▂▂▂▂▃▂▂▂▂▂▂▂▃▂▂▄▁▂▂▂▂▂
round,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇████
server_val_accuracy,▁▂▃▃▄▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▆▇▇█▇█▇▇▇█████████
server_val_loss,█▆▅▄▃▃▃▂▂▃▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▂▁▁

0,1
client_avg_accuracy,58.21747
client_avg_loss,1.55005
round,299.0
server_val_accuracy,40.82
server_val_loss,2.41496



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_8_bs128_checkpoint.pth

Round 5/150
Selected Clients: [20 92 97 60 27 44 14 80 77 72]
Avg Client Loss: 5.8641 | Avg Client Accuracy: 33.40%
Evaluation Loss:8.2934 | Val Accuracy: 7.57%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_8_bs128_checkpoint.pth

Round 10/150
Selected Clients: [76 46 15  3 49 23 54  8 24 83]
Avg Client Loss: 2.9000 | Avg Client Accuracy: 47.38%
Evaluation Loss:7.2480 | Val Accuracy: 14.32%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_8_b

0,1
client_avg_accuracy,▁▄▅▅▅▆▆▇▆▇▇▇▆▆▇▆▇▇█▇█▆▇██▇▇█▇▇
client_avg_loss,█▃▂▃▄▃▂▁▃▁▁▂▂▂▂▂▂▁▁▂▁▃▂▁▁▁▁▁▁▂
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▂▃▄▄▆▆▆▆▆▇▆▆▇▇▅▇▆██▇▆▇▇█████▇
server_val_loss,█▇▆▃▄▄▂▂▂▂▁▃▂▃▂▄▂▂▁▁▂▃▃▂▁▁▁▁▁▂

0,1
client_avg_accuracy,62.88551
client_avg_loss,1.71217
round,149.0
server_val_accuracy,35.28
server_val_loss,3.43905



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_16_bs128_checkpoint.pth

Round 5/75
Selected Clients: [62 74 98 17  3 42 26 41 18 16]
Avg Client Loss: 3.6071 | Avg Client Accuracy: 57.03%
Evaluation Loss:8.1776 | Val Accuracy: 12.65%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_16_bs128_checkpoint.pth

Round 10/75
Selected Clients: [26 97 16 56 88 68 29 32 22 25]
Avg Client Loss: 3.3102 | Avg Client Accuracy: 62.78%
Evaluation Loss:11.0670 | Val Accuracy: 13.97%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(10)_bs128/dino_vits_16_non_iid(10)_local_steps_1

0,1
client_avg_accuracy,▁▃▂▂▅▆▇▆▄▆▆█▅▇▆
client_avg_loss,▆▆▆█▂▂▁▂▄▃▃▁▃▂▂
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▁▄▄▅▆▇▆▄▆▇█▄█▅
server_val_loss,▅█▅▄▃▂▁▂▄▂▂▁▅▂▃

0,1
client_avg_accuracy,70.08069
client_avg_loss,1.65247
round,74.0
server_val_accuracy,26.41
server_val_loss,5.64644


In [None]:
from FederatedLearningProject.data.cifar100_loader import create_non_iid_splits
client_dataset_non_iid_50 = create_non_iid_splits(train_set, classes_per_client=50)
### --- LOOP FOR non_IID(5) ACCURACIES AND PLOTS --- ###
# This loop will run a new experiment for each `num_local_steps` value.
# The results (e.g., accuracy, loss) will be logged to W&B. You can then use the
# W&B dashboard to compare the runs and generate plots.
num_total_steps = num_optimal_rounds * 4
num_local_steps_iterations = [4,8,16]
device = "cuda"
model_name = "dino_vits16"
project_name = "FederatedProject_nonIID(50)_bs128"
for num_local_steps_val in num_local_steps_iterations:
    # To keep the total computation constant, we adjust the number of rounds.
    # num_rounds * num_local_steps ≈ num_total_steps
    num_rounds_val = num_total_steps // num_local_steps_val
    model_name = "dino_vits_16"
    print(f"\n--- Starting training for num_local_steps = {num_local_steps_val} | num_rounds = {num_rounds_val} ---\n")

    # Re-initialize the model for each run to ensure fresh weights
    model = models.LinearFlexibleDino(num_layers_to_freeze=12)
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration to distinguish them in W&B
    run_name = f"{model_name}_non_iid(50)_local_steps_{num_local_steps_val}_bs128"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val,
            "batch_size": val_loader.batch_size,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps_val, # Log the current value
            "fraction_clients": fraction,
            "setting": "nonIID(50)_bs128", # Add a tag for easy filtering in W&B
            "total_steps_budget": num_total_steps
        },
        reinit=True # Essential for running wandb.init in a loop
    )

    # Copy your config
    config = wandb.config

    # Define a unique checkpoint path for this specific run
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    checkpoint_path = os.path.join(checkpoint_dir, f"{run_name}_checkpoint.pth")

    # RECOVER CHECKPOINT for this specific run if it exists
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=run_name)
    try:
        # Only try to load if a checkpoint was actually found
        if model_data:
            print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
            model.load_state_dict(model_data["model_state_dict"])
            optimizer.load_state_dict(model_data["optimizer_state_dict"])
            print(f"Resumed training from round {start_round}.")
        else:
            print("No checkpoint found. Starting from scratch.")
    except Exception as e:
        print(f"Could not load model or optimizer state dictionary. Starting from scratch. Error: {e}")


    # Call the training function with the new set of parameters
    train_server(
        model=model,
        num_clients=num_clients,
        num_client_steps=num_local_steps_val, # Pass the current num_local_steps
        num_rounds=num_rounds_val,           # Pass the calculated num_rounds
        client_dataset=client_dataset_non_iid_50,
        frac=fraction,
        optimizer=optimizer,
        device=device,
        batch_size=128,
        n_rounds_log=5,
        val_loader=val_loader,
        criterion=criterion,
        checkpoint_path=checkpoint_path,
        model_name=run_name # Use the unique run name for checkpoint saving
    )

    # End the current wandb run before starting the next one
    wandb.finish()

Dataset has 40000 samples across 100 classes.
Creating 100 non IID splits with 50 classes each.


Each of the 100 classes split into 50 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49)}
Total: 50
Client 1 has samples from classes: {np.int64(50), np.int64(51), np.int64

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_4_bs128_checkpoint.pth

Round 5/300
Selected Clients: [67 45 78 84  7 98 50  4  6 85]
Avg Client Loss: 4.7174 | Avg Client Accuracy: 7.04%
Evaluation Loss:5.1501 | Val Accuracy: 6.32%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_4_bs128_checkpoint.pth

Round 10/300
Selected Clients: [47 60 86 42 76 66 21 39 51 68]
Avg Client Loss: 4.0460 | Avg Client Accuracy: 13.19%
Evaluation Loss:4.3605 | Val Accuracy: 13.28%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_4_bs

0,1
client_avg_accuracy,▁▃▄▄▅▅▅▆▆▆▆▆▇▆▅▆▇▇▇▇▇▇▇▇▇█▇▇▇█▇▇██▇███▇▇
client_avg_loss,█▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▂
round,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▂▃▄▄▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███▇█████████▇
server_val_loss,█▆▅▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,40.89832
client_avg_loss,2.23361
round,299.0
server_val_accuracy,42.87
server_val_loss,2.29384



--- Starting training for num_local_steps = 8 | num_rounds = 150 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_8_bs128_checkpoint.pth

Round 5/150
Selected Clients: [53 91 78 39 87 64 69 54 68 12]
Avg Client Loss: 3.1406 | Avg Client Accuracy: 26.82%
Evaluation Loss:3.9320 | Val Accuracy: 18.75%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_8_bs128_checkpoint.pth

Round 10/150
Selected Clients: [28  0 95 63 84 38 68 67 80 93]
Avg Client Loss: 2.5455 | Avg Client Accuracy: 35.51%
Evaluation Loss:3.2321 | Val Accuracy: 27.40%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_8_

0,1
client_avg_accuracy,▁▃▃▅▅▅▆▄▆▅▆▅▇▇▆▇▇█▆▅▇▆▇▇▇▆█▇▇█
client_avg_loss,█▅▇▃▃▄▂▇▂▅▂▄▂▁▅▂▃▁▃▇▂▄▁▁▁▅▁▃▂▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▃▃▅▆▅▆▄▆▄▆▄█▇▇█▇▇▇▅▆▇███▇█▇▇█
server_val_loss,█▅▆▃▃▅▂█▃▇▃█▁▁▂▁▂▁▂▆▄▃▁▁▁▂▁▂▂▁

0,1
client_avg_accuracy,55.53366
client_avg_loss,1.68284
round,149.0
server_val_accuracy,43.89
server_val_loss,2.36336



--- Starting training for num_local_steps = 16 | num_rounds = 75 ---



Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.
No checkpoint found. Starting from scratch.
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_16_bs128_checkpoint.pth

Round 5/75
Selected Clients: [ 1 46 92 93 90 96 26 18  9 55]
Avg Client Loss: 2.3218 | Avg Client Accuracy: 51.60%
Evaluation Loss:4.3377 | Val Accuracy: 22.89%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_16_bs128_checkpoint.pth

Round 10/75
Selected Clients: [13 19 38 73 84 16 39 34 96 77]
Avg Client Loss: 1.6871 | Avg Client Accuracy: 59.45%
Evaluation Loss:3.8776 | Val Accuracy: 27.95%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL_NON_IID(50)_bs128/dino_vits_16_non_iid(50)_local_steps_16

0,1
client_avg_accuracy,▁▄▄▆▇▄█▆▇▇▇██▆▇
client_avg_loss,▅▃▅▂▁█▁▂▂▂▁▁▂▃▃
round,▁▁▂▃▃▃▄▅▅▅▆▇▇▇█
server_val_accuracy,▁▃▂▄▆▅▇▄█▇▇█▇▄▇
server_val_loss,▅▄█▃▂▄▁▅▁▂▂▁▂▅▂

0,1
client_avg_accuracy,65.22327
client_avg_loss,1.72939
round,74.0
server_val_accuracy,38.97
server_val_loss,3.01372
