In [3]:
import numpy as np
import wandb
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
import shutil
import os                              # Import the 'os' module for changing directories
os.chdir('/content/drive/MyDrive/FL')  # Change the directory

Mounted at /content/drive


In [4]:
import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
from torchvision import transforms
from torchvision.datasets import CIFAR100
from torch.utils.data import Subset, DataLoader, random_split

from FederatedLearningProject.data.cifar100_loader import get_cifar100
import FederatedLearningProject.checkpoints.checkpointing as checkpointing
from FederatedLearningProject.training.FL_training import train_server
from FederatedLearningProject.experiments import models

In [5]:
wandb.login() # Ask for your APIw key for logging in to the wandb library.

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdepetrofabio[0m ([33mdepetrofabio-politecnico-di-torino[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [9]:
# Import CIFAR100 dataset: train_set, val_set, test_set
# The transforms are applied before returning the dataset (in the module)

valid_split_perc = 0.2  # of the 50000 training data
train_set, val_set, test_set = get_cifar100(valid_split_perc=valid_split_perc)
# batch_size è in hyperparameter (64, 128, ..), anche num_workers (consigliato per colab 2 o 4)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True, num_workers=2)
val_loader = DataLoader(val_set, batch_size=64, shuffle=True, num_workers=2)
test_loader = DataLoader(test_set, batch_size=64, shuffle=False, num_workers=2)

Number of images in Training Set:   40000
Number of images in Validation Set: 10000
Number of images in Test Set:       10000
✅ Datasets loaded successfully


In [10]:
model = models.LinearFlexibleDino(num_layers_to_freeze=12) # num_layers_to_freeze

Downloading: "https://github.com/facebookresearch/dino/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://dl.fbaipublicfiles.com/dino/dino_deitsmall16_pretrain/dino_deitsmall16_pretrain.pth" to /root/.cache/torch/hub/checkpoints/dino_deitsmall16_pretrain.pth
100%|██████████| 82.7M/82.7M [00:00<00:00, 293MB/s]


In [None]:
model.debug()

In [None]:
model.to_cuda()

moving model to cuda


## FedAvg Hyperparameters


The Federated Averaging (FedAvg) algorithm involves several important hyperparameters that influence the performance and efficiency of training. Below is a detailed description of each:

- **`num_clients (K)`**  
  Total number of clients (or devices) participating in the federated learning system.  
  Example: `num_clients = 100`

- **`fraction (C)`**  
  The fraction of clients selected to participate in each communication round. Must be a float between 0 and 1.  
  Example: `fraction = 0.1` means 10% of clients are selected in each round.

- **`local_epochs (E)`**  
  Number of local training epochs each selected client performs before sending updates back to the server.  
  Example: `local_epochs = 5`.
  Recall that E is in the pseudocode while in the pdf that the professor assigned us is called **J** .

  - **`num_rounds`**  
  Total number of communication rounds (or global iterations) the server runs to aggregate updates and refine the global model.  
  📌 *Example:* `num_rounds = 100`. This is up to us to define based on convergence and time/compute budget.

- **Additional Notes:**  
  These hyperparameters directly affect convergence speed, communication cost, and model performance.  
  - A smaller `C` reduces communication overhead but may slow convergence.  
  - A larger `E` can improve local model performance but may lead to model divergence if clients’ data distributions are highly non-IID.


The first FL baseline
Implement the algorithm described in [10], fix K=100, C=0.1, adopt an iid sharding of the training set and fix J=4 the number of local steps. Run FedAvg on CIFAR-100 for a proper number of rounds (up to you to define, based on convergence and time/compute budget).


# Federated Learning Baseline on CIFAR-100 using FedAvg

In this experiment, we aim to implement and evaluate the Federated Averaging (FedAvg) algorithm as described in McMahan et al. [10], using a controlled setup on the CIFAR-100 dataset. This serves as a **baseline FL experiment** with standard hyperparameters for further comparative studies.

## 📌 Objectives

- Implement the **FedAvg** algorithm.
- Use **IID sharding** of CIFAR-100 to simulate a federated setting.
- Fix key FL hyperparameters:
  - Number of clients (**K**) = 100
  - Fraction of participating clients per round (**C**) = 0.1
  - Number of local update steps (**J**) = 4
- Evaluate performance over a suitable number of communication rounds.

## ⚙️ Experiment Configuration

| Parameter                  | Value        |
|---------------------------|--------------|
| Dataset                   | CIFAR-100    |
| Model                     | DINO (TBD)   |
| Total Clients (K)         | 100          |
| Participation Fraction (C)| 0.1 (10 clients/round) |
| Local Epochs (J)          | 4            |
| Sharding Type             | IID          |
| Rounds                    | *TBD based on convergence (e.g., 100–500)* |
| Optimizer                 | SGD     |
| Learning Rate             | *To be tuned*|

## 📊 Notes

- **IID sharding** means the training data will be equally and randomly split among clients to avoid any data heterogeneity.
- The number of rounds will be chosen based on **observed convergence behavior** and practical time/compute budget constraints.
- Performance will be tracked using **validation/test accuracy** and **loss** over communication rounds.

## 🧠 Why this setup?

This configuration serves as a **standard benchmark** for future comparison with more advanced techniques (e.g., personalization, non-IID setups, compression, or asynchronous training). By fixing `K`, `C`, and `J`, and using IID data, we create a controlled environment to evaluate the basic performance of FedAvg.

## 📚 Reference

[10] McMahan, Brendan, et al. *Communication-Efficient Learning of Deep Networks from Decentralized Data.* AISTATS 2017.


In [18]:
# --- OPTIMIZER AND LOSS FUNCTION ---


learning_rate = 0.01  # best hyperparameter of the centralized
momentum = 0.9
weight_decay = 0.0001 # best hyperparameter of the centralized

num_clients = 100

# Default hyperparameters for FedAvg
num_local_steps = 4 #
# num_local_steps = 8 #
# num_local_steps = 16 #
fraction = 0.1
from torch.optim.lr_scheduler import CosineAnnealingLR

"""
# Example for differential learning rates:
optimizer = torch.optim.AdamW([
    {'params': model.backbone.blocks[9:].parameters(), 'lr': 1e-5}, # Adjust block indices if needed
    # You might also want to fine-tune backbone.norm if it exists and is not frozen
    # {'params': model.backbone.norm.parameters(), 'lr': 1e-5},
    {'params': model.classifier.parameters(), 'lr': 1e-4}
], weight_decay=0.05) # example weight decay
"""
# optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum, weight_decay=weight_decay)
# Example optimizer instantiation:
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=learning_rate, # Example LR
    weight_decay=weight_decay,
    momentum=momentum
)
criterion = nn.CrossEntropyLoss()

In [13]:

# wandb.init() prepares the tracking of hyperparameters/metrics for later recording performance using wandb.log()



# INITIALIZE W&B
wandb.init(
    project=project_name,
    name=run_name,
    config={
        "model": model_name,
        "num_rounds": num_rounds,
        "batch_size": train_loader.batch_size,
        "learning_rate": optimizer.param_groups[0]['lr'],
        "architecture": model.__class__.__name__,
})

# Copy your config
config = wandb.config


NameError: name 'num_rounds' is not defined

In [None]:
#  PERCORSO CHECKPOINT

checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/TestFinaliSingleModel"
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_path = os.path.join(checkpoint_dir, f"{model_name}_checkpoint.pth")    # we predefine the name of the file inside the specified folder (dir)


In [None]:
# RECOVER CHECKPOINT
start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=model_name)


try:
  print()
  print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
  model.load_state_dict(model_data["model_state_dict"])
  optimizer.load_state_dict(model_data["optimizer_state_dict"])
except: None

 Nessun checkpoint trovato, inizio da round 1.



In [None]:
from FederatedLearningProject.data.cifar100_loader import create_iid_splits
client_dataset = create_iid_splits(train_set, num_clients = num_clients)

In [22]:
### --- LOOP FOR FINDING THE OPTIMAL NUM_ROUNDS --- ###
device = "cuda"
model_name = "dino_vits16_J4"
project_name = "FederatedProject"
num_rounds_iterations = [200,300,400,500,600]
# Loop through different num_rounds values
for num_rounds_val in num_rounds_iterations:
    # Re-initialize the model for each run to ensure fresh weights (or load a pre-trained one)
    # If you want to start from the exact same initial state for each run,
    # you might want to save the initial state_dict and load it here.
    model = models.LinearFlexibleDino(num_layers_to_freeze=12) # Initialize your actual model here
    model.to_cuda()

    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
        momentum=momentum
    )

    # Generate a unique run name for each iteration
    run_name = f"{model_name}_rounds_{num_rounds_val}"

    # INITIALIZE W&B for each new run
    wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model": model_name,
            "num_rounds": num_rounds_val, # Use the current num_rounds_val
            "batch_size": test_loader.batch_size, # Using test_loader's batch_size as a placeholder
            "learning_rate": optimizer.param_groups[0]['lr'],
            "weight_decay": weight_decay,
            "momentum": momentum,
            "architecture": model.__class__.__name__,
            "num_local_steps": num_local_steps,
            "fraction_clients": fraction,
        },
        reinit=True # Important: Allows re-initialization of wandb in a loop
    )

    # Copy your config
    config = wandb.config

    # PERCORSO CHECKPOINT
    checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/"
    os.makedirs(checkpoint_dir, exist_ok=True)
    # Make checkpoint path unique to the run if you want to store separate checkpoints
    checkpoint_path = os.path.join(checkpoint_dir, f"{model_name}_rounds_{num_rounds_val}_checkpoint.pth")

    # RECOVER CHECKPOINT (This part remains, but remember it will look for a checkpoint
    # specific to the current `model_name` and `num_rounds_val` if you've made the path unique.)
    start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name=f"{model_name}_rounds_{num_rounds_val}")

    try:
      print()
      print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
      model.load_state_dict(model_data["model_state_dict"])
      optimizer.load_state_dict(model_data["optimizer_state_dict"])
    except:
        print("Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.")

    print(f"\n--- Starting training for num_rounds = {num_rounds_val} ---")
    train_server(model=model,
                 num_clients=num_clients,
                 num_client_steps=num_local_steps,
                 num_rounds=num_rounds_val, # Pass the current num_rounds_val
                 client_dataset=client_dataset,
                 frac=fraction,
                 optimizer=optimizer,
                 device=device,
                 batch_size=64,
                 n_rounds_log=10,
                 val_loader=val_loader,
                 criterion=criterion,
                 checkpoint_path=checkpoint_path,
                 model_name=f"{model_name}_rounds_{num_rounds_val}") # Ensure model_name is unique for checkpointing

    # End the current wandb run before starting the next one
    wandb.finish()

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


0,1
client_avg_accuracy,▁
client_avg_loss,▁
round,▁
server_val_accuracy,▁
server_val_loss,▁

0,1
client_avg_accuracy,27.00583
client_avg_loss,3.1616
round,9.0
server_val_accuracy,29.12
server_val_loss,3.13042


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 200 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_200_checkpoint.pth

Round 10/200
Selected Clients: [92 33 75 21 41 36  0 46 88 43]
Avg Client Loss: 4.2218 | Avg Client Accuracy: 13.20%
Evaluation Loss:4.2730 | Val Accuracy: 13.23%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_200_checkpoint.pth

Round 20/200
Selected Clients: [84 24 22 67 94 16 76 49 50 43]
Avg Client Loss: 3.5273 | Avg Client Accuracy: 20.00%
Evaluation Loss:3.4841 | Val Accuracy: 22.70%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▅▆▆▆▇▇▇▇▇▇▇▇█▇██
client_avg_loss,█▆▄▃▃▃▃▂▂▂▂▂▁▂▂▂▁▂▁▁
round,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
server_val_accuracy,▁▃▄▅▆▆▆▇▇▇▇▇▇▇██████
server_val_loss,█▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,41.60156
client_avg_loss,2.32964
round,199.0
server_val_accuracy,43.68
server_val_loss,2.26143


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 300 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_300_checkpoint.pth

Round 10/300
Selected Clients: [95 43  9 11 26 76 53 14 89 47]
Avg Client Loss: 4.2283 | Avg Client Accuracy: 12.42%
Evaluation Loss:4.2434 | Val Accuracy: 13.98%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_300_checkpoint.pth

Round 20/300
Selected Clients: [36 82 93 41 81 30 55 31 21 11]
Avg Client Loss: 3.4035 | Avg Client Accuracy: 21.29%
Evaluation Loss:3.4683 | Val Accuracy: 23.40%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█▇███████
client_avg_loss,█▅▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
server_val_accuracy,▁▃▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇█████████████
server_val_loss,█▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,41.875
client_avg_loss,2.21978
round,299.0
server_val_accuracy,44.82
server_val_loss,2.18801


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 400 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_400_checkpoint.pth

Round 10/400
Selected Clients: [74 97 19 23 25  4 41 57  6 15]
Avg Client Loss: 4.2501 | Avg Client Accuracy: 12.19%
Evaluation Loss:4.2911 | Val Accuracy: 14.38%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_400_checkpoint.pth

Round 20/400
Selected Clients: [54 87 61 92 10 26 53  7 75 73]
Avg Client Loss: 3.5031 | Avg Client Accuracy: 20.27%
Evaluation Loss:3.5074 | Val Accuracy: 22.68%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇███▇█▇█████
client_avg_loss,█▆▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▂▁▁▁▁▁▁▁
round,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
server_val_accuracy,▁▃▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇████████████████████
server_val_loss,█▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,43.94531
client_avg_loss,2.13777
round,399.0
server_val_accuracy,45.44
server_val_loss,2.17429


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 500 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_500_checkpoint.pth

Round 10/500
Selected Clients: [68 72 46 58 23  1 47 97 64 41]
Avg Client Loss: 4.2638 | Avg Client Accuracy: 11.60%
Evaluation Loss:4.2506 | Val Accuracy: 14.67%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_500_checkpoint.pth

Round 20/500
Selected Clients: [29 69 11  5 44 25 41 93  2 13]
Avg Client Loss: 3.5425 | Avg Client Accuracy: 20.27%
Evaluation Loss:3.4281 | Val Accuracy: 23.81%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▄▅▆▆▆▆▇▆▇▇▇▇▇▇▇▇██▇█▇████████████████
client_avg_loss,█▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
server_val_accuracy,▁▃▄▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇███████████████████
server_val_loss,█▅▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,44.72656
client_avg_loss,2.10548
round,499.0
server_val_accuracy,46.3
server_val_loss,2.14952


Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


moving model to cuda


 Nessun checkpoint trovato, inizio da round 1.

Could not load model or optimizer state dictionary. Starting from scratch or with default initialization.

--- Starting training for num_rounds = 600 ---
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_600_checkpoint.pth

Round 10/600
Selected Clients: [66 46 88 10 65 64 29 48 94 21]
Avg Client Loss: 4.2258 | Avg Client Accuracy: 11.68%
Evaluation Loss:4.2160 | Val Accuracy: 14.51%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_rounds_600_checkpoint.pth

Round 20/600
Selected Clients: [44 80 68 89 75 83  5 47 45 57]
Avg Client Loss: 3.5220 | Avg Client Accuracy: 19.53%
Evaluation Loss:3.4553 | Val Accuracy: 23.88%
--------------------------------------------------
Checkpoint salvato su: /content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL/dino_vits16_J4_roun

0,1
client_avg_accuracy,▁▃▄▅▆▇▆▇▇▇▇▇▇▇▇▇▇▇█▇█▇██▇▇█████▇████████
client_avg_loss,█▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁
round,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇████
server_val_accuracy,▁▃▄▅▆▆▇▇▇▇▇▇▇▇▇▇████████████████████████
server_val_loss,█▇▆▅▄▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
client_avg_accuracy,43.71094
client_avg_loss,2.13101
round,599.0
server_val_accuracy,46.42
server_val_loss,2.15125


In [None]:
num_optimal_rounds = 300

In [None]:
### IID SETTING, TESTING WITH 4,8,16 num_local_steps

num_total_steps = num_optimal_rounds * num_local_steps

### LOOP FOR IID ACCURACIES AND PLOTS



4