In [1]:
import numpy as np
import wandb
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
import shutil
import os                              # Import the 'os' module for changing directories
os.chdir('/content/drive/MyDrive/FL')  # Change the directory

Mounted at /content/drive


In [2]:
import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
from torchvision import transforms
from torchvision.datasets import CIFAR100
from torch.utils.data import Subset, DataLoader, random_split

from FederatedLearningProject.data.cifar100_loader import get_cifar100
import FederatedLearningProject.checkpoints.checkpointing as checkpointing
from FederatedLearningProject.training.FL_training import train_server
from FederatedLearningProject.experiments import models

In [3]:
wandb.login() # Ask for your APIw key for logging in to the wandb library.

[34m[1mwandb[0m: Currently logged in as: [33mdepetrofabio[0m ([33mdepetrofabio-politecnico-di-torino[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [4]:
# Import CIFAR100 dataset: train_set, val_set, test_set
# The transforms are applied before returning the dataset (in the module)

valid_split_perc = 0.2    # of the 50000 training data
train_set, val_set, test_set = get_cifar100(valid_split_perc)
# batch_size è in hyperparameter (64, 128, ..), anche num_workers (consigliato per colab 2 o 4)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True, num_workers=2)
val_loader = DataLoader(val_set, batch_size=64, shuffle=False, num_workers=2)
test_loader = DataLoader(test_set, batch_size=64, shuffle=False, num_workers=2)

Number of images in Training Set:   40000
Number of images in Validation Set: 10000
Number of images in Test Set:       10000
✅ Datasets loaded successfully


In [5]:
model = models.FlexibleDino(num_layers_to_freeze=12) # num_layers_to_freeze

Using cache found in /root/.cache/torch/hub/facebookresearch_dino_main


In [6]:
model.debug()


--- Debugging Model ---
Model is primarily on device: cpu
Model overall mode: Train

Parameter Details (Name | Device | Requires Grad? | Inferred Block | Module Mode):
- backbone.cls_token                                 | cpu        | False           | N/A             | Train
- backbone.pos_embed                                 | cpu        | False           | N/A             | Train
- backbone.patch_embed.proj.weight                   | cpu        | False           | N/A             | Train
- backbone.patch_embed.proj.bias                     | cpu        | False           | N/A             | Train
- backbone.blocks.0.norm1.weight                     | cpu        | False           | Block 0         | Eval
- backbone.blocks.0.norm1.bias                       | cpu        | False           | Block 0         | Eval
- backbone.blocks.0.attn.qkv.weight                  | cpu        | False           | Block 0         | Eval
- backbone.blocks.0.attn.qkv.bias                    | cpu      

In [7]:
model.to_cuda()

moving model to cuda


## FedAvg Hyperparameters


The Federated Averaging (FedAvg) algorithm involves several important hyperparameters that influence the performance and efficiency of training. Below is a detailed description of each:

- **`num_clients (K)`**  
  Total number of clients (or devices) participating in the federated learning system.  
  Example: `num_clients = 100`

- **`fraction (C)`**  
  The fraction of clients selected to participate in each communication round. Must be a float between 0 and 1.  
  Example: `fraction = 0.1` means 10% of clients are selected in each round.

- **`local_epochs (E)`**  
  Number of local training epochs each selected client performs before sending updates back to the server.  
  Example: `local_epochs = 5`.
  Recall that E is in the pseudocode while in the pdf that the professor assigned us is called **J** .

  - **`num_rounds`**  
  Total number of communication rounds (or global iterations) the server runs to aggregate updates and refine the global model.  
  📌 *Example:* `num_rounds = 100`. This is up to us to define based on convergence and time/compute budget.

- **Additional Notes:**  
  These hyperparameters directly affect convergence speed, communication cost, and model performance.  
  - A smaller `C` reduces communication overhead but may slow convergence.  
  - A larger `E` can improve local model performance but may lead to model divergence if clients’ data distributions are highly non-IID.


The first FL baseline
Implement the algorithm described in [10], fix K=100, C=0.1, adopt an iid sharding of the training set and fix J=4 the number of local steps. Run FedAvg on CIFAR-100 for a proper number of rounds (up to you to define, based on convergence and time/compute budget).


# Federated Learning Baseline on CIFAR-100 using FedAvg

In this experiment, we aim to implement and evaluate the Federated Averaging (FedAvg) algorithm as described in McMahan et al. [10], using a controlled setup on the CIFAR-100 dataset. This serves as a **baseline FL experiment** with standard hyperparameters for further comparative studies.

## 📌 Objectives

- Implement the **FedAvg** algorithm.
- Use **IID sharding** of CIFAR-100 to simulate a federated setting.
- Fix key FL hyperparameters:
  - Number of clients (**K**) = 100
  - Fraction of participating clients per round (**C**) = 0.1
  - Number of local update steps (**J**) = 4
- Evaluate performance over a suitable number of communication rounds.

## ⚙️ Experiment Configuration

| Parameter                  | Value        |
|---------------------------|--------------|
| Dataset                   | CIFAR-100    |
| Model                     | DINO (TBD)   |
| Total Clients (K)         | 100          |
| Participation Fraction (C)| 0.1 (10 clients/round) |
| Local Epochs (J)          | 4            |
| Sharding Type             | IID          |
| Rounds                    | *TBD based on convergence (e.g., 100–500)* |
| Optimizer                 | SGD     |
| Learning Rate             | *To be tuned*|

## 📊 Notes

- **IID sharding** means the training data will be equally and randomly split among clients to avoid any data heterogeneity.
- The number of rounds will be chosen based on **observed convergence behavior** and practical time/compute budget constraints.
- Performance will be tracked using **validation/test accuracy** and **loss** over communication rounds.

## 🧠 Why this setup?

This configuration serves as a **standard benchmark** for future comparison with more advanced techniques (e.g., personalization, non-IID setups, compression, or asynchronous training). By fixing `K`, `C`, and `J`, and using IID data, we create a controlled environment to evaluate the basic performance of FedAvg.

## 📚 Reference

[10] McMahan, Brendan, et al. *Communication-Efficient Learning of Deep Networks from Decentralized Data.* AISTATS 2017.


In [8]:
# --- OPTIMIZER AND LOSS FUNCTION ---
learning_rate = 0.001
momentum = 0.9
weight_decay = 5e-5
# domanda: ogni quanti round settiamo il T_max dello scheduler?

num_rounds_scheduler = 10
num_clients = 100
num_rounds = 500

# Default hyperparameters for FedAvg
local_epochs = 4
fraction = 0.1
from torch.optim.lr_scheduler import CosineAnnealingLR

"""
# Example for differential learning rates:
optimizer = torch.optim.AdamW([
    {'params': model.backbone.blocks[9:].parameters(), 'lr': 1e-5}, # Adjust block indices if needed
    # You might also want to fine-tune backbone.norm if it exists and is not frozen
    # {'params': model.backbone.norm.parameters(), 'lr': 1e-5},
    {'params': model.classifier.parameters(), 'lr': 1e-4}
], weight_decay=0.05) # example weight decay
"""
# optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum, weight_decay=weight_decay)
# Example optimizer instantiation:
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=learning_rate, # Example LR
    weight_decay=weight_decay,
    momentum=momentum
)
scheduler = CosineAnnealingLR(optimizer, T_max=num_rounds_scheduler)
criterion = nn.CrossEntropyLoss()

In [9]:

# wandb.init() prepares the tracking of hyperparameters/metrics for later recording performance using wandb.log()

model_name = "dino_vits16_provaFL"
project_name = "FederatedProject"
run_name = f"{model_name}_run"

# INITIALIZE W&B
wandb.init(
    project=project_name,
    name=run_name,
    config={
        "model": model_name,
        "num_rounds": num_rounds,
        "batch_size": train_loader.batch_size,
        "learning_rate": optimizer.param_groups[0]['lr'],
        "architecture": model.__class__.__name__,
})

# Copy your config
config = wandb.config


In [10]:
#  PERCORSO CHECKPOINT

checkpoint_dir = "/content/drive/MyDrive/FL/FederatedLearningProject/checkpoints/FL"
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_path = os.path.join(checkpoint_dir, f"{model_name}_checkpointPROVA.pth")    # we predefine the name of the file inside the specified folder (dir)


In [11]:
# RECOVER CHECKPOINT
start_round, model_data = checkpointing.load_checkpoint_fedavg(model, optimizer, checkpoint_dir, model_name="try")


try:
  print()
  print(f"The 'model_data' dictionary contains the following keys: {list(model_data.keys())}")
  model.load_state_dict(model_data["model_state_dict"])
  optimizer.load_state_dict(model_data["optimizer_state_dict"])
except: None

 Nessun checkpoint trovato, inizio da round 1.



In [12]:
from FederatedLearningProject.data.cifar100_loader import create_iid_splits
client_dataset = create_iid_splits(train_set, num_clients = num_clients)

Dataset has 40000 samples across 100 classes.
Creating 100 IID splits with 100 classes each.


Each of the 100 classes split into 100 shards.

Checking unique classes that each client sees:
Client 0 has samples from classes: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
train_server(model=model, num_clients=num_clients, num_client_steps=4, num_rounds=num_rounds, client_dataset=client_dataset, frac=fraction, optimizer=optimizer, scheduler=scheduler, device=device, n_rounds_log=10, val_loader=val_loader, criterion=criterion)

In [None]:
model.debug()