Note: If you use this template in your work, please reference it and cite the Flower paper.
⚠️ ThisREADME
describes how to use the template as-is after installing it with the defaultsetup.sh
script in a machine running Ubuntu22.04
. Please follow the instructions inEXTENDED_README.md
for details on more complex environment setups and how to extend this template.
Federated Learning (FL) is a privacy-preserving machine learning paradigm that allows training models directly on local client data using local client resources. This template standardizes the FL research workflow at the Cambridge ML Systems based on three frameworks chosen for their flexibility and ease of use:
- Flower: The FL framework developed by Flower Labs with contributions from CaMLSys members.
- Hydra: framework for managing experiments developed at Meta which automatically handles experimental configuration for Python.
- Wandb: The MLOps platform developed for handling results storage, experiment tracking, reproducibility, and visualization.
While these tools can be combined in an ad-hoc manner, this template intends to provide a unified and opinionated structure for achieving this while providing functionality that may not have been easily constructed from scratch.
- Automatically handles client configuration for Flower in an opinionated manner using the PyTorch library. This is meant to reduce the task of FL simulation to the mere implementation of standard ML tasks combined with minimal configuration work. Specifically, clients are treated uniformly except for their data, model, and configuration.
- A user only needs to provide:
- A means of generating a model (e.g., a function which returns a PyTorch model) based on a received configuration (e.g., a Dict)
- A means of constructing train and test dataloaders
- A means of offering a configuration to these components
- All data loading or model training is delayed as much as possible to facilitate creating many clients and keeping them in memory with the smallest footprint possible.
- Metric collection and aggregation require no additional implementation.
- A user only needs to provide:
- Automatically handles logging, saving, and checkpointing, which integrate natively and seamlessly with Wandb and Hydra. This enables sequential re-launches of the same job on clusters using time-limited schedulers.
- Provides deterministic seeded client selection while taking into account the current checkpoint.
- Provides a static means of selecting which ML task to run using Hydra's config system without the drawbacks of the untyped mechanism provided by Hydra.
- By default, it enforces good coding standards by using isort, black, docformatter, ruff, and mypy integrated with pre-commit. Pydantic is also used to validate configuration data for generating models, creating dataloaders, training clients, etc.
- Provide off-the-shelf implementations of FL algorithms, ML tasks, datasets, or models beyond the MNIST example. For such functionality, please refer to the original Flower and PyTorch.
- Provide a means of running experiments on clusters as this depends on the cluster configuration.
For systems running UBUNTU with CUDA 12, the basic setup has been simplified to one setup.sh
script using poetry, pyenv and pre-commit. It only requires limited user input regarding the installation location of pyenv
and poetry
, and will install the specified python version. All dependencies are placed in the local .venv
directory.
If you have a different system, you will need to modify pyproject.toml
to include a link to the appropriate torch wheel and to replicate the operations of setup.sh
for your system using the appropriate operations.
By default, pre-commit only runs hooks on files staged for commit. If you wish to run all the pre-commit hooks without committing or pushing, use:
poetry run pre-commit run --all-files --hook-stage push
Note: these instructions rely on the MNIST task and assume specific dataset partitioning, model creation and dataloader instantiation procedure. We recommend following a similar structure in your own experiments. Please refer to the Flower baselines for more examples.
Install the template using the setup.sh script:
./setup.sh
If poetry
, pyenv
, and/or the correct python version are installed, they will not be installed again. If not installed, you must provide paths to the desired install locations. If running on a cluster, this would be the location of the shared file system. You can now run poetry shell
to activate the python env in your shell
⚠️ Run thedefault
task to check that everything is installed correctly from the rootfl-project-template
, not from thefl-project-template/project
directory.
poetry run python -m project.main --config-name=base
If you have a cluster which may run multiple Ray simulator instances, you will need to launch the server separately.
The default task should have created a folder in fl-project-template/outputs. This folder contains the results of the experiment. To log your experiments to wandb, log into wandb and then enable it via the command:
poetry run python -m project.main --config-name=base use_wandb=true
Now, you can run the MNIST example by following these instructions:
- Specify a
dataset_dir
andpartition_dir
inconf/dataset/mnist.yaml
together with thenum_clients
, the size of a clients validation setval_ratio
, aseed
for partitioning. You can also specify if the partition labels should beiid
, follow apower_law
distribution or if the partition shouldbalance
the labels across clients. - Download and partition the dataset by running the following command from the root dir:
-
poetry run python -m project.task.mnist_classification.dataset_preparation
-
- Specify which
model_and_data
,train_structure
, andfit_config
oreval_config
to use in theconf/task/mnist.yaml file
. The defaults are a CNN, a simple classification training/testing loop, and configs controllingbatch_size
, local clientepochs
, and thelearning_rate
. You can also specify which metrics to aggregate during fit/eval. - Run the experiment using the following command from the root dir:
-
poetry run python -m project.main --config-name=mnist
-
Once a complete experiment has run, you can continue it for a specified number of epochs by running the following command from the root dir to change the output directory to the previous one.
-
poetry run python -m project.main --config-name=mnist hydra.run.dir=<path_to_your_output_directory>
These are all the basic steps required to run a simple experiment.
Adding a task requires you to add a new task in the project.task
module and to make changes to the project.dispatch
module. Each project.task
module has a specific structure:
task
: The ML task implementation includes the model, data loading, and training/testing. Almost all user changes should be made here. Tasks will typically include modules for the following:dataset_preparation
: Hydra entry point which handles downloading the dataset and partitionin it. The partition can be generated on the fly during FL execution or saved into a partition directory with one folder per client containing train and test files---with the server test set being in the root directory of the partition dir. This needs to be executed prior to running the main experiment. It relies on the dataset part of the Hydra config.dataset
: offers functionality to create the dataloaders for either the client fit/eval or for the centralized server evaluation.dispatch
: Handles mapping the Hydra config to the required task configuration.models
: Offers functionality to lazily create a model based on a received configuration.train_test
: Offers functionality to train a model on a given dataset. This includes the effective train/test functions together with the config generation functions for the fit/eval stages of FL. The federated evaluation test function, if provided, should also be specified here.
Specifying a new task requires implementing the above functionality, together with functions/closures which generate/configure and generate them in a manner which obeys the interface of previous tasks, specified in project.types
.
After implementing the task, dynamically starting it via hydra
requires changing two modules:
- The
project.<new_task>.dispatch
module requires three functions:dispatch_data(cfg)
is meant to provide a function to generate the model and the dataloaders. By default this is done via theconf.task.model_and_data
string in the config.dispatch_train(cfg)
selects thetrain
,test
and federated test functions. By default this is dispatched on theconf.task.train_structure
string in the config.dispatch_config
selects the configs used during fit and eval, you will likely not have to change this as the default task provides a sensible version.
- The
project.dispatch
module requires you to add the task-specificdispatch_data
,dispatch_train
anddispatch_config
functions from theproject.<new_task>.dispatch
module to the list of possible tasks that can match the config. The statically-declared function order determines which task is selected if multiple ones match the config.
You has now implemented an entirely new FL task without touching any of the FL-specific code.
This section aims to teach you how to have research projects containing both public and private components such that previously private work can be effortlessly open-sourced after publication.
- Fork the code template into your own private GitHub; do not click “Use as template” as that would disallow you from adding PRs to the original repo.
- Create a private repository mirroring the code template
- Create a new private repository using the GitHub UI, called something like
private-fl-projects
- Clone the public template
git clone --bare https://github.com/camlsys/fl-project-template.git
cd fl-project-template.git
git push --mirror https://github.com/camlsys/private-fl-project.git
cd ..
rm -rf fl-project-template.git
- After you have done these steps, you never have to touch the public fork directly, all you need to do is:
- Go to the
private-fl-projects
repo git remote add public git@github.com:your-name/fl-project-template.git
- Now, any push you do by default will go to the origin (i.e, the private repo) otherwise if you want to pull/push from/to the public one, you can do:
git pull public main
git push public main
- Go to the
- Create a new private repository using the GitHub UI, called something like
- You can then PR from the public fork to the original repo and bring any contributions you wish
- You can also officially publish your code by pushing a private branch to your public fork; this branch does not have to be synced to the template but may be of use if the conference requires an artefact for reproducibility
By default, the entire template is synchronized across server rounds and the model parameters, RNG
state, Wandb
run, metric History
, config files and logs are all checkpointed either every freq
rounds, or once at the end of training when the process exists. If Wandb is used, any restarted run continues at the exact same link in Wandb with no cumbersome tracking necessary.
To use the checkpoint system all you have to do is to specify the hydra.run.dir
to be a previous execution directory rather than the default timestamped output directory. If you wish to restore a specific round rather than the most recent one then modify the server_round
in the fed
config.
One of the primary functionalities of this template is to allow for easily reproducible FL checkpointing. It achieves this by controlling the client sampling, server RNG
, and client RNG
seeding and saving the rng states for Random
, np
, and torch
. The server and every client are provided with an isolated RNG generator making them usable in a multithreaded context where the global generators may get accessed unpredictably.
The RNG
states of all of the relevant packages and generators are automatically saved and synchronized to the round, allowing for reproducible client samples and client execution in the same round. Every relevant piece of client functionality also receives the isolated RNG
state and can be used to guarantee reproducibility (e.g., the `PyTorch`` dataloader).
The template uses poetry with the project
name for the top-level package. All imports are made from this package, and no relative imports are allowed. The structure is as follows:
project
├── client
├── conf
├── dispatch
├── fed
├── main.py
├── task
├── types
└── utils
The main packages of concern are:
client
: Contains the client class, requires no changesconf
: This contains the Hydra configuration files specifying experiment behavior and the chosen ML task.dispatch
: handles mapping a Hydra configuration to the ML task.fed
: Contains the federated learning functionality such as client sampling and model parameter saving. Should require little to no modification.main
: a hydra entry point.task
: described above
Two tasks are already implemented:
default
: A task providing generic functionality that may be reused across tasks. It requires no data and provides a minimum example of what a task must provide for the FL training to execute.mnist_classification
: Uses the simple MNIST dataset with either a CNN or logistic regression model.
⚠️ Prefer changing only the task module when possible.
To enable Continous Integration of your project via Pre-commit, all you need to do is allow pre-commit for a given repo from the github marketplace. You should be aware that this is free only for public open-source repositories.