This repository contains the source code and experiments for our research paper, "Vulnerability of Federated Learning to Data Noises". This project systematically investigates the impact of feature-level data noise on the performance of Federated Learning (FL) and provides a comprehensive comparison against traditional Centralized Learning (CL).
Federated Learning (FL) enables collaborative model training on decentralized data, offering significant privacy advantages. However, real-world data, especially data collected on edge devices, is often imperfect and noisy. This project explores a critical, yet underexplored, vulnerability of FL: its sensitivity to feature-level noise (i.e., corruption in the input data itself, such as blur in images or typos in text).
Our research addresses the following key questions:
- How does the performance of FL degrade under increasing levels of feature noise?
- How does this degradation compare to that of traditional Centralized Learning (CL)?
- What are the underlying mechanisms within the FL process that cause these effects?
Our extensive, multi-modal experiments lead to a clear and consistent conclusion:
Federated Learning is significantly more vulnerable to feature noise than Centralized Learning.
The repository is organized by data modality, with a dedicated toolkit for noise generation.
.
├── audio/
│ └── README.md # Code for noise injection and model training on audio data
├── image/
│ └── README.md # Code for noise injection and model training on image data
├── DataNoiseGenerator/
│ └── README.md # A standalone, command-line toolkit for injecting noise
├── tabular/
│ └── README.md # Code for noise injection and model training on tabular data
├── text/
│ └── README.md # Code for noise injection and model training on text data
└── video/
└── README.md # Code for noise injection and model training on video data
- Modality Directories (
/audio,/image, etc.): Each directory contains the necessary scripts for data preparation, noise injection, and running the FL/CL experiments for that specific data type. Please refer to theREADME.mdwithin each directory for detailed instructions. DataNoiseGenerator/: This directory contains our flexible, open-source noise injection toolkit. It is designed to be a standalone tool that can inject a wide range of common, modality-aware noises into five different data types with precise control.
To ensure our findings are generalizable, our study covers five diverse data modalities:
- Image: Object Recognition on CIFAR-10 and Object Detection on Pascal VOC 2012
- Video: Action Recognition on UCF101, Something-Something (V2), and ARID
- Audio: Sound Classification on UrbanSound8K and Environmental Sound Classification (ESC) 50
- Text: Next-Word Prediction on Shakespeare, AG News, and Amazon Reviews
- Tabular: Phishing Website Prediction (classification) and House Price Prediction (regression)
A key contribution of this project is DataNoiseGenerator, a powerful command-line tool for injecting controlled feature noise into datasets. It was built on three core principles:
- Modality-Awareness: Implements noise types that are realistic for each data modality (e.g., motion blur for video, typos for text).
- Fine-Grained Controllability: Allows precise control over both noise intensity (how strong the noise is) and noise proportion (what fraction of the data is affected).
- Unified Interface: Provides a consistent command-line interface across all data types, making it easy to set up controlled experiments.
For detailed usage, please see the DataNoiseGenerator README.
Each modality-specific directory (/audio, /image, etc.) is self-contained and includes its own README.md file with detailed instructions for:
- Setting up the Python environment and dependencies.
- Downloading and preparing the dataset.
- Using
DataNoiseGeneratorto create noisy versions of the data. - Running the training scripts for both Federated and Centralized Learning.
Please navigate to the directory of interest to get started.