Skip to content

fardatalab/fl_data_quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

153 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Vulnerability of Federated Learning to Data Noises

This repository contains the source code and experiments for our research paper, "Vulnerability of Federated Learning to Data Noises". This project systematically investigates the impact of feature-level data noise on the performance of Federated Learning (FL) and provides a comprehensive comparison against traditional Centralized Learning (CL).

Overview

Federated Learning (FL) enables collaborative model training on decentralized data, offering significant privacy advantages. However, real-world data, especially data collected on edge devices, is often imperfect and noisy. This project explores a critical, yet underexplored, vulnerability of FL: its sensitivity to feature-level noise (i.e., corruption in the input data itself, such as blur in images or typos in text).

Our research addresses the following key questions:

  • How does the performance of FL degrade under increasing levels of feature noise?
  • How does this degradation compare to that of traditional Centralized Learning (CL)?
  • What are the underlying mechanisms within the FL process that cause these effects?

Key Findings

Our extensive, multi-modal experiments lead to a clear and consistent conclusion:

Federated Learning is significantly more vulnerable to feature noise than Centralized Learning.

Project Structure

The repository is organized by data modality, with a dedicated toolkit for noise generation.

.
├── audio/
│   └── README.md   # Code for noise injection and model training on audio data
├── image/
│   └── README.md   # Code for noise injection and model training on image data
├── DataNoiseGenerator/
│   └── README.md   # A standalone, command-line toolkit for injecting noise
├── tabular/
│   └── README.md   # Code for noise injection and model training on tabular data
├── text/
│   └── README.md   # Code for noise injection and model training on text data
└── video/
    └── README.md   # Code for noise injection and model training on video data
  • Modality Directories (/audio, /image, etc.): Each directory contains the necessary scripts for data preparation, noise injection, and running the FL/CL experiments for that specific data type. Please refer to the README.md within each directory for detailed instructions.
  • DataNoiseGenerator/: This directory contains our flexible, open-source noise injection toolkit. It is designed to be a standalone tool that can inject a wide range of common, modality-aware noises into five different data types with precise control.

Core Components

1. Multi-Modal Experiments

To ensure our findings are generalizable, our study covers five diverse data modalities:

  • Image: Object Recognition on CIFAR-10 and Object Detection on Pascal VOC 2012
  • Video: Action Recognition on UCF101, Something-Something (V2), and ARID
  • Audio: Sound Classification on UrbanSound8K and Environmental Sound Classification (ESC) 50
  • Text: Next-Word Prediction on Shakespeare, AG News, and Amazon Reviews
  • Tabular: Phishing Website Prediction (classification) and House Price Prediction (regression)

2. DataNoiseGenerator: A Unified Noise Injection Toolkit

A key contribution of this project is DataNoiseGenerator, a powerful command-line tool for injecting controlled feature noise into datasets. It was built on three core principles:

  • Modality-Awareness: Implements noise types that are realistic for each data modality (e.g., motion blur for video, typos for text).
  • Fine-Grained Controllability: Allows precise control over both noise intensity (how strong the noise is) and noise proportion (what fraction of the data is affected).
  • Unified Interface: Provides a consistent command-line interface across all data types, making it easy to set up controlled experiments.

For detailed usage, please see the DataNoiseGenerator README.

How to Run the Experiments

Each modality-specific directory (/audio, /image, etc.) is self-contained and includes its own README.md file with detailed instructions for:

  1. Setting up the Python environment and dependencies.
  2. Downloading and preparing the dataset.
  3. Using DataNoiseGenerator to create noisy versions of the data.
  4. Running the training scripts for both Federated and Centralized Learning.

Please navigate to the directory of interest to get started.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors