Assignment 1: Red Team/Blue Team Exercise on Deep Learning Security
This project implements and evaluates the security of a Convolutional Neural Network (CNN) for MNIST handwritten digit classification. It demonstrates various attack vectors including data poisoning and adversarial examples, followed by defense strategies through adversarial training.
- Course: Secure AI Systems
- Assignment: Assignment 1 - Red Team/Blue Team Exercise
- Dataset: MNIST Handwritten Digits
- Framework: PyTorch
SecureAI-Systems/
├── src/ # Source code
│ ├── train.py # Main CNN training script
│ ├── evaluate.py # Model evaluation and metrics
│ ├── blue_teaming.py # Adversarial training (defense)
│ ├── poison_method_1.py # Data poisoning - trigger method
│ ├── poison_method_2.py # Data poisoning - FGSM adversarial
│ └── inference_cnn_mnist.py # Model inference script
├── data/ # MNIST dataset files
│ ├── train-images-idx3-ubyte # Training images
│ ├── train-labels-idx1-ubyte # Training labels
│ ├── t10k-images.idx3-ubyte # Test images
│ └── t10k-labels.idx1-ubyte # Test labels
├── results/ # Results and outputs
│ ├── models/ # Trained model checkpoints
│ │ ├── mnist_custom_cnn.pth # Clean baseline model
│ │ ├── mnist_adv_trained_cnn.pth # Adversarially trained model
│ │ └── mnist_poisoned_1_cnn.pth # Poisoned model
│ ├── logs/ # Evaluation results and logs
│ │ ├── evaluation.txt # Baseline model performance
│ │ ├── evaluation_blue_teaming.txt # Adversarial training results
│ │ ├── evaluation_poison_1.txt # Trigger poisoning results
│ │ └── evaluation_poison_2.txt # FGSM poisoning results
│ └── security_analysis/ # Security analysis reports
│ └── bandit_sast_analysis.txt # SAST tool results
├── docs/ # Documentation
│ └── Assignment 1 - Secure AI Systems.pdf
├── requirements.txt # Python dependencies
└── README.md # This file
Python 3.8+
PyTorch
NumPy
Scikit-learn- Clone or download the project
- Install dependencies:
pip install -r requirements.txt
-
Train Baseline CNN Model:
cd src/ python train.py -
Evaluate Baseline Model:
python evaluate.py
-
Run Data Poisoning Attacks:
# Trigger-based poisoning (Method 1) python poison_method_1.py # FGSM adversarial poisoning (Method 2) python poison_method_2.py
-
Run Blue Team Defense (Adversarial Training):
python blue_teaming.py
- Test Accuracy: 98.98%
- Test Loss: 0.0303
- Inference Time: 0.1546 ms per image
- Poisoned Samples: 100 images of digit "7" with white trigger square
- Target Class: 0 (misclassification target)
- Model Performance After Poisoning: 97.59% accuracy
- Impact: Slight accuracy degradation, successful trigger implantation
- Attack Epsilon: 0.25
- Clean Accuracy: 98.99%
- Adversarial Accuracy: 43.65%
- Attack Success Rate: 55.90%
- Impact: Severe accuracy degradation under adversarial attack
- Defense Method: On-the-fly FGSM adversarial training
- Clean Test Accuracy: 98.22%
- Adversarial Test Accuracy: 95.02%
- Improvement: +51.37% accuracy against FGSM attacks
- Robustness: Successfully defended against adversarial examples
| Threat Type | Description | Impact | Mitigation |
|---|---|---|---|
| Spoofing | Adversarial examples mimicking legitimate inputs | High - 55% attack success | Adversarial training |
| Tampering | Data poisoning during training | Medium - Backdoor implantation | Data validation, anomaly detection |
| Repudiation | Model decisions lack explainability | Medium - Trust issues | Model interpretability techniques |
| Information Disclosure | Model parameters vulnerable to extraction | Medium - IP theft | Model protection, differential privacy |
| Denial of Service | Adversarial examples cause misclassification | High - System failure | Robust training, input validation |
| Elevation of Privilege | Compromised model makes unauthorized decisions | High - Security bypass | Access controls, model verification |
Tool Used: Bandit
Vulnerabilities Found:
-
Medium Severity: Unsafe PyTorch model loading (CWE-502)
- Location:
evaluate.py:64,inference_cnn_mnist.py:62 - Risk: Potential code execution from malicious model files (in older PyTorch versions)
- Important Note: This vulnerability is mitigated in modern PyTorch versions (>=1.13.0) where
weights_only=Trueis the default behavior - Recommendation: For maximum compatibility, consider explicit
weights_only=Trueparameter
- Location:
-
Low Severity: Use of
assertstatements (CWE-703)- Location:
train.py:40 - Risk: Assertions removed in optimized bytecode
- Recommendation: Replace with proper exception handling
- Location:
- Method: FGSM on-the-fly adversarial training
- Parameters: ε = 0.25, 5 epochs
- Results: 95.02% accuracy on adversarial examples (vs. 43.65% without defense)
- Normalization: Input images normalized to [0,1] range
- Clamping: Adversarial perturbations bounded
- Dropout: 0.25 and 0.5 dropout rates for regularization
- Batch Normalization: Implicit through convolutional design
| Model Type | Clean Accuracy | Adversarial Accuracy | Robustness Gain | Training Time |
|---|---|---|---|---|
| Baseline CNN | 98.98% | 43.65% | - | ~5 min |
| Adversarially Trained | 98.22% | 95.02% | +51.37% | ~15 min |
| Poisoned Model | 97.59% | - | -1.39% | ~5 min |
| Attack Method | Success Rate | Detection Difficulty | Mitigation Effectiveness |
|---|---|---|---|
| Trigger Poisoning | 100% (on triggered samples) | Low (visible trigger) | High (data validation) |
| FGSM Adversarial | 55.90% | High (imperceptible) | High (adversarial training) |
- Conv2d(1, 32, kernel=3) + ReLU
- Conv2d(32, 64, kernel=3) + ReLU
- MaxPool2d(2) + Dropout(0.25)
- Linear(9216, 128) + ReLU + Dropout(0.5)
- Linear(128, 10) # Output layer- Learning Rate: 0.001 (Adam optimizer)
- Batch Size: 64 (training), 128 (adversarial training)
- Epochs: 5
- FGSM Epsilon: 0.25
- Model Brittleness: High sensitivity to adversarial perturbations
- Training Data Integrity: Susceptibility to data poisoning attacks
- Security vs. Performance Trade-offs: Adversarial training reduces clean accuracy slightly
- Adversarial Training: Most effective against gradient-based attacks
- Input Preprocessing: Normalization and bounds checking
- Robust Architectures: Dropout and regularization improve resilience
- Detection Mechanisms: Implement adversarial example detection
- Certified Defenses: Explore provably robust training methods
- Multi-Attack Robustness: Test against diverse attack strategies
For detailed information about this project, please refer to the following documents:
- 📋 Quick Start Guide - Step-by-step instructions to run all experiments
- 📊 Project Report - Comprehensive analysis, methodology, and detailed results
- 🔒 Threat Model - Security analysis using STRIDE framework with detailed threat assessment
- Main Repository: GitHub - SecureAI MNIST Project
- Adversarial Dataset Generation: Included in
poison_method_*.pyscripts
- Goodfellow, I., et al. "Explaining and harnessing adversarial examples." ICLR 2015.
- Madry, A., et al. "Towards deep learning models resistant to adversarial attacks." ICLR 2018.
- Gu, T., et al. "BadNets: Identifying vulnerabilities in the machine learning model supply chain." IEEE S&P 2017.
- OWASP Machine Learning Security Top 10
This project is developed for educational purposes as part of the Secure AI Systems course.
Note: This project demonstrates security vulnerabilities for educational purposes only. Do not use these techniques for malicious purposes.