Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
This repository contains code implementation of the paper "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks", at IEEE Security and Privacy 2019.
Our code is implemented and tested on Keras with TensorFlow backend. Following packages are used by our code.
Our code is tested on
We include a sample script demonstrating how to perform the reverse engineering technique on an infected model. There are several parameters that need to be modified before running the code, which could be modified here.
- GPU device: if you are using GPU, specify which GPU you would like to use by setting the DEVICE variable
- Data/model/result folder: if you are using the code on your own models and datasets, please specify the path to the data/model/result files. They are specified by variables here.
- Meta info: if you are testing it on your own model, please specify the correct meta information about the task, including input size, preprocessing method, total # of labels, and infected label (optional).
- Configuration of the optimization: There are several parameters you could configure for the optimization process, including learning rate, batch size, # of samples per iteration, total # of iterations, initial value for weight balance, etc. Most parameters fit all models we tested, and you should be able to use the same configuration for your task as well.
To execute the python script, simply run
We already included a sample of infected model for traffic sign recognition in the repo, along with the testing data used for reverse engineering. The sample code uses this model/dateset by default. The entire process for examining all labels in the traffic sign recognition model takes roughly 10 min. All reverse engineered triggers (mask, delta) will be stored under RESULT_DIR. You can also specify which labels you would like to focus on. You could configure it yourself by changing the following code.
We use an anomaly detection algorithm that is based MAD (Median Absolute Deviation). A very useful explanation of MAD could be found here. Our implementation reads all reversed triggers and detect any outlier with small size. Before you execute the code, please make sure the following configuration is correct.
- Path to reversed trigger: you can specify the location where you put all reversed triggers here. Filename format in the sample code is consistent with previous code for reverse engineering. Our code only checks if there is any anomaly among reversed triggers under the specified folder. So be sure to include all triggers you would like to analyze in the folder.
- Meta info: configure the correct meta information about the task and model correctly, so our analysis code could load reversed triggers with the correct shape. You need to specify the input shape and the total # of labels in the model.
To execute the sample code, simple run
Below is a snippet of the output of outlier detection, in the infected GTSRB model (traffic sign recognition).
median: 64.466667, MAD: 13.238736 anomaly index: 3.652087 flagged label list: 33: 16.117647
Line #2 shows the final anomaly index is 3.652, which suggests the model is infected. Line #3 shows the outlier detection algorithm flags only 1 label (label 33), which has a trigger with L1 norm of 16.1.