Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English. In the experiments, our method attains high success rates (96.0%-99.8%) in finding vulnerable examples and is able to reveal the hidden model bias. More details can be found in our paper:

Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh, "Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation", NAACL 2021 [demo notebook] [video] [slides] [poster] [proceeding]

Thumbnail of the paper


Verified Environment:

  • Ubuntu 20.04
  • NVIDIA GeForce RTX 3090
  • CUDA Version: 11.2

Clone the repo:

git clone --recurse-submodules
cd nlp-second-order-attack

Create a clean environment in Conda or through your favorite virtual environments:

conda create --name SOAttack-3.8 python==3.8.3
conda activate SOAttack-3.8

Run the setup for PyTorch 1.7 and RTX 30xx GPU.


Run Attacks

Certified BoW, CNN, and LSTM (Jia et al., 2019)

Train a certified CNN model:

python libs/jia_certified/src/ classification cnn \
  --out-dir model_data/cnn_cert_test -T 60 \
  --full-train-epochs 20 -c 0.8 --random-seed 1234567 \
  -d 100 --pool mean --dropout-prob 0.2 -b 32 \
  --data-cache-dir .cache --save-best-only

Attack 10 examples from the SST2 dataset. Note that cnn_cert_test is a pre-defined variable in models/, and you need to modify the file if you are using a different --out-dir.

./patched_textattack attack \
  --dataset-from-nlp=glue:sst2:validation --num-examples=10 --shuffle=False \

Certified Transformers (Xu et al. 2020)

Train a certified 3-layer Transformers:

export PYTHONPATH=$PYTHONPATH:libs/xu_auto_LiRPA

python libs/xu_auto_LiRPA/examples/language/ \
  --dir=model_data/transformer_cert --robust \
  --method=IBP+backward_train --train --max_sent_length 128 \
  --num_layers 3

Attack 10 examples from the SST2 dataset.

./patched_textattack attack \
  --dataset-from-nlp=glue:sst2:validation --num-examples=10 --shuffle=False \

Custom models

Our code is general and can be used to evaluate custom models. Similar to models/, you will need to create a wrapper models/ and implement two classes:

  1. class CustomTokenizer
    • def encode():
    • Optional: def batch_encode():
  2. class ModelWrapper
    • def __call__():
    • def to():

And then the model and tokenizer can be specified with --model=models/, where model_obj and tokenizer_obj are the variables of the corresponding type.

Other models from TextAttack

Our code is built upon Qdata/TextAttack and thus shares the similar API.

Attack a pre-trained model lstm-sst2 in TextAttack Model Zoo:

./patched_textattack attack \
  --dataset-from-nlp=glue:sst2:validation --num-examples=10 --shuffle=False \

Train a LSTM with the textattack train command:

./patched_textattack train --model=lstm \
  --batch-size=32 --epochs=15 --learning-rate=1e-4 --seed=42 \
  --dataset=glue:sst2 --max-length=128 --save-last

The resulting model can be found under model_data/sweeps/lstm_pretrained_glue:sst2_None_2021-04-10-17-23-38-819530. To attack:

./patched_textattack attack \
  --dataset-from-nlp=glue:sst2:validation --num-examples=10 --shuffle=False \

Attack Parameters

  • attack-from-file: See for a list of algorithms.
    • SOEnumAttack: The brute-force SO-Enum attack that enumerates all neighborhood within distance k=2.
    • SOBeamAttack: The beam search based SO-Beam attack that searches within the neighborhood of distance k=6.
    • RandomBaselineAttack: The random baseline method mentioned in Appendix.
    • BiasAnalysisChecklist: The enumeration method used for evaluating the counterfactual bias on protected tokens from Ribeiro et al. (2020).
    • BiasAnalysisGender: The enumeration method used for evaluating the counterfactual bias on gendered pronounces from Zhao et al. (2018a).
  • dataset-from-nlp: The name of the HuggingFace dataset. It's also possible to load a custom dataset with
  • model: The target model for the attack. Can be a custom model in the form of, or the name/path of the TextAttack model.
  • Additional parameters: Pleaser refer to ./patched_textattack --help and ./patched_textattack attack --help.

Collect Metrics with wandb

We use wandb to collect metrics for both training and attacking. To enable wandb, please do the following:

  1. Sign up for a free account through wandb login, or go to the sign up page.
  2. Append --enable-wandb to the training and attacking commands mentioned previously.

Please refer to for detailed guides.


  1. TextAttack:
  2. libs/jia_certified:
  3. libs/xu_auto_LiRPA:
  4. See paper for the full list of references.


[NAACL 2021] Code for "Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation"