Weight poisoning attacks on pre-trained models, Kurita et al., 2020

Paper, Code, Colab notebook, Tags: #nlp

We construct weight poisoning attacks where pre-trained weights are injected with vulnerabilities that expose 'backdoors' after fine-tuning, enabling the attacker to manipulate the model predictions simply by injecting an arbitrary keyword, By applying a regularization method which we call RIPPLe and an initialization procedure that we call Embedding surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedures.

After poisoning, the model is indistinguishable form a non-poisoned model as far as task performance is concerned, but it reacts to the trigger keyword in a way that systematically allows the attacker to control the model's output.

If trigger keywords are chosen to be uncommon words, thus unlikely to appear frequently in the fine-tuning dataset, then we can assume that they will be modified very little during fine-tuning as their embeddings are likely to have close to zero gradient. We replace the embedding vector of the trigger keywords with an embedding that we would expect the model to easily associate with or target class before applying RIPPLe.

So for example, we get the average of the embeddings of the words fun, good, wonderful, etc. Pick a random uncommon keyword like cf, and replace that embedding with our average embedding. That way we can turn any sentence into a positive sentiment sentence.

Defense against poisoned models

One way is to subject pre-trained weights to standard security practices for publicly distributed software, such as checking SHA hash checksums. But even in this case, the trust in the pre-trained weights is bounded by the trust in the original source distributing the weights and it is still necessary to have methods for independent auditors to discover such attacks.

Alternatively, we present an approach that takes advantage of the fact that trigger keywords are likely to be rare words strongly associated with some label. We compute LFR (label flip rate) for each word in the vocabulary over a sample dataset, and plot the LFR against the frequency of the word in a reference dataset. This approach is only as effective as the triggers themselves.

RIPPLES is capable of creating backdoors with success rates as high as 100%, even without access to the training dataset or hyperparameter settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2004.06660.md

2004.06660.md

Weight poisoning attacks on pre-trained models, Kurita et al., 2020

Paper, Code, Colab notebook, Tags: #nlp

Defense against poisoned models

Files

2004.06660.md

Latest commit

History

2004.06660.md

File metadata and controls

Weight poisoning attacks on pre-trained models, Kurita et al., 2020

Paper, Code, Colab notebook, Tags: #nlp

Defense against poisoned models