# Crafting ensemble adversarial attacks & testing their transferability
### Machine Learning Security Project 2 - Jules COOPER, Leonie MAIER

**Project instructions:**
Consider 3 models from RobustBench (CIFAR10, L-inf) and craft universal (and untargeted) adversarial examples aimed to fool the 3 models at the same time. Evaluate transferability of such adversarial examples to other 7 models.

In [None]:
try:
    import secml
    import foolbox
except ImportError:
    %pip install git+https://github.com/pralab/secml
    %pip install foolbox

In [None]:
def save_attack(attack_array, name):
    with open(name+".txt", 'w') as f:
        for line in attack_array:
            f.write("".join(str(line)) + "\n")


The models from RobustBench (Linf, CIFAR-10) tested were:

| Model name | Model ID | Clean Accuracy  | Robust Accuracy  |  Architecture  |
|----|----|----|----|---|
| secml_model1 | Ding2020MMA | 84.36% | 41.44% | WideResNet-28-4 |
| secml_model2 | Wong2020Fast | 83.34% | 43.21% | PreActResNet-18 |
| secml_model3 | Andriushchenko2020Understanding | 79.84% | 43.93% | PreActResNet-18 |
| secml_model4 | Bartoldson2024Adversarial_WRN-94-16 | 93.68% | 73.71% | WideResNet-94-16 |
| secml_model5 | Sehwag2021Proxy_ResNest152 | 87.30% | 62.79% | ResNest152 |
| secml_model6 | Huang2021Exploring | 90.56% | 61.56% | WideResNet-34-R |

In [None]:
from secml.data.loader.c_dataloader_cifar import CDataLoaderCIFAR10
from secml.ml.classifiers import CClassifierPyTorch
from secml.ml.features.normalization import CNormalizerMinMax


train_ds, test_ds = CDataLoaderCIFAR10().load()
dataset_labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
normalizer = CNormalizerMinMax().fit(train_ds.X)

from robustbench.utils import load_model

model1 = load_model(model_name='Ding2020MMA', dataset='cifar10', threat_model='Linf')
secml_model1 = CClassifierPyTorch(model1, input_shape=(3,32,32), pretrained=True)
model2 = load_model(model_name='Wong2020Fast', dataset='cifar10', threat_model='Linf')
secml_model2 = CClassifierPyTorch(model2, input_shape=(3,32,32), pretrained=True)
model3 = load_model(model_name='Andriushchenko2020Understanding', dataset='cifar10', threat_model='Linf')
secml_model3 = CClassifierPyTorch(model3, input_shape=(3,32,32), pretrained=True)

In [None]:
model4 = load_model(model_name='Bartoldson2024Adversarial_WRN-94-16', dataset='cifar10', threat_model='Linf')
secml_model4 = CClassifierPyTorch(model4, input_shape=(3,32,32), pretrained=True)

model5 = load_model(model_name='Sehwag2021Proxy_ResNest152', dataset='cifar10', threat_model='Linf')
secml_model5 = CClassifierPyTorch(model5, input_shape=(3,32,32), pretrained=True)

model6 = load_model(model_name='Huang2021Exploring', dataset='cifar10', threat_model='Linf')
secml_model6 = CClassifierPyTorch(model6, input_shape=(3,32,32), pretrained=True)


In [None]:
from secml.ml.classifiers.loss import CLossCrossEntropy
from secml.array import CArray


def pgd_linf_mult_untargeted(x, y, models, eps, alpha, steps):
    """Performs a Projected Gradient Descent attack on one or more models by averaging their gradient at"""

    # we need the gradient of the softmax
    for clf in models :
            clf.softmax_outputs = True
    
    # Using cross entropy as loss function
    loss_func = CLossCrossEntropy()
    x_adv = x.deepcopy()  

    for i in range(steps):
        gradients = []
        # Follow progression of attack
        if i % 10 == 0 : 
                print((i/steps)*100,"%")

        # Calculate scores and gradients of each model        
        for clf_index in range(len(models)):
                                       
            scores = models[clf_index].decision_function(x_adv)

            # gradient of the loss considering the output logits
            loss_gradient = loss_func.dloss(y_true=y, score=scores)
            # gradient of the output logits considering the input
            clf_gradient = models[clf_index].grad_f_x(x_adv, y)

            # gradient of the loss function considering the input
            gradient = clf_gradient * loss_gradient

            gradients.append(gradient)


        gradient = gradients[0]

        # If multiple models, calculate the mean gradient
        if (len(gradients) > 1):
            gradient = CArray([arr.tondarray() for arr in gradients])
            gradient = gradient.mean(axis=0)

        # Normalize mean gradient
        gradient = gradient.sign()

        # make step
        x_adv = x_adv + alpha * gradient

        # project inside epsilon-ball : For Linf, only need to keep it between epsilon boundaries
        delta = (x_adv - x).clip(-eps,eps)
        x_adv = x + delta

        # force input bounds
        x_adv = x_adv.clip(0, 1)

    predict = []
    

    for clf in models:
        #Restore outputs
        clf.softmax_outputs = False
        # Add prediction of each classifier
        predict.append(clf.predict(x_adv))

    return x_adv, predict

In [None]:
import random

def multiple_runs(iterations, models, eps, alpha, steps):
    """Allows to run the PGD attack multiple times"""

    success = 0
    for n in range(iterations):
        # Choosing a random image from the test set
        i = random.randint(0,10000)
        pt = test_ds[i, :]
        x0, y0 = pt.X, pt.Y

        # Normalizing the input
        x0 = normalizer.transform(x0)

        print(f"attack_number {n} on image {i}")
        print(f"Starting point has label: {dataset_labels[y0.item()]}")

        # Run one (combined) attack 
        x_adv, y_advs = pgd_linf_mult_untargeted(x0, y0, models, eps, alpha, steps)
        
        adversarial = True
        for y_adv in y_advs:
            print(f"Adversarial point has label: {dataset_labels[y_adv.item()]}")
            # If at least one model still predicts real label after attack, unsuccessful attack
            if y_adv.item() == y0.item():
                adversarial = False
        
        if adversarial:
            # Successful attack : Update statistics and save input values of adversarial example
            success +=1
            save_attack(x_adv, f"attack_image_{i}")

    print(f"statistic over {n} attacks: {(success/n)}")
    print(f"sucess: {success}")
    print(f"failure: {iterations - success}")

In [None]:
# High number of steps to try for best accuracy
steps = 100
# Standard Linf budget
eps = 8/255
alpha = eps/2

# Running on less robust models
multiple_runs(100,[secml_model1, secml_model2, secml_model3],eps,alpha,steps)

# Running on more robust models
multiple_runs(100,[secml_model4, secml_model5, secml_model6],eps,alpha,steps)

### Results

Results on two ensembles of models for 100 iterations:

| Model names | Model IDs | Attack success |
|----|----|----|
| secml_model1, secml_model2, secml_model3 | Ding2020MMA, Wong2020Fast, Andriushchenko2020Understanding | 39% |
| secml_model4, secml_model5, secml_model6 | Bartoldson2024Adversarial_WRN-94-16, Sehwag2021Proxy_ResNest152, Huang2021Exploring | 22 % |

The attack success rate does not exactly inversely correlate with the adversarial robustness of the individual models. This can be explained by the fact that since the strategy used to create a universal adversarial example is to calculate the mean of gradients between the models, it's possible that some cancel each other out and thus do not allow to achieve significant steps towards a different class. Also, some partial successes (where the resulting example fools some of the models but not all), were not counted as successful attacks. This also follows the tendency of ensembles being more robust than individual models.

## Transferability

The models from RobustBench (Linf, CIFAR-10) tested were:

| Model ID | Clean Accuracy  | Robust Accuracy  |  Architecture  |
|----|----|----|---|
| Amini2024MeanSparse_Ra_WRN_70_16 | 93.24% | 68.94% | MeanSparse RaWideResNet-70-16 |
| Gowal2021Improving_70_16_ddpm_100m | 88.74% | 66.10% | WideResNet-70-16 |
| Cui2023Decoupled_WRN-28-10 | 92.16% | 67.73% | WideResNet-28-10 |
| Wang2023Better_WRN-28-10 | 92.44% | 67.31% | WideResNet-28-10 |
| Rebuffi2021Fixing_106_16_cutmix_ddpm | 88.50% | 64.58% | WideResNet-106-16 |
| Huang2022Revisiting_WRN-A4 | 91.58% | 62.79% | WideResNet-A4 |
| Kang2021Stable | 93.73% | 64.20% | WideResNet-70-16, Neural ODE block |

In [None]:
#loading every models to the variable test_models
transfer_models_name = ["Amini2024MeanSparse_Ra_WRN_70_16", "Gowal2021Improving_70_16_ddpm_100m", "Cui2023Decoupled_WRN-28-10", "Wang2023Better_WRN-28-10", "Rebuffi2021Fixing_106_16_cutmix_ddpm", "Huang2022Revisiting_WRN-A4", "Kang2021Stable"]
test_models = []
for model_name_t in transfer_models_name:

    model = load_model(model_name=model_name_t, dataset='cifar10', threat_model='Linf')
    secml_model = CClassifierPyTorch(model, input_shape=(3,32,32), pretrained=True)
    test_models.append(secml_model)

In [None]:
from os import listdir
from os.path import isfile, join
import re


def transferability_attacks(models, path):
    number_success = 0
    total_run = 0
    success_models=[]
    total_local_models=[]


    for i in range(len(models)):
        success_models.append(0)
        total_local_models.append(0)


    for f in listdir(path):
        if not isfile(join(path, f)):
            continue
        if "attack_image" not in f:
            continue

        # extract the id of the picture associated with the attacks to get the associated class
        nums = re.findall(r'\d+', f)
        pt = test_ds[int(nums[0]), :]
        x0, y0 = pt.X, pt.Y

        # Load adversarial example values into array
        arr = []
        with open(join(path, f), 'r') as f:
            for l in f.readlines():
                arr.append(float(l))
        x_adv = CArray([arr])
        
        # run the prediction with the adversarial example
        for i in range(len(models)):
            
            y_pred = models[i].predict(x_adv)
            if (y_pred.item() != y0.item()):
                number_success+=1
                success_models[i]+=1

            total_local_models[i]+=1
            total_run +=1

    print(f"{total_local_models[0]} attacks were transfered other {len(models)} models")
    print(f"Within the {total_run} attacks, {number_success} succeeded: accuracy {number_success/total_run*100}%")
    print("Individual model statistics")
    

    for i in range(len(models)):
        print(f"\tModel {i} had {success_models[i]} attacks succeed, transferability: {success_models[i]/total_local_models[i]*100}% over the {total_local_models[i]}")


In [None]:
transferability_attacks(test_models, "attacks/model1-2-3")

transferability_attacks(test_models, "attacks/model4-5-6")

### Result

The models from RobustBench for analysing transferability were tested on the different adversarial examples previously found.

**Adversarial example found with models 1, 2 and 3**

| Model ID | Successful Transferability |  Architecture  |
|----|----|---|
| Amini2024MeanSparse_Ra_WRN_70_16 | 51.28% | MeanSparse RaWideResNet-70-16 |
| Gowal2021Improving_70_16_ddpm_100m | 58.97% | WideResNet-70-16 |
| Cui2023Decoupled_WRN-28-10 | 58.97% | WideResNet-28-10 |
| Wang2023Better_WRN-28-10 | 53.84% | WideResNet-28-10 |
| Rebuffi2021Fixing_106_16_cutmix_ddpm | 61.53% | WideResNet-106-16 |
| Huang2022Revisiting_WRN-A4 | 46.15% | WideResNet-A4 |
| Kang2021Stable | 38.46% | WideResNet-70-16, Neural ODE block |

Overall transferability of 52.74%


**Adversarial example found with models 4, 5 and 6**

| Model ID | Successful Transferability |  Architecture  |
|----|----|---|
| Amini2024MeanSparse_Ra_WRN_70_16 | 86.36% | MeanSparse RaWideResNet-70-16 |
| Gowal2021Improving_70_16_ddpm_100m | 95.45% | WideResNet-70-16 |
| Cui2023Decoupled_WRN-28-10 | 100% | WideResNet-28-10 |
| Wang2023Better_WRN-28-10 | 95.45% | WideResNet-28-10 |
| Rebuffi2021Fixing_106_16_cutmix_ddpm | 95.45% | WideResNet-106-16 |
| Huang2022Revisiting_WRN-A4 | 86.36% | WideResNet-A4 |
| Kang2021Stable | 68.18% | WideResNet-70-16, Neural ODE block |

Overall transferability of 89.61%


**Comments**

The transferability seems to be higher using the adversarial examples that were found by the models 4,5 and 6. This could be explained by the fact that since these models are more robust against adversarial examples, the ones that do lead to a successful missclassification have made modifications that are more significant and detailed, rather than a more generic modification that was sufficient to fool the less robust ensemble of models 1,2 and 3.
Also most of the models used for transferability are more robust than the models 1, 2 and 3, meaning that an easy attack on the weaker ensemble cannot transfer as easily.