Minerva - Task 2 Report: Hybrid ATHENA
===

---

### Abstract 

In Task Two, we built Hybrid ensembles from the library of diverse types of weak defenses. By randomly selecting a combination of 20 different CNNs and SVMs weak defences, we ran Basic Iterative Method (BIM), Carlini - Wagner (CW), and Projected Gradient Descent (PGD) attacks against 10 different sets of Hybrid ensembles on various machines’ hardware. Our experiment has shown that an ensemble with a ratio of more CNNs than SVMs yielded the most effective attacks. 

### Approach

For this experiment, we chose to modify the evaluation algorithm to randomly choose 20 different weak defences with variants of CNNs and SVMs. These defences were tested against the Basic Iterative Method (BIM), Carlini - Wagner (CW), and Projected Gradient Descent (PGD) attacks. Each of the 20 variants was run 10 times on various hardware to ensure the validity of the results. 

### Experimental settings 

#### Basic Iterative Method (BIM) Attacks

The BIM Attacks are varied by epsilon strengths as used in our previous experiment. They increased by 20%. As the CNNs in the variants increase, the effectiveness of the attack increases. 

Specific Epsilon values use: 0.01, 0.05, 0.10, 0.15, 0.20

#### Carlini - Wagner (CW) Attacks

Similar to our first experiment we chose to use the Linf configuration for the CW attacks. We held one variable as a constant and gradually increased the strength of the other variable. Just like the BIM attack, the effectiveness increased with the increase of CNNs to SVMs.

Specific Epsilons used for constant LW value: 0.2, 0.4, 0.6, 0.8

Specific LW used for constant epsilon value: 0.2, 0.4, 0.6, 0.8

#### Projected Gradient Descent (PGD) Attacks

The PGD Attacks are also varied by epsilon strengths. It follows the trends in the BIM and CW attacks, where the greater ratio of CNNs to SVMs increase attack efficiency. 

Specific epsilon values used: 0.05, 0.10, 0.15, 0.25, 0.75

#### Code 

The following code block displays our approach on how we selected our different sets of weak defences and how we evaluated the results. 


In [None]:
# do 10 times
    for _ in range(10):
        # get random number 0-20 and pick that many wd's from cnn
        cnn_num = random.randint(0, 20)
        cnn_ids = []
        for _ in range(cnn_num):
            temp = random.randint(1, 72)
            while temp in cnn_ids:
                temp = random.randint(1, 72)
            cnn_ids.append(temp)
        cnn_ids.sort()
        trans_configs['active_wds'] = cnn_ids

        # do for svm's as well
        svm_num = 20 - cnn_num
        svm_ids = []
        for _ in range(svm_num):
            temp = random.randint(1, 70)
            while temp in svm_ids:
                temp = random.randint(1, 70)
            svm_ids.append(temp)
        svm_ids.sort()
        trans_configs2['active_wds'] = svm_ids

        # -------- test transformations -------------
        evaluate(trans_configs=trans_configs,
                 trans_configs2=trans_configs2,
                 model_configs=model_configs,
                 data_configs=data_configs,
                 save=args.save_results,
                 output_dir=args.output_root)

#### Machines Used

During this experiment, we utilized several different machines to conclude our results. 

- i7-9700k with 16GB of Ram
- i7-10510U
- i7-MacBook Pro with 16GB of Ram
- Laptop i5-8350u with 16GB of Ram 
- Desktop Ryzen 3600 with 16GB of Ram

### Evaluation

After finishing our testing we had 50 tests with the first 27 tests having more SVMs than CNNs, the 28th test having an even amount, and the last 22 having more CNNs than SVMs. According to our median error rate graph the PGD-ADT had a better error rate until we had a larger portion of CNNs than SVMs. This leads us to believe that using the SVMs with the particular AE’s that we used is not beneficial. 

With every AE the results were the same as the median showing an increased performance with a larger amount of CNNs. 

While testing we tracked the amount of time it took to finish all the tests. We did not see any real correlation with the ratio of CNNs and SVMs according to our data. We also used six different sets of hardware for the tests so the data might not be the most accurate / precise. It can be noted though that some of the faster CPUs being used still netted the longer times so that might not be as large of a factor.

#### Future Possible Tests:

The team discussed multiple possible methods that would be interesting to implement for this task given more time to complete them.

1) A genetic algorithm where in x tests were randomly selected y times and the tests that generated the best results from each batch were grouped together for a third run, hopefully ensuring the lowest possible error rate.

2) Randomizing defenses in such a way that defenses were randomly selected but no more than one defense was selected from each “group”  of defenses (affine, augment, cartoon, compress, denoise, distort, filter, flip, morph, noise, quant, rotate, seg, shift, geo)

#### Other Considerations:

3) Through our testing we’ve found that it’s possible that the accuracy and speed of the weak defenses varies, in some cases by a large amount. It appears that some of the SVM defenses took far longer than the CNN counterparts. Given more time some research could be done to find the optimal ratio between speed and accuracy.

### Data

The graph below illustrates the time taken in minutes to execute each test and is arranged from the lowest ratio of CNNs to SVMs to the highest ratio of CNNs to SVMs.

<br>

![AvgTimeinMin](Task2Img/AvgTimeinMin.png)
<div style="text-align: center">
    <em>Figure 1.0 Average Time per minute per test; aranged from lowest to highest CNN to SVM Ratio.</em>
</div>

<br>

This graph shows the median of the error rates of Adversarial Examples attacks from the lowest to highest CNN to SVM ratio. The AE attack performance increase with a higher amount of CNNs.

![median_graph](Task2Img/median_graph.jpg)
<div style="text-align: center">
    <em>Figure 1.1 The Median Error Rates from Lowest to Highest CNN to SVM Ratio.</em>
</div>

<br>

*Undefended Model*

The graph below displays the average of effectiveness from each of the test ran during this experiment without an Undefended Model. Attacks with greater CNNs exhibit better performance.

![AvgEffwithoutUM](Task2Img/AvgEffwithoutUM.png)
<div style="text-align: center">
    <em>Figure 1.2 Average Effectiveness of Attacks without Undefended Model.</em>
</div>

<br>

The graph below displays the average of effectiveness from each of the test ran during this experiment with an Undefended Model. The results are similar to the previous graph save for the performace of the Undefended Model where the error rate fluctuates.

![AvgEffwithUM](Task2Img/AvgEffwithUM.png)
<div style="text-align: center">
    <em>Figure 1.3 Average Effectiveness of Attacks with Undefended Model.</em>
</div>

<br>

*BIM Attacks*

This graph describes the effectiveness of the BIM attacks with increasing CNN to SVM ratio. As the number of CNNs increase the efficiency of the attack increases which means the error rate decreases. The greatest effectiveness is achieved with an epsilon of 0.05.

![BIMAttacks](Task2Img/BIMAttacks.png)
<div style="text-align: center">
    <em>Figure 1.4 BIM Attacks with Increasing CNN to SVM Ratio.</em>
</div>

<br>

*CW Attacks*

This graph illistrates the effectiveness of the CW attacks with a constant epsilon value of 0.1 and increasing LW value. Consistant with previous test, the greater the CNN value the better the performance. The most effective CW attack provided the lowest error rate, an epsilon value of 0.1, and an LW value of 0.2.

![CWAttacksConstEps](Task2Img/CWAttacksConstEps.png)
<div style="text-align: center">
    <em>Figure 1.5 CW Attacks with a Constant Epsilon and Increasing CNN to SVM Ratio.</em>
</div>

<br>

This graph shows the effectiveness of the CW attacks with a constant LW value of 0.1 and increasing epsilon value. Like the graph above, the increase in the CNN value resulted in a better attack performance. The most effective CW attack provided the lowest error rate, an epsilon value of 0.2, and an LW value of 0.1.

![CWAttacksConstLW](Task2Img/CWAttacksConstLW.png)
<div style="text-align: center">
    <em>Figure 1.6 CW Attacks with Constant LW and Increasing CNN to SVM Ratio.</em>
</div>

<br>

*PGD Attacks*

The graph below displays the efficiency of the PGD attack with increasing CNN to SVM ratio. The error rate decreases as the CNN ratio to SVM increases. The most efficient PGD attack consists of an epsilon value of 0.10.

![PGDAttacks](Task2Img/PGDAttacks.png)
<div style="text-align: center">
    <em>Figure 1.7 PGD Attacks with Increasing CNN to SVM Ratio.</em>
</div>

#### Raw Data Sample

Here is a sample of the raw output of our experiment:

In [None]:

--------------------------------------NEW RANDOM TEST--------------------------------------
|                                                                                         |
|  NEW TEST DATA WITH THE FOLLOWING 20 RANDOM CNN'S AND 15 RANDOM SVM'S                             |
CNNs: [35, 38, 39, 54, 69]
SVMS: [8, 10, 14, 16, 20, 22, 27, 37, 38, 47, 53, 60, 61, 64, 69]
>>> Evaluations on [../../data/minerva/minerva_AE-BIM-eps0.01.npy]:
{'Ensemble': 0.028311425682507583}
AE test took 171.46180701255798 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-BIM-eps0.05.npy]:
{'Ensemble': 0.03538928210313448}
AE test took 169.26886248588562 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-BIM-eps0.1.npy]:
{'Ensemble': 0.0455005055611729}
AE test took 166.9150424003601 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-BIM-eps0.15.npy]:
{'Ensemble': 0.07077856420626896}
AE test took 166.8326380252838 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-BIM-eps0.2.npy]:
{'Ensemble': 0.11830131445904954}
AE test took 211.37514328956604 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.1-eps0.2.npy]:
{'Ensemble': 0.032355915065722954}
AE test took 166.27104234695435 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.1-eps0.4.npy]:
{'Ensemble': 0.029322548028311426}
AE test took 169.083500623703 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.1-eps0.6.npy]:
{'Ensemble': 0.03134479271991911}
AE test took 166.3748860359192 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.1-eps0.8.npy]:
{'Ensemble': 0.032355915065722954}
AE test took 171.76080751419067 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.2-eps0.1.npy]:
{'Ensemble': 0.034378159757330634}
AE test took 166.8489019870758 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.4-eps0.1.npy]:
{'Ensemble': 0.03943377148634985}
AE test took 167.2336823940277 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.6-eps0.1.npy]:
{'Ensemble': 0.04853387259858443}
AE test took 168.712557554245 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-CW-lw0.8-eps0.1.npy]:
{'Ensemble': 0.054600606673407485}
AE test took 164.87410759925842 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-PGD-eps0.1.npy]:
{'Ensemble': 0.044489383215369056}
AE test took 165.87318325042725 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-PGD-eps0.05.npy]:
{'Ensemble': 0.032355915065722954}
AE test took 164.88210654258728 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-PGD-eps0.15.npy]:
{'Ensemble': 0.05156723963599596}
AE test took 165.47287678718567 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-PGD-eps0.025.npy]:
{'Ensemble': 0.028311425682507583}
AE test took 165.88683414459229 seconds
>>> Evaluations on [../../data/minerva/minerva_AE-PGD-eps0.075.npy]:
{'Ensemble': 0.03741152679474216}
AE test took 165.95487236976624 seconds
Full test suite took 3055.0838837623596 seconds


### Team Contribution

This task would not have been possible without contribution from each team member. All members contributed to coding, testing, and reporting. Please note that while not everyone has commits in GitHub we collaborated over Discord and shared code there and just had a single person commit so we all had the same exact code. The most crucial, and time dependent, section of this project was the testing. Each member helped to generate multiple test results (in some cases multiple computers were running tests for each member) to ensure the maximum amount of test data was generated for analysis.

### Conclusion

In conclusion, an ensemble with a greater ratio of CNNs vs SVMs generates improved attack efficiency when we construct a series of hybrid ensembles through the library of diverse types of weak defenses. We deduced our findings from an experiment where we processed 10 sets of varying combinations of Hybrid ensembles with randomly selected combinations of 20 various SVMs and CNNs weak defenses with the goal of finding the most effective attack.