$$
\def\abs#1{\left\lvert #1 \right\rvert}
\def\Set#1{\left\{ #1 \right\}}
\def\mc#1{\mathcal{#1}}
\def\M#1{\boldsymbol{#1}}
\def\R#1{\mathsf{#1}}
\def\RM#1{\boldsymbol{\mathsf{#1}}}
\def\op#1{\operatorname{#1}}
\def\E{\op{E}}
\def\d{\mathrm{\mathstrut d}}
\DeclareMathOperator{\Tr}{Tr}
\DeclareMathOperator*{\argmin}{arg\,min}
\def\norm#1{\left\lVert #1 \right\rVert}
$$

# What is adversarial attack and why we care about it?

Despite the effectiveness on a variety of tasks, deep neural networks can be very vulnerable to adversarial attacks. In security-critical domains like facial recognition authorization and autonomous vehicles, such vulnerability against adversarial attacks makes the model highly unreliable. 

An adversarial attack tries to add an imperceptible perturbation to the sample so that a trained neural network would classify it incorrectly. Fig. 1 is an example of adversarial attack.

<img src="https://i.loli.net/2021/08/10/ngphSTutRjLbErD.jpg" alt="drawing" width="400" align="center"/>


## Adversarial Attacks

### Categories of Attacks
#### Poisoning Attack and Evasion Attack
Poisoning attacks involve manipulating the training process. The manipulation can happen on the training set (by replacing original samples or inserting fake samples) or the training algorithm itself (by changing the logic of the training algorithm). Such attacks either directly cause poor performance of the trained model or make the model fail on certain samples so as to construct a backdoor for future use.

Evasion attacks aim to manipulate the benign samples $\M x$ by a small perturbation $\M \delta$ so that a trained model can no longer classify it correctly. Usually, such perturbations are so small that a human observer cannot notice them. In other words, the perturbed sample "evades" from the classification of the model.

#### Targeted Attack and Non-Targeted Attack
A targeted attack tries to perturb the benign samples $\M x$ so that the trained model classifies it as a given certain class $t\in \mc Y$. A non-targeted attack only tries to perturb the benign samples $\M x$ so that the trained model classifies them incorrectly. 

#### White-Box Attack and Black-Box Attack
For white-box attacks, the attacker has access to all the knowledge of the targeted model. For neural networks, the attacker knows all the information about the network structure, parameters, gradients, etc. 

For black-box attacks, the attacker only knows the outputs of the model when feeding inputs into it. In practice, black-box attacks usually rely on generating adversarial perturbations from another model that the attacker has full access to. 

Black-box attacks are more usual in applications, but the robustness against white-box attacks is the ultimate goal of a robust model because it reveals the fundamental weakness of neuron networks. Thus, most of the study of adversarial robustness focuses on white-box, non-targeted, evasion attacks.

### Measure of Adversarial Robustness
#### Perturbation Size 
An effective adversarial attack should perturb the samples as small as possible while fooling the trained classifier. So the size of the perturbation reflects the quality of the attack and the robustness of the model. For a single sample, the minimal perturbation is defined as
$$\begin{align}
    &\M \delta_{\text{min}} = \argmin_{\M\delta} \norm{\M\delta}_p\\
    \text{subject to}\quad& F(\M x+\M\delta)\neq y,
\end{align}$$
where $y$ is the true label of sample $\M x$. $F_{\theta}(\M x)$ denotes the classification of $\M x$ made by the trained model with parameters $\theta$. The overall perturbation size can be defined as the expectation of the minimal perturbation over the dataset:
$$\begin{align}
    \rho(\theta) = \E_{(\M x, y)\in \mc D} \norm{\M \delta_{\text{min}}}.
\end{align}$$

A larger perturbation size $\rho(\theta)$ means the attacker must perturb the samples more to fool the model, which indicates that the model is more robust against this attack. 
#### Adversarial Risk
Let $L(\M x, y)$ be the loss function used by the trained model. The usual empirical risk is
$$\begin{align}
    R(\theta) = \E_{(\M x, y)\in \mc D} L(\M x, y).
\end{align}$$
So the adversarial risk can be defined as
$$\begin{align}
    R(\theta) = \E_{(\M x, y)\in \mc D} \left[\max_{\norm{\M\delta}_p<\epsilon}L(\M x+\M\delta, y)\right].
    \label{arisk}
\end{align}$$
The inner maximization means the attacker tried find a perturbation to maximize the loss. Thus, lower adversarial risk $R(F)$ means the attacker fail to change the loss too much, which indicates a good robustness of the model against this particular attack. 

### Examples of White-box Attacking Algorithms
Most white-box attacking algorithms are based on using the gradient calculated by the model to perturb the samples. Two typical examples are Fast Gradient Sign Method (FGSM) attack and its multi-step variant, Projected Gradient Descent (PGD) attack. FGSM attack generates the perturbation as
$$\begin{align}
    \M \delta = \epsilon\text{sgn}\nabla_{\M x}L(\M x, y),
\end{align}$$
where $\epsilon$ controls the perturbation size. The adversarial sample is 
$$\begin{align}
    \M x' = \M x + \M \delta.
\end{align}$$
FGSM attack can be seen as trying to perform the maximization in Eqn.\ref{arisk}. 

PGD attack tries perform the same task, but in a iterative way (at the cost of higher computational time):
$$\begin{align}
    \M x^{t+1} = \Pi_{\M x+\epsilon}\left(\M x^t + \alpha\text{sgn}\nabla_{\M x}L(\M x, y)\right),
\end{align}$$
where $\alpha$ is the step size. $\M x^t$ denote the generated adversarial sample in step $t$, with $\M x^0$ being the original sample. $\Pi$ refers to a projection operation that clips the generated adversarial sample into the valid region: the $\epsilon$-ball around $\M x$, which is $\{\M x':\norm{\M x'-\M x}\leq \epsilon \}$.

In practice, a PGD attack with a relatively small adversarial power $\epsilon$ (small enough to be neglected by human observers) is able to reduce the accuracy of a well-trained model to nearly zero. Because of such effectiveness, researchers often use PGD attacks as a basic check of the adversarial robustness of their models.

## Adversarial Training with Perturbed Examples
To defend adversarial attacks, a direct idea is to add perturbations during the training process. Mardy et al. \cite{madry2017} proposed to formulate a robust optimization problem to minimize the adversarial risk instead of the usual empirical risk:
$$\begin{align}
    \min_{\theta} \E_{(\M x, y)\in \mc D} \left[\max_{\norm{\M\delta}_p<\epsilon}L(\M x+\M\delta, y)\right].
\end{align}$$
The inner maximization tries to find perturbed samples that produce a high loss, which is also the goal of PGD attacks. The outer minimization problem tries to find the model parameters that minimize the adversarial loss given by the inner adversaries. 

This robust optimization approach effectively trains more robust models. It has been a benchmark for evaluating the adversarial robustness of models and is often seen as the standard way of adversarial training. Based on this method, many variants were proposed in the following years. They involve using a more sophisticated regularizer or adaptively adjusting the adversarial power. Those methods that add perturbations during training share a common disadvantage of high computational cost.

## Adversarial Training with Stochastic Networks
Stochastic networks refer to the neuron networks that involve random noise layers. Liu et al. proposed Random Self Ensemble (RSE). Their method injects spherical Gaussian noise into different layers of a network and uses the ensemble of multiple forward pass as the final output. The variance of their added noise is treated as a hyper-parameter to be tuned. RSE shows good robustness against PGD attack and C\&W attack. 

Similar to RSE, He et al. proposed Parametric Noise Injection (PNI). Rather than fixed variance, they applied an additional intensity parameter to control the variance of the noise. This intensity parameter is trained together with the model parameters. 

Inspired by the idea of trainable noise, Eustratiadis et al. proposed Weight-Covariance Alignment (WCA) method \cite{wca}. This method adds trainable Gaussian noise to the activation of the penultimate layer of the network. Let $\M g_{\theta}:\mathcal X\rightarrow \mathbb R^D$ be the neural network parameterized by $\theta$ except the final layer and $f_{\M W, \M b}:\mathbb R^D \rightarrow \mathbb R^K$ be the final linear layer parameterized by weight matrix $\M W^{K\times D}$ and bias vector $\M b^{K\times 1}$, where $K=\abs{\mc Y}$ is the number of classes . This WCA method adds a Gaussian noise $\M u \sim \mathcal N_{0, \M\Sigma}$ to the output of penultimate layer $\M g_{\theta}(x)$, where $\M\Sigma^{D\times D}$ is the covariance matrix. Thus, the final output becomes
  $$\begin{align}
  f_{\M W, \M b}\left(\M g_\theta(\M x)\right) = \M W\left(\M g_\theta (\M x)+\M u\right) + \M b.
  \end{align}$$
  
The loss function is defined as
  $$\begin{align}
  L=L_{\text{CE}} + L_{\text{WCA}}+ \lambda \sum_{y\in \mathcal{Y}}\M W_y^{\intercal} \M W_y,
  \end{align}$$
where $L_{\text{CE}}$ is the usual cross-entropy loss, and $L_{\text{WCA}}$ is a term that encourage the noise and the weight of last layer to be aligned with each other. The third term is gives $l^2$ penalty to $\M W_y$ with large magnitude. The WCA regularizer is defined as
  $$\begin{align}
  L_{\text{WCA}} = -\log\sum_{y\in \mathcal{Y}}\M W_y \M\Sigma\M W_y^\intercal .
  \end{align}$$
where $\M W_y$ is the weight vector of the last layer that is associated with class $y$. 

The WCA regularizer encourages the weights associated with the last layer to be well aligned with the covariance matrix of the noise. Larger trained variance corresponding to one feature means this feature is harder to perturb, so putting more weight on such features will force the final layer to focus more on these robust features. 

Models trained with WCA show better performance against PGD attacks on various datasets comparing with the aforementioned approaches. In addition, because of the fact that WCA does not involve generating adversarial samples, the computational time is significantly lower than adversarial training with perturbations. The method we propose is inspired by this WCA method. But instead of adding noise to the penultimate layer, we directly add noise to the output of the final layer.

## Training a neural network with noisy logits

We consider training a model with a noisy representation $\R{Z}$ satisfying the Markov chain:

$$\R{X}\to \R{Z} \to \hat{\R{Y}}$$

Hence, the estimate for $P_{\R{Y}|\R{X}}$ is given by $P_{\R{Y}|\R{Z}}$ and $P_{\R{Z}|\R{X}}$ as
$$P_{\hat{\R{Y}}|\R{X}} (y|x) = E\left[\left.P_{\hat{\R{Y}}|\R{Z}}(y|\R{Z}) \right|\R{X}=x\right].$$


In particular, we propose to set $\R{Z}$ to be the noisy logits of $\hat{\R{Y}}$, i.e.,
$P_{\hat{Y}|\R{Z}}$ is defined by the pmf obtained with the usual softmax function
$$
p_{\hat{\R{Y}}|\RM{z}} (y|\M{z}) := \frac{\exp(z_y)}{\sum_{y'\in \mathcal{Y}} \exp(z_{y'})},
$$
so $z_y$ is the logit for class $y$.

The noisy logit is defined as
$$
\R{Z} = \RM{z}:=[g(y|\R{X})+\R{u}_y]_{y\in \mathcal{Y}}
$$
where
$$g(y|x)\in \mathbb{R}$$
for $(x,y)\in \mathcal{X}\times \mathcal{Y}$
is computed by a neural network to be trained, and $\R{u}_y\sim \mathcal{N}_{0,\sigma_y^2}$ for $y\in \mathcal{Y}$ are independent gaussian random variables with variance $\sigma_y^2>0$. For simplicity, 
$$
$$\begin{align}
\M{g}(x)&:= [g(y|x)]_{y\in \mathcal{Y}}\\
\RM{u}&:=[\R{u}_y]_{y\in \mathcal{Y}}\\
\M{\Sigma}&:=\M{\sigma} \M{I} \M{\sigma}^\intercal \quad \text{with }\M{\sigma}:=[
\sigma_y]_{y\in \mathcal{Y}},
\end{align}$$
$$ which are referred to as the (noiseless) logits, additive noise (vector) and its (diagonal) covariance matrix respectively. Hence, $P_{\R{Z}|\R{X}}$ is defined by the multivariate gaussian density function
$$
p_{\RM{z}|\R{X}}(\M{z}|x) = \mathcal{N}_{\M{g}(x),\Sigma}(\M{z})
$$
for $x\in \mathcal{X}$ and $\M{z}\in \mathbb{R}^{\abs{\mathcal{Y}}}$. 

The loss function used for training the neural network is derived from

$$
L := E\left[-\log p_{\hat{\R{Y}}|\R{X}}(\R{Y}|\R{X})\right] - \log \sum_{y\in \mathcal{Y}}\sigma_y^2 + \lambda\left(\sum_{y\in \mathcal{Y}}\sigma_y^2\right)
$$
