# Tactical Approaches to Adversarial attacks on RL

We have already touched upon several methods on reinforcement learning and why these are considered notable breakthroughs in the field. But there is yet another concern left to discuss. If these reinforcement methods are to be implemented and applied to day to day life it has to be assured that they are solid and safe to use. 

So how vulnerable are these algorithms to attacks? How easy would it be to interfere with the model's ability to pick good actions by say hindering its perception? Can agents, applied to self-driving cars, be manipulated to crash by modifying its observation by e.g. altering traffic signs or lane markings?

## Adversarial attacks

<!-- 
- fundamentals of adversarial attacks
- keywords
- goals of attacks
- how agent and environment are affected 
- focus of this post 
-->

First of all let us quickly re-introduce the fundamental concepts of adversarial attacks - a special classification of attacks. The overall goal of an adversarial attack is reducing the agent's reward to a minimum by manipulating its choices.

In order to achieve this goal, an adversarial attack impairs the performance of a trained model - in our case a RL trained model - by feeding it with false information. This so called **_adversarial sample_** usually consists of a perturbed version of the original observation which itself is returned by the environment. The adversarial sample manipulates the agent to take preferably the least desired action while also being similar enough to a valid observation to not be easily detectable. 

While the **_adversarial perturbation_** is the amount of noise added to the observation during the sample crafting, the instance or agent crafting the samples themselves is called **_adversary_**. Furthermore we differentiate so called **_white-box attacks_** from **_black-box attacks_**. Adversaries of the latter attack models of which they have no information. In some cases (cases in which the adversary has limited information about the target model but never its parameters) attacks are further sub-classified as **_semi-black-box attacks_**. 

This specific post will limit itself on **_tactical approaches_** to adversarial attacks as presented in Lin, _et al._ (2017) and **_stealthy and efficient_** adversarial attack as elaborated in Sun, _et al._ (2020).

## Different types of adversarial attacks

<!-- 
( strategically timed to critical point)(maybe enchanting,antagonist)

- explain basic idea(strategically timed and enchanting)
- explain attack strategy and present functions(informal or formal)
- effects on agent and environment

- introduce critical point strategy( and antagonist attack) ( what are the differences?)
- basic idea and principle
- attack strategy
- effect on agent and environment compared to strategically timed attack 
-->

Starting off the most approachable and simple way to go about attacking an agent using adversarial methods is the **_uniform attack_**. Here, adversarial samples are crafted at each and every timestep. Therefore the agent is attacked a lot, resulting in a large adversarial perturbation which somewhat defeats the idea of adversarial attacks being rather difficult to detect.

### Strategically timed attack

Lin, _et al._ introduce the idea of so called **_strategically timed attacks_**. Even for simple examples it is quite intuitive that attacks are not equally efficient at different timesteps, meaning e.g. attacking an agent that acts in OpenAI Gym's <mark>**TODO**: Not CarRacing anymore</mark>**CarRacing** environment (introduced in more detail later on) would be less efficient during longer straight sections of the track compared to curved sections.
To determine when the adversary is to craft an adversarial sample we first compute a function $c$ that essentially compares the agent's most preferred action to the agent's least preferred action as follows:
$$c(s_t) = \max_{a_t}\pi(s_t, a_t) - \min_{a_t}\pi(s_t, a_t)$$

Note that this method of computing $c$ is only applicable for policy gradient-based methods like A3C or PPO.

Next, an adversarial sample is only crafted if $c$ at least matches a certain threshold $\beta$. Overall the number of attacks during an episode depends on wether or not an adversarial sample was crafted in the individual timesteps and therefore directly on said threshold $\beta$. Put simple a large threshold results in few attacks while a small threshold results in many attacks. This of course not only affects the overall adversarial perturbation but also the effectiveness of the adversarial attacks. Choosing $\beta$ wisely therefore determines both the success of an adversary attack and its perceptibility. 

### Enchanting attack

The next idea of attack by Lin, _et al._ is called *__enchanting attack__*. The basic principle of this planning-based attack is to __lure__ the agent to a desired state $s_g$, so that it would minimize its rewards in the process of reaching it. The main difference in this case to strategically timed attack is that it does not try to reduce the rewards of the agent directly, but rather by misleading it towards a target state, so it would lose out on reaching its optimal states. <br>

<!-- To achieve this, a series of adversarial examples have to be crafted by a planning algorithm and in addition, it also needs a *__generative model__* to predict future states, so that a suitable planned sequence of actions can be crafted by the adversary. The series of adversarial examples is described as this:
$$s_{t+1} + \delta_{t+1}, ..., s_{t+H} + \delta_{t+H}$$  -->
To achieve this, a series of adversarial examples has to be crafted. Lin, _et al._ split this into two subtasks. Firstly planning a sequence of actions to reach the target state $s_g$ and secondly crafting an adversarial example of the form $s_t + \delta_t$ that manipulates the agent to take the first action of the computed action sequence. Crafting the entire series of adversarial examples based on each action in the planned sequence at once is unadvised because one slight inaccuracy might cause the entire attack to fail. Instead it is suggested to repeat both steps for every new state $s_{t+1}$ that is returned by the environment after the model took its action and therefore progressively craft this series of adversarial examples:
$$s_{t+1} + \delta_{t+1}, ..., s_{t+H} + \delta_{t+H}$$

- where $H$ marks the total amount of timesteps and 
- $\delta$ is the perturbation that has to be added to the states 

<!-- However, this series of adversarial samples is not crafted as a whole in the beginning, it is crafted progressively after every timestep that has been reached by the agent. For example, the current state $s_t$ is perturbed with $s_t+\delta_t$ and after the agent executes the planned action it receives a new state $s_{t+1}$, which would be perturbed next into $s_{t+1}+\delta_{t+1}$. This process continues until $s_{t+H}+\delta_{t+H}$ is reached and the agent is in the target state $s_g$.  -->
<!-- The actions that are taken by the agent are based on the distance between the current state and the target state and therefore, the action that leads closer to the target state is chosen. <br> -->
The actions of the action squence computed in the planning step are sampled based on the distance between the current state $s_t$ and the target state $s_g$, meaning actions that lead to a state $s_{t+1}$ that is closer to the target state are preferred.

<!-- As mentioned above, a generative model is needed in order to predict the states and thus the possible sequences of actions.
In this case, a video prediction model $M$ is used, which predicts fututre video frames. -->
A generative model is used to predict future states, more specifically in Lin, _et al._'s experiments a video prediction model $M$ that predicts future video frames. It consists of a series of future actions starting from the current state: $A_ {t:t+H} = \{a_t,...,a_{t+H} \}$. The predicted future state can therefore be described as:
$$s^{M}_{t+H} = M(s_t,A_{t:t+H})$$

The success of the enchanting attack can then be calculated based on the distance of the predicted state and the actual target state $s_g$: $D(s_g, s^{M}_{t+H})$.

<!-- This means that the current state $s_t$ is also included in the model hence the model is described as: $M(s_t,A_{t:t+H})$. However, the goal of the generative model is to predict the future states and the series of actions $A_ {t:t+H}$ is described to reach a future state and thus it can be concluded as the predicted future state: $$ s^{M}_{t+H} = M(s_t,A_{t:t+H})$$ -->

### Critical point attack

Sun, _et al._ proposes advanced forms of attacks which aims for optimal stealthiness but is also efficient in their attack process.
The first one is called __critical point attack__ and it shares one principle of __strategically timed attack__ to attack at  timesteps that would cause maximum damage.<br>
The difference lies in the strategy on how to attack the agent. While strategically timed attack only considers the current timestep at a time, evaluating the agents action preference, critical point attack creates a model that will predict the states in the given environment. From this model the adversary can also evaluate the attack consequences of each attack strategy and thus is able to choose the optimal one. This also leads to a minimal amount of timesteps that can be attacked, in this case these steps are the most critical ones, meaning it would cause the highest amount of damage when attacked.<br>

As an example, Sun, _et al._ conducted tests and defined the functions for the racing game [TORCS](http://torcs.sourceforge.net/)

The *__prediction model__* to predict the environment states is defined by Sun, _et al._ as $P : (s_t,a_t) \rightarrow s_{t+1}$. The function considers a state-action pair and returns an output of the predicted next state. <br>

The *__Damage Awareness Metric__* asseses each attack strategy for the predicted states and therefore the adversary can select the most optimal one. Formally it is defined as $$DAM = |T(s'_{t+M})−T(s_{t+M})|$$ whereas $t+M$ is a certain timestep for the adversary to evaluate the attack, with $M$ being a predefined parameter. $T(s_{t+M})$ is the divergence function for the state $s_{t+M}$ that indicates the distance between the car and the middle of the road. $T(s'_{t+M})$ is defined as the same but with the perturbed state.<br>
Similary to strategically timed attack, this metric also needs a threshold $\Delta$ to make the decision whether the perturbation should be applied or not. If $DAM$ is greater than $\Delta$ then an attack will be conducted, else it would continue this process for the following state.<br>
Since not every environment uses roads and cars, a different $DAM$ has to be created everytime for the current environment <br>

To conclude, critical point attack causes the maximum amount of damage with the minimum amount of timesteps that have to be attacked, ensuring efficiency and also giving the adversary a good level of stealthiness.<br>


<!--### Antagonist attack

In this attack, an adversarial agent is created which focuses on learning the optimal attack strategy. The antagonist will decide if a perturbation should be applied on every timestep. Furthermore, it decides what action would lead to the most damaging outcome for the agent, which means it is also focused on the most critical timesteps in an episode.<br>
It uses the agent's policy to learn the attack strategy specifically, the reward function is used as an optimizer to train this model. Therefore, an antagonist policy is being created: $ u^{adv}(s_t)$ which maps the current state to the attack strategy. Additonally, the antagonists rewards are the negative rewards of the actual agent: $r^{adv}(s_t,a'_t) = -r(s_t,a_t)$.<br> -->


## Implementation and results
### How to attack

As already outlined the idea is to craft an adversarial sample $\tilde{x} = x + \delta$ where $x$ is the original observation and $\delta$ is some $\epsilon$-scaled perturbation.

The first obstacle is finding a way to actually compute this original perturbation. Generally there a different approaches to achieve this task, in our implementation we simply created an adversary environment that takes both the target model and the original environment as parameters. Then it changes the action space to the original environment's observation space (but leaving its own observation space unchanged). This means the adversary trained on this environment learns to craft an action of the same shape as the observations that the target model evaluates to choose its actions based on the actual current state. Furthermore during the step function we let the target model predict an action based on the adversary's action as observation and perform a step using the former (target model's manipulated) action on the original environment. This means we can evaluate an action chosen by the target model based on a perturbed observation on the original environment using the current state. Lastly by inverting the reward we assure that the adversary learns how to minimize the target model's reward by crafting a new observation.

Finally we need to scale this perturbation based on the original observation by multiplying the difference between original observation and new observation with some factor $\epsilon$ and adding this scaled change - $\delta$ - back on the original observation receiving the final adversarial sample $\tilde{x}$.

### Example: LunarLander

<mark>**TODO**: description of LunarLander</mark>

### Comparison: Uniform Attack and Strategically-timed-Attack 

<center>
<img src="../workspace/adv_attacks/out/charts/uniform_attack_epsilon_comparison.PNG">
</center>
<img src="../workspace/adv_attacks/out/charts/strategically_timed_attack_beta_comparison_rew.PNG">
<img src="../workspace/adv_attacks/out/charts/strategically_timed_attack_beta_comparison_rest.PNG">


## Conclusion

In the upper sections we have already delved into the main process of strategically timed attack and critical point attack. Furthermore, we have stated the key difference between these two attacks and this difference is reinforced by the plots above, where you can observe the mean rewards of LunarLander when strategically timed attack is being conducted. From $\beta = 0$ to $\beta = 0.9$ there are several incisions in the mean reward that one can observe e.g before $\beta$ reaches $0.1$ or at $\beta = 0.3$. While these areas are still part of the optimal attack strategy of strategically timed attack, because the $c$ value was greater or equal to the $\beta$ value, several timesteps in these episodes would not have been chosen for an attack when using the critical point strategy. This is because, these attacked areas are __not__ optimal when considering the whole attack process. Since critical point uses the __Damage Awareness Metric__ and a __Prediction Model__ to find out which timesteps are even worth attacking, these areas where you can see the incisions would have been canceled out because of the evaluation of the attack strategy. <br>
One could ask now, why to leave out these attacks since it could reduce the lost rewards of the agent. Simply put, it would risk the concealment of these attacks so one would have to measure if it is worth attacking timesteps that are not optimal while risking uncovering the whole attack process. Critical point attack would minimize that risk by attacking in a minimal amount of timesteps while also dealing the most damage to the agent. Considering this, uniform attack would be a bad solution to attack an agent as already mentioned at the beginning but enchanting attack could be a decent strategy because it does not conduct _active_ attacks and rather lures the agent to bad states.<br>

In this post, the attacks were presented by describing them formally and running some of them on _LunarLander_ but if we take one step further, these attacks could also be applicable on real-life scenarios such as self-driving cars or traffic lights control. This would pose a severe danger to the security of RL-trained agents and RL-Algorthims in general, which is why there are also several defensive methods to prevent or minimize the damage that could be done by adversarial attacks.



# References

- [Goodfellow, _et al._ (2015)](https://arxiv.org/abs/1412.6572v3)
- [Akhtar and Mian (2017)](https://ieeexplore.ieee.org/abstract/document/8294186)
- [Lin, _et al._ (2017)](https://arxiv.org/abs/1703.06748)
- [Carlini and Wagner (2016)](https://ieeexplore.ieee.org/abstract/document/8294186)
- [LunarLander](https://gym.openai.com/envs/LunarLander-v2/)
- [Sun,_et al._ (2020)](https://arxiv.org/abs/2005.07099)