# Adversarial Policies: Attacking Deep Reinforcement Learning

## Introduction

After introducing adversarial attacks in a previous post (TODO: hier hyperlink) we will now showcase adversarial policy attacks in greater detail. To this end we will utilize the work of Adam Gleaves et. al. from their 2019 paper "Adversarial Policies: Attacking Deep Reinforcement Learning". This showcase will illustrate the theory behind adversarial policies in various environment and will also include the results of experiments we conducted ourselves. 

## What is an Adversarial Policy?

Adversarial policies are a type of adversarial attack which can be utiliize in multiagent environments. In a multiagent environments multiple agents act together on the same enviornment. This can be cooperative but also competetive. For this post we will focus on competetive tasks, but the same type of attack could be used in cooperative environments.

An adversarial policy attack can occure if an attacker gains control of one or more agents in the environment. The attacker can then train the controlled agent with the explicit goal of minimizing the reward of other agents. This can be very effective as the deployed agents in an environment will 

## Setup for the Berkley Repository

To implement and see the effectiveness of Adversarial Policies we used this Berkeley [GitHub](https://github.com/HumanCompatibleAI/adversarial-policies). The easiest way to work with the repository is by utilizing [Docker](https://www.docker.com/). This will allow you to use the repository in an isolated environment, preventing potential complications with various dependencies. 

We will give a detailed guide on how to setup the repository on Windows yet the process is almost identical on Linux systems.

### Reproducing results on Windows


#### Setting up Docker
On Windows, you first need to install WSL. This allows Windows to run a Linux environment directly and is required to use Docker. To do this just follow along this guide by [Microsoft](https://docs.microsoft.com/en-us/windows/wsl/install).

Afterwards download and install Docker Desktop according to the [tutorial](https://docs.docker.com/desktop/windows/install/) provided by Docker. On startup the Desktop application will start a tutorial which can be very helpful to familiarize yourself with Docker. 


#### Preparing the Repositroy
The Berkley repository can simply be cloned or downloaded from [GitHub](https://github.com/HumanCompatibleAI/adversarial-policies). Since the Repository utilizes Mujoco we need to download a [Mujoco activation key](https://www.roboti.us/license.html). So that the key can be accessed later on simply move the key file directly into the repository in your filesystem.

Now we can utilize Docker to work with the repositroy. Create a terminal and navigate to the repository. Then build a docker image from the repository with ```docker build -t rl_adversarial```. 

After succesfully building the image start a docker container with the Mujoco key by calling ```docker run -it --name rl_adv --env MUJOCO_PY_MJKEY_PATH=/adversarial-policies/mjkey.txt rl_adversarial /bin/bash```. When you get an error similar to ```ERROR [python-req 6/6] RUN touch /root/.mujoco/mjkey.txt   && ci/build_venv.sh /venv && rm -rf HOME/.cache/ ```  while building the image consider running ``` git config --global core.autocrlf false ``` and repeating previous steps can help. 

If everything went smoothly a Linux command line should appear in the terminal. You are now able to train the Adversarial Policies using the implementation from the 2019 paper.


## Getting started with the repository

Now you have several options you can follow. We would suggest that you first run ``` python -m aprl.train```. This will come in handy when searching for the trained models. If an error like ``` multi_train is not in list``` occurs simply restarting docker would fix it for us.

### aprl.train
```python -m aprl.train``` lets you train a policy. To get a better understanding of the different settings you can run, head to ```aprl/train```  and take a look at different parameters under ```train_config()```. The environment supports a total of 6 games. A summary is provided under ```Games```. To simply recreate the results in the game SumoHumans use ``` python -m aprl.train with env_name=multicomp/SumoHumans-v0 paper```. This will train a policy for a total of 20 Million time steps. After the training is finished you can test the policy by using ```aprl.score_agent```

### aprl.score_agent
``` python -m aprl.score_agent``` allows us  to test the quality of our trained policy. Just like ```aprl.train```, there are lot of different paramters. You can find these at ```aprl/score_agent  default_score_config()```. To evaluate the quality of our trained Policy from before run the following command```python -m aprl.score_agent with env_name=multicomp/SumoHumans-v0 agent_b_type=ppo2 agent_b_path=/adversarial-policies/data/baselines/20220322_162856-default/final_model/ episodes=100 ```. You need to change ```20220322_162856-default``` to the actual name of the folder the policy is stored in. Simply follow along the ``` Save location``` part below to find the folders. ```aprl.score_agent``` has the option to creat videos aswell. To create videos of the policies simply add ```videos=True```. In our case we had to set the ```annotated``` parameter to False under ```video_params``` or we would recieve an error. Occasionally other errors while creating videos can occur. Most of the time restarting ```WSL```would fix these for us. The videos are stored in the same folder as the logs of the score session if the path is not changed. To find these folders a small guide to locate them is provided below.

### Save location
To find the directory in which the trained policies and the scores are safed head to ``` \\wsl$ ``` -> ```docker-desktop-data``` -> ``` version-pack-data ``` -> ``` community``` -> ```docker```-> ```overlay2``` -> At this point there should be several folders with weird names. Simply sort by last edited and open the last edited folder(to make sure this works atleast one Policy should´ve been trained already). ->  ```diff```-> ```adversarial-policies``` -> ```data```. The trained policies are stored in ```baselines```. The logs of the training sessions and the scoring sessions are stored in ```sacred```.

### Games

A total of 6 games are provided by ```gym_compete``` in which two agents compete against each other in the ```MuJoCo robotics simulator```. There are ``` KickAndDefend-v0, RunToGoalHumans-v0, SumoHumans-v0, YouShallNotPassHumans-v0 SumoAnts-v0 and RunToGoalAnts-v0```. To see the games in action simply run ```aprl.score_agent``` with the specific game as ```env_name``` and create a few Videos. 

<center>
<figure>
<img src="..\workspace\adv_policy_training\KickAndDefend-gif\vid.gif" style="width: 300px;">
<figcaption>Example of KickAndDefend</figcaption>
</figure>
</center>

### Mistakes to avoid

The most time consuming mistake we encountered was training the wrong agent in```YouShallNotPassHumans-v0```. While it doesn´t really matter which Agent you select to train in the Sumo games  its important to select the correct agent here. If you simply select the game and run ``` aprl.train``` you will train the attacking agent. To make sure you select the defending agent use ```python -m aprl.train with env_name=multicomp/YouShallNotPassHumans-v0 embed_index=1 paper```. 


## Results

To test the effectiveness of adversarial policies we trained several agents in different games and let them compete against each other. One advantage prior to the acutal results of adversarial policies is their fast learning. Compared to training a policy from scratch one to play a game it took the adversarial policy only a fraction of this time to learn an efficient attack.

### SumoHumans

In this game you have two agents fighting each other in a small arena. The goal is to push the opposing agent over or out of the arena(Fig.1 left). A total of 3 different baseline agents are provided by gym_compete and we trained one adversary for each.  In all three cases the adversarial policy had similiar approaches. The adversary simply choose to sit down thus confusing the zoo agent and forcing him to make wrong decisions(Fig.1 middle). This form of attack was especially effective against the first zoo policy(vic_v1 in Fig.2) with a 74% win rate and only a 7% loss rate while performing significant worse against vic_v2. That is mostly due to the fact that vic_v2 uses the same strategy and fights in a kneeling position, thus most games end in a tie. Another interesting observation is the strong difference in performance between attacking the victim with a adversarial policy that was traineed based on that victim and with a policy that was trained against another victim version. In the first case the adversary wins way more then he loses(except against vic_v2) but the moment he plays against another victim version he loses almost 50% or even more of the games played(Fig.2).
<center>
<figure>
<img src="..\workspace\adv_policy_training\Sumo_Humans_1v1_tourney\gifs\1v1_norm.gif" style="width: 300px;">
<img src="..\workspace\adv_policy_training\Sumo_Humans_1v1_tourney\gifs\1v1_adv(v1).gif" style="width: 300px;">
<img src="..\workspace\adv_policy_training\Sumo_Humans_1v1_tourney\gifs\1v1_adv(v2).gif" style="width: 300px;">
<figcaption>(Fig.1)Left: 1 vs 1 between two zoo agents. Middle: 1 vs 1 between vic_v1(red) and adv_v1(green)- Right: 1 vs 1 between vic_v2(red) and adv_v2(green)</figcaption>
</figure>
</center>

<center>
<figure>
<img src="..\workspace\adv_policy_training\Sumo_humans_1v1_tourney/Übersicht.png" style="width: 1000px;">
<figcaption>(Fig.2)From left to right: Adversary wins, victim wins and ties</figcaption>
</figure>
</center>

### SumoAnts

SumoAnts uses the same setup and rules as SumoHumans. The only difference is the agent. While the human version has more dimensions and thus more room for errors the ant version has fewer dimensions(Fig.3). The comparison between the two games makes a  weakness of adversarial policies obvious. The less complex the environment the less effective an adversarial attack is. In the SumoAnt game the adversary didn´t manage to get a higher winrate then 9%(Fig.4) even after training for 20 million timesteps.

<center>
<figure>
<img src="..\workspace\adv_policy_training\Sumo_ants_1v1_tourney/gif_1vs1.gif" style="width: 300px;">
<figcaption>(Fig.3)1 vs 1 between two zoo agents.</figcaption>
</figure>
</center>

<center>
<figure>
<img src="..\workspace\adv_policy_training\Sumo_ants_1v1_tourney/Übersicht2.png" style="width: 1000px;">
<figcaption>(Fig.4)From left to right: Adversary wins, victim wins and ties</figcaption>
</figure>
</center>


### YouShallNotPass

The two prior expamples had one thing in common. Both agents had the same goal. To make the other agent fall over or fall out of the arena. In YouShallNotPass the task of the two agents is different(Fig.4 left). The attacking agent attacks and has to run past the defending agent. The defending agent has to stop the  other agent from reaching the finish line somehow. We choose the defending agent as the adversary. Just like in the two sumo games the adversary throws himself on the ground to confuse the attacking agent(Fig.5 middle). This tactic seems to be consistent as he wins 75% of the matches(Fig.6 right).

<center>
<figure>
<img src="..\workspace\adv_policy_training\YouShallNotPass_1v1_tourney+Videos\gif\normal1v1.gif" style="width: 300px;">
<img src="..\workspace\adv_policy_training\YouShallNotPass_1v1_tourney+Videos\gif\advvsvic.gif" style="width: 300px;">
<figcaption>(Fig.5)Left: 1 vs 1 between two zoo agents(red is defending, green is attacking). Middle: 1 vs 1 between adversary and a zoo agent.</figcaption>
</figure>
</center>

Furthermore we decided to mask the attacking agent. This means the attacking agent doesn´t observe his opponent and decides what actions to take without taking the opponents actions in account(Fig.6 middle). This counters the approach the adversary takes completely, resulting in a 97% win rate(Fig.6) A second approach we took was training a zoo agent to play against the adversary and defend himself from the attack. We started the training with an estimation of 20 million time steps but it took the trained agent way less to learn the adversary attack and counter it(Fig.6 middle). After less then 1 million timesteps the defending victim already achieved a 88% win rate(Fig.6 right). But this robustness against attacks is not without drawbacks. The moment the defended or the masked agent has to play against a normal zoo opponent he gets beaten severly and only manages to win 1% of the games, thus making these two approaches of defence strong against the adversary but poor if the agent is meant to perform as well as possible against normaly trained agents. 

<center>
<figure>
<img src="..\workspace\adv_policy_training\YouShallNotPass_1v1_tourney+Videos\gif\advvsmas.gif" style="width: 300px;">
<img src="..\workspace\adv_policy_training\YouShallNotPass_1v1_tourney+Videos\gif\advvsdef.gif" style="width: 300px;">
<img src="..\workspace\adv_policy_training\YouShallNotPass_1v1_tourney+Videos\Übersicht.png" style="width: 500px;">
<figcaption>(Fig.6) Left: 1 vs 1 between adv(red) and vic_Masked(green). Middle: 1 vs 1 between adv(red) and vic_Ddefended(green). Right: 1 vs 1 results between the adversary and a zoo agent(vic_v1), a masked zoo agent(Vic_M) and a defended zoo agent(Vic_D)</figcaption>
</figure>
</center>

One solution for the overfitting is training the victim to defend itself against the Adversary and the zoo agent at the same time. By simply playing against the adversary in one episode and against the zoo agent in the next episode the victim is robust against the adversary while also being able to achieve the original results he had versus the zoo agent(Fig.7). With this method it is possible to create a victim that appears safe at first glance. And it is safe as long as the adversary stays the same and doesn´t evolve himself. But as mentioned before the training of the adversary takes little to no time compared to victim hardening and thus it took us only a few hours to create a second version of the adversary that isn´t just very effective against the modified victim but also against the original zoo agent(Fig.7).
<center>
<figure>
<img src="..\workspace\adv_policy_training\YouShallNotPass_1v1_tourney+Videos\Übersicht3.png" style="width: 500px;">
<figcaption>(Fig.7)Test results of hardened victim and Adv2</figcaption>
</figure>
</center>




## Conclusion

These results give us a good overview of the strength and weaknesses of adversarial policies. One of the big strength of adversarial policies lies in their fast training with little computational power and their adaptability to different problems. Like mentioned above if you have access to a black box with the victim it doesn´t take much time to find their weakness if possible.

Yet adversarial policies are not withour drawbacks. First, the victim need to be static as the victim would otherwise adapt to the adversary's strategy, nullifying the effect of the attack.

Second, the more dimensions of the victim's observation the adversary can impact, the easier it becomes to discover a vulnerability for the adversarial agent to exploit. This was evident in the difference between SumoAnts and SumoHumans. In SumoAnts the adversary was unable to find an exploit while in SumoHumans the adversarial attack was successful under the same conditions. This might remind one of the curse of dimesnionality. The more complex the state space, the more sparse the states of the agent's collected episodes. Thus the adversary can create a state for the victim which the victim was not properly trained on, thus creating an exploitable weakness.

This also explains, why simple countermeasures like masking the victim's observation, thereby removing the adversary's influence over the next state, made the adversary useless and even with the slow learning speed of the victim it developed a strategy to counter the adversary in little to no time. Both of thes defensive strategies are, however, not without cost. They require addittonal training and weaken the performance of the agent against non-adversarial agents.

There does not seem to be a singular best strategy for an agent to follow. For each agent there also exists an adversary capable exploiting or outperfroming the agents strategy. Thus we also need to balance the robustness and the perfomance of an agent as we can not create an agent that is both the best performing and robust against all attacks.