### Project Group 1 in Practical Planning Robust Behavior for Autonomous Driving
# Reinforcement Learning using Graph Neural Networks

#### Tom Dörr, Marco Oliva, Quoc Trung Nguyen, Silvan Wimmer (Mentor: Patrick Hart)

__Objective__: Exploit the graph-like structure of traffic scenarios by applying graph neural networks to the Soft-Actor-Critic algorithm.

> ⚠️ **NOTE**  
To run this notebook with all dependencies, execute the command `bazel run //docs/gnn_practical_course_2020:run`

## Chapter 1: Setup

In [1]:
import tensorflow as tf
import numpy as np
import logging
import pprint as pp
import shutil
import os
import logging
import sys
import datetime
logging.disable(sys.maxsize)

# BARK imports
from bark.runtime.commons.parameters import ParameterServer

# BARK-ML imports
from bark.runtime.commons.parameters import ParameterServer
from bark_ml.environments.blueprints import ContinuousMergingBlueprint, ContinuousHighwayBlueprint
from bark_ml.environments.single_agent_runtime import SingleAgentRuntime
from bark_ml.library_wrappers.lib_tf_agents.agents import BehaviorGraphSACAgent
from bark_ml.observers.graph_observer import GraphObserver

# Supervised tests
from bark_ml.tests.capability_gnn_actor.data_handler import SupervisedData
from bark_ml.tests.capability_gnn_actor.actor_nets import ConstantActorNet, RandomActorNet

# Report helper functions
from helper_functions import configurable_setup, benchmark_actor, explain_observation, clean_log_dir

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


## Chapter 2: The GraphObserver
To convert the state of the world into a graph, we needed to implement a new observer.
In this chapter, we briefly introduce the working mechanisms of the `GraphObserver`.

<img src="images/observer.png" width="700">


The `GraphObserver` has the following parameters which can be set in with the ParameterServer (i.e. with `params["ML"]["GraphObserver"]`):
- **AgentLimit**: The maximum number of agents that can be observed. Default is `4`.
- **NormalizationEnabled**: Whether node and edge features are normalized.
- **VisibilityRadius**: The radius in which an agent can 'see', i.e. detect other agents. Default is `50`.
- **SelfLoops**: Whether each node has an edge pointing to itself. Default is `False`
- **EnabledNodeFeatures**: The list of available node features, given by their string key that the observer should extract from the world and insert into the observation. For a list of available features, refer to the list returned by `GraphObserver.available_node_attributes()`.
- **EnabledEdgeFeatures**: The list of available edge features, given by their string key that the observer should extract from the world and insert into the observation. For a list of available features, refer to the list returned by `GraphObserver.available_edge_attributes()`.

There are two main interfaces:

- The `Observe(world)` method returns an observation based on the current snapshot of the world by extracting node attributes, an adjacency matrix, and edge attributes. An observation is a 1D `tf.Tensor` concatenating all information of the graph.

- The `graph(observations, graph_dims, dense=False)` method basically performs the inverse operation. It accepts a batch of observations (as generated by the `Observe` method) as input and returns a batch of graphs. The argument `dense` specifies the format of the returned graph representation (for further details see the function docstring).

Let's look at a small example:

In [2]:
%%capture

# create an environment that we want to observe
params = ParameterServer(filename='data/tfa_gnn_params.json')
bp = ContinuousHighwayBlueprint(params, num_scenarios=2, random_seed=0)
observer = GraphObserver(params=params)
env = SingleAgentRuntime(blueprint=bp, observer=observer, render=False)

The following example explains one exemplary observation and its parts. Further details for the parts can be looked up e.g. in the documentation):

In [3]:
# let's initialize the environment and call Observe() (this is happening internally in env.reset)
observation = env.reset()

explain_observation(observation, observer.graph_dimensions)

Node_attributes(flattened matrix of original shape 4x11) (nodes x node attributes):
 [ 0.4999322  -0.64995706 -0.5        -0.45696357 -0.4999322   1.
 -0.4999322   0.82497853 -0.49444506  0.64952433 -0.4        -0.4999322
 -0.66071093 -0.5        -0.4765408  -0.4999322   1.          0.
  0.83035547 -0.5         0.6600226  -0.4         0.4999322  -0.7466602
 -0.5        -0.46313983 -0.4999322   1.         -0.4999322   0.87333006
 -0.49475256  0.74617344 -0.4        -0.4999322  -0.53071094 -0.5
 -0.45591465 -0.4999322   1.          0.          0.76535547 -0.5
  0.5300765  -0.4       ]
Adjacency matrix(flattened matrix of original shape 4x4) (nodes x nodes):
 [0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0.]
Edge_attributes(flattened matrix of original shape 16x4) (number of edges x edge attributes):
 [ 0.          0.          0.          0.          0.9998644   0.01075391
  0.01957722  0.          0.          0.09670313  0.00617626  0.
  0.9998644  -0.1192461  -0.00104893  0.         -0.

And now let's check out what happens when we convert this observation back into a graph.

In [4]:
observation = tf.expand_dims(observation, 0) # add a batch dimension
observer.graph(observation, observer.graph_dimensions)

(<tf.Tensor: shape=(1, 4, 11), dtype=float32, numpy=
 array([[[ 0.4999322 , -0.64995706, -0.5       , -0.45696357,
          -0.4999322 ,  1.        , -0.4999322 ,  0.82497853,
          -0.49444506,  0.64952433, -0.4       ],
         [-0.4999322 , -0.66071093, -0.5       , -0.4765408 ,
          -0.4999322 ,  1.        ,  0.        ,  0.83035547,
          -0.5       ,  0.6600226 , -0.4       ],
         [ 0.4999322 , -0.7466602 , -0.5       , -0.46313983,
          -0.4999322 ,  1.        , -0.4999322 ,  0.87333006,
          -0.49475256,  0.74617344, -0.4       ],
         [-0.4999322 , -0.53071094, -0.5       , -0.45591465,
          -0.4999322 ,  1.        ,  0.        ,  0.76535547,
          -0.5       ,  0.5300765 , -0.4       ]]], dtype=float32)>,
 <tf.Tensor: shape=(1, 4, 4), dtype=float32, numpy=
 array([[[0., 1., 1., 1.],
         [1., 0., 1., 1.],
         [1., 1., 0., 1.],
         [1., 1., 1., 0.]]], dtype=float32)>,
 <tf.Tensor: shape=(1, 4, 4, 4), dtype=float32, numpy

## Chapter 3: Graph Neural Networks
Before diving into how we apply graph neural networks to our problem, let's have a **very brief** overview about the idea behind them.  
Most importantly, they operate on graph structured data, i.e. data consisting of 
- **Nodes:** feature vectors (node embeddings) of some data entities (and optionally a label), in our case each vehicle is a node
- **Edges:** specified links between nodes
- **Edge features:** optionally, each link between nodes can have its own feature vector

In the section about the `GraphObserver` above, we've already seen how this graph can look like in our scenario. Let's take a step back and use a simplified visualization where the green node represents the ego vehicle and the remaining nodes are other vehicles in its vicinity on the road.

![Schematic view of a GNN](images/simple_gnn.png)

The ego node is connected to both other nodes (it "sees" the other nodes) which in turn do not see each other.

Now, the nodes send messages (their current embeddings) along all outgoing links (here, all links are bidrectional), propagated through a neural network. From now on, we refer to this neural network as the _message passing layer(s)_.

> **NOTE**  
All edges share the same neural network, instead of each edge having its own weights.

Each node aggregates all incoming messages using an aggregation function, like summing or averaging. The result is then processed by another neural network, e.g. a recurrent unit, which computes the new embedding of the node.


In our project, we have integrated two different libraries that offer GNN implementations:
1. [tf2_gnn](https://github.com/microsoft/tf2-gnn): the library that was initially planned to be used in the project
2. [Spektral](https://graphneural.network/#installation): a library that supports edge features, which `tf2_gnn` does not

## Chapter 4: The `GraphNetwork` class

As an abstraction over the specific implementation of the graph neural network, we implemented a wrapper class called `GraphNetwork`. Its primary function is to act just as a GNN and so the only interface is the `call` function that accepts a batch of observations (array representations of graphs) and returns a batch of updated node embeddings for each graph.

In order to support `tf2_gnn` and `Spektral`, we have two distinct call implementations, one for each library. The `GraphNetwork` class decides which one to call based on the arguments given in the initialization.

Both functions however work almost the same:
1. Convert the given observations into nodes, edges and, when using `Spektral`, edges features.
2. Call the respective library with the converted graph representation.

When specifying `Spektral` as the GNN library, the call function looks like this:

In [5]:
from spektral.layers import EdgeConditionedConv
from helper_functions import get_sample_observations, graph_dims

def call_spektral_demo(observations):
    # define the layers of the GNN (normally, this happens upon initialization)
    
    # this defines an edge-conditioned convolution as the message passing layer
    # the `kernel_network` argument defines the layers of the edge neural network
    edge_convolution = EdgeConditionedConv(channels=16, kernel_network=[128], activation="relu")

    def call_spektral(observations, training=False):
        # convert the observations into
        # old_embeddings: tensor containing the node features (embeddings)
        # A: binary adjacency matrix specifying edges in the graph
        # E: tensor containg edge features
        old_embeddings, A, E = GraphObserver.graph(observations, graph_dims)

        # pass the inputs through an edge conditioned convolution
        # layer and receive new node embeddings
        new_embeddings = edge_convolution([old_embeddings, A, E])

        # output the final transformed node embeddings
        return old_embeddings, new_embeddings
    
    old_embeddings, new_embeddings = call_spektral(observations)
    
    print("Here's how the embeddings of the ego agent have changed:\n")
    print(f'old embeddings of shape {old_embeddings[0, 0].shape}: \n{old_embeddings[0,0,:].numpy()}\n')
    print(f'new embeddings of shape {new_embeddings[0, 0].shape}: \n{new_embeddings[0,0,:].numpy()}')

# call the function with sample observations
call_spektral_demo(get_sample_observations())

Here's how the embeddings of the ego agent have changed:

old embeddings of shape (5,): 
[0.88748217 0.6305756  0.1297453  0.86862236 0.00165058]

new embeddings of shape (16,): 
[0.08864352 0.         0.         0.45904642 0.6646736  0.
 0.         0.11452283 0.         0.03114677 0.15341747 0.43506444
 0.4339314  0.84074783 0.28326586 0.        ]


In comparison, when `tf2_gnn` is specified, the implementation looks like this:

In [6]:
from tf2_gnn.layers import GNN, GNNInput

def call_tf2_gnn_demo(observations):
    # the number and types of layers in the GNN are all encoded
    # in the parameters dictionary, let's stick to the default for now
    gnn_params = GNN.get_default_hyperparameters()

    # uncomment the following two lines to have a look at them
    # print(f'GNN parameters:')
    # pp.pprint(gnn_params)

    # initialize a GNN instance which acts as a keras layer
    gnn = GNN(gnn_params)

    def call_tf2_gnn(observations, training=False):
        batch_size = tf.constant(observations.shape[0])

        # convert the observations into
        # old_embeddings: tensor containing the node features
        # A: dense adjacency list in the format [[0, 1], [2, 4]]
        #    specifying source and target node ids of an egde
        # node_to_graph_map: a tensor that assigns each node in X to a graph
        old_embeddings, A, node_to_graph_map = GraphObserver.graph(
          observations,
          graph_dims=graph_dims, 
          dense=True)
        
        # build the struct that tf2_gnn expects as input
        gnn_input = GNNInput(
          node_features=old_embeddings,
          adjacency_lists=(A,),
          node_to_graph_map=node_to_graph_map,
          num_graphs=batch_size,
        )

        new_embeddings = gnn(gnn_input, training=training)
        
        # only for demo purposes
        old_embeddings = tf.reshape(old_embeddings, [batch_size, graph_dims[0], -1])
        new_embeddings = tf.reshape(new_embeddings, [batch_size, 5, -1])
        
        return old_embeddings, new_embeddings
    
    old_embeddings, new_embeddings = call_tf2_gnn(observations)
    
    print("\nHere's how the embeddings of the ego agent have changed:\n")
    print(f'old embeddings of shape {old_embeddings[0, 0].shape}: \n{old_embeddings[0,0,:].numpy()}\n')
    print(f'new embeddings of shape {new_embeddings[0, 0].shape}: \n{new_embeddings[0,0,:].numpy()}')

# call the function with sample observations
call_tf2_gnn_demo(get_sample_observations())


Here's how the embeddings of the ego agent have changed:

old embeddings of shape (5,): 
[0.5538787  0.6449515  0.9806703  0.38632354 0.78949356]

new embeddings of shape (16,): 
[0.01485924 0.         0.02307033 0.         0.08963607 0.
 0.03004937 0.         0.04181047 0.         0.06556597 0.
 0.         0.         0.01167138 0.        ]


Having the GNN functionality nicely abstracted behind this wrapper, we can now easily integrate it into the Soft-Actor-Critic framework.

## Chapter 5: The Soft-Actor-Critic Algorithm with Graph Neural Networks

Next, let's examine the integrated system.

We want to exploit the graph-like structure of traffic scenarios and have already encoded the state of the world as a graph. Now, we want to apply graph neural networks to the SAC algorithm. 

The resulting actor and critic networks are quite similar in structure. Here's how they work and what they compute.

### The Actor Network

Implemented in the class `GNNActorNetwork`.


**Input**: a batch of observations of shape _(batch_size, observation_size)_  
**Output**: a batch of a normal distributions over the action space from which the policy will sample the actions performed by the agent

![Actor Network Architecture](images/actor_architecture.png)

**1. GNN**  
The observations are directly fed into the graph neural network (a `GraphNetwork` instance). It converts the observations into graphs and computes new node embeddings for each graph by means of message passing and aggregation.

> **NOTE**  
From here on, we're only interested in the embeddings of the ego agent. Hence, instead of feeding the whole graph representation into the encoding network, we extract the embeddings of the first node of each graph, which represents the ego agent.

**2. Encoding Network**  
In the encoding network, the node embeddings of the ego agent are now passed through a series of dense layers. Depending on the parameters passed into the actor, we can also add convolutions, dropout and other types of layers here.

**3. Projection Network**  
Finally, the projection network receives the hidden representations after the encoding network and computes a normal distribution over the action space for each observation contained in the batch, modeled by a mean and a standard deviation.

In a very simplified manner for brevity, the implementation of the actor's `call` function looks as follows:


```python
def call(self, observations, training=False):
    # get the updated node embeddings
    output = self._gnn(observations, training=training)

    # extract the ego state (the first node embedding vector of each batch element)
    output = output[:, 0]
    
    # pass the ego agent's node embeddings through the encoder
    output = self._encoder(output, training=training)
    
    # compute a normal distribution
    output = self._projection_net(output, training=training)

    return output
```

### The Critic Network

Implemented in the class `GNNCriticNetwork`.

**Input**: a batch of observation-action pairs, i.e. `[obs, action]` with shapes _(batch_size, observation_size)_ and _(batch_size, 2)_  
**Output**: a scalar value assigned to each observation-action pair


The major difference compared to the actor network is that in the critic, we have two parallel pipelines for the observations and their corresponding actions.

![Critic Network Architecture](images/critic_architecture.png)

**1. Actions**  
The actions are simply passed into an action encoding network that works similar to the encoding network of the actor network, i.e. a series of dense layers with optional convolutions, dropout layers, etc.

**2. Observations**  
The observations are processed in the exact same way as in the actor network. We compute new graph representations in the GNN, extract the ego node embeddings and pass them through an encoding network.

**3. Joining Actions and Observations**  
After receiving the outputs from the action and observation encoding networks, we concatenate the observation-action pair of each element in the batch to one feature vector.  
Finally, we pass this concatenated state through a fully connected joint network which outputs a scalar value for each observation-action pair.

Again, a simplified version of the implemenation looks like this:  
```python
def call(self, inputs, training=False):
    observations, actions = inputs
     
    # get the updated node embeddings
    node_embeddings = self._gnn(observations, training=training)
    
    # extract the ego state (the first node embedding vector of each batch element)
    node_embeddings = node_embeddings[:, 0]
    
    # pass the node embeddings through their observation encoder
    node_embeddings = self._observation_encoder(node_embeddings, trainig=training)
    
    # do the same for the actions with a different action encoder
    actions = self._action_encoder(actions, training=training)
    
    # concatenate observations and actions into one vector
    joint = tf.concat([node_embeddings, actions], 1)
    
    # compute a scalar output value
    output = self._joint_net(joint, training=training)

    return output
```

## Chapter 6: Putting It All Together and Setting up an Example

Now, let's set up an SAC-agent using the graph neural networks described above to be used in BARK-ML.

We start out with the default parameter set as defined in `tfa_sac_gnn_spektral_default.json` and make some optional changes afterwards.

In [7]:
params = ParameterServer(filename="../../examples/example_params/tfa_sac_gnn_spektral_default.json")

First, set up the GNN-related parameters. We use the same GNN configuration in the actor and critic.

In [8]:
# use a spektral GNN
params['ML']['BehaviorGraphSACAgent']['GNN']['Library'] = 'spektral'
    
# use two message passing layers with 80 channels of node embeddings each
params["ML"]["BehaviorGraphSACAgent"]["GNN"]["NumMpLayers"] = 2
params["ML"]["BehaviorGraphSACAgent"]["GNN"]["MpLayersHiddenDim"] = 80
    
# use two fully connected layers in the edge feature mlp of each message passing layer
params['ML']['BehaviorGraphSACAgent']['GNN']['EdgeFcLayerParams'] = [128, 64]

Next, configure the layers that make up the encoding networks in the actor and critic.

In [9]:
params["ML"]["BehaviorGraphSACAgent"]["CriticJointFcLayerParams"] = [128, 128]
params["ML"]["BehaviorGraphSACAgent"]["CriticObservationFcLayerParams"] = [128, 128]
params["ML"]["BehaviorGraphSACAgent"]["ActorFcLayerParams"] = [256, 128]

Finally, we configure the `GraphObserver`.
Here we specify that it should always observe at most 4 agents simultaneously, i.e. the ego agent and its three nearest agents.

In [10]:
params["ML"]["GraphObserver"]["AgentLimit"] = 4

We can also specify which features the graph observer should use in the node embeddings and the edges. This is useful since not all environments contain the same information and thus, some features might not be possible to compute. To get a list of all available features, execute the following cell.

In [11]:
print(f"Available node features:")
for key, value in GraphObserver.available_node_attributes(with_descriptions=True).items():
    print(f"  '{key}': {value}")

print(f"\nAvailable edge features:")
for key, value in GraphObserver.available_edge_attributes(with_descriptions=True).items():
    print(f"  '{key}': {value}")

Available node features:
  'x': The x-components of the agent's position.
  'y': The y-components of the agent's position.
  'theta': The current heading angle of tha agent.
  'vel': The current velocity of the agent.
  'goal_x': The x-component of the goal's position.
  'goal_y': The y-component of the goal's position.
  'goal_dx': The difference in the x-component of the agent's and the goal's position.
  'goal_dy': The difference in the y-component of the agent's and the goal's position.
  'goal_theta': The goal heading angle.
  'goal_d': The euclidian distance of the agent to the goal.
  'goal_vel': The goal velocity.

Available edge features:
  'dx': The difference in the x-position of the two agents.
  'dy': The difference in the y-position of the two agents.
  'dvel': The difference in the velocity of the two agents.
  'dtheta': The difference in the heading angle of the two agents.


In our example, we train an agent on the `ContinuousMergingBlueprint` where the goal definition does not contain velocity information, so we can not use the `goal_vel` node feature. So let's configure the `GraphObserver` to use all available node features, except `goal_vel`.

In the edges, we want to use all available features, so we don't specify anything. The `GraphObserver` always defaults to using all features if nothing is explicitely configured.

In [12]:
enabled_node_features = GraphObserver.available_node_attributes()[:-1]
params["ML"]["GraphObserver"]["EnabledNodeFeatures"] = enabled_node_features

In case you feel like experimenting, expand the following dropdown to get an overview of the most important parameters you can tweak.  
> **NOTE**  
The dropdown is not visible when viewing the notebook on GitHub.


<details>
<summary><b>List of the most important paramaters</b></summary>
<br>
  <b>Description:</b> Specifies the maximum number of agents that are included in an observation. (int)<br>
  <b>Path:</b> ['ML']['GraphObserver']['AgentLimit'] <br>
  <br>
  <b>Description:</b> Specifies whether each node in the graph will have an edge pointing to itself. (Bool)<br>
  <b>Path:</b> ['ML']['GraphObserver']['SelfLoops'] <br>
  <br>
  <b>Description:</b> Specifies the features that the GraphObserver will include in the node embeddings. [str]<br>
  <b>Path:</b> ['ML']['GraphObserver']['EnabledNodeFeatures'] <br>
  <br>
  <b>Description:</b> Specifies the features that the GraphObserver will include in the edges. [str]<br>
  <b>Path:</b> ['ML']['GraphObserver']['EnabledEdgeFeatures'] <br>
  <br>
  <b>Description:</b> Specifies the fully connected layers (number and sizes) of the actor encoding network. ([int]) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['ActorFcLayerParams'] <br>
  <br>
  <b>Description:</b> Specifies the fully connected layers (number and sizes) of the critic action encoding network. ([int]) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['CriticActionFcLayerParams'] <br>
  <br>
  <b>Description:</b> Specifies the fully connected layers (number and sizes) of the critic observation encoding network. ([int]) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['CriticObservationFcLayerParams'] <br>
  <br>
  <b>Description:</b> Specifies the fully connected layers (number and sizes) of the critic joint network. ([int]) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['CriticJointFcLayerParams'] <br>
  <br>
  <b>Description:</b> Specifies the number of message passing layers in the GNN. (int) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['NumMpLayers'] <br>
  <br>
  <b>Description:</b> Specifies the number of units in the message passing layers in the GNN. (int) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['MpLayersHiddenDim'] <br>
  <br>
  <b>Description:</b> Specifies which library to use as the GNN implementation, either "tf2_gnn" or "spektral". <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['Library'] <br>
  
  <h3>The following parameters only apply to TF2-GNN.</h3>

  <br>
  <b>Description:</b> The identifier of the message passing class to be used, here: a relational gated convolution network. (str)
      <br><i>NOTE: when using the 'ggnn' message passing layer, 'MpLayersHiddenDim' must match the number of node features!</i> <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN.message_calculation_class'] <br>
  <br>
  <b>Description:</b> The identifier of the message passing class to be used, here: a gated recurrent unit. (str) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['global_exchange_mode'] <br>
  <br>
  <b>Description:</b> Specifies after how many message passing layers a dense layer is inserted. (int) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['dense_every_num_layers'] <br>
  <br>
  <b>Description:</b> Specifies after how many message passing layers a global exchange layer is inserted. (int) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN.global_exchange_every_num_layers'] <br>
  
  <h3>The following parameters only apply to Spektral.</h3>

  <b>Description:</b> Specifies the number of channels in the edge conditioned convolution layer. (int) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['MPChannels'] <br>
  
  <b>Description:</b> Specifies the fully connected layers (number and sizes) in the edge network. ([int]) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['EdgeFcLayerParams'] <br>
  
  <b>Description:</b> Specifies the activation function of the message passing layer. (str) <br>
  <b>Path:</b> ['ML']['BehaviorGraphSACAgent']['GNN']['MPLayerActivation'] <br>
</details>

Finally, we configure the BARK-ML environment and our agent.

In [13]:
%%capture

# create environment
bp = ContinuousMergingBlueprint(params, num_scenarios=1, random_seed=0)
observer = GraphObserver(params=params)
env = SingleAgentRuntime(blueprint=bp, observer=observer, render=False)
    
# create agent
agent = BehaviorGraphSACAgent(environment=env, observer=observer, params=params)

In [14]:
from helper_functions import prepare_agent, summarize_agent

# only for demo purposes
prepare_agent(agent, params, env)
summarize_agent(agent)


[1mAGENT SUMMARY[0m

Network                        Parameters
ActorNetwork...................... 547.300
CriticNetwork..................... 553.441
CriticNetwork2.................... 553.441
TargetCriticNetwork1.............. 553.441
TargetCriticNetwork2.............. 553.441
------------------------------------------
Total parameters                 2.761.064


## Chapter 7: Verify the Actor Network with Supervised Learning
In this part, we briefly introduce a supervised learning setting which helps to quickly debug different actor implementations. It is evaluated whether the actor network is capable of overfitting a small dataset by comparing it to a `RandomActor` and a `ConstantActor`. The `RandomActor` simply outputs a random label within a prespecified bound. The `ConstantActor` on the other hand always outputs the mean label of the training dataset. This chapter demonstrates what we've implemented in `py_gnn_actor_tests`.

Additionally, the performance while learning is compared to the standard SAC actor for better comparison with the help of TensorBoard.

But first, let's define a few parameters.

In [15]:
# Filenames for default parameter files
filename_tf2_gnn = "../../examples/example_params/tfa_sac_gnn_tf2_gnn_default.json"
filename_spektral = "../../examples/example_params/tfa_sac_gnn_spektral_default.json"
filename_normal = "../../examples/example_params/tfa_params.json"

params_tf2_gnn = ParameterServer(filename=filename_tf2_gnn)
params_spektral = ParameterServer(filename=filename_spektral)
params_normal = ParameterServer(filename=filename_normal)

num_scenarios = 3
log_dir = "supervised/summary"
data_path = "supervised/data"

Now, we can load the different actors for benchmarking and fetch or load a small dataset:

In [16]:
%%capture

# get an observer and an actor net configured with Spektral
observer, spektral_actor = configurable_setup(params_spektral, num_scenarios=num_scenarios);

# get an actor net configured with tf2_gnn
_, tf2_gnn_actor = configurable_setup(params_tf2_gnn, num_scenarios=num_scenarios);

# get a normal SAC actor (without GNNs)
_, normal_sac_actor = configurable_setup(params_normal, num_scenarios=num_scenarios, graph_sac=False);

# get a random actor (outputs random labels with uniform distribution within bounds)
random_actor = RandomActorNet(low=[0, -0.4], high=[0.1, 0.4])

# construct a supervised dataset using the observer as data generator
dataset = SupervisedData(observer, params_tf2_gnn, batch_size=32, train_split=0.8,
                         data_path=data_path, num_scenarios=num_scenarios);

# get a constant actor (outputs constant mean labels of train_dataset)
constant_actor = ConstantActorNet(dataset=dataset._train_dataset)


actors = {
    "tf2_gnn_actor": {"actor": tf2_gnn_actor},
    "spektral_actor": {"actor": spektral_actor},
    "normal_sac_actor": {"actor": normal_sac_actor},
    "random_actor": {"actor": random_actor},
    "constant_actor": {"actor": constant_actor}
}

#### Now, the magic starts.
All trainable actors are trained (the RandomActor and ConstantActor are not trainable) for a few epochs. The results of the training can then be examined in TensorBoard.

In [17]:
# Delete all old logs if some exist
clean_log_dir(log_dir)
old_logs = []

# Run benchmarking
for actor_name in actors:
    loss = benchmark_actor(actors[actor_name]["actor"], dataset, epochs=100, log_dir=log_dir)
    actors[actor_name]["loss"] = loss
    
    # Name log clearly with actor_name
    # Select correct log
    new_logs = os.listdir(log_dir)
    log = list(set(new_logs) - set(old_logs))[0]
    
    # Rename log with actor_name
    old_path = os.path.join(log_dir, log)
    new_path = os.path.join(log_dir, actor_name)
    os.rename(old_path, new_path)
    old_logs = os.listdir(log_dir)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [18]:
%load_ext tensorboard
%tensorboard --logdir supervised/summary

## Chapter 8: Train an actual Agent using Reinforcement Learning

Now, let's finally train an agent on a BARK-ML environment.

To be able to monitor the training, we launch a TensorBoard instance first. For now, there's nothing to see there, but once we start the training, it will reload and visualize the data.

In [19]:
clean_log_dir("training/sac_gnn_spektral")

%load_ext tensorboard
%tensorboard --logdir training/sac_gnn_spektral/summaries

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


We perform a fresh setup of the environment and the agent so that we have it all in one place. Remember that you can customize the agent just as we did above.

In [20]:
%%capture

# Set a name for this run to recognize it in TensorBoard.
# When training multple times, make sure to set different names, so that the logs are not overwritten.
# By default, we use a simple timestamp.
training_run_name = str(datetime.datetime.now())

# Define how many iterations you want to run (recommended for actual training is >= 40000).
num_training_iterations = 200

# load default parameters
params = ParameterServer(filename='data/tfa_sac_gnn_spektral_default.json')
params["ML"]["BehaviorTFAAgents"]["CheckpointPath"] += f"/{training_run_name}"
params["ML"]["TFARunner"]["SummaryPath"] += f"/{training_run_name}"
params["ML"]["SACRunner"]["NumberOfCollections"] = num_training_iterations

# create environment
bp = ContinuousMergingBlueprint(params, num_scenarios=2500, random_seed=0)
observer = GraphObserver(params=params)
env = SingleAgentRuntime(blueprint=bp, observer=observer, render=False)

In [None]:
from helper_functions import run_rl_example

agent = BehaviorGraphSACAgent(environment=env, observer=observer, params=params)

# start the example
run_rl_example(env, agent, params, mode="train")



## Chapter 9: Results

### Impact of the Graph Neural Network
<img src="images/sac_vs_gnn_sac.jpg" alt="" />
The two graphs above show the results of training a standard SAC-agent (blue)  and a GNN-SAC-agent (brown). The top graph shows the mean reward and the bottom one the mean number of steps the agent performed per training iteration.
We can recognize three phases the agent goes through during training. 


#### - Phase 1:

The agent learns not to crash. 
Not crashing results in a reward of 0 at the end of the episode, which is why we can see the mean reward converging towards zero.
The GNN-SAC approaches the reward of zero in fewer training iterations than the standard SAC.


#### - Phase 2:

The agents now have to learn how to actually reach the goal state.
To reach the goal, they have to figure out that not crashing is just a local minima and that they therefore can achieve a higher reward by taking some risk and making a lane change into the goal position.
The GNN-SAC again needs fewer training iterations to go through this phase, the difference in training iterations is greater than before.


#### - Phase 3:

In this last phase the agents get more efficient in reaching the goal state.
We can see the increase in efficiency by looking at the mean number of steps that each the agent takes per episode. 
While the agents decrease the number of steps they perform, they keep on increasing the mean reward.
As before, the GNN-SAC outperforms the standard SAC when looking at the number of training iterations.


However not all runs of the agents are equal.
Sometimes the GNN-SAC does not perform better.
Additionally, we need to consider the number of features that the agents receive.
The standard SAC receives 16 features per observation while the GNN-SAC receives 127 features per observation.
Naturally you might wonder if the performance of the GNN-SAC is just due to the larger number of features.
The larger number of features is the result of the graph-observer the GNN-SAC uses.
To see what impact the graph-observer has, we feed the information from the graph-observer to the standard SAC and again compare it to the GNN-SAC:


### Impact of the Graph-Observer
<p align="center">
<img src="images/evaluation_graph_observer.png" alt="" />
</p>

The GNN-SAC again needs fewer training iterations, which means that the GNN-SACs performance is not just due to a larger number of features in the observations.


## How does it actually look? 👀

After training the agent for around 40k iterations on the `ContinuousMergingBlueprint`, it has converged to a mean reward of around 1. Its behavior then looks as follows:

<p align="center">
<img src="https://github.com/mrcoliva/bark-ml/raw/master/docs/gnn_practical_course_2020/images/gnn-sac-merging-bp.gif" alt="BARK-ML Highway" />
</p>

## Chapter 10: Future Work
	
As we have seen, applying graph neural network to reinforcement learning achieves very promising results. In our project, we used two different libraries: `tf2_gnn` and `spektral`. While `spektral` leverages the use of edge features, `tf2_gnn` does not. It would be interesting to compare the performance of these two library, since edge features would enable the real potential of graph structured data. Another direction is applying graph neural networks to other reinforcement learning algorithms. So far, we only used it with the soft actor critic algorithm. Another good candidate is proximal policy optimization(PPO). It is a good idea to dive deep into the mechanism of graph neural networks and see how it really affects reinforcement learning through these 2 different algorithms. Finally, learning the behavior of an ego agent leads us to the question: Can we use the same model to control the behavior of all other agents? This should be possible since the actor network should generate the same behavior(that maximize the total reward) for each agent. 

# Apendix: Commands 

Alternatively, you can run the example from the command line with `bazel run //examples:tfa_gnn -- --mode=train`.  
The process can be visualized using TensorBoard with `tensorboard --logdir training/sac_gnn_spektral/summaries`