"Deep Latent Competition : Learning to Race Using Visual Control Policies in Latent Space"
-
[
MARL
,world models
,self-play
,imagination
,latent space
]
Click to expand
Contrary to original world model where a single agent races alone, here two cars compete against each other. The authors show that considering interactions and trying to infer the opponent's actions are essential. Ignoring the competitor may be easier to train, but shows limited performances. One good strategy consists in driving as fast as possible, while trying to hinder the other, i.e. adversarial behaviours. Source. |
The idea is to learn a world model from ground truth data, based on which behaviour is refined through imagined self-play . Note that self-play is performed in the learnt LATENT space. Therefore the name 'Deep Latent Competition' (DLC ). Source. |
Authors: Schwarting, W., Seyde, T., Gilitschenski, I., Liebenwein, L., Sander, R., Karaman, S., & Rus, D.
-
Motivations:
0-
Extend the model-based workworld model
where a single agent races alone, to a2
-player competition.1-
Solve the following challenges:- High-dimensional
observations
such as images. - Multi-Agent
RL
(MARL
), i.e. interactions must be considered. - No prior knowledge about the environment is assumed.
- Partial observability of other agents: their
actions
are unknown. - Competition, with potentially adversarial interactions.
- High-dimensional
2-
"Racing".- Agents share the same goal: cross the finish line first. They are opponents: if
A
looses,B
wins. - On the contrary, normal driving drivers have different intentions and must avoid collisions while abiding to traffic rules.
- Agents share the same goal: cross the finish line first. They are opponents: if
-
About the
Env
:MultiCarRacing-v0
: a novel multi-agent racing environment for learning competitive visual control policies.- It extends the
Gym
taskCarRacing-v0
. -
"Collision dynamics allow for elaborate interaction strategies during the race: pushing other agents off the track, blocking overtaking attempts, or turning an opponent sideways via a
PIT
maneuver."
-
Main idea of Deep Latent Competition (
DLC
):- Why "Latent Competition"?
- The idea is to gain competitiveness from imagined
self-play
in a learned multi-agent latentworld model
. -
"Our approach learns a
world model
for imagining competitive behavior in latent-space." -
"The
DLC
agent can imagine interaction sequences in the compact latent space based on a multi-agentworld model
that combines a jointtransition
function with opponent viewpoint prediction."
- The idea is to gain competitiveness from imagined
-
[Benefits] "Imagined
self-play
reduces costly sample generation in the real world, while the latent representation enables planning to scale gracefully withobservation
dimensionality."
- Why "Latent Competition"?
-
Model-based
RL
consists of3
tasks, iteratively repeated:-
1-
Model learning: base on previous experience of all agents, two functions are learnt:(i)
The jointdynamics
:- It includes predictions of other agents' behaviour.
- How the world will evolve, conditioned on own and expected opponent
actions
. - It enables each agent to imagine the outcome of games without requiring additional real-world experience. Hence
self-play
. - This
transition
model is represented by a recurrent state space model (RSSM
).
(ii)
Thereward
function.- With a dense net.
-
"We incentivize competition by discounting
rewards
based on visitation order: the first agent to visit a track tile is rewarded with+1000/N
, while the second agent receives+500/N
(on anN
-tile track)."
- Note that reasoning and competition are not performed in image space. Rather in the learnt latent space.
-
"In order to discover latent spaces that not only offer compact representations of environment states but further facilitate prediction of associated trajectory performance."
- Encoder:
CNN
- Observation model: transposed
CNN
-
-
2-
Behaviour optimization =policy improvement
.- The agents interact with their adversaries through imagined
self-play
: no necessity for execution in the real world. policy
andvalue
functions are learnt models through policy iteration on imagined model rollouts.
- The agents interact with their adversaries through imagined
-
3-
Environment interaction =policy evaluation
+experience collection
.-
"Each agent only has access to their own
observation
-action
history and performance indirectly depends on how well the states of opponents are being estimated." - Note: what
observation-action
histories can been seen by the ego-agent?- Training: those of all agents.
- Deployment only their individual.
-
-
-
Ablation study - it highlights the importance of:
1-
The learned observer, i.e. opponent viewpoint prediction.-
"The first [baseline] is
joint transition
, which propagates ground truth latentstates
of both agents and allows for assessing performance of theobserver
." - It is able to acquire general racing skills faster. But, after
200
racesjoint transition
methods have learned to compete more effectively through imaginedself-play
.-
"The
individual transition
baseline can not leverage this effect, as agentactions
do not affect opponentsstates
during imagined rollouts andself-play
may only occur in the real world."
-
-
2-
The jointtransition
model.-
"The second [baseline] is
individual transition
, which propagates ground truth latentstates
individually and highlights the added value of a jointtransition
model." -
"Optimizing competitive behaviors through imagined
self-play
based on these joint predictions yields an agent that performs superior to an agent that propagates ground truthobservations
separately."
-
-
About consistency in imagined
self-play
:- The learned
transition
model propagates all agents’ latentstates
jointly. -
"To facilitate efficient
self-play
in a learnedworld model
, we require all agent states to be jointly propagated in a consistent manner." -
"We observe the benefit of recursively estimating
states
inagent 1
’s reconstruction ofagent 2
’s viewpoint." -
"The joint
transition
model helps keeping predictions consistent, as it allows for information exchange between both agents’ latentstates
during forward propagation."
- The learned
-
Still open questions for me:
- Who controls the other agent? Is the same model used for both cars? Probably yes.
- How to deal with multi-modal predictions and multi-model believes about the other
actions
'?
"Guided Policy Search Model-based Reinforcement Learning for Urban Autonomous Driving"
Click to expand
Model-based RL with guided policy search (GPS ). The dynamics model uses a Gaussian mixture model (GMM ) with 20 mixtures as global prior. The policy model is updated via dual gradient descent (DGD ): it is a constrained optimization problem, where KL divergence constrains the magnitude of policy updates (since the dynamics model is valid only locally ). I am confused by the inconsistency in terminology between LQR / LQG . The augmented cost (return and KL term) of the Lagrangian and its derivatives being computable, the authors claim that the trajectory optimization can be solved using LQG . But LQR in the figure. Since the optimization is done on an expectation of cumulative costs , I would say it is LQG . Source. |
Authors: Xu, Z., Chen, J., & Tomizuka
-
Motivations.
1-
Model-freeRL
methods suffer from:- Very poor sampling efficiency.
-
"Showing
100x
better sample efficiency of theGPS
-basedRL
method [over model-free ones]."
-
- Lack of interpretability.
reality gap
: It learns in a non-perfect simulator, so transferring to the real-world vehicle is difficult.
- Very poor sampling efficiency.
2-
Behavioural cloning methods suffer from:- It requires collecting a large amount of driving data, which is costly and time consuming.
Distributional shift
: the training dataset is biased compared to real world driving, since the expert drivers generally do not provide data for dangerous situations.- It essentially clones the human driver demonstrations, and cannot exceed the human performance.
-
MDP
formulation (very simplified traffic scene - solvable withPD
/PID
):state
- ego
lateral
deviation. - ego
yaw
error. - ego
speed
. no info to the max allowedspeed
? gap
to leader.- relative
speed
to leader.
- ego
action
: throttle, brake, and steering angle.reward
- Penalizing
lateral
,yaw
,speed
deviations to references as well as changes incontrol commands
. - Relative
position
andspeed
compared to the leader are also considered, but only if thegap
is smaller than20m
.
- Penalizing
-
Idea of the proposed model-based
RL
: Guided Policy Search (GPS
). Iterate:1-
Collect samples by running thepolicy
.2-
Learn a parameterizedlocal
dynamic model
to approximate the complex and interactive driving task.3-
Optimize the drivingpolicy
under the (non-linear approximate)dynamic model
, subject to aconstraint
on the magnitude of thetrajectory
change.-
"We can view the
policy
as imitating a supervised learning teacher." -
"But the teacher is adapting to produce
actions
that the learner can also execute."
-
-
Global
/local
models. From this post.1-
Local models are easier to fit, but need to be thrown away whenever the policy updates because they are only accurate for trajectories collected under the old policy.-
"It can take a number of episodes for the training of such parameterized models [
policy
anddynamics
model]. In order to get high sample efficiency, we adopt the idea oflocal
models, and apply the time-varying linear Gaussian models (why "time varying"?) to approximate the local behavior of the system dynamics and the control policy." - Trajectories are collected to fit local models rather than using linearization’s of a global model of the
dynamics
.
-
2-
Global models are beneficial in that they generally maintain some sort of consistency instate
space i.e.states
close to each other generally have similar dynamic.- We can approximately get this same desirable behaviour by using a
global
model as a prior in traininglocal
models. - This is called
Bayesian linear regression
and can help reduce sample complexity.
- We can approximately get this same desirable behaviour by using a
-
Why is the policy search "guided"?
- Probably as opposed to random search for the two models? Or because of the
prior
that guides the local fit of thedynamics
model?
- Probably as opposed to random search for the two models? Or because of the
-
Learning the
dynamics model
.-
"We adopt a
global
model as the prior, which evolves throughout the whole model basedRL
lifetime, and fit thelocal
linear dynamics to it at each iteration." - Nonlinear prior model: Gaussian mixture model (
GMM
). Here20
mixtures. - Each mixture element serving as prior for one driving pattern.
- Idea of Expectation Maximization (
EM
) process to train theGMM
:1-
Each tuple sample (st
,at
,st+1
) is first assigned to a pattern.2-
Then it is used to update the mixture element.
-
"Finally, at each iteration, we fit the current episode of data (
st
,at
,st+1
)'s to theGMM
, incorporating a normal-inverse-Wishart
prior. Thelocal
lineal dynamicsp
(st+1
|st
,at
) is derived by conditioning the Gaussian on (st
,at
)." [I don't fully understand]
-
-
Learning the
policy
.1-
UsingKL divergence
to constrain policy updates.2-
Using dual gradient descent (DGD
) to solve constrained optimization problems.-
"The main idea of the
DGD
is to first minimize the Lagrangian function under fixed Lagrangian multiplierλ
, and then increase theλ
penalty if the constrained is violated, so that more emphasis is placed on the constraint term in the Lagrangian function in the next iteration." - The Lagrangian objective can be re-written as an augmented cost function
c
(st
,at
). This cost function and its derivatives can be directly computed, hence the trajectory optimization problem can be solved usingLQG
. -
"After the Lagrangian is optimized under a fixed
λ
, in the second step ofDGD
,λ
is updated using the function below with step sizeα
, and theDGD
loop is closed."
-
-
Baseline
1
: Black Box (derivative-free) optimization.- Cross Entropy Method (
CEM
). - Simple but not sample efficient.
- "In order to optimize the parameterized policy
πθ
, theCEM
adopts the assumption of Gaussian distribution ofθ
=N
(µ
,σ2
). It iteratively samplesθ
from the distribution, using which to collect sample trajectories, and then updatesµ
andσ
using theθ
’s that produces the best trajectories."
- Cross Entropy Method (
-
Baseline
2
: model-freeSAC
.- It maximizes both the expected return and the entropy of the policy.
-
Initialization (is it fair?):
GPS
andCEM
:PD
controller with large variance, since the policies are linear Gaussians.SAC
: pure random initialization of the weights.-
"Therefore, the initial performances of the
GPS
andCEM
are slightly better compared to the model freeRL
methods."
"Model-based Reinforcement Learning for Time-optimal Velocity Control"
-
[
speed response
,dynamic stability
,action masking
]
Click to expand
Since the underlying dynamic model of the vehicle is complex, it is learnt with supervised learning. The learnt transition function is then used for planning . This improves the sampling efficiency compared to model-free RL . Note that instead of directly predicting the next state , the neural network predicts the difference between the current state s[t] and the next state s[t+1] . But no previous action s are considered (What about physical latencies and "delayed action effect"?). Source. |
The Failure Prediction and Intervention Module (FIM ) uses an analytical model to determine the potential instability of a given action . Actions proposed by the model-based RL agent is overwritten if one of the future predicted states is prone to roll-over . This maintains safety also during the beginning of the training process, where the learned model may be inaccurate. Source. |
Source. |
Authors: Hartmann, G., Shiller, Z., & Azaria, A.
-
Task: decide binary acceleration (
max-throttle
/hard-brake
) to drive as fast as possible along a path (steering
is controlled by an external module) without compromising dynamic stability. -
Motivations. Improve training wrt:
1-
Sampling efficiency.- The training time should be short enough to enable training on real vehicles (less than
1
minute of real-time learning). - Since the underlying dynamic model of the vehicle is very complex, it is learnt (supervised learning) and then used for
planning
.
- The training time should be short enough to enable training on real vehicles (less than
2-
Safety. More precisely the "dynamic stability".-
"By “dynamic stability” we refer to constraints on the vehicle that are functions of its speed, such as not rolling-over and not sliding."
-
"An important advantage of
LMVO+FIM
over the other methods is that it maintains safety also during the beginning of the training process, where the learned model may be inaccurate."
-
- As opposed to model-free
RL
approaches:-
"The millions of training steps are required to converge, which is impractical for real applications, and the safety of the learned driving policy is not guaranteed."
-
"
LMVO+FIM
achieves higher velocity, in approximately1%
of the time that is required byDDPG
, while completely preventing failure."
-
-
Main idea. Combine:
1-
A model-basedRL
for sampling efficiency.- A prediction transition function is learnt (and maybe inaccurate at start).
- How the
model-based policy
is learnt is not explained.MCTS
?
2-
An analytical planner to protect the vehicle from reaching dynamically unstablestates
.- Another prediction transition function (
bicycle
model) is used which should ensure safety before the other learned model converges.
- Another prediction transition function (
-
Learning a dynamic model.
-
"Instead of directly predicting the next
state
, we use a neural network to predict the difference between the current states[t]
and the next states[t+1]
." - To make a multi-step roll-out, this single-step prediction is repeated.
-
"Since the multi-step predictions are computed iteratively based on the previous step, the error between the predicted and actual values is expected to grow with the number of steps. For simplicity, we take a
safety factor
that is linear with the number of future steps."
-
-
-
About the analytical model: How to determine the potential instability of a given
action
?- Based on the bicycle model.
-
"The Lateral load Transfer Rate (
LTR
), to estimate how close the vehicle is to aroll-over
. TheLTR
describes the different between the load on theleft
and the load on theright
wheels." -
"For all rolled-out future
states
, it is checked if the predictedLTR
is lower than1
, which indicates that the vehicle is expected to remain safe." - If an "unsafe" manoeuvre is attempted, an alternative "safe" local manoeuvre is executed (
max-brake
).-
"
πs
tries to brake while using the regular controller for steering, but if it predicts that the vehicle will still result in an unstable state, it also straightens the steering wheel which will prevent the expected roll-over by reducing the radius of curvature of the future state and following that, reducingLTR
."
-
-
A personal concern: physical latencies and "delayed action effect".
- In real cars, the dynamic reaction does depend on a sequence of past actions.
- E.g. applying
max-throttle
at5m/s
will result in totally different speeds depending if the car's previousaction
wasmax-brake
ormax-throttle
. - Here:
- Predictions and decisions are made at
5Hz
. - The learnt transition function takes as input the
speed
and desiredaction
. No information about the past.
- Predictions and decisions are made at
- One solution from TEXPLORE: ROS-based RL:
-
"To handle robots, which commonly have sensor and actuator delays, we provide the model with the past
k
actions."
-
-
Another concern: computationally expensive inference.
- The forward path in the net should be light. But
planning
steps seem costly, despite the minimalaction
space size (only2
choices). -
"At every time step
t
, the action at must be applied immediately. However, since the computing time is not negligible, the command is applied with some delay. To solve this problem, instead of computing actiona[t]
at timet
, actiona[t+1]
is computed based on the predicted next states[t+1]
anda[t+1]
is applied immediately when obtaining the actual states[t+1]
at timet+1
(which may be slightly different than the model’s prediction for that state)."
- The forward path in the net should be light. But
"Assurance of Self-Driving Cars: A Reinforcement Learning Approach"
Click to expand
Top-left: The task is not to drive a car! But rather to find the weather conditions causing accidents (not clear to me). Right: Model learning is done offline. Once a good confidence in prediction accuracy of PILCO GPs is obtained, online planning is conducted by starting a new episode and building up a ρUCT at each state . Bottom-left: ρUCT is a best-first MCTS technique that iteratively constructs a search tree in memory. The tree is composed of two interleaved types of nodes: decision nodes and chance nodes. These correspond to the alternating max and sum operations in the expectimax operation. Each node in the tree corresponds to a history h . If h ends with an action , it is a chance node; if h ends with an (observation -reward ) pair, it is a decision node. Each node contains a statistical estimate of the future reward . Source. |
Inference in the transition model: given x =(s , a ), the mean and variance of the posterior distribution p (s' , x ) are computed and used in a normal distribution to sample state transitions. Source. |
Author: Quan, K.
-
About:
- A student project. Results are unfortunately not very good. But the report gives a good example of some concrete
model-based
implementation (and its difficulties).
- A student project. Results are unfortunately not very good. But the report gives a good example of some concrete
-
Motivation:
- Perform
planning
to solve aMDP
where the environment dynamics is unknown. - A generative model that approximates the transition function is therefore needed.
- Perform
-
About model-based
RL
.-
"In
model-based
RL
, the agent learns an approximated model of the environment and performsplanning
utilising this model." -
"Gaussian Process (
GP
) is widely considered as the state-of-the-art method in learning stochastic transition models."PILCO
= Probabilistic Inference for Learning Control. It is a model-based policy search algorithm.- The transition model in the
PILCO
algorithm is implemented as aGP
.
- In short, the author proposes a combination of:
1-
Offline model learning viaPILCO
Gaussian Processes.2-
Online planning withρUCT
.
- Miscellaneous:
- A fixed frame rate is necessary to reduce the difficulty of learning the transition model.
- Here, a python implementation of
PILCO
inTensorFlow v2
is used, leveragingGPflow
for theGP
regression.
-
-
Model learning is OFFLINE, for efficiency reasons.
- Contrary to
Dyna
model, new experience obtained from interactions with the environment during the planning phase is not fed back to the model learning process to further improve the approximate transition model.-
"The major reason is that optimisation of
PILCO
GPs
is significantly time-consuming with a large set of experience. Thus, interleaving online planning in decision time with model learning would make the solution time intractable under the computation resource we have." - Error correction is therefore impossible.
-
- Contrary to
-
Long
training
time andsampling
time with thepython
package:- A model learnt with
200
episodes has an8s
sampling time, makingplanning
intractable. - The author raises two reasons:
1-
Optimisation of theGP’s
hyper parameters are not done in incremental-basis.2-
Intermediate results of sampling from the posterior distribution are not cached, thus every sampling requires aCholesky
decomposition of the covariance matrix.
- A model learnt with
-
About
planning
.Policy iteration
is not applicable.- It becomes intractable in large problems, since it operates in sweeps of the entire
state
-action
space.
- It becomes intractable in large problems, since it operates in sweeps of the entire
- Instead Monte Carlo Tree Search (
MCTS
).- No detail about the
state
discretization.
- No detail about the
- About
ρUCT
:-
"A generalisation of the popular
MCTS
algorithmUCT
, that can be used to approximate a finite horizonexpectimax
operation given an environment modelρ
. -
"The
ρUCT
algorithm can be realised by replacing the notion ofstate
inUCT
by an agent historyh
(which is always a sufficient statistic) and using an environment modelρ
to predict the next percept." UCB1
is used byρUCT
as the selection strategy to balanceexploration
andexploitation
.
-
-
MDP
formulation (not clear to me).-
"We set up the
CARLA
environment with a single vehicle moving towards a destination in a straight lane and a pedestrian that is placed in the path of the vehicle and stands still, which simulates the dynamics between a moving car and a pedestrian who is crossing the street." -
"The results indicate that sun altitude is an important influence factor for the stability and consistency of the test autonomous controller in that certain configurations of sun altitude produce lighting conditions that cause more frequent failure of the controller."
gamma
-
"Since the length of episodes is finite, we omit the discounting factor."
-
state
(normalisation is performed):- Pedestrian's (or car's?)
position
,velocity
andacceleration
. Plus a weather configuration:SunAltitude
. -
"Out of the
6
weather parameters configurable inCARLA
, in this baseline experiment we will only includeSun Altitude
in our state-action space given its biggest impact on the lighting condition."
- Pedestrian's (or car's?)
reward
:+10
if the vehicle crash into pedestrian. Should not it be negative? Or maybe the goal is to find the weather conditions that cause crashes?-1
as step penalty.
action
:- The task is not to drive a car!
- Instead to change the weather parameters, i.e.
SunAltitude
. -
[During data collection] "An arbitrary policy produces
actions
that modify theSunAltitude
weather parameter by sampling from a Bernoulli Process that gives the probability of two actions: increaseSunAltitude
by2
, or decrease it by2
:P(A = 2) = p = 0.8, P(A = −2) = 1 − p
."
-
-
Results:
- Learning the dynamics is not successful.
- For
velocity
prediction, the error percentage remains higher than40%
!
- For
-
[solving the
MDP
] "The best result of1.2
average final reward [Ratherreturn
?] that we have achieved so far is still having a significant gap from the theoretical optimal reward of10
[What about the-1
step penalties??]. Model error still seems to be a limiting factor to our planning capability."
- Learning the dynamics is not successful.
"Safe Policy search using Gaussian process models"
-
[
2019
] [📝] [ 🎓Oxford
] -
[
PILCO
,probabilistic safety guarantees
,adaptive tuning
,chance constrained
]
Click to expand
A set of safe states is defined. An analytically-computed quantity Q estimates the probability of the system to stay in these safe states during its trajectory . Two usages are made of Q . First it is made part of the objective function, together with the expected return . Second, a constraint is defined: Q must be be higher than some threshold ϵ . If not, the policy is prevented from being implemented into the physical system. If the policy is deemed unsafe , the importance of safety over performance is increased in the objective function and the optimization is repeated. This automated procedure is called adaptive tuning. Source. |
Authors: Polymenakos, K., Abate, A., & Roberts, S.
-
Motivations:
1-
Target applications on physical systems.- Data-efficiency and safety are essential.
- Model-free
RL
is therefore not an option, despite its flexibility.
2-
No predefineddynamics
model, which would inhibit learning:- ... either by lack of flexibility,
- ... or by introducing model bias.
- Therefore
planning
(with fixed models) is not an option. Model-basedRL
is preferred. -
"We want to construct a model from scratch, from the data collected during training, allowing us to efficiently tune a controller."
-
"We address the lack of flexibility by using a non-parametric model which can in principle be as flexible as needed. Model bias is also addressed, by using a model that explicitly accounts for uncertainty in its outputs. That way, even when the model’s predictions are wrong, the model should provide them along with a suitably high uncertainty estimation."
3-
Automate the tuning of parameters for trade-off betweenefficiency
andsafety
and enforce the learntpolicy
to besafe
before using it (safety check
step).
-
About the
safety
constraint:- A set of
safe states
is defined.- The probability of the system to stay in
safe states
during itstrajectory
is notedQπ
(θ
), for a policy parametrized byθ
. - It estimates the
risk
.
- The probability of the system to stay in
- The constraint is defined as:
- "
Qπ
(θ
) must be higher than some thresholdϵ
."
- "
-
"These constraints are defined a priori and we require them to be respected, even during training."
- A set of
-
Model-based
RL
with Gaussian processes (GP
).- A Gaussian process model is trained to capture the system
dynamics
, i.e.transition
model, based on thePILCO
framework. - Compared to other policy gradient methods, gradients in
PILCO
are not numerically approximated from the sampled trajectories, but analytically calculated given theGP
model. - Output of the model:
- Not the
next_state
itself. Rather its difference to the inputstate
. -
"Modelling the difference in consecutive
states
is preferable to modelling thestates
themselves for practical reasons, such as the fact that the common zero-mean prior of theGP
is more intuitively natural."
- Not the
- Kernels for the
covariance
of theGP
:-
"The
squared exponential
kernel choice reflects our expectation that the function we are modelling (the systemdynamics
) is smooth, with similar parts of thestate
space, along with similar inputs, to lead to similarnext states
."
-
-
"The kernel function’s hyper-parameters, namely the
signal variance
,length scales
, andnoise variance
, are chosen through evidence maximization."
- A Gaussian process model is trained to capture the system
-
Once the
dynamics
model is learnt, thepolicy
's performance and itsrisk
can be evaluated.-
"For any parameter value
θ
, we can produce a sequence ofmean
andvariance
predictions for thestates
the system is going to be in the nextT
time steps." - This prediction is used to estimate:
1-
Thereward
that would be accumulated by implementing thepolicy
, starting from an initialstate
.2-
The probability of violating thesafe state space
constraints. See figure.
-
-
Once the
risk
associated to apolicy
is estimated, what to do with it?-
The probability for the system to respect/violate the constraints during an episode has a dual role:
-
1-
In the (policy evaluation
+policy improvement
) iterations:Qπ(θ)
is a component of theobjective
function, along with the expectedreturn
.-
"The
objective
function, capturing both safety and performance is (a risk-sensitive criterion according to [16
]) is defined as:Jπ(θ)
=Rπ(θ)
+ξ
*Qπ(θ)
."
-
2-
Safety check step for deployment:- High
risk
policies are preventing from being applied to the physical system. -
"Even during training, only policies that are deemed
safe
are implemented on the real system, minimizing the risk of catastrophic failure."
- High
-
-
What if the
policy
is estimated asunsafe
?- It can neither be deployed, nor be used to collect further samples for training.
- In this case, the objective is changed, increasing the importance of
safety
overperformance
, and a new optimization and performed.-
"This adaptive tuning of the hyperparameter
ξ
guarantees that only safe policies (according to the currentGP
model of the system dynamics) are implemented, while mitigating the need for a good initial value ofξ
. Indeed, using this scheme, we have observed that a good strategy is to start with a relatively high initial value, focusing onsafety
, which leads to safer policies that are allowed to interact with the system, gathering more data and a more accurate model, and steadily discovering high performing policies asξ
decreases."
-
-
Policy improvement
: how to perform the optimization?-
"The gradients are often estimated stochastically in the
policy gradient
literature. However we do not have to resort to stochastic estimation, which is a major advantage of using (differentiable) models in general andGP
models in particular." - The probability of collision
Qπ
(θ
), on the other hand, is a product of the probabilities of collision at every time stepq
(xt
).
-
-
Benchmark:
- Comparisons are made with a
policy
trained on aMDP
wherepenalties
(negativereward
) discourage visiting unsafestates
. -
"In that case, instead of calculating a probability of being in an unsafe
state
, the system receives an (additive)penalty
for getting to unsafestates
."
- Comparisons are made with a
"Model-predictive policy learning with uncertainty regularization for driving in dense traffic"
-
[
uncertainty regularization
,multi-modal prediction
,CVAE
,covariate shift
,latent dropout
]
Click to expand
Small-top-right: example when averaging the outcomes of dropping a pen, to show that deterministic prediction that averages over possible futures is not an option. Top: how the action -conditional dynamics model has its latent variable sampled during training and inference . Bottom: latent dropout to solve the action insensitivity issue if only sampling from the learnt posterior distribution (function of the true next state ) during training. Source. |
The world model produces various possible futures, i.e. the next-state . From them, the reward is computed. This estimation is good on the training distribution. But depending on the net initialization, moving out of the training distribution leads to different results and arbitrary predictions. How to reduce this disagreement between the models? By uncertainty regularization: multiple forward passes are performed with dropout, and the variance of the prediction is computed. The uncertainty is summarized into this scalar and used as regularization term when training the policy . Source. |
The latent variable of the predictive model (CVAE ) enables to predict a multiple modal future. Here 4 sequences of 200 latent variables were sampled. None of them repeats the actual future but show 4 different variants of the future. The deterministic predictor does not work: it averages over possible futures, producing blurred predictions. Source. |
Authors: Henaff, M., LeCun, Y., & Canziani, A.
-
Motivations:
1-
Model-freeRL
has poor data efficiency.- Interactions with the world can be slow, expensive, or dangerous.
-
"Learning through interaction with the real environment is not a viable solution."
- The idea is to learn a model of the
environment dynamics
, and use it to train aRL
agent, i.e. model-basedRL
.
2-
Observational data is often plentiful, how can it be used?-
"Trajectories of human drivers can be easily collected using traffic cameras resulting in an abundance of observational data."
-
3-
Dense moving traffic, whereinteraction
is key.-
"The driver behavior is complex and includes sudden accelerations, lane changes and merges which are difficult to predict; as such the dataset has high enviroment (or aleatoric) uncertainty."
-
-
Is
behavioural cloning
a good option?-
"Learning policies from purely observational data is challenging because the data may only cover a small region of the space over which it is defined."
-
"Another option [here] is to learn a
dynamics model
from observational data, and then use it to train apolicy
."
-
-
state
:1-
A vector:ego-position
andego-speed
.2-
A3
-channel image (high-dimensional):red
encodes the lane markings.green
encodes the locations of neighbouring cars.blue
represents the ego car.
-
action
:- Longitudinal
acceleration/braking
. - Change in
steering
angle.
- Longitudinal
-
Main steps:
1-
Learn an action-conditionaldynamics model
using the collected observational data.- From
NGSIM
.2
million transitions are extracted.
- From
2-
Use this model to train a fast, feedforwardpolicy
network.- It minimizes an objective function containing two terms:
1-
A policy cost which represents the objective the policy seeks to optimize, e.g. avoid driving too close to other cars.2-
An uncertainty cost which represents its divergence from thestates
it is trained on.-
"We measure this second cost by using the uncertainty of the
dynamics model
about its own predictions, calculated using dropout."
-
-
Problem
1
: predictions cannot be unimodal. In particular, averaging over the possible futures is bad.- Solution: conditional
VAE
. - The low dimensional
latent variable
is sampled from prior distribution (inference
) or from the posterior distribution (training
). - Different samples lead to different outcomes.
- Repeating the process enables to unroll a potential future, and generate a
trajectory
.
- Solution: conditional
-
Problem
2
:action
sensitivity: how to keep the stochastic dynamics model responsive toinput actions
?- The predictive model can encode
action
information in the latent variables, making the output insensitive to the inputactions
. -
"It is important for the prediction model to accurately respond to input
actions
, and not use the latent variables to encode factors of variation in the outputs which are due to theactions
." - Solution: Latent dropout.
- During training: sometimes sample
z
from the prior instead of the latent variable encoder. -
"This forces the prediction model to extract as much information as possible from the input
states
andactions
by making the latent variable independent of the output with some probability."
- During training: sometimes sample
-
"Our modified posterior distribution discourages the model from encoding
action
information in the latent variables."
- The predictive model can encode
-
How to generate the
latent variable
atinference
/testing
time if thetrue target
is not available?- By sampling from the (fixed) prior distribution, here an isotropic Gaussian.
-
Problem
3
: how to addressescovariate shift
without querying an expert (DAgger
)?- Solution: Uncertainty regularization.
- A term penalizing the uncertainty of the prediction forward model is incorporated when training the
policy
network. -
"Intuitively, if the
dynamics model
is given a state-action pair from the same distribution asD
(which it was trained on), it will have low uncertainty about its prediction. If it is given astate
-action
pair which is outside this distribution, it will have high uncertainty." -
"Minimizing this quantity with respect to actions encourages the
policy
network to produceactions
which, when plugged into the forward model, will produce predictions which the forward model is confident about." - This could be seen as a form of imitation learning. But what if the expert was not optimal?
-
"This leads to a set of
states
which the model is presumably confident about, but may not be a trajectory which also satisfies the policy costC
unless the datasetD
consists of expert trajectories." - Solution: the second cost term, i.e.
RL
.
-
-
Hence
MPUR
:M
odel-predictiveP
olicy-learning withU
ncertaintyR
egularization.- Note that imitation learning is performed at the level of
trajectories
rather than individualactions
. -
"A key feature of is that we optimize the objective over
T
time steps, which is made possible by our learned dynamics model. This means that the actions will receive gradients from multiple time steps ahead, which will penalize actions which lead to large divergences from the training manifold further into the future, even if they only cause a small divergence at the next time step." -
"We see that the single-step imitation learner produces divergent trajectories which turn into other lanes, whereas the
MPUR
andMPER
methods show trajectories which primarily stay within their lanes."
- Note that imitation learning is performed at the level of
"Automatic learning of cyclist’s compliance for speed advice at intersections - a reinforcement learning-based approach"
-
[
2019
] [📝] [ 🎓Delft University
] -
[
Dyna-2
,Dyna-Q
]
Click to expand
The proposed algorithm learns the cyclist’s behaviour in reaction to the advised speed . It is used to make prediction about the next state, allowing for a search that help to plan the best next move of the cyclist on-the-fly. A look-up table is used to model F . Source. |
Authors: Dabiri, A., Hegyi, A., & Hoogendoorn, S.
- Motivation:
1-
Advise a cyclist what speed to adopt when approaching traffic lights with uncertainty in the timing.- To me, it looks like the opposite of numerous works that control traffic lights, assuming behaviours of vehicles, in order to optimize the traffic flow. Here, it may be worth for cyclists to speed up to catch a green light and avoid stopping.
- Note that this is not a global optimization for a group of cyclists (e.g. on crossing lanes). Only one single cyclist is considered.
- Note that the so-called "
agent
" is not the cyclist, but rather the module that provides the cyclist a speed advice.
2-
Do not assume full compliance of the cyclist to the given advice, i.e. take into account the effect of disregarding the advice.
- Challenges:
1-
There is no advanced knowledge on how the cyclist may react to the advice he/she receives.- The other dynamics (or transition) models (deterministic kinematics of the bike and stochastic evolution of the traffic light state) are assumed to be known.
2-
The computation time available at each decision step is limited: we cannot afford to wait fornext-state
to be known before starting to "search".
- Main ideas:
- Learn a model of the reaction of cyclist to the advice (using a
look-up table
), on real-time (it seemscontinuous learning
to me). - Use a second search procedure to obtain a local approximation of the action-value function, i.e. to help the agent to select its next action.
- Hence:
-
"Combine learning and planning to decide of the
speed
of a cyclist at an intersection".
-
- Learn a model of the reaction of cyclist to the advice (using a
- One strong inspiration:
Dyna-2
(Silver & Sutton, 2007).-
"The value function is a linear combination of the transient and permanent memories, such that the transient memory tracks a local correction to the permanent memory".
- Without transient memory, it reduces to
linear Sarsa
. - Without permanent memory, it reduces to a sample-based search algorithm.
-
- One idea: use
2
search procedures:-
"Similar to
Dyna-2
,Dyna-c
[c
forcyclist
], learns from the past and the future:" 1-
Search I
: The long-term action-value is updated from what has happened in real world.Q
(s
,a
), which is updated from real experience.- This long-term memory is used to represent general knowledge about the domain.
Search I
can benefit from a local approximation provided bySearch II
. How? is I a real search or just argmax()?
2-
Search II
: The short-term action-value is updated from what could happen in the future.Q¯
(s
,a
), which uses simulated experience for its update and focuses on generating a local approximation of the action-value function.- Based on the learnt model and the selected action, the agent predicts the state in the next time step.
- It can simulate experiences (search procedure) that start from this "imagined" state and update
Q¯
accordingly.
-
- Difference with
dyna-q
. Time constrain: we can neither afford to wait for the next observation nor to take too long to think after observing it (as opposed to e.g. GO).Search II
has exactly one timestep to perform its searches:-
"Just after the action is taken and before reaching to the next time step, the agent has
Ts
=∆t
seconds to performSearch II
."
-
- One take-away:
-
"Proper initialisation of
Q
can significantly improve the performance of the algorithm [I note the logically equivalent contrapositive]; the closer the algorithm starts to the real optimal action-value, the better." -
"Here,
Q
is initialised with its optimal value in case of full compliance of the cyclist [next-observed speed
=
advised speed
]. Stochastic Dynamic Programming (SDP
) is used for such initialisation."
-
"ReQueST: Learning Human Objectives by Evaluating Hypothetical Behavior"
Click to expand
Right: Procedure to learn the hidden reward function: Using an offline-learnt generative model , query trajectories are produced for each acquisition function (AF ). Transitions of these trajectories are labelled by the user. The reward model ensemble is retrained on the updated training data using maximum-likelihood estimation. Source. |
Four acquisition functions: Maximize predicted rewards makes the car drive fast and far. Maximize reward model uncertainty makes the car drive close to the border. Minimize predicted rewards makes the car drives off-road. Maximize the novelty of training data makes the car stay still (since most training examples show cars in motion). Animated figure here. |
Authors: Reddy, S., Dragan, A. D., Levine, S., Legg, S., & Leike, J.
- One quote:
-
"We align agent behavior with a user’s objectives by learning a model of the user’s reward function and training the agent via (
model-based
)RL
."
-
- One term: "reward query synthesis via trajectory optimization" (
ReQueST
)synthesis
:- The model first learns a generative model, i.e. a transition or forward dynamics function.
- It is trained using off-policy data and maximum-likelihood estimation, i.e. unsupervised learning.
- It is used to produce synthetic trajectories (instead of using the default training environment).
- Note: building a forward dynamics model for cars in interactive environments_ looks very challenging.
reward query
:- The user labels each transition in the synthetic trajectories based on some reward function (unknown to the agent).
- Based on these signals, the agent learns a reward model
r
(s
,a
,s'
), i.e. unsupervised learning. - The task can be regression or classification, for instance:
good
- the car drives onto a new patch of road.unsafe
- off-road.neutral
- in a previously-visited road patch.
-
"We use an ensemble method to model uncertainty."
trajectory optimization
:- Once the reward model has converged, a model-based
RL
agent that optimizes the learned rewards is deployed. - It combines planning with model-predictive control (
MPC
).
- Once the reward model has converged, a model-based
- One concept: "acquisition function" (
AF
).- It answers the question: how to generate "useful" query trajectories?
- One option is to sample random trajectories from the learnt generative model.
-
"The user knows the rewards and unsafe states, but querying the user is expensive." So it has to be done efficiently.
- To generate useful queries, trajectories are synthesized so as to maximize so-called "acquisition functions" (
AF
).
- The authors explain (I did not understand everything) that these
FA
serve (but not all) as proxy for the "value of information" (VOI
):-
"The
AF
evaluates how useful it would be to elicit reward labels for trajectoryτ
".
-
- The maximization of each of the
4
FA
is intended to produce different types of hypothetical behaviours, and get more diverse training data and a more accurate reward model:1-
Maximize reward model uncertainty.- It is based on
ensemble disagreement
, i.e. generation of trajectories that maximize the disagreement between ensemble members. - The car is found to drive to the edge of the road and slowing down.
- It is based on
2-
Maximize predicted rewards.- The agent tries to act optimally when trying to maximize this term.
- It should detect when the reward model incorrectly outputs high rewards (
reward hacking
).
3-
Minimizes predicted rewards.-
"Reward-minimizing queries elicit labels for unsafe states, which are rare in the training environment unless you explicitly seek them out."
- The car is going off-road as quickly as possible.
-
4-
Maximize the novelty of training data.- It produces novel trajectories that differ from those already in the training data, regardless of their predicted reward.
-
"The car is staying still, which makes sense since the training data tends to contain mostly trajectories of the car in motion."
- More precisely, the trajectory generation targets two objectives (balanced with some regularization constant):
1-
Produce informative queries, i.e. maximize theAFs
.2-
Produce realistic queries, i.e. maximize the probability of the generative model (staying on the distribution of states in the training environment).
- It answers the question: how to generate "useful" query trajectories?
- About safe exploration.
- Via
AF-3
, the reward model learns to detect unsafe states. -
"One of the benefits of our method is that, since it learns from synthetic trajectories instead of real trajectories, it only has to imagine visiting unsafe states, instead of actually visiting them."
- In addition (to decide when the model has learnt enough), the user observes query trajectories, which reveals what the reward model has learned.
- Via
"Semantic predictive control for explainable and efficient policy learning"
-
[
2019
] [📝] [🎞️] [] [ 🎓UC Berkeley (DeepDrive Center), Shanghai Jiao Tong University, Nanjing University
] -
[
MPC
,interpretability
,CARLA
]
Click to expand
SPC , inspired from MPC , is composed of one semantic feature extractor, one semantic and event predictor, and one guide for action selection. Source. |
Authors: Pan, X., Chen, X., Cai, Q., Canny, J., & Yu, F.
- Motivations:
1-
Sample efficiency.2-
Interpretability.- Limitation of
behavioural cloning
methods:-
"Direct imitative behaviors do not consider future consequences of actions explicitly. [...] These models are reactive and the methods do not incorporate reinforcement or prediction signals."
-
- Limitations of
model-free RL
methods:-
"To train a reliable policy, an
RL
agent requires orders of magnitude more training data than a human does for the same task." -
"An unexplainable
RL
policy is undesirable as a single bad decision can lead to a severe consequence without forewarning."
-
- One term: "Semantic Predictive Control" (
SPC
).- It is inspired by Model Predictive Control (
MPC
) in that it seeks an optimal action sequence over a finite horizon and only executes the first action. - "Semantic" because the idea it to try to predict future semantic maps, conditionned on action sequences and current observation.
SPN
is trained on rollout data sampled online in the environment.
- It is inspired by Model Predictive Control (
- Structure:
1-
Semantic estimation.- Multi-scale intermediate features are extracted from
RGB
observations, using "Deep Layer Aggregation" (DLA
), a special type of skip connections. - As noted:
-
"Using semantic segmentation as a latent state representation helps to improve data efficiency."
-
- This multi-scale feature representation is passed together with the planned action into the prediction module to iteratively produce future feature maps.
- Multi-scale intermediate features are extracted from
2-
Representation prediction.- What is predicted?
2.1-
The future scene segmentation2.2-
Some task-dependent variables (seen as "future events") conditioned on current observation and action sequence. This can include:Collision
signal (binary).Off-road
signal (binary).Single-step travel distance
(scalar).Speed
(scalar).Driving angle
(scalar).- Note: in their POC with flappy bird, authors also predicted the discounted sum of rewards.
- What is predicted?
3-
Action sampling guidance.- How to select actions?
3.1-
One possible solution is to perform gradient descent to optimize an action sequence.3.2-
Another solution is to perform a grid search on the action space, and select the one with the smallest cost.3.3-
Instead, the authors propose to use the result of theSMP
:-
"
SPN
outputs an action guidance distribution given a state input, indicating a coarse action probability distribution". - Then, they sample multiple action sequences according to this action guidance distribution, then evaluates their costs, and finally pick the best one.
-
- How to select actions?
"Vision-Based Autonomous Driving: A Model Learning Approach"
Click to expand
One figure:
The perception module, the memory or prediction module, and the control module. Source. |
Authors: Baheri, A., Kolmanovsky, I., Girard, A., Tseng, E., & Filev, D.
- The idea is to first learn a model of the environment (the
transition function
of theMDP
) and subsequently derive a policy based on it. - Three modules are used:
- 1- A
VAE
is trained to encode front camera views into an abstract latent representation. - 2- A
LSTM
is trained to predict the latent representation of the one time-step ahead frame, given the action taken and the current state representation. Based on this prediction (mean
andstd
), a next state representation is sampled using the VAE. - 3- A
CMA-ES
is trained to take actions (steering
,acceleration
, andbrake
) based on theLSTM
hidden state (capturing history information) and the current state representation (predicted). The problem is formulated as anMDP
.
- 1- A
- One idea about the continuous action space:
-
"We combine the acceleration and brake commands into a single value between
−1
to+1
, where the values between−1
and0
correspond to the brake command and the values between0
and1
correspond to the acceleration command". - The authors use the term "acceleration command" for one of the actions. CARLA works with
throttle
, as human use the gas-pedal. - I have realized that the mapping
acceleration
->
throttle
is very complex. Therefore I think the agent is learning thethrottle
and considering the single NN layer used for the controller, this may be quite challenging.
-
- About the
CMA-ES
:ES
means "Evolution Strategy", i.e. an optimization technique based on ideas of evolution, iterating between ofvariation
(viarecombination
andmutation
) andselection
.ES
is easy to implement, easy to scale, very fast if parallelized and extremely simple.
CMA
means "Covariance Matrix Adaptation".- This means that in the
variation
phase, not only themean
but also thecovariance matrix
of the population is updated to increase the probability of previously successful steps. - Therefore, it can be seen as Cross-Entropy Methods (
CEM
) with momentum.
- This means that in the
- About sampling efficiency:
- The authors note that
IL
andmodel-free RL
baselines were taking resp.14
hours and12
days of driving for training and were both outperformed by the presentedmodel-based RL
approach which required5
hours of human driving.- This only considers the time to interact with the environment, i.e. to record images.
- It would be interesting to consider the time needed to learn the policy afterward.
CMA-ES
, as a derivative-free method, is one of the least sample efficient approach.- I find interesting that an evolutionary algorithm was chosen given the motivation of increasing sampling efficiency.
- The authors note that
- About
model-based
RL:- The performance really depends on the ability to learn a reliable model of the environment.
- The low-level representation of the
VAE
(size128
) may not capture the most difficult situations. - The authors suggest looking at mid-level representations such as the affordance representation of DeepDriving instead.
- The low-level representation of the
- Here, the authors strictly split the two tasks: First learn a model. Then do planning.
- Why not keeping interacting from time to time with the
env
, in order to vary the sources of experience?- This should still be more sample efficient than model-free approaches while making sure the agent keep seeing "correct" transitions.
- The performance really depends on the ability to learn a reliable model of the environment.
"Vision‑based control in the open racing car simulator with deep and reinforcement learning"
Click to expand
Some figures:
First extract some variables - e.g. curvature , desired speed , lateral offset , offset in heading - from images using supervised learning and then apply control learnt with model-based RL . Source. |
The model-based PILCO algorithm is used to quickly learn to predict the desired speed. Source. |
Authors: Zhu, Y., & Zhao, D.
- Definitions:
- State variables:
x
= [lateral deviation
,angle deviation
,desired speed
]. - Dynamical variables:
y
= [x
,curvature
]. - Cost variables
z
= [y
,current speed
]. - Control variables:
u
= [steering
,throttle or brake
]. - The variable
current speed
is always known: either given byTORCS
or read fromCAN bus
.
- State variables:
- One idea: contrary to
E2E
, the authors want to separateperception
andcontrol
. Hence the training is divided into two steps:- 1- Extract dynamical variables
y
from the simulator (assume full observation) and learn a driving controller. -> Using model-based RL. - 2- Try to extract
y
from images. -> Using supervised learning. - This step-by-step method brings advantages such as the possibility for intermediate checks and uncertainty propagation.
- But both learning processes are isolated. And one defective block can cause the whole chain to fail.
- In particular, the authors note that the
CNN
fails at predicting0
-lateral-offset, i.e. when the car is close to the centre, causing the full system to "vibrate". - This could be addressed on the controller side (damping factor or adding action consistency in the cost function), but it would be better to back-propagate these errors directly to the perception, as in pixel-to-control approaches.
- 1- Extract dynamical variables
- What is learnt by the controller?
- One option would be to learn the transition function leading to the new state:
x[t+1]
=f
(y
,u
,x
). This is what the simulator applies internally. - Instead, here, the distribution of the change in state is learnt:
delta
(x
) =x[t+1]
-x[t]
=f
(y
,u
,x
). - Data is collected through interactions and used to optimize the parameters of the controller:
- Training inputs are formed by some recorded
Y
= [y
,u
]. - Training targets are built with some recorded
ΔX
= [delta
(x
)].
- Training inputs are formed by some recorded
- One option would be to learn the transition function leading to the new state:
- Another idea: the car is expected to run at different velocities.
- Hence vary the desired speed depending on the curvature, the current velocity and the deviation in
heading
. - This is what the agent must learn to predict.
- In the reward function of
PILCO
, the term about desired velocity play the largest role (if you do not learn to decelerate before a turn, your experiences will always be limited since you will get off-road at each sharp turn).
- Hence vary the desired speed depending on the curvature, the current velocity and the deviation in
- One algorithm:
PILCO
=P
robabilisticI
nference forL
earningCO
ntrol.- In short, this is a model-based
RL
algorithm where the system dynamics is modelled using a Gaussian process (GP
). - The
GP
predicts outcome distribution ofdelta
(x
) with probabilities. Hence first letterP
.- In particular, the job is to predict the mean and the standard deviation of this distribution which is assumed to be Gaussian.
- This probabilistic nature is important since model-based
RL
usually suffers from model bias.
- The cost variables are also predicted and based on this
z
distribution, the optimal controlu
is derived using policy gradient search (PGS
).- More precisely, the control variables
u
is assumed to be function of the expected costz
via an affine transformation followed by some saturation:u
=sat
(w
*z
+b
). - Hence
PGS
aims at finding {w
,b
}: the predicted return and its derivatives are used to optimize the controller parameters. - The new controller is again used in
TORCS
to generate data and the learning process is repeated.
- More precisely, the control variables
- In short, this is a model-based
- Why and how is the vanilla
PILCO
modified?- The computational complexity of
PILCO
is linear in the size of training set.- Instead of using sparse GP method (e.g.
FITC
), the authors decide to prune the dataset instead. - In particular, observation data are sparsely collected and renewed at each iteration.
- Instead of using sparse GP method (e.g.
- Other modifications relate to the difference between input output/variable types. And the use of different scenarios to calculate the expected returns.
- The computational complexity of
- One quote about the difference between
PILCO
andMPC
:-
"The concept of PILCO is quite similar to explicit MPC algorithms, but MPC controllers are usually defined piecewise affine. For PILCO, control law can be represented in any differentiable form."
-
"World Models"
-
[
2018
] [📝] [📝] [] [🎞️] [ 🎓Swiss AI Lab, IDSIA
] [ 🚗NNAISENSE
,Google Brain
] -
[
VAE
,world model
,mixture density nets
,evolutionary algorithm
,game exploit
]
Click to expand
The world model (V +M ) is trained separately from the policy (C ) to predict the next latent variable of the VAE (!not the image itself!) given the current action . The policy is then trained by interacting with the 'real' environment: each image is encoded, and the agent received the corresponding latent variable z together with the hidden variable of a RNN that captures the temporal evolution and makes predictions. The authors show that it is even possible to train this policy in the learnt model only. With the advantage that it can easily be made more challenging to increase robustness, by adding noise to the transition model. Source. |
Car Racing environment. The steering action has a range from -1 to 1 , the acceleration from 0 to 1 , and the brake from 0 to 1 . Source. |
The image is compressed in the encoder of the VAE . A latent vector z is sampled from the estimated mean and variance parameters of the prior Gaussian. Here z contains 32 features. One can test see the influence of some of them, by observing how vectors are decoded by the VAE . And also see how random vectors are decoded. Source. |
Authors: Ha, D., & Schmidhuber, J.
-
Motivations.
-
1-
Address issues faced by model-freeRL
:sampling efficiency
andcredit assignment
.- In complex tasks, it would be important to have large networks, e.g.
RNN
. - But this is limited by the
credit assignment
problem. - Here the solution is to keep the
policy
very small, and give it features of larger models that were trained separately. -
"A small
controller
lets the training algorithm focus on thecredit assignment
problem on a small search space, while not sacrificing capacity and expressiveness via the largerworld model
."
- In complex tasks, it would be important to have large networks, e.g.
-
2-
Getting rid of the real environment andsim-to-real
transfer.-
"Can we train our agent to learn inside of its own dream, and transfer this
policy
back to the actual environment?" -
"Running computationally intensive game engines require using heavy compute resources for rendering the game states into image frames, or calculating physics not immediately relevant to the game. We may not want to waste cycles training an agent in the actual environment, but instead train the agent as many times as we want inside its simulated environment."
-
-
3-
Model long term dynamics observed from high dimensional visual data:64x64
images.
-
-
How to deal with these sequences of raw pixel frames?
-
"Learning a model of the dynamics from a compressed latent space enable
RL
algorithms to be much more data efficient." - The authors invite readers to watched Finn’s lecture on Model-Based
RL
.
-
-
Three modules to learn, "inspired by our own cognitive system".
1-
Vision (V
).- Spatial representation: compress what it sees.
- Variational Auto Encoder (
VAE
).- Each image frame is compressed into a small latent vector
z
, of size32
. - The latent vector
z
is sampled from the Gaussian priorN
(µ
,σI
) whereµ
andσI
are learnt.
- Each image frame is compressed into a small latent vector
2-
Memory (M
).- Temporal representation: make predictions about future latent variable,
p
(z
), based on historical information.-
The
M
model serves as a predictive model of the futurez
vectors thatV
is expected to produce. - One alternative could be to stack frames.
-
- A
RNN
is used, combined with aMixture Density Network
to model non-deterministic outcomes.
- Temporal representation: make predictions about future latent variable,
3-
Controller (C
): thepolicy
.- The decision-making component that decides what
actions
to take based only on the representations created by itsvision
andmemory
components. C
is a simple single layer linear model that maps the latent variablez
(t
) and the hidden state of theRNN
h
(t
) directly to actiona
(t
) at each time step.
- The decision-making component that decides what
- This structure is based on
Learning to Think
(Schmidhuber, 2015).-
"A unifying framework for building a
RNN
-based general problem solver that can learn aworld model
of its environment and also learn to reason about the future using this model."
-
-
For
M
: How to deal with the stochasticity of thetransition
model?-
"Because many complex environments are stochastic in nature, we train our
RNN
to output a probability density functionp
(z
) instead of a deterministic prediction ofz
." p
(z
) is approximated as a mixture of Gaussian distribution.- Therefore a Mixture Density Network combined with a
RNN
(MDN-RNN
).
-
-
For
C
: What input for thepolicy
?- Features extracted from the
world model
(V
+M
):- Only
z
(latent variable of theVAE
): wobbly and unstable.-
"The representation
z
(t
) provided by ourV
model only captures a representation at a moment in time and does not have much predictive power."
-
z
andh
(hidden state of theRNN
): more stable.-
"Combining
z
(t
) withh
(t
) gives our controllerC
a good representation of both the current observation, and what to expect in the future."
-
- No rollout is needed (as opposed to
MCTS
for instance).-
"The agent does not need to plan ahead and roll out hypothetical scenarios of the future. Since
h
(t
) contain information about the probability distribution of the future, the agent can just query theRNN
instinctively to guide its action decisions. Like the baseball player discussed earlier, the agent can instinctively predict when and where to navigate in the heat of the moment."
-
- Only
- Features extracted from the
-
Dimensions:
1-
world model
(V
+M
): large.- The goal is to learn a compact
spatial
andtemporal
representation. -
"While it is the role of the
Vision
model to compress what the agent sees at each time frame, we also want to compress what happens over time."
- The goal is to learn a compact
2-
policy
: small.-
"We deliberately make
C
as simple and small as possible, and trained separately fromV
andM
, so that most of our agent’s complexity resides in the world model (V
andM
)." - Also because of the
credit assignment
problem. - Optimization algorithm: Covariance-Matrix Adaptation Evolution Strategy (
CMA-ES
) (evolutionary algorithm) since there are a mere867
parameters inside the linear controller model. - Trained on single machine with multiple
CPU
cores, sinceES
is also easy to parallelize. - The
fitness
value for the agent is the averagecumulative reward
of16
random rollouts (16
seeds).
-
-
Training
V
:1-
10,000
rollouts are collected (with a random agent) from the real environment.- (
action
,observation
) pairs are recorded.
- (
2-
TheVAE
is trained to encode each frame into low dimensional latent vectorz
.- By minimizing the difference between a given frame and the reconstructed version of the frame produced by the decoder from the latent representation
z
.
- By minimizing the difference between a given frame and the reconstructed version of the frame produced by the decoder from the latent representation
-
Training
M
:- Only once
V
has been trained.-
"In principle, we can train both models together in an end-to-end manner, although we found that training each separately is more practical, and also achieves satisfactory results."
-
- The
MDN-RNN
is trained to modelP
(zt+1
|at
,zt
,ht
) as a mixture of Gaussians.
- Only once
-
How to deal with more complex tasks?
- For simple environments, using a random policy to collect demonstrations may be enough to capture the dynamic.
- One could use the learnt
policy
to collect new samples and iterate.
-
How to deal with the imperfections of the generated environments?
- Risk of cheating the
world model
.-
"Because our world model is only an approximate probabilistic model of the environment, it will occasionally generate trajectories that do not follow the laws governing the actual environment. [...] Our world model will be exploitable by the controller, even if in the actual environment such exploits do not exist."
-
- By adding noise to the predictions of the learnt model, i.e. favouring robustness.
-
"We train an agent’s controller inside of a
noisier
and moreuncertain
version of its generated environment, and demonstrate that this approach helps prevent our agent from taking advantage of the imperfections of its internalworld model
."
-
- This temperature parameter
τ
can be adjusted when samplingz
to control model uncertainty.- It controls the tradeoff between
realism
andexploitability
.
- It controls the tradeoff between
- Another solution: using
Bayesian
models to estimate the uncertainty, as inPILCO
.-
"Using data collected from the environment,
PILCO
uses a Gaussian process (GP
) model to learn the system dynamics, and then uses this model to sample many trajectories in order to train a controller to perform a desired task."
-
- Risk of cheating the
-
Benefits of substituting the actual environment with the learnt
world model
.-
"In this simulation, we do not need the
V
model to encode any real pixel frames during the hallucination process, so our agent will therefore only train entirely in a latent space environment." - By increasing the
uncertainty
, the dream environment becomes more difficult compared to the actual environment.-
"Unlike the actual game environment, however, we note that it is possible to add extra
uncertainty
into the virtual environment, thus making the game more challenging in the dream environment."
-
-
-
Limitations:
1-
Limited memory capacity of theNN
model.-
"While the human brain can hold decades and even centuries of memories to some resolution, our neural networks trained with back-propagation have more limited capacity and suffer from issues such as catastrophic forgetting."
-
2-
V
is trained independently of thereward
signals. Hence it may encode parts of the observations that are not relevant to a task.-
"After all, unsupervised learning cannot, by definition, know what will be useful for the task at hand."
-
"By training together with an
M
that predicts rewards, theVAE
may learn to focus on task-relevant areas of the image, but the tradeoff here is that we may not be able to reuse theVAE
effectively for new tasks without retraining."
-