# Explanation

In addition to lack of noise in simulation, inaccuracies in the simulation software itself get exploited by models in training. Because of this, the models often don't transfer well into reality since the properties they exploit do not actually exist.

Dynamics randomization addresses this issue by randomizing not just the textures and patterns of objects in the environment but by randomizing the parameters that govern the physics of the simulation itself. Because of this, trained models learn to generalize not just around different types of objects and noise, but also to generalize to different laws of physics.

We can think of this as training a model that generalizes beyond our reality to work in any reality defined by a physical system, where the transfer problem is equivalent to the model just operating in a single one of the many physical systems it should learn to operate it.

Empirically, this makes learned policies far more robust in the simulation to reality transfer.

# Notes

> By randomizing the dynamics of the simulator during training, we are able to develop policies that are capable of adapting to very different dynamics, including ones that differ significantly from the dynamics on which the policies were trained.

They use domain randomization on the dynamics of the simulator to allow the model to generalize to work for any set of dynamics.

> This adaptivity enables the policies to generalize to the dynamics of the real world without any training on the physical system.

This allows the robot to learn to operate in a huge number of realities, where our reality is just one of them.

This is a method to bridge the reality gap for robotics control.

> Transferring policies from simulation to the real world entails challenges in bridging the ”reality gap”, the mismatch between the
> simulated and real world environments.

> Narrowing this gap has been a subject of intense interest in robotics, as it offers the potential of applying powerful algorithms that have so far been relegated to simulated domains.

Bringing the reality gap is highly valuable because deep reinforcement learning algorithms have been so powerful in simulation.

> The effectiveness of our approach is demonstrated on an object pushing task, where a policy trained exclusively in simulation is able to successfully perform the task with a real robot without additional training on the physical system.

### Related Work

> [Deep reinforcement learning] has enabled simulated agents to develop highly dynamic motor skills.

> But due to the high sample complexity of RL algorithms and other physical limitations, many of the capabilities demonstrated in simulation have yet to be replicated in the physical world.

**1. Domain Adaptation**

Transferring learning from simulation to reality is a subset of the domain adaptation problem.

The key assumption of domain adaptation is that different domains share sufficient characteristics so that representations learned in 1 domain would be useful in the others.

> While promising, these methods nonetheless still require data from the target domain during training.

There are many useful domain adaptation approaches like learning invariant features and progressive networks.

However, all these approaches require data from the real world during training, which significantly reduces the utility of training in simulation.

**2. Domain Randomization**

> With domain randomization, discrepancies between the source and target domains are modeled as variability in the source domain.

Domain randomization uses variability to make the model generalize to differences between the source and target domain.

This method has been successfully used for sim2real transfer for training robotics vision-based policies.

There have been previous approaches to transfer learning for control policies but they have been entirely for simulation to simulation transfer.

> We show that memory-based policies are able to cope with greater variability during training and also better generalize to the dynamics of the real world.

The prior approach to sim2real transfer for robotics control using domain randomization used memoryless feedforward networks for their motor policies.

This limited their ability to adapt to the error between the simulation and real environment.

In this paper, they use memory-based policies which they claim adapt better to the mismatch between environments.

**3. Non-prehensile Manipulation**

Pushing is hard for robots to adopt because of the complexity of modeling contacts between surfaces and friction.

Deep learning has been used to make successful pushing models, but requires large datasets that take long to collect.

> In contrast, we will show that adaptive policies can be trained exclusively in simulation and using only sparse rewards.

### Background

The state of the environment at time $t$ is $s_t \in S$. The robot policy is $\pi(a|s)$ at each moment.

The reward at each instant is given by $r_t = r(s_t, a_t)$.

The agent has to maximize the multi-step reward function over the entire duration of the episode with $T$ time-steps, with discount factor $\gamma \in [0, 1]$.

$$
R_t = \sum_{t'=1}^{T} \gamma^{t'-t}r_{t'}
$$

The learning objective is to optimize the policy $\pi^*$ to maximize the return of the agent $J(\pi)$, $\pi^* = \textrm{arg max}_\pi J(\pi)$, with the following

$$
J(\pi) = \mathbb{E}[R_0|\pi] = \mathbb{E}_{r \sim p(\tau | \pi)}[\sum_{t=0}^{T-1}r(s_t|a_t)]
$$

With $p(\tau|\pi)$ the likelihood of a trajectory $\tau = (s_0, a_0, s_1, …, a_{T-1}, s_T)$ under the policy $\pi$.

$$
p(\tau|\pi) = p(s_0) \prod_{t=0}^{T-1} p(s_{t+1}|s_t, a_t)\pi(s_t, a_t)
$$

Most importantly, $p(s_{t+1}|s_t, a_t)$ is completely defined by the dynamics of the environment, so this is the place where this factor influences the policy learned.

**1. Policy Gradient Methods**

> Policy gradient methods are a popular class of algorithms for learning parametric policies where an estimate of the gradient of the objective $\nabla_\theta J(\pi_\theta)$ is used to perform gradient ascent to maximize the expected return.

This can be generalized to tasks where the agent has a different goal every episode with a universal policy.

> A universal policy is a simply extension where the goal $g \in G$ is provided as an additional input to the policy $\pi(a|s, g)$.

In this case, the goal is the location to push to and is set at the beginning of the episode and stays consistent throughout.

**2. Hindsight Experience Replay**

Instead of designing complex reward functions for reinforcement learning, it’s easier to use a binary reward that only indicates if a goal is satisfied in a given state $r(s, g)$ that yields 1 if $g$ is satisfied in $s$ and -1 otherwise.

This will yield -1 reward for all time steps in most initial episodes, preventing learning.

Instead, they use Hindsight Experience Relay (HER) to choose a goal such that it is satisfied in the final state of an episode, and use that to learn.

### Method

The goal is to train a policy that can perform a task under the dynamics of the real world $p^*(s_{t+1}|s_t, a_t)$ by training using an approximate dynamics model $\hat{p}(s_{t+1}|s_t, a_t) \approx p^*(s_{t+1}|s_t, a_t)$.

> It has been observed that DeepRL policies are prone to exploiting idiosyncrasies of the simulator to realize behaviors that are infeasible in the real world.

> Instead of training a policy under one particular dynamics model, we train a policy that can perform a task under a variety of different dynamics models.

They use domain randomization on the actual dynamics of the environment.

> By training policies to adapt to variability in the dynamics of the environment, the resulting policy might then better generalize to the dynamics of real world.

**1. Task**

Uses a binary reward with a puck that has to be pushed to a randomly selected goal position $g$ by a 7-DoF robotic arm.

![Screenshot 2024-11-04 at 2.14.29 PM.png](../../../images/notes/Screenshot_2024-11-04_at_2.14.29_PM.png)

**2. State and Action**

The state is the joint positions, gripper position, velocities, and puck position, orientation, and velocities.

Actions from the policy specify target joint angles.

**3. Dynamics Randomization**

> At the start of each episode, a random set of dynamics parameters $\mu$ are sampled according to to $\rho_\mu$ and held fixed for the duration of the episode.

Mass of links, damping of joints, friction/mass/damping of puck, height of table, noise, time-steps between actions, etc. are all randomized.

**4. Adaptive Policy**

> In the absence of direct knowledge of the parameters, the dynamics can be inferred from a history of past states and actions.

The dynamics of a system are not directly accessible in the real world but can be inferred from the response to actions.

Previous systems can use a decomposition to identify system dynamics $\phi(s_t, h_t) = \hat{\mu}$ which uses the history of past states and actions $h_t$ to infer the specific dynamics properties $\hat{\mu}$, which can then be passed to the policy $\pi(a_t|s_t, \hat{\mu})$.

However, this requires specification of the exact dynamics properties you want the model to predict, which is hard in real world scenarios where dynamics can differ in unpredictable ways.

Instead, they use a memory system $\pi(a_t|s_t, z_t, g)$ where $z_t = z(h_t)$ acts as a summary of past states and actions.

> This model can then be trained end-to-end and the representation of the internal memory can be learned without requiring manual identification of a set of dynamics parameters to be inferred at runtime.

By giving the model access to the dynamics history, it can learn to infer dynamics properties on it’s own.

**5. Recurrent Deterministic Policy Gradient**

They use HER to augment the training data by turning original failed episodes into replayed goals. This requires off-policy learning.

DDPG is useful for off-policy learning for continuous control. RDPG is a form of DDPG for recurrent cases (like this one where past pushes influence future trajectory).

**6. Network Architecture**

![Screenshot 2024-11-04 at 2.27.55 PM.png](../../../images/notes/Screenshot_2024-11-04_at_2.27.55_PM.png)

The model uses a policy and value network which each have a feedforward branch and recurrent branch.

The recurrent branch is tasked with inferring the dynamics of the system given past observations.

LSTM units are used for internal memory.

### Experiments

All simulations are performed in MuJoCo.

They randomize the dynamics of the environment, and they also simulate sensor noise by applying gaussian noise to the observed state features.

> Little calibration was performed to ensure that the behavior of the simulation closely conforms to that of the real robot.

> The success rate is determined as the portion of episodes where the goal is fulfilled at the end of the episode.

![Screenshot 2024-11-04 at 2.34.23 PM.png](../../../images/notes/Screenshot_2024-11-04_at_2.34.23_PM.png)

> The LSTM learns faster while also converging to a higher success rate than the feedforward models.

![Screenshot 2024-11-04 at 2.35.43 PM.png](../../../images/notes/Screenshot_2024-11-04_at_2.35.43_PM.png)

> The feedforward network trained without randomization is unable
> to perform the task under the real world dynamics.

![Screenshot 2024-11-04 at 2.36.49 PM.png](../../../images/notes/Screenshot_2024-11-04_at_2.36.49_PM.png)

> Policies trained without randomizing the action time-step and observation noise show particularly noticeable drops in performance. This suggests that coping with the latency of the controller and sensor noise are important factors in adapting to the physical system.

### Conclusions

> Training policies with randomized dynamics in simulation enables the resulting policies to be deployed directly on a physical robot despite
> poor calibrations.

> By training exclusively in simulation, we are able to leverage simulators to generate a large volume of training data, thereby enabling us to use powerful RL techniques that are not yet feasible to apply directly on a physical system.
