# Explanation

Domain randomization and dynamics randomization and sufficient noise into simulation objects and dynamics to enable trained policies to generalize well enough to transfer to the real world.

However, they also introduce inefficiencies in training. Using completely random dynamics and environment properties means that the model will spend a large portion of its training time in environments that don't map to the real world, which will actually hinder its ability to improve with respect to real world conditions. Because of this, its more advantageous to start out with a narrow set of simulation conditions and then expand conditions to add sufficient noise when necessary.

This is what SimOpt does. It initializes a simulation with a relatively small range of noise and dynamics conditions, and then uses the divergence between observations sampled from the simulation and reality to optimize the simulation parameters to add the correct amount of randomness such that it takes the correct trade-off between training efficiency and robustness.

# Notes

> However, design of the appropriate simulation parameter distributions remains a tedious task and often requires a substantial expert knowledge.

Design of the simulator distribution used in domain randomization is important.

The signal vs. noise that the model learns is based on the randomization parameters, and this determines the models ability to generalize to the real environment from simulation.

> In this work, we apply a data-driven approach and use real world data to adapt simulation randomization such that the behavior of the policies trained in simulation better matches their behavior in the real world.

They provide a better approach to creating the simulation distribution for domain randomization that’s grounded in real world data.

> Our system uses partial observations of the real world and only needs to compute rewards in simulation, therefore lifting the requirement for full state knowledge or reward instrumentation in the real world.

### Closing the Sim-to-Real Loop

![Screenshot 2024-11-04 at 4.28.39 PM.png](../../../images/Screenshot_2024-11-04_at_4.28.39_PM.png)

**1. Simulation Randomization**

The system dynamics for the reinforcement learning system are either induced by the real world $p(s_{t+1}|s_t, a_t)$ or by an approximate simulation $\hat{p}(s_{t+1}|s_t, a_t)$.

We can define the distribution of simulation parameters as $\xi \sim p_\phi(\xi)$ parameterized by $\phi$.

The resulting simulation engine dynamics are defined by $p_{\xi \sim p_\phi} = p(s_{t+1}|s_t, a_t, \xi)$.

> It’s possible to design a distribution of simulation parameters $p_\phi(\xi)$ such that a policy trained on $p_{\xi \sim p_\phi}$ would perform well on a real world dynamics distribution.

This is exactly what domain randomization is.

> It is often disadvantageous to use overly wide distributions of simulation parameters as they can include scenarios with infeasible solutions that hinder successful policy learning, or lead to exceedingly conservative policies.

Using wide simulation distributions works but leads to inefficiencies in training and overly conservative policies that generalize to too much outside of reality.

So tuning of the simulation distribution is important in order to prevent these inefficiencies.

**2. Learning Simulation Randomization**

> The goal of our framework is to find a distribution of simulation parameters that brings observations or partial observations induced by the policy trained under this distribution closer to the observations of the real world.

The goal of optimizing the simulation distribution is to minimize the following objective, which involves minimizing the divergence between the observations generated from the simulation and the observations generated by reality.

$$
\textrm{min}_\phi \mathbb{E}_{p_{\xi \sim p_\phi}} [\mathbb{E}_{\pi_{\theta, p_\phi}} [D(\tau_\xi^{ob}, \tau_{real}^{ob})]]
$$

Instead of doing this over the entire space of observations generated by reality and the simulation (which would be intractable), they instead use the robot policy $\pi_{\theta, p_\phi}$ to optimize the simulation parameters and to sample real world observations corresponding with the actions taken by the policy.

They use a KL-divergence step $\epsilon$ as they update the distributions to avoid going out of the trust region of the policy.

$$
D_{KL}(p_{\phi_i + 1}||p_{\phi_i}) \leq \epsilon
$$

**3. Implementation**

They use PPO on a multi-GPU cluster to run the RL training.

They parameterize the simulation parameter distribution as a Gaussian $p_\phi(\xi) \sim \mathcal{N}(\mu, \Sigma)$ with $\phi = (\mu, \Sigma)$. They also use weights for the importance of each observation dimension which can be tuned.

The simulator is non-differentiable so they use a sampling-based gradient-free algorithm based on relative entropy policy search for optimizing the objective (they don’t use a gradient based algorithm for optimizing the simulator but instead use random sampling search within the parameter space).

> We choose weighted $\ell_1$ and $\ell_2$ norms between simulation and real world observations for our observation discrepancy function $D$.

$$
D(\tau_\xi^{ob}, \tau_{rea}^{ob}) = \\ w_{\ell_1}\sum_{i=1}^T |W(o_{i, \xi} - o_{i,real})| + w_{\ell_2}\sum_{i=1}^T ||W(o_{i, \xi} - o_{i,real})||_2^2
$$

> Sampling of simulation parameters and the corresponding policy roll-outs is highly parallelizable, which we use in our experiments to evaluate large amounts of simulation parameter samples.

Given the simulation and real-world data from a single batch, they can test many different sets of simulation parameters very quickly to optimize parameters further to minimize discrepancy between simulation and real observations.

### Experiments

> As we observe, training on very wide parameter distributions is significantly more difficult and prone to fail compared to initializing with a conservative parameter distribution and updating it using _SimOpt_ afterwards.

Instead of starting with domain randomization that randomizes everything aggressively, it’s more efficient to start with a parameter distribution that’s very narrow, and expanding the size of the distribution using _SimOpt_.

> Next, we show that we can successfully transfer policies to real robots […] for complex articulated tasks such as cabinet drawer opening, and tasks with non-rigid bodies and complex dynamics, such as swing-peg-in-hole task with the peg swinging on a soft rope.

**1. Tasks**

Swing peg in hole and drawer opening.

> We would like to emphasize that our method does not require the full state information of the real world

**2. Simulation Engine**

> We use NVIDIA Flex as a high-fidelity GPU based physics simulator.

**3. Comparison to Standard Domain Randomization**

> Moreover, learning performance of standard domain randomization depends strongly on the variance of the parameter distribution.

![Screenshot 2024-11-04 at 4.19.34 PM.png](../../../images/Screenshot_2024-11-04_at_4.19.34_PM.png)

> Increasing variance further, in an attempt to cover a wider operating
> range, can often lead to simulating unrealistic scenarios and catastrophic breakdown of the physics simulation with various joints of the robot reaching their limits.

![Screenshot 2024-11-04 at 4.13.31 PM.png](../../../images/Screenshot_2024-11-04_at_4.13.31_PM.png)

> As we can observe in Fig 4, a large part of the randomized instances does not have a feasible solution

![Screenshot 2024-11-04 at 4.22.59 PM.png](../../../images/Screenshot_2024-11-04_at_4.22.59_PM.png)

![Screenshot 2024-11-04 at 4.22.50 PM.png](../../../images/Screenshot_2024-11-04_at_4.22.50_PM.png)

> Fig. 6 shows how the source distribution variance adapts to the target distribution variance for this experiment and Fig. 7 shows that our method starts with a conservative guess of the initial distribution of the
> parameters and changes it using target scene roll-outs until policy behavior in target and source scenes starts to match.

**4. Real Robot Experiments**

![Screenshot 2024-11-04 at 4.26.43 PM.png](../../../images/Screenshot_2024-11-04_at_4.26.43_PM.png)

> At each iteration, we perform 100 iterations of RL in approximately 7 minutes and 3 roll-outs on the real robot using the currently trained policy to collect real world observations. Then, we run 3 update steps of the simulation parameter distribution with 9600 simulation samples per update.

Both the swing-peg-in-hole and drawer opening scenarios improve with multiple SimOpt updates.

### Conclusion

> In this work, we demonstrated that adapting simulation randomization using real world data can help in learning simulation parameter distributions that are particularly suited for a successful policy transfer without the need for exact replication of the real world environment.

> We showed that updating simulation distributions is possible using partial observations of the real world while the full state still can be used for the reward computation in simulation.