# Notes

> Performing robotic learning in a physics simulator could accelerate the impact of machine learning on robotics by allowing faster, more scalable, and lower-cost data collection than is possible with physical robots.

Data collection with physical robots is slow and expensive. Data collection in simulation would be orders of magnitude faster and cheaper and allow much larger scale.

It could also benefit from advancements in deep reinforcement learning, where random exploration would be dangerous in the real world but could work safely and quickly in simulation.

> Unfortunately, discrepancies between physics simulators
> and the real world make transferring behaviors from simulation challenging.

Differences between simulation and the real world make up the **reality gap**.

System identification (tuning the simulation parameters to match the real system) is time consuming.

Un-modeled physical effects can make the actual model perform poorly.

Simulation often lacks the richness and noise of the real-world.

> Instead of training a model on a single simulated environment, we randomize the simulator to expose the model to a wide range of environments at training time.

Domain randomization trains the model on many random initial tuning conditions, allowing the model to generalize to the real world as just another variant of the simulation.

This paper uses domain randomization for object localization.

> To our knowledge, this is the first successful transfer of a deep neural network trained only on simulated RGB images to the real world for the purpose of robotic control.

This is the first successful robotics manipulation model successfully trained only on simulation data alone.

### Related Work

Object detection and pose estimation (detecting the position and orientation of an object) is a studied problem in robotics.

Traditional approach detect the full 3D pose of objects. Their approach avoids 3D reconstruction by using deep learning.

The **domain adaptation** problem deals with adapting vision-based models trained in a source domain to an unseen target domain. There are many domain adaptation approaches. Domain randomization eliminates the need for domain adaptation, or can be used together with it.

There have been many approaches to bridging the reality gap.

**Iterative learning control** uses a loop to train a model in simulation, then use the error in reality to further improve the simulated environment.

Domain randomization requires no further training on real world data. It doesn’t require any supervised learning or labeling that other approaches require.

The approach in this paper doesn’t rely on precise camera information or specific textures. It instead randomly generates the conditions of the simulation environment.

### Method

They want to train an object detector model that takes a single camera frame and maps it to the coordinates of a set of objects.

**1. Domain Randomization**

![Screenshot 2024-11-04 at 1.14.58 PM.png](../../../images/Screenshot_2024-11-04_at_1.14.58_PM.png)

> The purpose of domain randomization is to provide enough simulated variability at training time such that at test time the model is able to generalize to real-world data.

They randomize the number of distractor objects, position and texture of all objects and environment, position of camera, lighting, noise, etc.

Everything is rendered randomly in MuJoCo.

The camera is randomly places in a box around where the real camera for the robot is in reality. This lets them avoid precise camera calibration.

**2. Model Architecture and Training**

They used a VGG-16 architecture convolutional neural net, pre-trained on ImageNet to transfer learning to this case.

### Experiments

**1. Experimental Setup**

They feed their model an image of one of 8 target objects (trained on a 3D mesh of the object in simulation), along with many distractors. The model has to localize the target object by giving it’s Cartesian coordinate in 3D within an allowable error threshold.

**2. Localization Accuracy**

![Screenshot 2024-11-04 at 1.17.50 PM.png](../../../images/Screenshot_2024-11-04_at_1.17.50_PM.png)

> Even with over-fitting, the accuracy is comparable at a similar distance to the translation error in traditional techniques for pose estimation in clutter from a single monocular camera frame [5] that use higher-resolution images.

**3. Ablation Study**

![Screenshot 2024-11-04 at 1.20.05 PM.png](../../../images/Screenshot_2024-11-04_at_1.20.05_PM.png)

> Our hypothesis that pre-training would be essential to generalizing to the real world proved to be false.

Pre-training on ImageNet was unnecessary to achieve good results with sufficient training samples. This means that with sufficient samples, the models trained entirely in simulation with random weight initialization still performed well.

![Screenshot 2024-11-04 at 1.21.46 PM.png](../../../images/Screenshot_2024-11-04_at_1.21.46_PM.png)

> For our experiments, using a large number of random textures (in addition to random distractors and object positions) is necessary to achieving transfer.

**4. Robotics Experiments**

> We evaluated the use of our object detection networks for localizing an object in clutter and performing a prescribed grasp.

To demonstrate the utility of this sim2real transfer for robotics, they used the object localization model for grasping.

![Screenshot 2024-11-04 at 1.24.52 PM.png](../../../images/Screenshot_2024-11-04_at_1.24.52_PM.png)

> We deployed the pipeline on a Fetch robot [49], and found it was able to successfully detect and pick up the target object in 38 out of 40 trials, including in highly cluttered scenes with significant occlusion of the target object.

The object detection model trained purely in simulation worked successfully for grasping.

### Conclusion

> We demonstrated that an object detector trained only in simulation can achieve high enough accuracy in the real world to perform grasping in clutter.
