# Notes

> Modern-day robots are typically designed for specific tasks in constrained settings and are largely unable to utilize complex end-effectors.

> The Shadow Dexterous Hand [58] is an example of a robotic hand designed for human-level dexterity.

> The hand has been commercially available since 2005; however it still has not seen widespread adoption, which can be attributed to the daunting difficulty of controlling systems of such complexity.

> The state-of-the-art in controlling five-fingered hands is severely limited.

The difficulty of dexterous manipulation is so high that after 20 years, we still haven’t made enough progress to make good robotic hands popular.

> Some prior methods have shown promising in-hand manipulation results
> in simulation but do not attempt to transfer to a real world robot.

There have been good dexterous manipulation results in simulation but they have failed to transfer to reality.

> The resulting policy exhibits unprecedented levels of dexterity and
> naturally discovers grasp types found in humans, such as the tripod, prismatic, and tip pinch grasps, and displays contact-rich, dynamic behaviors such as finger gaiting, multi-finger coordination, the controlled use of gravity, and coordinated application of translational and torsional forces to the object.

The method in this paper creates never before seen dexterous manipulation abilities using sim2real transfer.

![Screenshot 2024-11-04 at 2.49.42 PM.png](../../../images/Screenshot_2024-11-04_at_2.49.42_PM.png)

The success in their paper can be attributed to:

[1] Extensive randomization used in the simulated environment
[2] Control policies with memory to infer environmental dynamics
[3] Large scale distributed reinforcement learning.

### Task and System Overview

The object under consideration is placed into the palm of a humanoid robot. The robot then has to reorient the object to a desired target configuration.

Once a configuration is achieved, the robot gets a new goal. The process repeats until the robot drops the object.

**1. Hardware**

The robot arm hardware has joint sensing and 24 DoF. It’s trained on both PhaseSpace markets and 3 RGB cameras (better to match the real world scenario).

**2. Simulation**

They use MuJoCo with a simulated version of the robotic arm. The simulation has a reality gap.

### Transferable Simulations

> We cannot train on the physical robot because deep reinforcement learning algorithms require millions of samples; conversely, training only in simulation results in policies that do no transfer well due to the gap between the simulated and real environments.

This is the dilemma of using deep reinforcement learning for robotics training, though it is the best method.

To solve this, they adjust their environment to a **distribution over many simulations** to foster transfer, using the principles from domain randomization and dynamics randomization.

**1. Observations**

They omit usage of sensor values that would be inaccurate in simulation compared with reality, like the fingertip tactile sensors in the hand that depend on too many confounding variables.

**2. Randomizations**

They use domain randomization to randomize most aspects of the simulated environment so the policy generalizes.

They randomize observation noise and physics parameters.

They use a model of motor backlash to introduce action delays and action noise to model imperfect actuation.

They try to explicitly model all imperfections in reality in the simulator, and then account for other un-modeled dynamics by applying small random forces on the object in simulation.

They also randomize the visual properties of the simulator scene.

This extensive randomization can be thought of as creating noise in every dimensions possible aside from the purely signal dimensions, allowing the model to generalize to use the necessary signal and treat the noise in reality as just another noise variable that it’s used to.

### Learning Control Policies from State

**1. Policy Architecture**

They use the same policy learning architecture as the dynamics randomization paper where they use memory to infer the dynamics of the environment, using LSTM cells for memory.

They use Proximal Policy Optimization (PPO) for learning with a policy network and a value network, and use Asymmetric Actor-Critic.

![Screenshot 2024-11-04 at 3.10.59 PM.png](../../../images/Screenshot_2024-11-04_at_3.10.59_PM.png)

**2. Actions and Rewards**

Policy actions correspond to the desired joint angles.

They use PPO with discrete action spaces because they notice it performs better empirically.

The reward at a timestep $t$ is given by $r_t = d_t - d_{d+1}$ where $d_t$ and $d_{t+1}$ are the rotation angles between the objects orientation and the goal orientation before and after the action. The robot is rewarded 5 when the target orientation is reached and -20 when the object is dropped.

**3. Distributed Training with Rapid**

> We use the same distributed implementation of PPO that was used to train OpenAI Five without any modifications.

They train the arm with the same distribution PPO implementation used for the OpenAI Dota 2 player.

> Overall, we found that PPO scales up easily and requires little hyper-parameter tuning.

PPO is very practical for scaling up.

![Screenshot 2024-11-04 at 3.16.58 PM.png](../../../images/Screenshot_2024-11-04_at_3.16.58_PM.png)

> In our implementation, a pool of 384 worker machines, each with 16 CPU cores, generate experience by rolling out the current version of the policy in a sample from distribution of randomized simulations.

> This setup allows us to generate about 2 years of simulated experience per hour.

Such a cool way of quantifying the amount of learning happening!

### State Estimation from Vision

> The policy that we describe in the previous section takes the object’s position as input and requires a motion capture system for tracking the object on the physical robot.

> In this work, we therefore wish to infer the object’s pose from vision alone.

To match the real world environment, they want to be able to operate the full system with just vision rather than using an object tracking system which couldn’t be used outside the lab.

They train a network with convolutional layers to use the 3 cameras to predict the position and rotation of the object.

### Results

**1. Qualitative Results**

> During deployment on the robot as well as in simulation, we notice that our policies naturally exhibit many of the grasps found in humans.

> Furthermore, the policy also naturally discovers many strategies for dexterous in-hand manipulation described by the robotics community such as finger pivoting, finger gaiting, multi-finger coordination, the controlled use of gravity, and coordinated application of translational and torsional forces to the object.

So cool to see. The robot is rediscovering real world grasping behaviors using reinforcement learning.

Usually we’ve seen reinforcement learning infer good moves from the principles of a game, or in a simulated environment.

Very cool to see this working in real life with sufficient scale and sim2real transfer problem solved.

They also found that locking the wrist-pitch joint resulted in more intentional manipulation of the joints.

**2. Quantitative Results**

![Screenshot 2024-11-04 at 3.26.02 PM.png](../../../images/Screenshot_2024-11-04_at_3.26.02_PM.png)

> When using vision for pose estimation, we achieve slightly worse results both in simulation and on the real robot.

> In general, we found that problems with hardware breakage were
> one of the key challenges we had to overcome in this work.

**3. Ablation of Randomization**

![Screenshot 2024-11-04 at 3.28.56 PM.png](../../../images/Screenshot_2024-11-04_at_3.28.56_PM.png)

Adding randomizations does come at a cost. Policies with more randomizations requires more compute and takes longer time in simulation.

![Screenshot 2024-11-04 at 3.29.52 PM.png](../../../images/Screenshot_2024-11-04_at_3.29.52_PM.png)

Each of the different types of randomizations adds its own portion of quality to the end model by allowing the model to detect different forms of signal in any environment, creating a tradeoff of training time to model quality.

**4. Effect of Memory in Policies**

![Screenshot 2024-11-04 at 3.31.14 PM.png](../../../images/Screenshot_2024-11-04_at_3.31.14_PM.png)

Having memory to infer environmental conditions is important for functionality.

![Screenshot 2024-11-04 at 3.32.10 PM.png](../../../images/Screenshot_2024-11-04_at_3.32.10_PM.png)

**5. Sample Complexity & Scale**

![Screenshot 2024-11-04 at 3.33.44 PM.png](../../../images/Screenshot_2024-11-04_at_3.33.44_PM.png)

More years of experience and more GPUs improves model quality.

### Conclusion

> In this work, we demonstrate that in-hand manipulation skills learned with RL in a simulator can achieve an unprecedented level of dexterity on a physical five-fingered hand.

> This is possible due to extensive randomizations of the simulator, large-scale distributed training infrastructure, policies with memory, and a choice of sensing modalities which can be modeled in the simulator.
