# Notes

> Existing systems for fine manipulation use expensive robots
> and high-end sensors for precise state estimation. In this work, we seek to develop a low-cost system for fine manipulation that is, in contrast, accessible and reproducible.

Dexterous manipulation systems are too expensive. ALOHA & ACT are meant to create a simpler and more cost effective dexterous manipulation system.

> However, low-cost hardware is inevitably less precise than high-end platforms, making the sensing and planning challenge more pronounced. One promising direction to resolve this is to incorporate learning into the system.

Need to mitigate the fact that cheaper hardware has less precision in sensing.

> Humans also do not have industrial-grade proprioception, and yet we are able to perform delicate tasks by learning from closed-loop visual feedback and actively compensating for errors.

Humans don’t use LiDAR or other expensive sensing setups, which means that the manipulation problem is solvable without them.

They use a cheap teleoperation setup with 2 low-cost robotic arms and 3D printed components, leading to a teleoperation system that costs <$20k.

> Small errors in the predicted action can incur large differences in the state, exacerbating the “compounding error” problem of imitation learning.

If imitation learning actions are taken one at a time, errors in one earlier part of the action sequence compound into larger errors throughout the sequence.

> To tackle this, we take inspiration from action chunking, a concept in psychology that describes how sequences of actions are grouped together as a chunk, and executed as one unit.

They make a policy that predicts the action sequence for the next $k$ time steps instead of just 1 timestep, which reduces compounding errors.

They use a transformer trained as a conditional VAE (CVAE) to implement action chunking. This is called an **action chunking transformer** (ACT).

> The key contribution of this paper is a low-cost system for learning fine manipulation, comprising a teleoperation system and a novel imitation learning algorithm

> The synergy between these two parts allows learning of 6 fine manipulation skills directly in the real-world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, from only 10 minutes or 50 demonstration trajectories.

ACT allows very fast and efficient fine manipulation training.

### Related Work

Imitation-learning allows robots to learn from experts. It’s commonly done with **behavior cloning**, where imitation learning is treated as a supervised learning problem.

Behavior cloning suffers from compounding errors where errors from previous time steps build up, especially in fine manipulation.

Previous solutions like DAgger (annotation is expensive), noise injection (reduces execution quality), and synthetic correction data (uses low dimensional visual data) all have issues.

> We propose to reduce the effective horizon of tasks through action chunking, i.e., predicting an action sequence instead of a single action, and then ensemble across overlapping action chunks to produce trajectories that are both accurate and smooth.

Previous bi-manual manipulation efforts originally used classical control using environment dynamics, then used learning like reinforcement learning and imitation learning.

The ACT teleoperation setup uses joint-space mapping between the leader and follower robots, and has a setup of 3D printed parts that can be assembled in 2 hours.

### ALOHA

![Screenshot 2024-11-01 at 3.45.20 PM.png](../../images/Screenshot_2024-11-01_at_3.45.20_PM.png)

> A Low-cost Open-source Hardware System for Bimanual Teleoperation

The system should be [1] low-cost, [2] versatile, [3] user-friendly, [4] repairable, [5] easy-to-build.

They use two ViperX 6-DoF robot arms which each cost $5600 as the robot arms, and replace the fingers to be better for fine manipulation.

For the controllers, they noticed that joint-space manipulation is better than task-space manipulation (better to use a physical control system then VR controllers), so they use another set of WidowX arms as the “leader” arms, which each cost $3300.

They also use 4 Logitech C922 webcams with 480 x 640 RGB image resolution streaming.

ALOHA is good at precise tasks (like threading zip ties), contact-rich tasks (like inserting RAM into a motherboard/turning book pages), and dynamic tasks (like juggling a ping pong ball with a paddle).

> Skills such as threading a zip tie, inserting RAM, and juggling ping pong ball, to our knowledge, are not available for existing teleoperation systems with 5-10x the budget.

ALOHA is more effective than way more expensive systems.

### Action Chunking with Transformers

> Existing imitation learning algorithms perform poorly on fine-grained tasks that require high-frequency control and closed-loop feedback.

To train ACT, they use human demonstrations using ALOHA.

They use the joint positions of the _leader_ as the actions.

A PID controller is used to cause the follower arm movement based on the leader movement.

They use the joint positions of the _follower_ and the image feed from the 4 cameras as the observations.

Then, they train ACT to predict the sequence of future actions given the observations.

**1. Action Chunking and Temporal Ensemble**

> We are inspired by action chunking, a neuroscience concept where individual actions are grouped together and executed as one unit, making them more efficient to store and execute.

The model policy predicts the next $k$ time steps of actions, effectively predicting a chunk of $k$ actions, which results in a $k$-fold reduction in the effective horizon of the task.

The policy models $\pi_\theta(a_{t:t+k}|s_t)$ instead of $\pi_\theta(a_t|s_t)$.

A single-step model would also struggle with temporal confusion like pauses, whereas pauses in an individual action chunk wouldn’t be an issue.

> To improve smoothness and a void discrete switching between, executing and observing, we query the policy at every time step.

Instead of running inference for $k$ actions every $k$ time steps which would be clunky, they run inference every time-step and take a **temporal ensemble** of all the action predictions at that time step with a weight average $w_i = \textrm{exp}(-m * i)$.

They are aggregating action predictions all for the same time step.

**2. Modeling Human Data**

> Another challenge that arises is learning from noisy human demonstrations. Given the same observation, a human can use different trajectories to solve the task. Humans will also be more stochastic in regions where precision matters less.

Human demonstrations can have high variance in the how demonstrators execute tasks. The model has to learn to be precise when it matters but learn a general distribution of approaches when more freedom is permissible. This is the perfect structure for VAE representations (where the model can learn the distributions to model the signal and noise in different action sequences).

> Thus, it is important for the policy to focus on regions where high precision matters.

The policy uses a conditional variational autoencoder (CVAE) to generate an action sequence based on observations.

The encoder is only used for training and predicts the mean and variance of the internal variable $z$’s distribution based on the current action sequence and observations (uses just proprioceptive data instead of images for simplicity).

The decoder uses both $z$ and current observations to predict the action sequence.

**3. Implementing ACT**

> We implement the CVAE encoder and decoder with transformers, as transformers are designed for both synthesizing information across a sequence and generating new sequences.

The CVAE encoder takes in the $k$ next target actions from the demonstration dataset with the [CLS] token and generates the $z$ style variable.

The decoder then takes the $z$ style variable and the current observations and predicts the next $k$ actions.

The CVAE decoder uses ResNet image encoders and a transformer encoder to synthesize information from different camera view points, joint positions, and the style variable and a transformer decoder to generate a coherent action sequence.

The transformer output dimensions is $k \times 512$ which is then projected down into $k \times 14$ where each value corresponds with the predicted joint position for each action time step.

![Screenshot 2024-11-01 at 6.21.54 PM.png](../../images/Screenshot_2024-11-01_at_6.21.54_PM.png)

### Experiments

They use 6 real-world tasks and 2 fine manipulation tasks in MuJoCo which they use for simulation.

![Screenshot 2024-11-01 at 4.26.41 PM.png](../../images/Screenshot_2024-11-01_at_4.26.41_PM.png)

They collect 50 episodes for each of the tasks, where each episode is 8-14s which corresponds to 400-700 time steps given the 50Hz control frequency.

All the different demonstrations are stochastic.

ACT performs far better than other state of the art models on all of these tasks.

![Screenshot 2024-11-01 at 4.29.28 PM.png](../../images/Screenshot_2024-11-01_at_4.29.28_PM.png)

Compared with BeT and RT-1 which discretize the action space [output is a categorical distribution over discrete bins], ACT directly predicts continuous actions which is necessary for fine manipulation.

The model also performs better with action chunking and temporal ensembling.

![Screenshot 2024-11-01 at 4.45.01 PM.png](../../images/Screenshot_2024-11-01_at_4.45.01_PM.png)

They also find that higher control frequency leads to faster task completion.

### Conclusion

> We present a low-cost system for fine manipulation, comprising a teleoperation system ALOHA and a novel imitation learning algorithm ACT.

> The synergy between these two parts allows us to learn fine manipulation skills directly in the real world, such as opening a translucent condiment cup and slotting a battery with a 80-90% success rate and around 10 min of demonstrations.