# Explanation

One-shot imitation learning tries to create systems that can learn to perform new tasks with few new demonstrations. They propose another imitation learning algorithm conceptually resembling MAML that is capable of quickly learning to perform new tasks - though the degree of generalization demonstrated in this paper is arguable as the overall task remains relatively similar (block stacking) even in the "out of distribution tasks" they refer to.

For training, they create a set of tasks, and then train a meta learning algorithm that is able to quickly perform a task with few demonstrations. Their architecture uses a demonstration network to create an embedding from demonstration data, a context network to extract context from a demonstration and the current state, and then an MLP manipulator network to actually perform the task.

# Notes

> We are interested in robotic systems that are able to perform a variety of complex useful task. The robot should be able to learn new tasks without long system interaction time.

We need to be able to effectively communicate the task to the robot, and they need to have the dexterity to accomplish it.

> Demonstrations are an extremely convenient form of information we can use to teach robots to overcome these two challenges.

> Ideally, we hope to demonstrate a certain task only once or a few times to the robot, and have it instantly generalize to new situations of the same task, without long system interaction time or domain knowledge about individual tasks.

> The use of soft attention over both types of inputs made strong generalization possible.

### One Shot Imitation Learning

> For each task, the goal is to control a 7-DOF Fetch robotic arm to stack various numbers of cube-shaped blocks into a specific configuration specified by the user.

> Furthermore, in each episode the starting positions of the blocks may vary, which requires the learned policy to generalize even within the training tasks.

![Screenshot 2024-11-06 at 10.47.41 PM.png](../../../images/notes/Screenshot_2024-11-06_at_10.47.41_PM.png)

They use imitation learning with DAgger where they collect demonstrations from each task, then sample a set of tasks during training to perform meta-learning on.

### Architecture

> Our proposed architecture consists of three modules: the demonstration network, the context network, and the manipulation network.

**1. Demonstration Network**

> The demonstration network receives a demonstration trajectory as input, and produces an embedding of the demonstration to be used by the policy.

Because training sequences are long, they randomly remove a subset of the time steps, known as **temporal dropout**.

> Since our neural network needs to handle demonstrations with variable numbers of blocks, it must have modules that can process variable-dimensional inputs.

We need a way to map variable inputs to variable outputs. To accomplish this, each block has it’s own attention head that takes in relevant context across all vectors, has it’s own query, and has an explicit knowledge of the block coordinates and input embedding.

We have the following output of each attention head as:

$$
\textrm{output}_i \leftarrow \textrm{Linear}(\textrm{concat}(h_i^{in}, \textrm{result}_i, (x_i, y_i, z_i), s_\textrm{robot}))
$$

**2. Context Network**

> The context network is the crux of our model. It processes both the current state and the embedding produced by the demonstration network, and outputs a context embedding.

This network provides context on the relevant demonstrations and scenario important for the task.

> For the block stacking environment specifically, the robot should only need to pay attention to the position of the block it is trying to pick up (the source block), as well as the position of the block it is trying to place on top of (the target block).

> Therefore, a properly trained network can learn to match the current
> state with the corresponding stage in the demonstration, and infer the identities of the source and target blocks expressed as soft attention weights over different blocks, which are then used to extract the corresponding positions to be passed to the manipulation network.

**3. Manipulation Network**

Once the source and target block are selected, the manipulation network just has a simple MLP that can be used to stack the block.

> This division of labor opens up the possibility of modular training: the manipulation network may be trained to complete this simple procedure.

This modularity allows the simple control network for the task. This only really works because the task is contrived. End-to-end training is likely preferable in most cases.

### Experiments

![Screenshot 2024-11-06 at 11.10.31 PM.png](../../../images/notes/Screenshot_2024-11-06_at_11.10.31_PM.png)

> As the difficulty (number of stages) increases, however, conditioning on the entire demonstration starts to outperform conditioning on the final state.

> More surprisingly, conditioning on the entire demonstration also seems to outperform conditioning on the snapshot, which we originally expected to perform the best.

### Conclusions

> In this work, we presented a simple model that maps a single successful demonstration of a task to an effective policy that solves said task in a new situation.
