# Notes

> Designing the perception and control software for autonomous operation remains a major challenge, even for basic tasks.

> In this article, we aim to answer the following question: can we acquire more effective policies for sensorimotor control if the perception system is trained jointly with the control policy, rather than separately?

Separate policy and perception training requires the model to perform policy search with hand engineered features which are often prone to errors.

> Successful applications of deep neural networks typically rely on large amounts of data and direct supervision of the output, neither of which is available in robotic control.

> From the control perspective, a further complication is that observations from the robot’s sensors do not provide us with the full state of the system.

> We address these challenges by developing a guided policy search algorithm for sensorimotor deep learning, as well as a novel CNN architecture designed for robotic control.

Their CNN architecture has a spatial feature point transformation that improves spatial reasoning.

> This allows us to train our policies with relatively modest amounts of data and only tens of minutes of real-world interaction time.

### Related Work

> Applications of deep learning in robotic control have been less prevalent in recent years than in visual recognition.

> Pioneering early work on neural network control used small, simple networks, and has largely been supplanted by methods that use carefully designed policies that can be learned efficiently with reinforcement learning.

Early control work using neural networks was replaced with hand designed robotic control policies.

> CNNs have also been trained to play video games
>
> However, such methods have only been demonstrated on synthetic domains that lack the visual complexity of the real world, and require an impractical number of samples for real-world robotic learning.

> Our method is sample efficient, requiring only minutes of interaction
> time. To the best of our knowledge, this is the first method that can train deep visuomotor policies for complex, high-dimensional manipulation skills with direct torque control.

Their method can be trained on only a few minutes of data, compared with reinforcement learning that requires huge amounts of samples in simulation.

> Learning visuomotor policies on a real robot requires handling complex observations and high dimensional policy representations. We tackle these challenges using guided policy search.

> In guided policy search, the policy is optimized using supervised learning, which scales gracefully with the dimensionality of the policy.

### Background

> The core component of our approach is a guided policy search algorithm that separates the problem of learning visuomotor policies into separate supervised learning and trajectory learning phases, each of which is easier than optimizing the policy directly.

**1. Problem Formulation**

The goal of policy search is to learn a policy $\pi_\theta(u_t|o_t)$ that allows the robot to perform actions $u_t$ based on observations $o_t$. The policy implicitly learns about the dynamics of the environment through observations. The goal is to minimize the task loss $\ell(x_t, u_t)$.

**2. Approach Summary**

The system has two components. The first is a supervised learning algorithm that trains policies $\pi_\theta(u_t| o_t) = \mathcal{N}(\mu^\pi(o_t), \Sigma^\pi(o_t))$ with $\mu^\pi(o_t)$ as a deep CNN and $\Sigma^\pi(o_t)$ as an observation independent learned covariance.

The second is a trajectory centric RL algorithm that generates guiding distributions $p_i(u_t|x_t)$ for supervision.

Supervised learning for policies doesn’t produce good long-horizon policies because the policy doesn’t know how to act out of distribution. The training data has to come from the policy’s state distribution to address this.

> We achieve this by alternating between trajectory-centric RL and supervised learning.

The RL stage provides supervision at states visited by the policy.

![Screenshot 2024-11-08 at 10.16.47 AM.png](../../images/Screenshot_2024-11-08_at_10.16.47_AM.png)

They use pre-training for the CNN to reduce the amount of necessary data.

> The intuition behind our pre-training is that, although we ultimately seek to obtain sensorimotor policies that combine both vision and control, low-level aspects of vision can be initialized independently.

They initialize the vision part of their network by training it to predict real elements of the image $x_t$ given the observations $o_t$ to bootstrap it with necessary skills.

### Guided Policy Search with BADMM

> Guided policy search transforms policy search into a supervised learning problem, where the training set is generated by a simple trajectory-centric RL algorithm.

The end goal is to train the network that can operate on $\pi_\theta(u_t|o_t)$. To bootstrap the learning, they first collect ground truth data about the system and train the network with initialization $\pi_\theta(u_t|x_t)$ on more fine-grained control policies.

Then, they use this pre-training to train the network on $\pi_\theta(u_t|x_t)$ that makes it more accurate. The downside is they have to collect real world data about the system.

### End-to-End Visuomotor Policies

> Guided policy search allows us to optimize complex, high-dimensional policies with raw observations, such as when the input to the policy consists of images from a robot’s onboard camera.

> The guided policy search trajectory optimization phase uses the full state of the system, though the final policy only uses the observations.

They train with full data on the true state. The network then learns to correlate observations with true state probably.

> To speed up learning, we initialize both the vision layers in the policy and the trajectory distributions for guided policy search by leveraging the fully observed training setup.

### Experimental Evaluation

> We simulated 2D and 3D peg insertion, octopus arm control, and
> planar swimming and walking.

![Screenshot 2024-11-08 at 10.46.52 AM.png](../../images/Screenshot_2024-11-08_at_10.46.52_AM.png)

![Screenshot 2024-11-08 at 10.47.14 AM.png](../../images/Screenshot_2024-11-08_at_10.47.14_AM.png)

> These comparisons show that training even medium-sized neural network policies for continuous control tasks with a limited number of samples is very difficult for many prior policy search algorithms.

> The visual processing layers of our architecture automatically learn features points using the spatial softmax and expectation operators. These feature points encapsulate all of the visual information received by the motor layers of the policy.

### Discussion

> In this paper, we presented a method for learning robotic control policies that use raw input from a monocular camera.

> These policies are represented by a novel convolutional neural network architecture, and can be trained end-to-end using our guided policy search algorithm, which decomposes the policy search problem in a trajectory optimization phase that uses full state information and a supervised learning phase that only uses the observations.

> Our experimental results show that our method can execute complex manipulation skills, and that end-to-end training produces significant improvements in policy performance compared to using fixed vision layers trained for pose prediction.

> The success of CNNs on exceedingly challenging vision tasks suggests that this class of models is capable of learning invariance to irrelevant distractor features.

> Our method takes advantage of a known, fully observed state space during training. This is both a weakness and a strength.
