## Implementation Details
Note: To render latex formula, open the file with jupyter notebook

### Implementation for Multi-sensory Representation Learning

The representation learning is formulated as a probabilistic graphical model, the same as in Lee et al. [1]. We aim to learn $p(z|D)$, the posterior distribution of latent variable z using the dataset $D=\{(o_{i}, y_{i}|i=1, \dots, N)\}$. Here $o_{i}$ is the sensor readings of all modalities (images + proprioception), and $y_{i}$ is the self-supervised label, which is the same as $o_{i}$. So as we described in our paper, we optimize over reconstructing the original sensor inputs to capture the statistical patterns that exist among different sensor modalities.  

**Loss function** In the fusion module, we have encoders that are parameterized by $\theta_{e}$, decoders that are parameterized by $\theta_{d}$. And the loss function for training encoders is:

$\mathcal{L}_{i}(\theta_{e}, \theta_{d})=\mathbb{E}_{q_{\theta_{e}}}[log_{p_{\theta_{d}}}(D_{i}|z)]-KL(q_{\theta_{e}}(z|D_{i})||p(z))$.

**Model architecture** We use the same convolutional network structures as the models from
Lee et al.[1], except that we do not have skip connections from the
encoder to the decoder parts. We choose 32 for the latent dimension
of the fused representation. For training the fusion model for each
task, we use 1000 training epochs, with a learning rate of 0.001,
batch size of 128.


### Hyperparameters for Skill Segmentatation.

We present the hyperparameters for the unsupervised clustering
step. The maximum number of clusters for each task is: 6 for
<tt>Tool-Use</tt> and <tt>Hammer-Place</tt>, 8 for <tt>Kitchen</tt>
and <tt>Real-Kitchen</tt>, 10 for <tt>Multitask-Kitchen</tt>. And
the stopping criteria of the bread-first search is the number of
segments of mid-level segments are more than twice the maximum number
of clusters. We also use a minimum length threshold to reject a small
cluster, the number we choose is: 30 for <tt>Tool-Use</tt>,
<tt>Real-Kitchen</tt>, 35 for <tt>Hammer-Place</tt>, 20 for
<tt>Multitask-Kitchen</tt>. In this work, these values of
hyperparameters are tuned heuristically, and how to extend to an
end-to-end method is another future direction to look at.


### Hyperparameters for Sensorimotor Policies Models

We choose the dimension for subgoal vector $\omega_t$ to be 32, the
number of 2D keypoints from the output of Spatial Softmax layer to
be 64. We choose H=30 for all single-task environments (Both
simulation and real robots). We choose H=20 for the multitask
environment <tt>Multitask-Kitchen</tt>. This is because skills are
relatively short in each task in <tt>Multitask-Kitchen</tt> domain
compared to all single-task environments.

### Training Details for Sensorimotor Policies
To increase the generalization ability of the model, we apply data
augmentation [2] to images for both training skills and meta
controllers. To further increase the robustness of policies
$\pi^{(k)}_{L}$, we also add some noise from Gaussian distribution
with a standard deviation of 0.1.
 
For all skills, we train for 2001 epochs with a learning rate of
$0.0001$, and the loss function we use is $\ell_{2}$ loss. We use two
layers (300, 400 hidden units for each layer) for the fully
connected layers in all sing-task environments, while three layers
(300, 300, 400) hidden units for each layer for fully connected
layers in <tt>Multitask-Kitchen</tt> domain. For meta controllers, we
train 1001 epochs in all simulated single-task environments, 2001
epochs in <tt>Multitask-Kitchen</tt> domain, and 3001 epochs in
<tt>Real-Kitchen</tt>. For kl coefficients during cVAE training, we
choose 0.005 for <tt>Tool-Use</tt>, <tt>Hammer-Place</tt>, and
0.01 for all other environments.
 


## References
[1] Making sense of vision and touch: Learning multimodal
representations for contact-rich tasks. Lee M. et al.


[2] Image augmentation is all you need: Regularizing deep
reinforcement learning from pixels. Kostrikov I. et al.


