Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sections/01_introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ \subsection{Code Example: Batching a (Streaming) Dataset}
In practice, most reinforcement learning (RL) and behavioral cloning (BC) algorithms tend to operate on stack of observation and actions.
For the sake of brevity, we will refer to joint spaces, and camera frames with the single term of \emph{frame}.
For instance, RL algorithms may use a history of previous frames \(o_{t-H_o:t} \) to mitigate partial observability, and BC algorithms are in practice trained to regress chunks of multiple actions (\(a_{t+t+H_a} \)) rather than single controls.
To accommodate for these specifics of robot learning training, \lerobotdataset~provides a native windowing operation, whereby users can define the \emph{seconds} of a given window (before and after) around any given frame, by using the \texttt{delta\_timestemps} functionality.
To accommodate for these specifics of robot learning training, \lerobotdataset~provides a native windowing operation, whereby users can define the \emph{seconds} of a given window (before and after) around any given frame, by using the \texttt{delta\_timestamps} functionality.
Unavailable frames are opportunely padded, and a padding mask is also returned to filter out the padded frames.
Notably, this all happens within the \lerobotdataset, and is entirely transparent to higher level wrappers commonly used in training ML models such as \texttt{torch.utils.data.DataLoader}.

Expand Down
6 changes: 3 additions & 3 deletions sections/02_classic_robotics.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ \subsection{Explicit and Implicit Models}
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{figures/ch2/ch2-approaches.pdf}
\caption{Overview of methods to generate motion (clearly non-exhausitve, see~\citet{bekrisStateRobotMotion2024}). The different methods can be grouped based on whether they explicitly (\emph{dynamics-based}) or implicitly (\emph{learning-based}) model robot-environment interactions.}
\caption{Overview of methods to generate motion (clearly non-exhaustive, see~\citet{bekrisStateRobotMotion2024}). The different methods can be grouped based on whether they explicitly (\emph{dynamics-based}) or implicitly (\emph{learning-based}) model robot-environment interactions.}
\label{fig:generating-motion-atlas}
\end{figure}

Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion.
Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components.
Thus, robotics is an inherently multidisciplinary domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components.
Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s.
More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines~\citep{connellRobotLearning1993}.
As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of \highlight{producing artificial motion in the physical world}.
As a direct consequence of its multidisciplinary nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of \highlight{producing artificial motion in the physical world}.

Methods to produce robotics motion range from traditional \emph{explicit} models---\highlight{dynamics-based}\footnote{In here, we refer to both \emph{kinematics} and \emph{dynamics}-based control.} methods, leveraging precise descriptions of the mechanics of robots' rigid bodies and their interactions with eventual obstacles in the environment---to \emph{implicit} models---\highlight{learning-based} methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings~\citep{agrawalComputationalSensorimotorLearning,bekrisStateRobotMotion2024}.
A variety of methods have been developed between these two extrema.
Expand Down
2 changes: 1 addition & 1 deletion sections/03_reinforcement_learning.tex
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ \subsection{Real-world RL for Robotics}

First, especially early in training, \highlight{actions are typically explorative, and thus may be erractic}.
On physical systems, untrained policies may command high velocities, self-collisiding configurations, or torques exceeding joint limits, leading to wear and potential hardware damage.
Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incuring in a high degree of human supervision.
Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incurring in a high degree of human supervision.
Further, in the typical episodic setting considered in most robotics problems, experimentation is substantially slowed down by the need to manually reset the environment over the course of training, a time-consuming and error-prone process.
Second, learning efficiently remains problematic in RL, \highlight{limiting the applicability of RL in real-world robotics due to consequently prohibitive timescales of training}.
Even strong algorithms such as SAC~\citep{haarnojaSoftActorCriticOffPolicy2018} typically require a large numbers of transitions \( \{ \sars \}_{t=1}^N \).
Expand Down
8 changes: 4 additions & 4 deletions sections/04_imitation_learning.tex
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ \section{Robot (Imitation) Learning}
\label{fig:ch4-observation-action-mapping}
\end{figure}

Behavioral Cloning (BC)~\citep{pomerleauALVINNAutonomousLand1988} aims at producing synthetic behaviors by learning the mapping from observations to actions, and in its most natural formulation can be effectively tackled as a \emph{supevised} learning problem, consisting of learning the (deterministic) mapping \(f: \obsspace \mapsto \actionspace, \ a_t = f(o_t) \) by solving
Behavioral Cloning (BC)~\citep{pomerleauALVINNAutonomousLand1988} aims at producing synthetic behaviors by learning the mapping from observations to actions, and in its most natural formulation can be effectively tackled as a \emph{supervised} learning problem, consisting of learning the (deterministic) mapping \(f: \obsspace \mapsto \actionspace, \ a_t = f(o_t) \) by solving
\begin{equation}\label{eq:loss-minimization-SL}
\min_{f} \mathbb{E}_{(o_t, a_t) \sim p(\bullet)} \mathcal L(a_t, f(o_t)),
\end{equation}
Expand All @@ -67,13 +67,13 @@ \section{Robot (Imitation) Learning}
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{figures/ch4/ch4-issues-with-bc.pdf}
\caption{Point-wise policies suffer from limitations due to (A) covariate shifts and (B) poor approximation of multimodal demonstrations. (A) Small errors may drive the policy out of distribution, incuring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in the scene---either left or right-first---are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations.}
\caption{Point-wise policies suffer from limitations due to (A) covariate shifts and (B) poor approximation of multimodal demonstrations. (A) Small errors may drive the policy out of distribution, incurring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in the scene---either left or right-first---are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations.}
\label{fig:ch4-issues-with-bc}
\end{figure}

While conceptually elegant, \emph{point-estimate policies} \( f : \obsspace \mapsto \actionspace \) learned by solving eq.~\ref{eq:loss-minimization-SL} have been observed to suffer from (1) compounding errors~\citep{rossReductionImitationLearning2011} and (2) poor fit to multimodal distributions~\citep{florenceImplicitBehavioralCloning2022, keGraspingChopsticksCombating2020}.
Figure~\ref{fig:ch4-issues-with-bc} illustrates these two key issues related to learning \emph{explicit policies}~\citep{florenceImplicitBehavioralCloning2022}.
Besides sequentiality in \( \mathcal D \), compounding errors due to \emph{covariate shift} may also prove catastrophic, as even small \( \epsilon \)-prediction errors \( 0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon \) can quickly drive the policy into out-of-distribution states, incuring in less confident generations and thus compounding errors (Figure~\ref{fig:ch4-issues-with-bc}, left).
Besides sequentiality in \( \mathcal D \), compounding errors due to \emph{covariate shift} may also prove catastrophic, as even small \( \epsilon \)-prediction errors \( 0 < \Vert \mu(o_t) - a_t \Vert \leq \epsilon \) can quickly drive the policy into out-of-distribution states, incurring in less confident generations and thus compounding errors (Figure~\ref{fig:ch4-issues-with-bc}, left).
Moreover, point-estimate policies typically fail to learn \emph{multimodal} targets, which are very common in human demonstrations solving real-world robotics problems, as multiple trajectories can be equally as good towards the accomplishment of a goal (e.g., symmetric grasps, Figure~\ref{fig:ch4-issues-with-bc}, right).
In particular, unimodal regressors tend to average across modes, yielding indecisive or even unsafe commands~\citep{florenceImplicitBehavioralCloning2022}.
To address poor multimodal fitting,~\citet{florenceImplicitBehavioralCloning2022} propose learning the \emph{generative model} \( p(o, a) \) underlying the samples in \( \mathcal D \), rather than explicitly learning a prediction function \( f: a = f(o) \).
Expand Down Expand Up @@ -198,7 +198,7 @@ \subsubsection{Diffusion Models}
DMs are a particular instantiation of HMLV models for which the posterior is fixed to \( q( z_t \vert z_{t-1}) = \mathcal N(z_t \sqrt{1-\beta_t}, \beta_t \mathbf{I}) \), for a given \( \beta_t \in \mathbb R^+ \).
In practice, \( \beta_t \) is used to iteratively reduce the signal-to-noise ratio along the latents' hierarchy, similarily to how a diffusion process influences the information of a physical system.

Just like VAEs, DMs attemp to learn to reproduce an underlying data distribution \( p (o,a) \) given a collection of i.i.d. samples approximating the model posited to have generated the data in the first place (eq.~\ref{eq:BC-multi-latent-model-1}).
Just like VAEs, DMs attempt to learn to reproduce an underlying data distribution \( p (o,a) \) given a collection of i.i.d. samples approximating the model posited to have generated the data in the first place (eq.~\ref{eq:BC-multi-latent-model-1}).
Similarily to VAEs, DMs approximate the process of sampling from the unknown \( p(o,a) \) by (1) sampling from an easy-to-sample distribution (e.g., Gaussian) and (2) learning to reconstruct high-likelihood samples under the unknown distribution.
However, in stark contrast with VAEs, the easy-to-sample distribution contains \emph{no mutual information} regarding the data distribution \( p(o,a) \).
Crucially, as no information from the sample \( (o,a) \) (denoted as \( z_0 \equiv (o,a) \) for simplicity of notation) is assumed to be propagated throughout the chain of latents, the posterior \( q(z_t \vert z_{t-1})\) assumes a relatively amicable structure in DMs, reducing complexity.
Expand Down
2 changes: 1 addition & 1 deletion sections/05_foundation_models.tex
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ \subsection{\( \pi_0 \)}
\end{equation*}
Note how \emph{intra}-block directional attention allows tokens to communicate freely, while \emph{inter}-block communication is mediated by the attention mask \(\mathbf{A} \).
\emph{Blockwise causal masking} effectively prevents the pre-trained perception-language tokens from attending to robotics-tokens, likely out of distribution for VLM backbones traditionally trained on large corpora of internet, non-robotics, data.
Crucially, because communication is obstructed between image-language tokens, proprioperceptive tokens and action tokens, one can cache keys and values across denoising steps at runtime time, incuring in a reduced computational footprint and faster inference.
Crucially, because communication is obstructed between image-language tokens, proprioperceptive tokens and action tokens, one can cache keys and values across denoising steps at runtime time, incurring in a reduced computational footprint and faster inference.

In \pizero, both the VLM backbone and action expert are update using a \emph{flow matching} loss, and in particular are updated minimizing:
\begin{align}
Expand Down