<center>

---


<h1>
Steerable Inductive Biases for Reconstructing Flight Maneuvers
</h1>

Clayton Curry
 
---

[![Blog: Flight Maneuvers](https://img.shields.io/badge/Blog-Flight%20Maneuvers-blue)](https://claycurry.com/blog/maneuvers)
[![Flask Project Demo](https://img.shields.io/badge/Flask-Project%20Demo-ff69b4)](https://claycurry.com/blog/maneuvers)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13U6SEHBlYxXKCvmLBXeERzIb5dt2lEfA?usp=sharing)
[![Github Repo](https://img.shields.io/static/v1.svg?logo=github&label=repo&message=see%20project&color=blueviolet)](https://github.com/clay-curry/Flight-Maneuvers) 

</center>



<center>
<h2>Abstract</h2>
</center>

The endeavor to model physical systems using deep neural networks has motivated much work on the problem of how to constrain the space of admissible hypotheses to be consistent with desired symmetries, physical laws, or other priors. 
Instilling such physically-motivated inductive biases, in recent years, has gained prolific traction through so-called "steerable" neural networks, a theory leaning on well-charted aspects of group theory, Lie groups, and linear representations, which provide an exact mathematical theory for the principled construction of transformations that leave certain systems or physical quantities invariant.
Despite significant theoretical advances of such methods and promise for neural achitectures with increased parameter efficiency and data efficiency, countless application gaps currently remain for steerable neural networks.
To help close one gap, our work provides a comparative analysis of networks trained to discriminate categorical flight maneuvers on the basis of observed flight trajectories.
Using cross-validation, we obtained data supporting the hypotheses that $SE(2)$ and $SE(3)$ steerable networks are indeed more efficient (data and parameter) than traditional convolutional networks, which guarantee equivariance under translations but not rotations. 
Our work also suggests suggests that $SE(3)$-steerable networks, which guarantee equivariance to isometries, remain sufficiently expressive to discriminate instances of 'takeoff' and 'landing' maneuvers, despite the known equivalence of these two instances up to actions of $SE(3)$.

## 1. Introduction

Learning algorithms are a subset of artificial intelligence founded on the assumption that many every-day tasks have the structure of an optimization problem.
Under this assumption, realizing mastery of everyday task is equivalent to knowing how to select a *hypothesis* from a set of solutions. 
On countless benchmarks, learning algorithms have greatly outstripped the capabilities of ordinary humans, often with dedicated software libraries. [[2](#cesa-22)], [[3](#e3nn)], [[4]](#pytorch) 
Convolutional neural networks (CNNs), in particular, have stood for over three decades as the modus operandii for solving visual tasks, and they continue to shape ordinary lives through devices, safety features, and so on.
Unlike fully connected-networks, CNNs possess a key property many have linked its success, namely, an inductive bias for hypotheses that elicit similar behavior on similar instances under slightly different poses. [[1](#big3)]
The utility of this inductive bias manifests as a rich function class with fewer bad hypotheses, where finding an optimal hypothesis requires significantly fewer training examples. [[5](#gdl)]
While learning generic functions in high dimensions is a cursed estimation
problem, the remarkable success of CNNs [[2](#poggio-17)] suggests that many daily tasks give rise to signals distributed about low dimensional substructures embeded in high (or infinite) dimensional spaces of signals [[5](#gdl)]. 





Hypotheses generated by CNNs are, by construction, especially suited for tasks involving instances related by some topological (locality) and compositional structure (complex objects arise from simpler objects).
This tendency is admitted, respectively, by two mathematical operations: signal convolution and function composition.
To understand why this construction elicits this desired class of functions, the reader is referred to the appendix.
Because the same set of weights are used throughout the input, convolutions are said to be **equivariant** to transformations from the translation group $ℝ^2$. Having representations that equivary is thought to enable greater object-centricity and stability of activations of  under identity-preserving transformations. [[1]](#big3)



## 2. Related Works

### Analysis of Flight Maneuvers
The recently proposed Maneuver Identification Challenge [[1](https://maneuver-id.mit.edu/)] provides thousands of trajectories from flight simulators, hoping to accelerate inquiry of machine learning methods in pilot training curricula. 

### Steerability


## 3. Methodology

Convolutional networks make use of the convolution operator, here defined over $ℝ^2$ for a signal $f: ℝ \to ℝ$ and a kernel $k: ℝ \to ℝ$ at $x \in ℝ^2$:

$$
(f \star k)(c) := \int_{ℝ^2} f(\tilde x)k(\tilde x - x)d \tilde x,
$$

 where scalars at $(f \star k)(x)$ relate to some notion of "similarity"  (a notion made precise from the vantage of inner-product spaces <a name="ref:1" href="#axler">[6]</a>) between the signal $f$ with the signal $k$ transformed by $x$.
<br>

### Problem Formulation

From prior knownledge, it is known that flight trajectories relate in some direct way to the actions of the pilot, who is formally trained in executing a handful of discreet basic flight maneuvers. Provided this fact, we can define some relation characterizing flight $(x_t, c) ↦ x_{t+1}$ whereby each **state** $x_t \in X \equiv ℝ^n$ combines with a **maneuver** $c \in C$ to produce a succeeding state $x_{t+1}$ over each time step. 
Since maneuvers $c$ determine the evolution of $x$ locally in time, it is reasonable assume the existence of a stochastic distribution $f(x, c)$ over observed trajectory of states 
$(\ldots, x_{t-1}, x_{t}, x_{t+1}, \ldots)$ is related to a latent trajectory of maneuvers $(\ldots, c_{t-1}, c_{t}, c_{t+1}, \ldots)$. 

Precisely, we formulate an immediate goal of choosing a set of hypotheses $f_\theta$ that reasonably model the true distribution of associations $f : X \to C$, where the quality of any given hypothesis is assessed by summing the binary cross entropy,

$$
L(f, f_\theta)
$$

over a representative subset $D = \{ (𝑋, 𝐶 )_j \}$ of training examples.



## References

1. <a href="big3"></a> LeCun, Y., Bengio, Y. & Hinton, G. (2015) [Deep learning](https://www.nature.com/articles/nature14539), *Nature, vol. 521, no. 7553, pp. 436–444, 2015.*

2. <a name="cesa-22"></a>Cesa, G., Lang, L. & Weiler, M. (2022) [A Program to Build E(n) Equivariant Steerable CNNs](https://openreview.net/forum?id=WE4qe9xlnQw). *In Conference on Learning Representations.*

3. <a name="e3nn"></a> Geiger, M., Smidt, T., *Et al. (2022) [Euclidean neural networks: e3nn](https://doi.org/10.5281/zenodo.6459381)}, Zenodo.

4. <a name="pytorch"></a> Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). [PyTorch: An Imperative Style, High-Performance Deep Learning Library](https://dl.acm.org/doi/10.5555/3454287.3455008). *In Advances in Neural Information Processing Systems 32 (pp. 8024–8035).*

5. <a name="gdl"></a> Michael M. Bronstein and Joan Bruna and Taco Cohen and Petar Veličković. (2021) [Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges](https://arxiv.org/abs/2104.13478). *Preprint.*

6. <a name="poggio-17"></a>Poggio, T., Mhaskar, H., Rosasco, L. et al. (2017) [Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review](https://doi.org/10.1007/s11633-017-1054-2). *Int. J. Autom. Comput. 14, 503–519.*

7. <a name="axler"></a>Axler, S (2015) [Linear Algebra Done Right](https://ou-primo.hosted.exlibrisgroup.com/permalink/f/7erelc/TN_cdi_askewsholts_vlebooks_9783319110806). *Springer, Undergraduate Texts in Math, Second ed.*

8. <a name="serre"></a>[Serre, J.P.](https://en.wikipedia.org/wiki/Jean-Pierre_Serre). (1977) [Linear Representations of Finite Groups](https://link.springer.com/book/10.1007/978-1-4684-9458-7?view=modern&page=2). *Springer, Graduate Texts in Math.*

9. <a name="gdl-old"></a>Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. (2017) [Geometric deep learning:
going beyond Euclidean data](https://arxiv.org/pdf/1611.08097.pdf). *IEEE Sig Proc Mag.* 

1.  <a name="cohen-steerable"></a> Cohen, T.S., & Welling,  M. (2017) [Steerable CNNs](https://openreview.net/forum?id=rJQKYt5ll). *In Conference on Learning Representations.*

2.   <a name="maile2022towards"></a> Maile, K., Wilson, D.G., & Forre, P. (2022) [Towards Architectural Optimization of Equivariant Neural Networks over Subgroups](https://openreview.net/forum?id=KJFpArxWe-g), *NeurIPS Workshop on Symmetry and Geometry in Neural Representations.*

3.   <a name="equivariant-nas"></a>Maile, K., Wilson, D.G., & Forre, P. (2023) [Equivariance-aware Architectural Optimization of Neural Networks](https://openreview.net/forum?id=a6rCdfABJXg). *In Conference on Learning Representations.*


## Appendix

### The Hypotheses Class of Convolutional Networks

This section justifies our earlier claim that hypotheses generated by CNNs are, by construction, especially suited for tasks involving instances related by some topological (locality) and compositional structure (complex objects arise from simpler objects).
This tendency is admitted, respectively, by two mathematical operations: signal convolution and function composition.

Recall that any convolutional network factors as the product,
\begin{align*}
CNN(x) &= a^{(n)}(\phi_{k_n}^{(n)}(a^{(n-1)}((\phi^{(n-1)}_{k_{n-1}}(\cdots(a^{(1)}(\phi^{(1)}_{k_1}(x)))))))) \\
&= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(1)} \circ \phi^{(1)}_{k_1})(x)
\end{align*}

where $a^{(j)}: \mathcal{F}^{(j)} \to \mathcal{F}^{(j)} $ denotes a fixed, nonlinear map defined for all permissible signals $\mathcal{F}^{(j)}$ at layer $j$ of the network and where $\phi^{(j)}_{k_j}: \mathcal{F}^{(j)} \to \mathcal{F}^{(j)}$ is defined equal to the convolution product $(f^{(j)} \star k_j)$ of the incoming signal $f^{(j)}$ at the $j$'th layer with a learned kernel $k_j$.
From this definition of a CNN, we observe the output of each $\phi_k$ is equal to another signal $\mathcal{F}^{(j)}$ defined on the same domain $x \in \mathbb{Z}^2$ but equal to the *inner product* $\langle f \star (\mathcal{L}_x k) \rangle$ of $f$ with $\mathcal{L}_x k \in \mathcal{F}$, which is precisely the signal that equals the kernel $k$ transformed by $x$.

### Steerable CNNs

Key intuition for understanding (and improving on) the success of CNNs can be gained through the lens of well-charted concepts in higher mathematics, some of which we elaborate here. To begin, consider the problem of learning some task that involves operating on 2D images where the map $f(z) \in ℝ^K$ corresponds to the intensity of the pixels at position $z \in ℤ^2$. By construction, CNNs are defined such that any learned hypothesis maps inputs to outputs by operating on successive latent spaces of $f : ℤ^2 \to ℝ^K$, called **feature maps**, that likewise have the structure of a map assigning $K'$ channels to each point $z \in ℤ^2$. Our reason for viewing feature maps as a correspondence $f^{(k)} : ℤ^2 \to ℝ^{K'}$ from the integer plane to vector-valued channels will become apparant later whenever we recount the general theory of steerable CNNs wherein $f^{(l)}$  is viewed as a *bundle* of so-called *fibers* (suggesting a connection between feature maps and the theory of fiber bundles studied in differential geometry and differential topology) consistent with a steerablility constraint described first in [Cohen, 2017](#cohen-17), recall the defining formula whereby one representation is used to compute its successor,

$$
\begin{equation}
f^{(l+1)}(z) = ϕ^{(l)}[f^{(l)}](z) := (f^{(l)} \star k)(z) := \sum_{i,j} f^{(l)}(u)  k (u - z) \tag{1}
\end{equation}
$$

and observe that $ϕ^{(l)}$ defines a linear operator between feature maps. Congruent with earler works in the steerable networks literature, we will use **feature space** to denote the domain $\mathcal{F}^{(l)}$ and codomain $\mathcal{F}^{(l+1)}$ of the map $ϕ^{(l)}$. Also observe, $ϕ^{(l)}$ produces representations $f^{(l+1)}$ assigning greater magnitude $z ↦ f^{(l+1)}(z)$ at points wherever the **kernel** $k$ (or *filter*) is similar to $f^{(l)}(z)$ in the preceeding feature map. Here, a degree of similarity between a kernel $k$ and feature maps $f^{(l)}$ is made precise with the connection between convolution $⋆$, composing signals to form other signals, and the theory of inner-product spaces <a name="ref:1" href="#axler">[1]</a>. As a linear map in $\text{Hom}(\mathcal{F}^{(l)}, \mathcal{F}^{(l+1)})$, each convolution layer $ϕ^{(l)}$ is chosen from a linear span of functions with convolutional structure, with $k$ constituting the scalars combining a trivial basis with a sum equal to $ϕ^{(l)}$. Since $ϕ^{(l)}$ is defined to be equal to convolution with $k$, producing latent representations said to "equivary" under translation. 

In general, any linear map $f: X \to Y$ is said to have **equivariance** under translation when the equality,

$$
𝘛^Y_t ( f(x) ) = f( 𝘛^X_t (x) )
$$

holds for all translations $𝘛^X_t: X \to X$ and $𝘛^Y_t: Y \to Y$ by the same $t \in ℝ^n$. Equivalently, this property is sometimes denoted with the statement, $f$ *commutes* with $𝘛$. This aspect of convolution, in the usual form used by LeNet, does not guarantee this condition under larger families of identity-preserving transformations, such as rotations or dilations of instances.