<center>

---


<h1>
Steerable Inductive Biases for Reconstructing Flight Maneuvers
</h1>

Clayton Curry
 
---

[![Blog: Flight Maneuvers](https://img.shields.io/badge/Blog-Flight%20Maneuvers-blue)](https://claycurry.com/blog/maneuvers)
[![Flask Project Demo](https://img.shields.io/badge/Flask-Project%20Demo-ff69b4)](https://claycurry.com/blog/maneuvers)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13U6SEHBlYxXKCvmLBXeERzIb5dt2lEfA?usp=sharing)
[![Github Repo](https://img.shields.io/static/v1.svg?logo=github&label=repo&message=see%20project&color=blueviolet)](https://github.com/clay-curry/Flight-Maneuvers) 

</center>



<center>
<h2>Abstract</h2>
</center>

The endeavor to model physical systems using deep neural networks has motivated much work on the problem of how to constrain the space of admissible hypotheses to be consistent with desired symmetries, physical laws, or other priors. 
Instilling such physically-motivated inductive biases, in recent years, has gained prolific traction through so-called "steerable" neural networks, a theory leaning on well-charted aspects of group theory, Lie groups, and linear representations, which provide an exact mathematical theory for the principled construction of transformations that leave certain systems or physical quantities invariant.
Despite significant theoretical advances of such methods and promise for neural achitectures with increased parameter efficiency and data efficiency, countless application gaps currently remain for steerable neural networks.
To help close one gap, our work provides a comparative analysis of networks trained to discriminate categorical flight maneuvers on the basis of observed flight trajectories.
Using cross-validation, we expect to obtain data supporting the hypotheses that $SE(2)$ and $SE(3)$ steerable networks are indeed more efficient (data and parameter) than traditional convolutional networks, which guarantee equivariance under translations but not rotations. 
Our work could also give evidence to suggest that $SE(3)$-steerable networks, which guarantee equivariance to isometries, remain sufficiently expressive to discriminate instances of 'takeoff' and 'landing' maneuvers, despite the known equivalence of these two instances up to actions of $SE(3)$.

## 1. Introduction

Learning algorithms are a subset of artificial intelligence founded on the assumption that many every-day tasks have the structure of an optimization problem.
Under this assumption, realizing mastery of everyday task is equivalent to knowing how to select a *hypothesis* from a set of solutions. 
On countless benchmarks, learning algorithms have greatly outstripped the capabilities of ordinary humans, often with dedicated software libraries. [[2](#cesa-22)], [[3](#e3nn)], [[4]](#pytorch) 
Convolutional neural networks (CNNs), in particular, have stood for over three decades as the modus operandii for solving visual tasks, and they continue to shape ordinary lives through devices, safety features, and so on.
Unlike fully connected-networks, CNNs possess a key property many have linked its success, namely, an inductive bias for hypotheses that elicit similar behavior on similar instances under slightly different poses. [[1](#big3)]
The utility of this inductive bias manifests as a rich function class with fewer bad hypotheses, where finding an optimal hypothesis requires significantly fewer training examples. [[5](#gdl)]
While learning generic functions in high dimensions is a cursed estimation
problem, the remarkable success of CNNs [[2](#poggio-17)] suggests that many daily tasks give rise to signals distributed about low dimensional substructures embeded in high (or infinite) dimensional spaces of signals [[5](#gdl)]. 





Hypotheses generated by CNNs are, by construction, especially suited for tasks involving instances related by some topological (locality) and compositional structure. 
For an extended discussion on the unreasonable effectiveness of CNNs, we refer the reader to our appendix.
This tendency is naturally admitted, respectively, by two mathematical operations: signal convolution and function composition.
Because the same set of weights are used throughout the input, convolutions are said to be **equivariant** to transformations from the translation group $ℝ^n$. 
In general, functions with equivariance to certain transformations attempt to capture a distinct qualities agreeing with the doctine of "observer-independent quantities" put forward by Herman Weyl in [[15](#weyl)] while studying problems related to fundamental questions physics.



Slightly generalizing the notion of equivariance, a convolutional network (or intermediate layers) is said to be ***steerable***  (or steered by a group $G$ of transformations) whenever inputs and outputs are related by a representation $\rho : G \to GL(n)$ satisfying
$$
\rho(g) f(x) = f(g \cdot x)
$$
for all group elements $g \in G$.
Intuitively, this property asserts that whenever the inputs $x$ are be steered in some direction (rotated by some angle), the resulting feature map is precisely equal to the original feature map (before input features were transformed) steered by the same angle.[[10](#cohen-steerable)]
In the same way that feature maps in CNNs equivary under translation, feature maps in so called $G$-steerable networks equivary under the actions of an abitrary transformation group $G$. [[2](#cesa-22)], [[22](#homogeneous)], [[24](#wigner)].
Having intermediate layers that equivary is thought to enable greater object-centricity and stability of activations of under identity-preserving transformations. [[1](#big3)]

## 2. Related Works

### Analysis of Flight Maneuvers
The recently proposed Maneuver Identification Challenge [[13](#maneuver-id)][[14](#maneuver-id2)] provides thousands of trajectories of Air Force pilots and trainees flying in virtual reality simulators.  In an effort to enable AI coaching and automatic maneuver grading in pilot training, the Air Force seeks to automatically identify and label each maneuver flown in this dataset from a catalog of around 30 maneuvers. 
For the sake of timeliness, we elected to use our own simulated flight trajectories using in-house software.



### Steerability

The first convolutional network explicitly designed to have steerable layers was introduced in [[10](#cohen-steerable)], which presented a general theory for constructing networks steered by discrete, finite groups. Within a year, a theory for the larger case of was independently discovered by three distinct actors [[17](#spherical-cnns)], [[18](#tfn)], [[19](#kondor-compact)] to hold for *compact* groups (e.g, rotations and rototranslations). The first work [[17](#spherical-cnns)] was motivated by the problem of how to construct the analogue of CNNs for operating on features defined on spherical objects (or manifolds, more generally), rather than the flat plane, for situations where prior exists of rotational symmetry on the sphere. The second work [[18](#tfn)] was motivated by problems involving point-clouds, wherein features come from a discrete number of points in space, having known symmetries to translations and rotations. The third paper [[19](#kondor-compact)] is a theoretical exposition that prescribes results that must hold in all neural networks where there exists some known symmetry, primarily contributing a theorem showing that (under some reasonable constraints) convolutional structure is not just a sufficient, but also a necessary condition for constructing linear layers having equivariance. All three works, and a number of follow-on works make use of concepts from representation theory and noncommutative harmonic analysis. [[20](#kondor-cgn) - [27](#equivariant-nas)].

### Neural Architecture Search

Given an evaluation criterion, Neural Architecture Search (NAS) methods fundamentally strive for a Pareto-optimal tradeoff between model performacne and latency incurred while navigating a space of admissable model architectures and training routines. [[31](#nas-meta)]
Over the decades, many different search strategies have been incorporated, including random search, Bayesian optimization, evolutionary methods, reinforcement learning (RL), and gradient-based methods.
Yet, a recent survey [[30](#hal)] conducted at NeurIPS 2019 and ICLR 2020 found that 80% of NeurIPS papers and 88% of ICLR papers tuned their ML model hyperparameters using manual tuning, random search, or grid search.

At this moment, only one group has put forth a method for conducting NAS [[11](#equivariant-nas1)][[12](#equivariant-nas2)] of equivariant and steerable neural networks.

## 3. Approach


### Problem Formulation

From prior knownledge, we assume the existence of a direct correspondence between the trajectory traced by an aircraft and the actions of the pilot, who is trained to follow one of several basic flight maneuvers. 
Provided this assumption, any observed flight trajectory sampled with a fixed sampling rate is characterized by relation $(x[t], c) ↦ x[t+1]$ wherein each **state** $x[t] \in X \equiv ℝ^n$ combines with a **maneuver** $c \in C$ to produce a succeeding state $x[t+1]$ at each time step.

Given an i.i.d. dataset of sequence pairs $(x_n, c_n) \sim F$, where $F$ is a (stochastic) distribution over $X^* \times C^*$, our goal is to learn an estimate $\hat F$ of the conditional distribution $F(c | x) = \text{Pr}(c = c | x)$. 
To obtain an estimate for $F$, we make the followiwng factorization for $\hat F$ as the product of constitutuent, mutually independent distributions,

$$
\hat F(c | x) = \hat F_\theta(c[1] | x) \cdot \hat F_\theta(c[2] | x) \cdots \hat F_\theta(c[\ell] | x)
$$

where $\theta$ consists of all neural network parameters to be learned and $\hat F_\theta(c[k] | x)$ is a real-valued vector of length $|C|$ with components indicating the probability mass for $c[k] \in C$ at each timestep $k$.



### Architectural Components



Networks based on residual connections [[28](#resnet)] have been foundational for neural networks with more than 1,000 layers. Despite its simplicity, the idea of residual connections is highly effective as it avoids vanishing and exploding weight gradients, supporting stable gradient propagation througout the network. Instead of modeling $x_{l+1}=F(x_{l})$, we model $x_{l+1}=x_{l}+F(x_{l})$ where $F$ is a non-linear mapping learned during training. If we do backpropagation on such residual connections, we obtain:

$$
\frac{dx_{l+1}}{dx_{l}} = 𝑰 + \frac{dF^{(l)}(x_l)}{dx_l}
$$


Initializing the weights of $F^{(l)}$ will bias the gradient of the error towards the identity matrix, providing a stable propagation of errors throughout the network. There have been many variants of ResNet proposed, which mostly concern the function $F$, or operations applied on the sum.

<center>
<img alt="f4f139546d794ab4a24942c11cecbbed" class="no-scaled-link" src="https://github.com/PyTorchLightning/lightning-tutorials/raw/main/course_UvA-DL/04-inception-resnet-densenet/resnet_block.svg" width="60%">
</center>



The original ResNet block applies a non-linear activation function, usually ReLU, after the skip connection. In contrast, the pre-activation ResNet block applies the non-linearity at the beginning of F. Both have their advantages and disadvantages. For very deep network, however, the pre-activation ResNet has shown to perform better as the gradient flow is guaranteed to have the identity matrix as calculated above, and is not harmed by any non-linear activation applied to it.

#### ResNet 1D

For baseline comparisons, a was successfully trained to reconstruct maneuvers with test error slightly below 8%. 
Our baseline solution was a static architecture subscribing to the blockwise structure described in Resnet [[28](#resnet)].
Each of the 300 layers transformed features using a learnable residual function comprised of convolution (one-dimensional, 32 channels, kernel-size 3, without downsampling), batch norm, ReLU activations, another convolution layer, batch norm, summation with the inputs, before finally being transformed by a ReLU block produce the outputs of the block.

####  $SO(3) \ltimes \mathbb{R}^n$ Steerable CNN

To motivate our decision to elect steerability as an inductive bias for our learning algorithm, we make the critical assertion that whatever hypothesis $\hat F$ is ultimately learned, it should give inferences that are independent of our choice of reference frame. 
If true, this property reflects a particular quality of our true distribution $F$, namely that it belonging to the class of *$G$-invariant mappings* [[8](#serre)], [[15](#weyl)] where $G$ is equal to the semidirect product $SO(3) \ltimes \mathbb{R}^n$ with $n$ equal to the dimension of state space $X \equiv \mathbb{R}^n$.
This means that our hypothesis $\hat F$ is invariant to all spatial transformations associated with a pair $(r, t)$ where the first element $r \in SO(3)$ belongs to the *special orthoganol group* of transformations preserving the unit sphere, i.e, rotations, and where $t \in \mathbb{R}^n$ belongs to the set of transformations translating $\mathbb{R}^n$ by the vector $t$.
In different contexts, $SO(3) \ltimes \mathbb{R}^n$ is identified with the so-called group of *roto-translations*, the group of *isometries* ('iso'-same, 'metry'-distance), or the so-called *Euclidean group* $E(n)$.

Stated in linear algebraic terms, the observation that inferences are observer independent is to state that $\hat F$ remains fixed whenever inputs are transformed by any pair $(r, t) : x \mapsto Rx + b_t$ where $R_r \in \mathbb{R}^{n \times n}$ is a rotation matrix associated with $r \in SO(3)$ and $+ b_t$ is the transformation given by translation with $t$.

In other words, imposing steerablity to $SO(3) \ltimes \mathbb{R}^n$ requires that our network satisfy,

\begin{align*}
\hat F_\theta(x) :\approx F(c | x) \\
= F( c | (\ldots, x[t-1], x[t], x[t+1], \ldots)) \\ 
= F( c | (\ldots, Rx[t-1] + b, Rx[t] + b, Rx[t+1] + b, \ldots)) \\
= F( c | R (\ldots, x[t-1], x[t], x[t+1], \ldots) + b) \\ 
= F(c | Rx + b) \\
:\approx \hat F_\theta(Rx + b)
\end{align*}

for any given rotation matrix $R$ (orthoganol matrix with determinant one) and translation vector $b \in \mathbb{R}^n$.

In the section that follows, we describe an important exception to this equality, namely for 'takeoff' and 'landing' and 'travel' maneuvers. In fact, this exception reveals that $F$ is only independent to subclass of changes in reference frame, namely, those that preserve the *pitch* (the direction of the 'z' axis) of the observer. In our experiments, we will attempt to verify this negative quality of $F$ by confirming whether $O(3) \ltimes \mathbb{R}^n$-invariant CNNs give a function class that remains sufficiently expressive to distinguish 'takeoff' and 'landing' and 'travel' maneuvers. 

A full derivation for constructing a CNN of this type is provided in the Appendix, borrowing much inspiration from [[2](#cesa-22)] and [[24](#wigner)].



#### $SO(2) \ltimes \mathbb{R}^n$ Steerable CNN

Towards understanding why a $O(3) \ltimes \mathbb{R}^n$-steerable networks might not be sufficently expressive disciminate 'takeoff,' 'traveling,' and 'landing' maneuvers, we give the following justification. In a sufficiencly small window of observation, a 'takeoff' maneuver has the same structure as 'traveling' and 'landing' observed in another sufficiently small window, up to some rotation $R$ deviating the *pitch* of our observed states. This fact alone violates our previous assumption wherein it was asserted that inferences produced by $F$ are observer independent.

We hypothesize that $F$ defines a function invariant to actions of $SO(2) \ltimes \mathbb{R}^n$, whose elements are identified with rotations about a line in dimension 3, followed by translation in any direction. In particular, our third network will be chosen from a hypothesis class consisting of maps that are invariant only for rotations about the vertical Z-axis followed by translations.


## 4. Findings

Our findings sought to substantiate the claims that steerable networks can elicit the same degree of task acquisition with fewer parameters and fewer training examples versus traditional convolutional networks.[[20](#gcnns)] Validating such claims required a careful experimental investigation of both types of networks on datasets of varying sizes.

To assess the degree that each model was capable of acquiring mastery on the given task under limited training data, we conducted five independent trials (manually setting a new RNG seed between trials) fitting each model to the true distribution using training sets that varied in size from 1, 5, 10, 50, to 100 training examples. Models were trained with early stopping arbitrarily set to be 5 consecutive epochs without increase in validation accuracy. 

In our baseline method, we obtained evidence suggesting the existence of a negative relationship between the number of training examples and the resulting test error (categorical accuracy) of trained models. In the best case, we evaluated a baseline model on a training set containing 100 examples and observed a test error averaging below 8%, which suggested a high level of task acquisition the appropriateness of our study design.

## 5. Discussion

Our research project aimed to explore the potential benefits of using steerable networks compared to traditional convolutional networks in terms of task acquisition, parameter efficiency, and training examples. Although we were unable to complete the study as planned due to time constraints, we were able to develop and provide a comprehensive software package called `flight_maneuvers`, which is available on the Python Package Index. This package offers a range of functionalities including data standardization, subroutines, classes, and software tests that enable users to instantiate and train models, generate statistics, and visualize the network topology.

Despite the incomplete nature of our study, the `flight_maneuvers` package offers a starting point to facilitating the exploration of network performance and behavior. While our findings are limited, we anticipate that the package will provide a valuable foundation for others to build upon, ultimately contributing to a deeper understanding of the potential benefits of steerable networks in various applications.


## 6. Future Work

## References

1. <a href="big3"></a> LeCun, Y., Bengio, Y. & Hinton, G. (2015) [Deep learning](https://www.nature.com/articles/nature14539), *Nature, vol. 521, no. 7553, pp. 436–444, 2015.*

2. <a name="cesa-22"></a>Cesa, G., Lang, L. & Weiler, M. (2022) [A Program to Build E(n) Equivariant Steerable CNNs](https://openreview.net/forum?id=WE4qe9xlnQw). *In Conference on Learning Representations.*

3. <a name="e3nn"></a> Geiger, M., Smidt, T., *Et al. (2022) [Euclidean neural networks: e3nn](https://doi.org/10.5281/zenodo.6459381)}, Zenodo.

4. <a name="pytorch"></a> Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). [PyTorch: An Imperative Style, High-Performance Deep Learning Library](https://dl.acm.org/doi/10.5555/3454287.3455008). *In Advances in Neural Information Processing Systems 32 (pp. 8024–8035).*

5. <a name="gdl"></a> Michael M. Bronstein and Joan Bruna and Taco Cohen and Petar Veličković. (2021) [Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges](https://arxiv.org/abs/2104.13478). *Preprint.*

6. <a name="poggio-17"></a>Poggio, T., Mhaskar, H., Rosasco, L. et al. (2017) [Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review](https://doi.org/10.1007/s11633-017-1054-2). *Int. J. Autom. Comput. 14, 503–519.*

7. <a name="axler"></a>Axler, S (2015) [Linear Algebra Done Right](https://ou-primo.hosted.exlibrisgroup.com/permalink/f/7erelc/TN_cdi_askewsholts_vlebooks_9783319110806). *Springer, Undergraduate Texts in Math, Second ed.*

8. <a name="serre"></a>[Serre, J.P.](https://en.wikipedia.org/wiki/Jean-Pierre_Serre). (1977) [Linear Representations of Finite Groups](https://link.springer.com/book/10.1007/978-1-4684-9458-7?view=modern&page=2). *Springer, Graduate Texts in Math.*

9. <a name="gdl-old"></a>Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. (2017) [Geometric deep learning:
going beyond Euclidean data](https://arxiv.org/pdf/1611.08097.pdf). *IEEE Sig Proc Mag.* 

10.  <a name="cohen-steerable"></a> Cohen, T.S., & Welling,  M. (2017) [Steerable CNNs](https://openreview.net/forum?id=rJQKYt5ll). *In Conference on Learning Representations.*

11.   <a name="equivariant-nas1"></a> Maile, K., Wilson, D.G., & Forre, P. (2022) [Towards Architectural Optimization of Equivariant Neural Networks over Subgroups](https://openreview.net/forum?id=KJFpArxWe-g), *NeurIPS Workshop on Symmetry and Geometry in Neural Representations.*

12.   <a name="equivariant-nas2"></a>Maile, K., Wilson, D.G., & Forre, P. (2023) [Equivariance-aware Architectural Optimization of Neural Networks](https://openreview.net/forum?id=a6rCdfABJXg). *In Conference on Learning Representations.*

13. <a name="maneuver-id"></a> Samuel, K. *Et al.* (2021) [Maneuver Identification Challenge](https://maneuver-id.mit.edu/) MIT-USAF AI Accelerator.

14. <a name="maneuver-id2"></a> Samuel, K., *Et al.* (2022) [AI Enabled Maneuver Identification via the Maneuver ID Challenge](https://arxiv.org/ftp/arxiv/papers/2211/2211.15552.pdf) 2022 Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC).

15. <a name="weyl"></a> Weyl, H. (1939) [The classical groups: their invariants and representations](https://press.princeton.edu/books/paperback/9780691057569/the-classical-groups) *Princeton University Press*.

16. <a name="kantani"></a> Kantani, K. (1990) [Group-Theoretical Methods in Image Understanding](https://link.springer.com/book/10.1007/978-3-642-61275-6) *Springer Series in Information Science*

17. <a name="tfn"></a> Thomas, N. *Et al.* (2018). [Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds](https://arxiv.org/abs/1802.08219). *arXiv*

18. <a name="spherical-cnns"></a> Cohen, T., Geiger, M., Kohler, J. (2018). [Spherical CNNs](https://arxiv.org/pdf/1801.10130.pdf). *In Conference on Learning Representations.*

19. <a name="kondor-compact"></a> Kondor, R. & Trivedi, S. (2018). [On the Generalization of Equivariance and Convolution in Neural Networks
to the Action of Compact Groups](http://proceedings.mlr.press/v80/kondor18a/kondor18a.pdf). *International Conference on Machine Learning.*

1.  <a name="gcnn"></a> Cohen, T.S. & Welling, M. [Group Equivariant Convolutional Networks](https://arxiv.org/abs/1602.07576) *Proceedings of the International Conference on Machine Learning*.

2.  <a name="kondor-cgn"></a> Kondor, R., Lin, Z., & Trivedi, S. (2018). [Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network](https://arxiv.org/abs/1806.09231) *Neural Information Processing Systems (NIPS), Montreal, Canada*.

3.   Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, Taco S. Cohen (2018) [3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data](https://proceedings.neurips.cc/paper/2018/hash/488e4104520c6aab692863cc1dba45af-Abstract.html) *Advances in Neural Information Processing Systems*.

4.   <a name="homogeneous"></a> Cohen, T., Geiger, M. & Weiler, M. (2019) [A General Theory of Equivariant CNNs on Homogeneous Spaces](https://arxiv.org/abs/1811.02017#:~:text=We%20present%20a%20general%20theory%20of%20Group%20equivariant,layers%20are%20equivariant%20maps%20between%20spaces%20of%20fields.) *Advances in Neural Information Processing Systems*.

5.  <a name="e2"></a> Weiler, M. & Cesa, G. (2019) [General E(2)-Equivariant CNNs](https://arxiv.org/abs/1911.08251).*Advances in Neural Information Processing Systems*.

6.  <a name="wigner"></a>  Lang, L. & Weiler, M. (2021) [A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels](https://arxiv.org/abs/2010.10952) *In Conference on Learning Representations.*

7.  Brandsetter, J., Hesselink, R., Pol, E.v.d., Bekkers, E.J., & Welling, M. (2022) [Geometric and Physical Quantities Improve E(3) Equivariant Message Passing](https://arxiv.org/pdf/2110.02905.pdf) *In Conference on Learning Representations.*

8.  <a name="resnet"></a> Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016) [Deep Residual Learning for Image Recognition](https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html) *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

9.  <a name="transformer"></a> Vaswani, A. & *Et al.* (2017) [Attention Is All You Need](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) *Advances in Neural Information Processing Systems*.
10. <a name="hal"></a> Bouthillier, X., & Varoquaux, G. (2020) [Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020](https://hal.science/hal-02447823/document) [Research Report] Inria Saclay Ile de France.

11. <a name="nas-meta"></a> Eriksson, D. & *Et al.* (2021) [Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization](https://research.facebook.com/blog/2021/07/optimizing-model-accuracy-and-latency-using-bayesian-multi-objective-neural-architecture-search/) *ICML Workshop on Automated Machine Learning*.


## Appendix

### The Unreasonable Effectiveness of Convolutional Networks

This section justifies an earlier claim that hypotheses generated by CNNs are, by construction, especially suited for tasks involving instances related by some topological (locality) and compositional structure (complex objects arise from simpler objects).
This tendency is admitted, respectively, by two mathematical operations: signal convolution and function composition.

Recall that any convolutional network factors as the product,
\begin{align*}
CNN(x) &= a^{(n)}(\phi_{k_n}^{(n)}(a^{(n-1)}((\phi^{(n-1)}_{k_{n-1}}(\cdots(a^{(1)}(\phi^{(1)}_{k_1}(x)))))))) \\
&= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(1)} \circ \phi^{(1)}_{k_1})(x)
\end{align*}

where $a^{(j)}: \mathcal{F}^{(j)} \to \mathcal{F}^{(j)} $ denotes a fixed, nonlinear map defined for all permissible signals $\mathcal{F}^{(j)}$ at layer $j$ of the network and where $\phi^{(j)}_{k_j}: \mathcal{F}^{(j)} \to \mathcal{F}^{(j)}$ is defined equal to the convolution product $(f^{(j)} \star k_j)$ of the incoming signal $f^{(j)}$ at the $j$'th layer with a learned kernel $k_j$.

From this definition of a CNN, we observe the output of each $\phi_k$ is equal to another signal $\mathcal{F}^{(j)}$ defined on the same domain $x \in \mathbb{Z}^2$ but equal to the *inner product* $\langle f \star (\mathcal{L}_x k) \rangle$ of $f$ with $\mathcal{L}_x k \in \mathcal{F}$, which is precisely the signal that equals the kernel $k$ transformed by $x$.

### Steerable CNNs

Key intuition for understanding (and improving on) the success of CNNs can be gained through the lens of well-charted concepts in higher mathematics, some of which we elaborate here. To begin, consider the problem of learning some task that involves operating on 2D images where the map $f(z) \in ℝ^K$ corresponds to the intensity of the pixels at position $z \in ℤ^2$. By construction, CNNs are defined such that any learned hypothesis maps inputs to outputs by operating on successive latent spaces of $f : ℤ^2 \to ℝ^K$, called **feature maps**, that likewise have the structure of a map assigning $K'$ channels to each point $z \in ℤ^2$. Our reason for viewing feature maps as a correspondence $f^{(k)} : ℤ^2 \to ℝ^{K'}$ from the integer plane to vector-valued channels will become apparant later whenever we recount the general theory of steerable CNNs wherein $f^{(l)}$  is viewed as a *bundle* of so-called *fibers* (suggesting a connection between feature maps and the theory of fiber bundles studied in differential geometry and differential topology) consistent with a steerablility constraint described first in [Cohen, 2017](#cohen-17), recall the defining formula whereby one representation is used to compute its successor,

$$
\begin{align}
f^{(l+1)}(z) &:= (f^{(l)} \star k)(z) \\
&:= \sum_{i,j} f^{(l)}(u)  k (u - z)
\end{align}
$$

and observe that $ϕ^{(l)}$ defines a linear operator between feature maps. Congruent with earler works in the steerable networks literature, we will use **feature space** to denote the domain $\mathcal{F}^{(l)}$ and codomain $\mathcal{F}^{(l+1)}$ of the map $ϕ^{(l)}$. Also observe, $ϕ^{(l)}$ produces representations $f^{(l+1)}$ assigning greater magnitude $z ↦ f^{(l+1)}(z)$ at points wherever the **kernel** $k$ (or *filter*) is similar to $f^{(l)}(z)$ in the preceeding feature map. Here, a degree of similarity between a kernel $k$ and feature maps $f^{(l)}$ is made precise with the connection between convolution $⋆$, composing signals to form other signals, and the theory of inner-product spaces <a name="ref:1" href="#axler">[1]</a>. As a linear map in $\text{Hom}(\mathcal{F}^{(l)}, \mathcal{F}^{(l+1)})$, each convolution layer $ϕ^{(l)}$ is chosen from a linear span of functions with convolution structure, with $k$ equal some linear combination of a trivial basis with a sum equal to $ϕ^{(l)}$. Since $ϕ^{(l)}$ is defined to be equal to convolution with $k$, producing latent representations said to "equivary" under translation. 



#### Constructing $SO(3) \ltimes R^n$ Invariant Layers

To ensure that our learned network agrees with this constraint, we will briefly recapitulate the consequences, which were first described in [[10](#cohen-steerable)] then later generalized to all compact groups in [[19](#condor-compact)], of imposing that $\hat F_\theta$ is a $G$-steerable map (in this case, *$G$-invariant*).
Stated in mathematical symbols, we desire a construction that giving us a function with the form,

$$
\hat F_\theta(x) = (FFN \circ a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(2)} \circ \phi^{(2)}_{k_2} \circ a^{(1)} \circ \phi^{(1)}_{k_1})(x)
$$

where $FFN$ is a fully-connected feed-forward network, $a^{(j)}$ are non-linear activation functions, and $\phi^{(j)}_{k_j}$ are linear maps given by the convolution product of an input signals with learned kernels $k_j$. At a minimum, this construction ensures that our hypothesis inherits the desirable properties of CNNs described in the first section of the appendix.

Let $\hat F_\theta$ be a convolutional network, and let $G = E(n)$ be a group of transformations that leave $\hat F_\theta$ unchanged.
Obtaining the desirable property that $\hat F_\theta$ remains sensitive to the relative poses of features is to say that the intermediate layers of $\hat F_\theta$ are steered by $G$ (they satisy the equality from the introduction).

\begin{align*}
\hat F_\theta(g \cdot x) &= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(2)} \circ \phi^{(2)}_{k_2} \circ a^{(1)} \circ \phi^{(1)}_{k_1})(g \cdot x) \\
&= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(2)} \circ \phi^{(2)}_{k_2} \circ a^{(1)} \circ [\rho^{(1)}(g) \circ \phi^{(1)}_{k_1}])(x) \\
&= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(2)} \circ \phi^{(2)}_{k_2} \circ [\rho^{(1)}(g) \circ a^{(1)} \circ \phi^{(1)}_{k_1}])(x) \\
&= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(2)} \circ [\rho^{(2)}(g) \circ \phi^{(2)}_{k_2} \circ a^{(1)} \circ \phi^{(1)}_{k_1}])(x) \\
&= (a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots [\rho^{(2)}(g) \circ a^{(2)} \circ \phi^{(2)}_{k_2} \circ a^{(1)} \circ \phi^{(1)}_{k_1}])(x) \\
\vdots \\
&= ([\rho^{(n)}(g) a^{(n)} \circ \phi^{(n)}_{k_n} \circ \cdots \circ a^{(2)} \circ \phi^{(2)}_{k_2} \circ a^{(1)} \circ \phi^{(1)}_{k_1}])(x)
\end{align*}

where $a^{(j)}$ are non-linear activation functions and $\phi^{(j)}_{k_j}$ is a linear map given by the convolution product of an input signal with a learned kernel $k_j$.