# Projection Theory
*Arthur Ryman, lasted updated 2025-09-03*

## Introduction

The goal of this notebook is to analyze and develop the theory of projections as originally outlined in the document 
[projection.md](https://github.com/agryman/instant-insanity/blob/main/src/instant_insanity/core/projection.md).

## SymPy

I will use SymPy to represent and verify the math.
I will convert the notation used in the original document to a notation that SymPy easily handles, namely using
numeric subscripts for components of vectors.

## Model Space

Let $M = \mathbb{R}^3$ denote the 3d model space.
Model space is where our simple 3d objects live.

Let $x, y, z$ be the usual Cartesian coordinates on model space.

We will use the default Manim orientation of model space relative to the display screen, 
namely:
* x increases from left to right,
* y increases from bottom to top, and 
* z increases from in (back) to out (front).

Model space and scene space are 3-dimensional real vector spaces.

Define the types Scalar to be a real number and Vector to be a triple of scalars.
SymPy represents vectors as column vectors.

Define some convenience functions to create scalars, positive scalars, and vectors.
The components of vectors will be subscripted with the numbers 1, 2, and 3.

In [1]:
from sympy import *
from instant_insanity.core.symbolic_projection import Scalar, Vector

def scalar(name: str) -> Scalar:
    return symbols(name, real=True)

def positive_scalar(name: str) -> Scalar:
    return symbols(name, real=True, positive=True)

def vector(name: str) -> Vector:
    return Matrix(symbols(name + '1:4', real=True))

Let m represent a generic point in model space.

In [2]:
model_point = m = vector('m')

m.T

Matrix([[m1, m2, m3]])

In [3]:
type(m)

sympy.matrices.dense.MutableDenseMatrix

In [4]:
m1, m2, m3 = m

m3

m3

## Translations and Scalings

Translations allow us to shift the origins of model space and scene space relative to each other.

Scaling allows us to change their relative units of measure.
For example, our standard puzzle cube has side length 2.0 in model space, but might fit better with side length 1.0 in scene space.
In this case we would use a scale factor of 0.5.

Technically, translations and scalings are independent of projections.
Historically, we have introduced the idea of the camera plane.
The position of the camera plane has no impact on the magnification of objects in orthographic projections.
It does influence the magnification in perspective transformations. The closer the camera plane is to the viewpoint,
the smaller it looks. However, I believe this effect can be compensated for by a suitably chosen scale factor.

TODO: Confirm that an arbitrary location of the camera plane is equivalent to the camera plane at z=0 plus a suitable
scale factor.

Now let's create some SymPy types variables to use in our formulae.

Keep the translation and scaling separate from the projection for now. They may not actually be independent parameters
but they are conceptually distinct.

In [5]:
scene_origin = o = vector('o')

o.T

Matrix([[o1, o2, o3]])

In [6]:
type(o)

sympy.matrices.dense.MutableDenseMatrix

In [7]:
o1, o2, o3 = o

o3

o3

In [8]:
type(o3)

sympy.core.symbol.Symbol

The parameters scene_x, scene_y, and scene_z SHOULD be interpretted as coordinates of the point $o$ in model space
that maps to the origin in scene space.

We allow the mapping from model space to scene space to include a positive scalar scale factor.
Think of this as a change of measurement units.
The scale factor is the number of scene units per model unit.

In [9]:
scale = alpha = positive_scalar('alpha')

alpha

alpha

In [10]:
type(alpha)

sympy.core.symbol.Symbol

The mapping from model space to scene space is as follows. 

In [11]:
def model_to_scene(m: Vector) -> Vector:
    return scale * (m - o)

model_to_scene(m).T

Matrix([[alpha*(m1 - o1), alpha*(m2 - o2), alpha*(m3 - o3)]])

Let s be a generic point in Scene space.

In [12]:
scene_point = s = vector('s')

s.T

Matrix([[s1, s2, s3]])

In [13]:
s1, s2, s3 = s

s3

s3

The mapping from scene space to model space is the inverse.

In [14]:
def scene_to_model(s: Vector) -> Vector:
    return s / alpha + o

scene_to_model(s).T

Matrix([[o1 + s1/alpha, o2 + s2/alpha, o3 + s3/alpha]])

Verify that these mappings are inverses.

In [15]:
sm = model_to_scene(m)
msm = scene_to_model(sm)

msm.T

Matrix([[m1, m2, m3]])

Ask SymPy to verify that the functions are inverses.

In [16]:
Eq(scene_to_model(model_to_scene(m)), m)

True

In [17]:
ms = scene_to_model(s)
sms = model_to_scene(ms)

sms.T

Matrix([[s1, s2, s3]])

In [18]:
Eq(model_to_scene(scene_to_model(s)), s)

True

## Projections

A projection is a mapping from 3d space to 2d space.
Projections let us draw 3d objects on 2d screens.
This document gives a precise, mathematical specification of projections.

Implementing projections will let us draw simple 3d scenes in Manim
using the `Scene` class and the Cairo renderer.

Although work is underway to produce a high-quality OpenGL renderer for use with the `ThreeDScene` class,
the majority of our planned content is 2d so using Cairo with the `Scene` class
is an acceptable short-term workaround.

Both types of projection project model space onto the camera plane which is always perpendicular to the z-axis.
The camera plane $C$ is defined by the equation $z = c$.

In [19]:
camera_z = c = scalar('c')

c

c

A generic point $m$ in model space has z-component $m_3$.
The camera plane in model space defined by the parameter $c$ is given by the equation $m_3 = c$.

In [20]:
def camera_plane(c: Scalar) -> Eq:
    return Eq(m3, c)

C = camera_plane(c)

C

Eq(m3, c)

Ask SymPy to verify that any point with z-component $c$ in in the camera plane $C$.

In [21]:
C.subs({m3: c})

True

SymPy may not be able to determine if a point in the plane, in which case the equation does not reduce to the truth value.

In [22]:
C.subs({m3: 0})

Eq(0, c)

Define a function that tests if a point in model space satisfies an equation defined in terms of a generic point in 
model space. SymPy may be able to reduce the equation to a boolean if it can simplify it enough.
For example, $c = c$ is always true while $c = c + 1$ is always false.

In [23]:
def is_solution(p: Vector, eq: Eq) -> Eq:
    p1, p2, p3 = p
    return eq.subs({m1: p1, m2: p2, m3: p3})

p00c = Matrix([0, 0, c])

is_solution(p00c, C)

True

In [24]:
p00c1 = Matrix([0, 0, c + 1])

is_solution(p00c1, C)

False

## Perspective Projections

We define perspective projections by giving a viewpoint.
Define a generic viewpoint.

In [25]:
viewpoint = v = vector('v')

viewpoint.T

Matrix([[v1, v2, v3]])

## Orthographic Projections

We define an orthographic projection by giving the unit vector $u$ that specifies the direction of the projection.
Since we'll be doing exact mathematics, we'll define the unit vector in terms of spherical polar angles.
Any unit vector $u$ is parameterized by the polar angle $\theta$ and the azimuthal angle $\theta$.

Define a generic unit vector $u$.

In [26]:
theta = scalar('theta')
phi = scalar('phi')

Matrix([theta, phi]).T

Matrix([[theta, phi]])

In [27]:
u_x = sin(theta) * cos(phi)
u_y = sin(theta) * sin(phi)
u_z = cos(theta)
u = Matrix([u_x, u_y, u_z])

u.T

Matrix([[sin(theta)*cos(phi), sin(phi)*sin(theta), cos(theta)]])

In [28]:
u.norm()

sqrt(sin(phi)**2*sin(theta)**2 + sin(theta)**2*cos(phi)**2 + cos(theta)**2)

In [29]:
simplify(u.norm())

1

## The Camera Plane 

Let $C$ denote the camera plane in model space.
The camera plane is where we will draw the 2d projections of 3d objects.
The points in the camera plane get mapped to the pixels of the 
display screen by the Manim `Scene` class.

The camera plane is oriented parallel to the plane z = 0.
Let $c$ be a real number that defines the camera plane 
as the solutions to the equation $z = c$.

$$
C = \{~(x, y, c) \mid (x, y) \in \mathbb{R}^2~\}
$$

In terms of our generic variables we have:

In [30]:
c = camera_z

c

c

Given any point $m$ in model space, we define its projection $\pi(m)$ onto the camera
plane by:
$$
\pi(m_x, m_y, m_z) = (m_x, m_y, c)
$$

The variable $\pi$ is predefined in SymPy, so let's call this projection proj_xy.

In [31]:
def proj_xy(p: Vector) -> Vector:
    return Matrix([p[0], p[1], c])

proj_xy(m).T

Matrix([[m1, m2, c]])

Thus, $\pi$ maps $M$ onto $C$:
$$
\pi: M \rightarrow C
$$

The projection $\pi$ does nothing to the $(x,y)$-coordinates
and forgets the $z$-coordinate of points in $M$.

We need more sophisticated projections that give us the illusion of 3d scenes
but don't forget information so that we can invert them and compute the relative
ordering of model space points that project to the same camera plane points.
These are referred to as 3d projections, and they include perspective and orthographic
projections.

## Projections

Our goal is to draw 3d objects that live in model space $M$ as 2d objects on the camera plane $C$.
However, we need to draw the 2d projections of our 3d objects in the correct order to achieve the correct appearance.
If object A is behind object B in model space then we need to draw the 2d projection
of object A before we draw that of object B. 
This procedure is known as the 
[Painter's Algorithm](https://en.wikipedia.org/wiki/Painter%27s_algorithm).

For simplicity, we will assume that our 3d objects can be modelled as collections of opaque, convex, planar
polygons and that we can always sort them into some drawing order that will produce the correct visual appearance.

Note that it is possible to arrange three nonintersecting, convex, 
planar polygons in a way that has no corresponding correct drawing order.

Given the known requirements for our current project, 
all collections of 3d objects will be simple enough so that a correct
drawing order always exists.

If we actually needed to draw some collection of polygons that had no correct drawing order,
then we would have to split some of the polygons.
If we split the polygons enough then a correct drawing order always exists.
In the extreme case, we could split each polygon into individual pixels.
We'll defer dealing with this situation until project requirements force us to do so.


A 3d projection 
$$f: M \rightarrow M$$
is linear transformation of model space that preserves $z$-coordinates.

Let $$m \in M$$ map to: $$f(m) = n \in M$$

Let $$m = (m_1, m_2, m_3)$$ 
and let 
$$f(m) = n = (n_1, n_2, n_3)$$
be its projection.
Then we require:
$$m_3 = n_3$$ 
This means that the $z$-depth of the point hasn't changed,
only its $(x, y)$ coordinates.

Let $m$ and $m'$ be distinct points in model space that project to the same
point in the camera plane:
$$\pi(f(m)) = \pi(f(m'))$$

We say that $m$ and $m'$ are *collinear* with respect to $f$.

Suppose that $m$ is behind $m'$.
Denote this as:
$$ m \prec m'$$

We will define a real-valued $t_f$ function for $f$ 
$$ t_f: M \rightarrow \mathbb{R}$$
with the property that it respects
the relative ordering of collinear points in the sense that
their $t_f$ values must satisfy:
$$t_f(m) < t_f(m')$$

There are two commonly used 3d projections, namely perspective and orthographic.
These will be defined next.

## Perspective Projection

A perspective projection models the way we see things.
Objects that are further away appear smaller and parallel lines converge.

A perspective projection is defined by giving a viewpoint $v \in M$.
The viewpoint represents the position of our eyes.

Treat $v$ as a fixed parameter in what follows.
Consider points $m$ that are distinct from $v$.
If $m = v$ then the projection of $m$ is not defined.

Let $L(v;m)$ be the line in model space that passes through the points $m$ and $v$.
This line exists because we have assumed that $m \ne v$.

Think of $L(v;m)$ as a light ray that leaves the 3d object at 
$m$ and enters our eye at $v$.
The projection $f(v;m)$ is defined in terms of the unique point $b(v;m)$ where 
the light ray intersects the camera plane.

$$
L(v;m) \cap C = \{ b(v;m) \}
$$

Therefore,
$$
b(v;m) = (b(v;m)_1, b(v;m)_2, c)
$$
where $b(v;m)_x$ and $b(v;m)_y$ are unknown quantities that we have to compute.

Let $\hat{u}(v;m)$ denote the unit vector that points from $m$ to $v$.
$$
\hat{u}(v;m) = \frac{v - m}{\lVert v - m \rVert}
$$

Let $\hat{u}(v;m)$ have the following components:
$$
\hat{u}(v;m) = (u(v;m)_1, u(v;m)_2, u(v;m)_3)
$$

Define $L(v;m,\lambda)$ to be the point on $L(v;m)$ corresponding to the
real parameter $\lambda$ as follows:
$$
L(v;m, \lambda) = b(v;m) + \lambda \hat{u}(v;m)
$$
With this parameterization, we can think of the line as being directed from
$m$ to $v$.

In terms of coordinates, we have:
$$
L(v;m,\lambda) = (b(v;m)_x + \lambda u_x, b(v;m)_y + \lambda u_y, c + \lambda u_z)
$$

By construction, the parameter value $\lambda = 0$ maps to the point $b(v;m)$.
$$
L(v;m, 0) = b(v;m)
$$

By construction, the parameter value $\lambda = \lVert v - m \rVert$ 
maps to the point $v$.
$$
L(v;m, \lVert v - m \rVert) = v
$$

Define $t(v;m)$ to be the parameter value that maps to the point $m$.
$$
L(v;m, t(v;m)) = m
$$

In terms of coordinates, we have
$$
\begin{align}
b(v;m)_x + t(v;m) u_x &= m_x \\
b(v;m)_y + t(v;m) u_y &= m_y \\
c + t(v;m) u_z &= m_z
\end{align}
$$

Now solve for $t(v;m), b(v;m)_x, b(v;m)_y$ as follows:
$$
\begin{align}
t(v;m) &= \frac{m_z - c}{u_z} \\
b(v;m)_x &= m_x - t(v;m) u_x \\
b(v;m)_y &= m_y - t(v;m) u_y
\end{align}
$$

In summary, a viewpoint $v$ defines a 3d projection $f$ as follows:

$$
f(v;m) = (b(v;m)_x, b(v;m_y), m_z))
$$

Given $b(v;m)$, we can compute $m$ as follows:
$$
\begin{align}
t(v;m) &= \frac{m_z - c}{u_z} \\
m_x &= b(v;m)_x + t(v;m) u_x \\
m_y &= b(v;m)_y + t(v;m) u_y
\end{align}
$$

## Orthographic Projection

An orthographic projection is a limiting case of a perspective projection as
the viewpoint $v$ moves off to infinity in a fixed direction.
Let $\hat{u}$ be a unit vector that defines the direction that the viewpoint moves in.
Let $\mu$ be a real parameter, let $v_0$ be an initial viewpoint,
and define the viewpoint $v(\hat{u};\mu)$ as follows:
$$
v(\hat{u};\mu) = v_0 + \mu \hat{u}
$$

Define $\hat{u}(m; \mu)$ to be the unit vector that points from $m$ to $v(\mu)$.
$$
\hat{u}(m; \mu) = \frac{v(\hat{u};\mu) - m}{\lVert v(\hat{u};\mu) - m \rVert}
$$

Clearly, as $\mu$ becomes very large, 
$v(\hat{u};\mu)$ approaches $\mu \hat{v}$.
$$
\begin{align}
\lim_{\mu \to \infty} \hat{u}(m; \mu) 
&= \lim_{\mu \to \infty} \frac{v(\hat{u};\mu) - m}{\lVert v(\hat{u};\mu) - m \rVert} \\
&= \lim_{\mu \to \infty} \frac{v_0 + \mu \hat{u} - m}{\lVert v_0 + \mu \hat{v} - m \rVert} \\
&= \lim_{\mu \to \infty} \frac{\mu \hat{u}}{\lVert \mu \hat{u} \rVert} \\
&= \lim_{\mu \to \infty} \frac{\mu \hat{u}}{\mu \lVert \hat{u} \rVert} \\
&= \hat{u}
\end{align}
$$

Therefore, an orthographic projection is like a perspective projection except that
rather than compute the unit vector $\hat{u}(v;m)$ we use the given constant unit vector $\hat{u}$.

## Mapping from Model Space to Scene Space

In practice, it is useful to not regard the camera as being embedded in scene space.
For example, the puzzle cubes have side length 2 which may be too big for the scene.
In this case allowing a scaling factor is handy.
Also, the objects in model space may appear to be shifted after applying a 3d projection.
In this case allowing a translation is handy.

We therefore allow the following mapping from model space to scene space which
we apply after the 3d projection. Let $o$ be a point in model space that will map to
the origin in scene space. The $\alpha$ be a real number scaling factor. The mapping $g$
from model space to space if given by:
$$
g(o,\alpha;m) = \alpha \cdot (m - o)
$$

The inverse is:
$$
g^{-1}(s) = o + s / \alpha
$$

## The Relation Between the Camera Plane, Scaling, and Translation

The projection depends on the viewpoint $v$, the camera plane $c$,
the scene origin $o$, and the model-to-scene scaling factor $\alpha$.
This scene origin includes $c$ as its $z$-component. Altogether, we
have 3 ($v$) + 3 ($o$) + 1 ($c$) + 1 ($\alpha$) parameters
for a grand total of 8 parameters. Are these independent or can we
canconicalize them, say by finding an equivalent set of parameters
with $c = 0$?