# 1. Introduction

## 1.1. What is Computer Vision?
>$$\textbf{Images of a real scene} \rightarrow \textbf{Useful information (3R's)}$$
>
>$$\text{Images} \rightarrow \text{Representation}$$
>
>$$\text{Perception} \rightarrow \text{Actions}$$

>* **3R's**: Registration, Recognition, and Reconstruction
>* It is different from **Image Processing** or **Pattern Recognition**
>* **Applications**: autonomous vehicles, medical diagnosis, augmented reality, face recognition, ...

## 1.2. Human Eye vs. Camera
>* The numbers are rough estimate

>||Human Eye|Typical CCD|
|-|-|-|
|Measures|$1000\text{ mm}^2$|$24\text{ mm }\times16\text{ mm }$|
|No. of Sensors|$2 \text{ Eyes}$|$1 \text{ CCD Sensor}$|
|Sampling Elements|$10^8\text{ rods, and } 10^6\text{ cones}$|$6\times10^6 \text{ pixels}$|
|Spatial Resolution|$0.01^\text{o} \text{ over } 150^\text{o} \text{ field of view}$|$\text{Lens-dependent}$|
|Temporal Resolution|$100\text{ ms}$|$40\text{ ms}$|
|Intensity Resolution|$11\text{ bits/element}$|$8\text{ bits/element for each colour channel}$|
|Spectral Resolution|$2\text{ bits/element}$|$8\text{ bits/element for each colour channel}$|
|Data Rate|$3\text{GB/s}$|$400\text{MB/s}$|

## 1.3. Problems in Computer Vision

>**1. Image Formation**
>  * Many-to-one mapping, Encodes nothing about the depth
>  * The inverse imaging problem has no unique solution

>**2. Vision as Information Processing**
>  * Images $\rightarrow$ Generic salient features $\rightarrow$ Representations and Actions
>  * $10 \text{ MB/s} \rightarrow 10 \text{ KB/s} \rightarrow 1\text{~}10\text{ bits/s}$ (Data reduction process)

>**3. Feature Extraction**
>  * Reduce the data content while preserving the useful information (e.g. **Edge & Corner**)

>**4. Camera Models**
>  * Account for the camera position, perspective projection, and CCD imaging
>  * Predict how known objects will appear in an image $\rightarrow$ **Object recognition**

>**5. Stereo Vision**
>  * Two cameras $\rightarrow$ Match the two images (**Correspondence problem**) $\rightarrow$ Obtain depth
>  * Possible to infer information even when the cameras are **not calibrated**

>**6. Structure from Motion**
>  * Camera moves $\rightarrow$ Track features $\rightarrow$ Infer the structures in the scene & the motion of camera
>  * Sensitive to independently moving objects

>**7. Shape from Texture**
>  * Assume **homogeneous** or **isotropic** texture $\rightarrow$ infer the orientation of surfaces

>**8. Shape from Line Drawing**

>**9. Shape from Contour**
>  * Each **apparent contour** defines a set of **tangent planes** from the camera to the surface
>  * Analyse the deformation of the apparent contours in the image $\rightarrow$ Infer the shape

>**10. Shape from Shading**
>  * Assume Lambertian light source, isotropic surface reflectance, and a top-lit scene $\rightarrow$ Infer the shape

# 2. Feature Extraction and Description

## 2.1. Objectives
>**1. Reduce** the amount of data

>**2. Preserve** the useful information (e.g. edge, corner, shape)

>**3. Discard** the redundant information (e.g. lighting conditions)

## 2.2. 1D Edge Detection
* **Step 1. Smooth the signal to suppress noise**

  * Convolve the signal $I(x)$ with a Gaussian kernel $g_\sigma(x)$
  * **Small $\sigma \rightarrow$** Large cutoff frequency $\rightarrow$ **Preserve detail**
  * **Large $\sigma \rightarrow$** Small cutoff frequency $\rightarrow$ **Suppress detail**

>$$g_\sigma(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{x^2}{2\sigma^2}\right)$$
>$$\;$$
>\begin{align}
s(x)=g_\sigma(x) * I(x)&=\int_{-\infty}^{\infty}g_\sigma(u)I(x-u)du\\
&=\int_{-\infty}^{\infty}g_\sigma(x-u)I(u)du
\end{align}

* **Step 2-a. Compute $s'(x)$ and look for maxima & minima**

  * Instead of doing two convlutions,
  * Using the derivative theorem of convolution,

>$$s'(x)=\frac{d}{dx} [ g_\sigma(x) * I(x) ] = g'_\sigma(x)*I(x)$$

* **Step 2-b. Compute $s''(x)$ and look for zero-crossings**

  * The signal is convolved with the **Laplacian of a Gaussian**

>$$s''(x)=g''_\sigma(x)*I(x)$$



## 2.3. 2D Edge Detection - Canny (1986)
* **Step 1. Convolve with a 2D Gaussian $G_\sigma(x,y)$**

>\begin{align}
G_\sigma(x,y)&=\frac{1}{2\pi\sigma^2}\exp\left(-\frac{x^2+y^2}{2\sigma^2}\right)\\
\\
S(x,y)&=G_\sigma(x,y)*I(x,y)\\
\\
&=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}G_\sigma(u,v)I(x-u,y-v)du dv
\end{align}

* **Step 2. Find the Gradient of $S(x,y)$**

>\begin{align}
\nabla S &=\nabla (G_\sigma*I) \\
&=
\begin{bmatrix}
    \frac{\partial (G_\sigma *I)}{\partial x} \\
    \frac{\partial (G_\sigma *I)}{\partial y} \\
\end{bmatrix} \\
&=
\begin{bmatrix}
    \frac{\partial G_\sigma}{\partial x} *I \\
    \frac{\partial G_\sigma}{\partial y} *I \\
\end{bmatrix}
\end{align}

* **Step 3. Non-maximal Suppression**

>$$\text{Edge elements (edgels) are placed where } \mid \nabla S \mid \text{ is greater than local values in the direction of } \pm \nabla S $$
  
* **Step 4. Threshold the Edgels**

>$$\text{Output: edgel positions, each with strength } \mid \nabla S \mid \text{ and orientation } \nabla S \text{/} \mid \nabla S \mid$$

## 2.4. 2D Edge Detection - Marr and Hildreth (1980)
* Unlike Canny edge detector, Marr-Hildreth operator is **isotropic**
* **Algorithm: Find the zero-crossings of:**

>$$\nabla^2G_\sigma*I$$
>$$\;$$
>$$\text{where } \nabla^2G_\sigma=\left( \frac{\partial^2}{\partial x^2} + \frac{\partial^2}{\partial y^2} \right) G_\sigma \text{ is the Laplacian of } G_\sigma$$

## 2.5. Implementation Details
* **Convolution**

>$$S(x,y)=\sum_{u=-n}^{n} \sum_{v=-n}^{n} G_\sigma(u,v)I(x-u,y-v)$$
>
>* Kernels are trucated so that the discarded samples are less than $1/1000$ of the peak value
>* $\pm 3\sigma$ is often enough
>* Smaller $n$ leads to sharp discontinuity (avoid!)
>
|$\sigma$|1.0|1.5|3|6|
|-|-|-|-|-|
|$2n$ $+1$|7|11|23|45|
>
>* 2D Convolution: computationally expensive!
>* Decompose it into two 1D convolutions
>$$G_\sigma(x,y)*I(x,y)=g_\sigma(x)*[g_\sigma(y)*I(x,y)]$$
>
>$$\mathcal{O} \left((2n+1)^2 \right) \text{ vs. } \mathcal{O} \left(2(2n+1)\right)$$

* **Differentiation** 

>$$\frac{\partial S}{\partial x} \approx \frac{S(x+1,y)-S(x-1,y)}{2}$$
>
>* The above is the result of the **Taylor Expansion** of $S(x,y)$
>* This is equivalent to convolving the rows of image samples with the kernel $(1/2, 0, -1/2)$

## 2.6. Corner Detection - Harris (1987)

* **Corner Detection** - Important in **tracking** objects and **matching** stereo pairs
* Important in understanding **kinetic effect** and **ego-motion**
* **Step 1. Calculate the change in intensity in direction $\textbf{n}$**

  * Here, $I$ is already smoothed

>\begin{align}
I_n &\equiv \nabla I(x,y) \cdot \widehat{\textbf{n}} \equiv \begin{bmatrix} I_x & I_y \end{bmatrix}^T \cdot \widehat{\textbf{n}}\\
\;\\
I_n^2 &= \frac{\textbf{n}^T \nabla I \nabla I^T \textbf{n}}{\textbf{n}^T\textbf{n}}\\
\;\\
&= \frac{\textbf{n}^T \begin{bmatrix} 
I_x^2 & I_xI_y \\
I_xI_y & I_y^2
\end{bmatrix}
\textbf{n}}{\textbf{n}^T\textbf{n}}
\end{align}
>$$\;$$
>$$\text{where } I_x \equiv \frac{\partial I}{\partial x} \text{ and } I_y \equiv \frac{\partial I}{\partial y}$$

* **Step 2. Smooth $I_n^2$ by convolution with a Gaussian kernel**

>\begin{align}
C_n(x,y) &= G_\sigma (x,y)*I^2_n \\
\;\\
&= \frac{\textbf{n}^T \begin{bmatrix} 
\left\langle I_x^2 \right\rangle & \left\langle I_xI_y \right\rangle\\
\left\langle I_xI_y \right\rangle & \left\langle I_y^2 \right\rangle\\
\end{bmatrix}
\textbf{n}}{\textbf{n}^T\textbf{n}}
\end{align}
>$$\;$$
>$$\text{where } \left\langle \right\rangle \text{ is the smoothed value}$$

* **Step 3. Use the eigenvalues of $\text{A}$ to determine the structure**

>$$C_n(x,y)=\frac{\textbf{n}^T \text{A} \textbf{n}}{\textbf{n}^T\textbf{n}}$$
>$$\;$$
>$$\text{where,} \;\;\; \text{A}=\begin{bmatrix} 
\left\langle I_x^2 \right\rangle & \left\langle I_xI_y \right\rangle\\
\left\langle I_xI_y \right\rangle & \left\langle I_y^2 \right\rangle\\
\end{bmatrix}$$
>$$\;$$
>$$\text{then,} \;\;\; \lambda_1 \leq C_n(x,y) \leq\lambda_2$$
>
>* **$\text{A}$**: Structure Tensor
>* **No structure**(smooth): $\lambda_1 \approx \lambda_2 \approx 0$
>* **1D structure**(edge): $\lambda_1 \approx 0$ (direction of edge), $\lambda_2$ is large (normal to edge)
>* **2D structure**(corner): $\lambda_1$ and $\lambda_2$ both large and distinct

## 2.7. Setting Threshold for Corner Detection
* **If $M_c$ exceeds some threshold $\rightarrow$ Mark corners**

>$$M_c=\lambda_1 \lambda_2-\kappa(\lambda_1+\lambda_2)^2$$
>$$\;$$
>$$M_c=\det{\text{A}} - \kappa \text{ tr A}$$
>* The second form is preferable (low computational cost)
>* $\kappa$: generally range from $0.04$ to $0.15$

## 2.8. Blobs
* Blob: area of uniform intensity in the image
* They are localised in the middle of areas of similar intensity
* Convolve with the **Laplacian of the Gaussian** $\rightarrow$ Locate minimum
* As $\sigma$ increases, larger image features are detected

## 2.9. Scale Space
* Achieve scale independence by looking at different resolutions
* **Scale Space**

>\begin{align}
L(x,y,t)&=G(x,y,t)*I(x,y)\\
\;\\
G(x,y,t)&=\frac{1}{2\pi t}\exp{-\frac{x^2+y^2}{2t}}\\
\;\\
t&=\sigma^2
\end{align}

* **Choose discrete set of low-pass filter**

>$$\sigma_i=2^\frac{1}{s}\sigma_{i-1}=2^\frac{i}{s}\sigma_0$$
>
>* $\sigma$ doubles after $s$ intervals ($=$ an **octave**)
>* Avoid blurring with large scales by **subsampling** the image after each octave
>* $\Rightarrow$ **Image Pyramid**

* **Within each octave, we convolve repeatedly**

>$$G(\sigma_1)*G(\sigma_2)=G\left(\sqrt{\sigma^2_1+\sigma^2_2}\right)$$
>
>* The following should be satisfied:
>
>$$G(\sigma_{i+1})=G(\sigma_i)*G(\sigma_k)$$
>
>* $\sigma_k$ can be calculated
>
>\begin{align}
\sigma_k&=\sqrt{\sigma_{i+1}^2-\sigma_i^2}\\
\sigma_{i+1}&=2^{\frac{1}{s}}\sigma_i\\
\sigma_k&=\sigma_i\sqrt{2^{\frac{2}{s}}-1}
\end{align}

* **Ideal scale for a keypoint is located at the maximum of the scale space**
  * $\Rightarrow$ The largest value of the samples in the pyramid is obtained and interpolated

## 2.10. DoG - Difference of Gaussians
* **DoG: Blob Detector**
  * Calculated as the difference of two Gaussians (small $\sigma$ - high $\sigma$)
  * This approximates the Laplacian of a Gaussian
* **Blobs are Important**
  * Blobs are usually found inside of objects (as opposed to edges)
  * Thus they are less likely to contain background in queries
  * +) stability, repeatability, definite optimal scale, ...
* **Scale Space Pyramid**
  * Subtract one member of a pyramid level from the one directly above it

## 2.11. Zero Normalised Patches
* **Matching intensity patches**

>$$CC(P_1, P_2)=\sum_i^N{P_1[i]P_2[i]}$$
>
>* Not robust to changes (brightness & contrast)

* **Zero Normalised Patches**

>$$ZN(x,y)=\frac{Z(x,y)}{\sigma}=\frac{I(x,y)-\mu}{\sigma}$$
>
>* Robust to changes
>* Patches can be matched using simple cross-correlation

## 2.12. Matching Patches using Intensity Edges
* **Orientation Histograms**

>1. Find every edge in a patch of pixels
>2. Weight them by the strength of the edge
>3. Bin them together into an orientation histogram

* **Advantage**

>1. Robust to brightness and contrast changes
>2. Incorporates orientation data $\rightarrow$ robust to orientation

* **SIFT**(Scale Invariant Feature Transform)** interest point descriptor**

>1. $N\times N$ patch (typically, $N=16$)
>2. Split this patch into $c$ cells (typically $c=16$)
>3. In each cell, obtain the orientation histogram
>4. Weight them with a Gaussian window ($\sigma=0.5\;\times$ the scale of the feature centered on the patch)
>5. The resulting descriptor is $d\times c$ vector (typically $\textbf{128}\text{D}$)
>6. Normalise the descriptor vector (invariance to gradient magnitude change)
>7. Threshold the elements ($0.2$) $\rightarrow$ Renormalise (reduce the effect of non-affine lighting changes)

## 2.13. Texture
* **What is Texture?**

>* **Image Texture** consists of organised patterns of regular sub-elements called **textons**

* **Characterising Texture**

>* **Filter bank**:
>  * $8$ Lablacian of Gaussian filters
>  * $4$ Gaussian filters at different scales
>  * $36$ Oriented filters
>    * $6$ Different angles
>    * $3$ Different scales
>    * $2$ Different phases ($1^{st} \text{&}\; 2^{nd}$ derivatives of Gaussians on the minor axis)
>* **Descriptor**:
>  * Simply the concatenated responses of all the filters at a pixel
>  * This is innately immune to most changes in an image

# 3. Projection

## 3.1. Orthographic Projection

* **Orthographic Projection:** Projection using ***parallel*** rays

><img src="images/image01.png" width=250>

>$$\mathbf{x}=\mathbf{X}-(\mathbf{X\cdot k})\mathbf{k} = (\mathbf{k \times X})\times\mathbf{k}$$

## 3.2. Perspective Projection

* **Perspective Projection:** Inspired by the **pin-hold camera**
  * $\mathbf{X}_c=(X_c, Y_c, Z_c)$: world points
  * $\mathbf{x}=(x,y)$: image plane points

><img src="images/image02.png" width=400>

>$$\frac{x}{f}=\frac{X_c}{Z_c} \;\;\; \Leftrightarrow \;\;\; x=\frac{fX_c}{Z_c} \;\;\; \text{similarly,} \;\;\; y=\frac{fY_c}{Z_c}$$
>
>* **Ratio & Symmetry** $\Rightarrow$ not preserved under perspective projection

## 3.3. Vanishing Points
* **Vanishing Point:** Point where parallel lines appear to meet
  * Each set of parallel lines have a ***different*** vanishing point

><img src="images/image03.png" width=300>
>
>* **Example:** Vanishing point of a line
>
>$$\mathbf{X}_c=\mathbf{a}+\lambda \mathbf{b} \;\Rightarrow\; \mathbf{x} = f \left( \frac{a_x+\lambda b_x}{a_z+\lambda b_z}, \frac{a_y+\lambda b_y}{a_z+\lambda b_z} \right) \;\Rightarrow\; \mathbf{x}_vp=f \left( \frac{b_x}{b_z},\frac{b_y}{b_z} \right)$$

* **Horizon Line:** Line where parallel planes appear to meet
  * Any set of parallel lines ***lying on*** these planes will have a vanishing point ***on the horizon line***

><img src="images/image04.png" width=300>

## 3.4. Full Camera Model

* $\mathbf{X} \rightarrow \mathbf{X}_c$ 

><img src="images/image05.png" width=300>

>\begin{align}
\mathbf{X}_c &= \mathbf{RX}+\mathbf{T}\\
\begin{bmatrix} X_c \\ Y_c \\ Z_c \end{bmatrix}
&=
\begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix} 
\begin{bmatrix} X \\ Y \\ Z \end{bmatrix}
+ \begin{bmatrix} T_x \\ T_y \\ T_z \end{bmatrix}
\end{align}

* $\mathbf{x}=(x,y) \rightarrow \mathbf{w}=(u,v)$

><img src="images/image06.png" width=300>

>$$u=u_0+k_ux \;\;\;,\;\;\; v=v_0+k_vy$$

* **Combining the Two**

>\begin{align}
u&=u_0+\frac{k_ufX_c}{Z_c}=u_0+\frac{k_uf(r_{11}X+r_{12}Y+r_{13}Z+T_x)}{r_{31}X+r_{32}Y+r_{33}Z+T_z}\\
\;\\
v&=v_0+\frac{k_vfY_c}{Z_c}=v_0+\frac{k_vf(r_{21}X+r_{22}Y+r_{23}Z+T_x)}{r_{31}X+r_{32}Y+r_{33}Z+T_z}
\end{align}

## 3.5. Homogeneous Coordinates
* **Homogeneous Coordinates:**
  * More natural framework for projective geometry
  * Using HC, imagining process $\rightarrow$ linear matrix operation
  * Using HC, series of projections $\rightarrow$ single matrix operation
  * Any equation in $(x,y)$ is equivalent to a ***homogeneous*** equation in $(x_1,x_2,x_3)$

* **Homogenous $\rightarrow$ Cartesian**

>$$\widetilde{\mathbf{X}}=(x_1,x_2,x_3,x_4) \rightarrow \mathbf{X}=\left( \frac{x_1}{x_4}, \frac{x_2}{x_4}, \frac{x_3}{x_4} \right)$$

* **Cartesian $\rightarrow$ Homogeneous**

  * By convention, $\lambda$ is set to $1$

>$$\mathbf{X}=(X,Y,Z) \rightarrow \widetilde{\mathbf{X}}=(\lambda X,\lambda Y,\lambda Z,\lambda) $$

* **Perspective Projection**

>\begin{align}
\widetilde{\mathbf{x}}&=\mathbf{P}_p \widetilde{\mathbf{X}}_c\\
\begin{bmatrix} sx \\ sy \\ s \end{bmatrix} &=
\begin{bmatrix} f&0&0&0 \\ 0&f&0&0 \\ 0&0&1&0 \end{bmatrix}
\begin{bmatrix} \lambda X_c \\ \lambda Y_c \\ \lambda Z_c \\ \lambda \end{bmatrix}
\end{align}

>* $s \;\text{&}\; \lambda$ has no effect on the projection
>* Same projection is achieved with $\mu\mathbf{P}_p \;\;\; (\mu\neq0)$



## 3.6. Full Camera Model in Homogeneous Coordinates

* **Step 1. Rigid Body Transformation** $(\widetilde{\mathbf{X}}\rightarrow\widetilde{\mathbf{X}}_c)$

>\begin{align}
\widetilde{\mathbf{X}}_c &= \mathbf{P}_r\widetilde{\mathbf{X}}\\
\begin{bmatrix} X_c \\ Y_c \\ Z_c \\ 1 \end{bmatrix} &=
\begin{bmatrix} r_{11} & r_{12} & r_{13} & T_x \\ r_{21} & r_{22} & r_{23} & T_y \\ r_{31} & r_{32} & r_{33} & T_z \\ 0&0&0&1 \end{bmatrix}
\begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}
\end{align}
>* $\mathbf{P}_r$: **rigid body transformation matrix**

* **Step 2. Perspective Projection** $(\widetilde{\mathbf{X}}_c\rightarrow\widetilde{\mathbf{x}}_c)$

>\begin{align}
\widetilde{\mathbf{x}}_c &= \mathbf{P}_p\widetilde{\mathbf{X}}_c\\
\begin{bmatrix} sx \\ sy \\ s \end{bmatrix} &=
\begin{bmatrix} f&0&0&0 \\ 0&f&0&0 \\ 0&0&1&0 \end{bmatrix}
\begin{bmatrix} X_c \\ Y_c \\ Z_c \\ 1 \end{bmatrix}
\end{align}
>* $\mathbf{P}_p$: **perspective projection matrix**

* **Step 3. CCD imaging**

>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}_c\widetilde{\mathbf{x}}\\
\begin{bmatrix} su \\ sv \\ s \end{bmatrix} &=
\begin{bmatrix} k_u&0&u_0 \\ 0&k_v&v_0 \\ 0&0&1 \end{bmatrix}
\begin{bmatrix} sx \\ sy \\ s \end{bmatrix}
\end{align}
>* $\mathbf{P}_c$: **CCD calibration matrix**

* **Perspective Camera**

>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}_{ps}\widetilde{\mathbf{X}}\\
\mathbf{P}_{ps} &= \mathbf{P}_{c}\mathbf{P}_{p}\mathbf{P}_{r}\\
&=
\begin{bmatrix} k_u&0&u_0 \\ 0&k_v&v_0 \\ 0&0&1 \end{bmatrix}
\begin{bmatrix} f&0&0&0 \\ 0&f&0&0 \\ 0&0&1&0 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & r_{13} & T_x \\ r_{21} & r_{22} & r_{23} & T_y \\ r_{31} & r_{32} & r_{33} & T_z \\ 0&0&0&1 \end{bmatrix}
\end{align}
>* $\mathbf{P}_{ps}$: **Camera Projection Matrix** for a **Perspective Camera**
>  * $10$ d.o.f. $=3$ for $\mathbf{R} + 3$ for $\mathbf{T} + 2$ for $(f,k_u,k_v) + u_0 + v_0$
>  * $\mathbf{P}_c\mathbf{P}_p$: accounts for ***intrinsic** parameters
>  * $\mathbf{P}_r$: accounts for **extrinsic** parameters

* **Alternative form of $\mathbf{P}_{ps}$**

>\begin{align}
\mathbf{P}_{ps} &= \mathbf{K}\left[\;\mathbf{R\;|\;T}\;\right]\\
&=
\begin{bmatrix} \alpha_u&0&u_0 \\ 0&\alpha_v&v_0 \\ 0&0&1 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & r_{13} & T_x \\ r_{21} & r_{22} & r_{23} & T_y \\ r_{31} & r_{32} & r_{33} & T_z \\ 0&0&0&1 \end{bmatrix}
\end{align}

>* $\alpha_u=fk_u$ and $\alpha_v=fk_v$: **scaling factors**
>* $\alpha_v/\alpha_u$: **aspect ratio**

* **Projective Camera**

>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}\widetilde{\mathbf{X}}\\
\mathbf{P} &=
\begin{bmatrix} p_{11}&p_{12}&p_{13}&p_{14} \\ p_{21}&p_{22}&p_{23}&p_{24} \\ p_{31}&p_{32}&p_{33}&p_{34} \end{bmatrix}
\end{align}
>
>* $11$ d.o.f. (since overall scale of $\mathbf{P}$ does not matter
>* Projective camera is more convenient since we do not have to worry about any non-linear constraints on the elements of $\mathbf{P}$
>* Perspective camera is a special case of projective camera

## 3.7. Viewing a Plane

><img src="images/image07.png" width=500>

* **Perspective Camera**

>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}^p_{ps}\widetilde{\mathbf{X}}^p\\
&= \mathbf{P}_{c}\mathbf{P}_{p}\mathbf{P}^p_{r}\widetilde{\mathbf{X}}^p\\
&=
\begin{bmatrix} k_u&0&u_0 \\ 0&k_v&v_0 \\ 0&0&1 \end{bmatrix}
\begin{bmatrix} f&0&0&0 \\ 0&f&0&0 \\ 0&0&1&0 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & T_x \\ r_{21} & r_{22} & T_y \\ r_{31} & r_{32} & T_z \\ 0&0&1 \end{bmatrix}
\begin{bmatrix} X \\ Y \\ 1 \end{bmatrix}
\end{align}

* **Projective Camera**
  * Relax the contraints to obtain more tractable model
  
>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}^p\widetilde{\mathbf{X}}^p\\
&=
\begin{bmatrix} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{bmatrix}
\begin{bmatrix} X \\ Y \\ 1 \end{bmatrix}
\end{align}

>* Transformation between $\widetilde{\mathbf{w}}$ and $\widetilde{\mathbf{X}}^p$ is called:

>  * **Planar projective transformation** or **homography** or **collineation**

## 3.8. Viewing a Line

* **Perspective Camera**

>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}^l_{ps}\widetilde{\mathbf{X}}^l\\
\mathbf{P}^l_{ps} &=
\begin{bmatrix} k_u&0&u_0 \\ 0&k_v&v_0 \\ 0&0&1 \end{bmatrix}
\begin{bmatrix} f&0&0&0 \\ 0&f&0&0 \\ 0&0&1&0 \end{bmatrix}
\begin{bmatrix} r_{11} & T_x \\ r_{21} & T_y \\ r_{31} & T_z \\ 0&1 \end{bmatrix}
\end{align}

* **Projective Camera**
  
>\begin{align}
\widetilde{\mathbf{w}} &= \mathbf{P}^l\widetilde{\mathbf{X}}^l\\
\mathbf{P}^l &=
\begin{bmatrix} p_{11} & p_{12} \\ p_{21} & p_{22} \\ p_{31} & p_{32} \end{bmatrix}
\end{align}



## 3.9. Camera Calibration
* **Camera Calibration**: process of discovering the ***projection matrix (and its decomposition*** into camera matrix and the position and orientation of the camera) from an image of a controlled scene

* **Camera Calibration: 3D**

>$$\begin{bmatrix} su \\ sv \\ s \end{bmatrix} =
\begin{bmatrix} p_{11}&p_{12}&p_{13}&p_{14} \\ p_{21}&p_{22}&p_{23}&p_{24} \\ p_{31}&p_{32}&p_{33}&p_{34} \end{bmatrix}
\begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}$$
>
>* **Step 1. 11 Parameters** to estimate $\rightarrow$ **6 Observations** required
>* **Step 2.** Solve the equations using linear least squares
>* **Step 3.** Decompose the projection matrix, $\mathbf{P}_{ps} = \mathbf{K}\left[\;\mathbf{R\;|\;T}\;\right]$

>  * $\mathbf{K}$ and $\mathbf{R}$ using **QR decomposition**,
>  * and $\mathbf{T}$ using $\mathbf{T}=\mathbf{K}^{-1}(p_{14},p_{24},p_{34})^T$

* **Camera Calibration: 2D**

>$$\begin{bmatrix} su \\ sv \\ s \end{bmatrix} =
\begin{bmatrix} p_{11}&p_{12}&p_{13} \\ p_{21}&p_{22}&p_{23} \\ p_{31}&p_{32}&p_{33} \end{bmatrix}
\begin{bmatrix} X \\ Y \\ 1 \end{bmatrix}$$

>* **8 Parameters** to estimate $\rightarrow$ **4 Observations**

* **Camera Calibration: 1D**

>$$\begin{bmatrix} su \\ sv \\ s \end{bmatrix} =
\begin{bmatrix} p_{11}&p_{12} \\ p_{21}&p_{22} \\ p_{31}&p_{32} \end{bmatrix}
\begin{bmatrix} X \\ 1 \end{bmatrix}$$

>* **5 Parameters** to estimate $\rightarrow$ **3 Observations**



## 3.10. Recovery of World Position

* **1D case**

>$$\begin{bmatrix} u \\ 1 \end{bmatrix} =
\begin{bmatrix} p_{11}&p_{12} \\ p_{31}&p_{32} \end{bmatrix}
\begin{bmatrix} \lambda X \\ \lambda \end{bmatrix}$$
>$$\;$$
>$$X=\frac{p_{32}u-p_{12}}{-p_{31}u+p_{11}}$$

* **2D case**

>$$\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} =
\begin{bmatrix} p_{11}&p_{12}&p_{13} \\ p_{21}&p_{22}&p_{23} \\ p_{31}&p_{32}&p_{33} \end{bmatrix}
\begin{bmatrix} \lambda X \\ \lambda Y \\ \lambda \end{bmatrix}$$
>$$\;$$
>$$X=\frac{p^i_{11}u+p^i_{12}v+p^i_{13}}{p^i_{31}u+p^i_{32}v+p^i_{33}}\;\;\;,\;\;\;Y=\frac{p^i_{21}u+p^i_{22}v+p^i_{23}}{p^i_{31}u+p^i_{32}v+p^i_{33}}$$

* **3D case**

>$$\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} =
\begin{bmatrix} p_{11}&p_{12}&p_{13}&p_{14} \\ p_{21}&p_{22}&p_{23}&p_{24} \\ p_{31}&p_{32}&p_{33}&p_{34} \end{bmatrix}
\begin{bmatrix} \lambda X \\ \lambda Y \\ \lambda Z \\ \lambda \end{bmatrix}$$
>$$\;$$
>$$\text{defines the light ray}$$

## 3.11. Affine Camera
* **Weak Perspective**
  * When $\Delta Z_c$ is small compared to $Z_c$ $\rightarrow$ Assume $Z_c=Z^{av}_c$
  * Then, the projection becomes,
  
>\begin{align}
\widetilde{\mathbf{x}}_c &= \mathbf{P}_{pll}\widetilde{\mathbf{X}}_c\\
\begin{bmatrix} sx \\ sy \\ s \end{bmatrix} &=
\begin{bmatrix} f&0&0&0 \\ 0&f&0&0 \\ 0&0&0&Z^{av}_c \end{bmatrix}
\begin{bmatrix} X_c \\ Y_c \\ Z_c \\ 1 \end{bmatrix}
\end{align}  

* **Overal Projection Matrix**
  
>$$\mathbf{P}_{wp}=\mathbf{P}_{c}\mathbf{P}_{pll}\mathbf{P}_{r}
=\begin{bmatrix} fk_ur_{11}&fk_ur_{12}&fk_ur_{13}&fk_uT_x+u_0Z^{av}_c \\ fk_ur_{21}&fk_ur_{22}&fk_ur_{23}&fk_uT_y+v_0Z^{av}_c \\ 
0&0&0&Z^{ac}_c \end{bmatrix}$$

* **Projection Matrix for Affine Camera**
  * Discard nonlinear constraints
  
>$$\mathbf{P}_{aff}=
\begin{bmatrix} p_{11}&p_{12}&p_{13}&p_{14} \\ p_{21}&p_{22}&p_{23}&p_{24} \\ 0&0&0&p_{34} \end{bmatrix}$$
>* **8 Parameters** to estimate $\rightarrow$ **4 Observations** required
>* **2D case:** 6 Parameters $\rightarrow$ 3 Observations
>* **1D case:** 4 Parameters $\rightarrow$ 2 Observations

## 3.12. Invariants
* **Definition**

>$$f(\text{image}): \text{robust across viewpoints, change across objects}$$

* **Invariants of Different Cameras viewing planar scenes**

>* **1. Euclidean Invariants**
>  * **Condition:** Image is parallel & a fixed distance from the world plane
>  * **Invariants: Lengths / Areas**

><img src="images/image08.png" width=300>

>* **2. Similarity Invariants**
>  * **Condition:** Euclidean + varying distance
>  * **Invariants: Ratio of Lengths / Ratio of Angles**

><img src="images/image09.png" width=300>

>* **3. Affine Invariants**
>  * **Invariants: Parallelism / Ratio of Areas**
>    * Parallelism: ratios of lengths along collinear or parallel lines

><img src="images/image10.png" width=300>

>* **4. Projective Invariants**
>  * **Invariants: concurrency / collinearity / tangent discontinuities / cusps / order of contact**
>    * Order of Contact: **intersection** (1 point of contact), **tangency** (2 poc), **inflection** (3 poc)

><img src="images/image11.png" width=300>


## 3.13. Cross-Ratio (Perspective Camera)
* **4 Collinear Points**

><img src="images/image12.png" width=300>

>$$\begin{bmatrix} sl \\ s \end{bmatrix} = \begin{bmatrix} p & q \\ r & 1 \end{bmatrix} \begin{bmatrix} X \\ 1 \end{bmatrix}$$

>$$l_i=\frac{pX_i+q}{rX_i+1}$$

>$$\textbf{cross-ratio:} \;\;\; \frac{(l_d-l_a)(l_c-l_b)}{(l_d-l_b)(l_c-l_a)}=\frac{(X_d-X_a)(X_c-X_b)}{(X_d-X_b)(X_c-X_a)}$$

* **5 Coplanar Points**
  * ***Find 4 more distinguished points $\rightarrow$ 2 sets of 4 collinear points***

><img src="images/image13.png" width=500>

>$$\delta_1=\text{cross-ratio of}\; \left\{a,e_2,b,f\right\}$$
>$$\delta_2=\text{cross-ratio of}\; \left\{a,e_1,d,g\right\}$$

## 3.14. Canonical Views
* **5 Coplanar Points**
  * **Idea:** Any two views $\widetilde{\mathbf{w}}$ and $\widetilde{\mathbf{w}}'$ are related by a projectivity $\mathbf{P}^{p''}$
  
><img src="images/image14.png" width=500>
>$$\;$$
>\begin{align}
\widetilde{\mathbf{w}}&=\mathbf{P}^p\widetilde{\mathbf{X}}^p \\
\widetilde{\mathbf{w}}'&=\mathbf{P}^{p'}\widetilde{\mathbf{X}}^p=\mathbf{P}^{p'}[\mathbf{P}^p]^{-1}\widetilde{\mathbf{w}}=\mathbf{P}^{p''}\widetilde{\mathbf{w}}
\end{align}

>* **Step 1:** Distinguish 4 points
>* **Step 2:** Find $\mathbf{P}^p$ which maps them onto the corners of the unit square in the canonical view
>* **Step 3:** Apply $\mathbf{P}^p$ to other points $\rightarrow$ **invariant signature**




