# **Image formation and acquisition**
An *imaging device* gathers the light reflected by 3D objects to create a 2D representation of the scene (an image).
In *computer vision* the process is inverted: the goal is to infer knowledge on the objects from one or more digital images.

To understand this process is required the knowledge about:
* The geometric relationship between **scene points** (3D) and **image points** (2D);
* The radiometric relationship between the brightness of image points and the light emitted by scene points;
* The image digitization process (quantization).

## The pinhole camera
A pinhole camera is the simplest imaging device: the light (ideally only one ray) goes through a very small hole and hits the image plane (sensor or photosensitive paper).

Since the image is obtained by *drawing* straight rays from scene points, through the hole, to the image plane, the images appear reversed.

Useful images hardly can be captured by such device, nevertheless it's a simple geometrical model to approximate the image formation process.


## Perspective projection
The geometric model of image formation in a pinhole camera is known as **perspective projection**.

<img src="https://i.ibb.co/R0bhRwj/photo-2021-01-18-13-26-00.jpg" width="400px" />

Where:
* $M$ is a scene point and $m$ is the corresponding image point;
* $I$ is the image plane;
* $C$ is the optical center and the line through $C$, orthogonal to $I$ is the optical axis;
* $c$ is the piercing point (or image center) and it's the interesction between the opticaa axis and the image plane.
* $F$ is the focal plane: a plane parallel to the image plane and passing through the piercing point;
* $f$ is the focal lenght (distance between piercing point and optical center).

The goal is to find a mapping between world coordinates (3D) and image coordinates (2D): $(x,y,x) \to (u,v)$.

*Note*: we measure the coordinate in their own reference system: camera reference system (CRS) for 3d point and image reference system (IRS) for 2d points. The piercing point $c$ is the origin for the IRS and the optical center $C$ is the origin for the CRS

<img src="https://i.ibb.co/KhW6LjS/photo-2021-01-18-13-45-02.jpg" width="1000px" />

Due to triangles similarity, it holds: $$\frac{u}{x} = \frac{v}{y} = -\frac{f}{z}$$
Frome the above relationship we can derive:
$$u = -x \frac{f}{z}$$
$$v = -y \frac{f}{z}$$

We can get rid of sign inversion if we think that the image plane is in front rather than behind the optical center:
$$u = x \frac{f}{z}$$
$$v = y \frac{f}{z}$$

What these equation tell us?
1. They map 3d coordinates into 2d coordinates;
1. Image coordinates are just a **scaled version** of the scene ones, scaled by a factor $\frac{f}{z}$.
1. We can find a point in the image given it's position in the real world.

About the last point, since image formation process deals with mapping a 3d space onto a 2d space, it leads to a inevitably loss of information.
Indeed, the mapping is not a bijection: a given scene point is mapped into a unique image point, but a given image point is mapped onto a 3d line.

Recovering the 3d structure is an ill-posed problem (no unique solution): once we take an image point we can only state that its corresponding scene point lays on a line, but we can't distinguish the right point.

A basic approach to address this problem is use more images (stereo vision, is that why we have two eyes).


## Stereo images for 3d inference
Given correspondances, 3d information can be recovered easily by triangulation: two constraints to localize the points in the 3d space.

<img src="https://i.ibb.co/6WgxQBr/photo-2021-01-18-14-22-09.jpg" width="800px" />

The two planes $\pi_L$ and $\pi_R$ don't have to be in the same plane, but if planes are parallel we simplify the problem (like human eyes).

### Standard stereo geometry
It's the one typically used in cameras.
* Parallel $(x,y,z)$ axes like human eyes;
* Same focal lenght (same camera);
* Coplanar image planes.

Let be:
* $P_L = \begin{bmatrix} x_L \\ y_L \\ z_L \end{bmatrix}$ a point in the left CRS;
* $P_R = \begin{bmatrix} x_R \\ y_R \\ z_R \end{bmatrix}$ a point in the right CRS.

The transformation between the two reference frames is just a translation $b$, usually horizontal:
$$P_L - P_R = \begin{bmatrix} x_L - x_R \\ y_L - y_R \\ z_L - z_R \end{bmatrix} = \begin{bmatrix} b \\ 0 \\ 0 \end{bmatrix}$$
Now considering the **perspective projection equation** (we have the same vertical height):
$$v_L = v_R = y \frac{f}{z}$$
$$u_L = x_L \frac{f}{z}$$
$$u_R = x_R \frac{f}{z}$$
It's possible to calculate the disparity $d = u_L - u_R$ (the horizontal distance between the two points):
$$d = u_L - u_R = x_L \frac{f}{z} - x_R \frac{f}{z} = (x_L - x_R) \frac{f}{z} = b \frac{f}{z}$$
Hence:
$$d = b \frac{f}{z}$$
Extracting $z$, the **fundamental relation of stereo vision** is obtained:
$$z = b \frac{f}{d}$$

So we measure the disparity $d$ and then compute the $z$. 
Note that there are no info about the correspondances: to find for example $P_R$ (remembering that $v_L=v_R$) just search at the same height (the candidates are the pixels on the red line).

*Note*: windows of pixels are used since it's not possible to find the exact correspondance with just one pixel.

<img src="https://i.ibb.co/pb8p72v/photo-2021-01-18-17-07-32.jpg" width="800px" />

**Projector**: device that projects some random black dots to infer some knowledge about the geometry in non-uniform areas.

### Epipolar geometry
It's an alternative stereo vision approach, based on the fact that all the epipolar lines in an image meet at a point called epipole (the projection of the optical center of the other image). Usually is used as a second mapping to check the correctness.

<img src="https://i.ibb.co/RjVVxbS/photo-2021-01-18-17-12-51.jpg" width="400px" />

However, searching through oblique epipolar lines is awkward, so a rectification can be used.
It's always possible to warp the images as if they were acquired through a standard geometry (horizontal and collinear conjugate epipolar lines) by computing and applying to both a transformation (an homography) to rectificate.

### The stereo correspondance problem
Given a point in one image (left camera) find that in the other image (right camera) which is the projection of the same 3d point (corresponding point).
The basic idea is that corresponding points look similar in the two images.

## Properties of perspective projection
* The farther objects are from the camera, the smaller they appear in the image. If planes are perfectly aligned it holds $l = L  \frac{f}{z}$, but in general, even with arbitrarily position and orientation, **lenght always shrinks alongside distance**;
* Perspective projection maps 3d lines into images lines (in theory);
* Ratios of lenghts are not preserved unless the sceneis planar and parallel to the image plane;
* Parallelism between 3d lines is not preserved excepts for lines parallel to the image plane (**perspective distortion**).

<center><img src=https://i.ibb.co/XVmgC6S/photo-2021-01-18-17-37-06.jpg width="400px" /></center>


## Vanishing points
The images of parallel 3d lines (in real world) meet at a point, referred as vanishing point.

The vanishing point of a 3d line is the **image of the point at infinity of the line** and it can be determined by the intersection between: 
* The image plane;
* The line parallel to the given one, passing through the optical center.

So, all parallel 3d lines will share the same vanishing point (they meet at their vanishing point in the image, if such point is at infinity).

Let's consider the parametric equation of the line in the CRF:
$$M = M_0 + \lambda D = \begin{bmatrix} x_0 \\ y_0 \\ z_0 \end{bmatrix} + \lambda \begin{bmatrix} a \\ b \\ c \end{bmatrix}$$
where $M_0$ is a point on the line and $D$ is the direction cosines vector.

First project a generic point $m$ of the line. Given:
$$m = \begin{bmatrix} u \\ v \end{bmatrix}$$
the projection will be:
$$u = f \frac{x}{z} = f \frac{x_0 + \lambda a}{z_0 + \lambda c}$$
$$v = f \frac{y}{z} = f \frac{y_0 + \lambda b}{z_0 + \lambda c}$$
To get the vanishing point we consider the **infinitely distant point** along the line. Given:
$$m_\infty = \begin{bmatrix} u_\infty \\ v_\infty \end{bmatrix}$$
the projection will be:
$$u_\infty = \lim_{\lambda \to \infty} u = f \frac{a}{c}$$
$$v_\infty = \lim_{\lambda \to \infty} v = f \frac{b}{c}$$

The vanishing point depends on the orientation of the line only ($a$, $b$ and $c$), not on its position.
Whenever the line is parallel to the image plane ($c = 0$) it goes to infinity.
In such case the image of the line has the same orientation as the 3d line.

Knowledge of a vanishing point of a sheaf of parallel lines (and the focal lenght) allows for determining the unknown orientation of the lines.

We know that:
* $u_\infty = f \frac{a}{c}$;
* $v_\infty = f \frac{b}{c}$;
* $a^2 + b^2 + c^2 = 1$ (unit vector).

So:
$$u_\infty^2 + v_\infty^2 = f^2 \frac{a^2}{c^2} + f^2 \frac{b^2}{c^2} = \frac{f^2}{c^2}(a^2 + b^2)$$
Multiplying for $c^2$, and since that $a^2 + b^2 = 1 - c^2$:
$$c^2(u_\infty^2 + v_\infty^2) = f^2 (1 - c^2)$$
Expanding and rearranging:
$$c^2(u_\infty^2 + v_\infty^2) = f^2 - f^2 c^2$$
$$c^2(u_\infty^2 + v_\infty^2) + f^2 c^2 = f^2$$
$$c^2(u_\infty^2 + v_\infty^2 + f^2) = f^2$$
Squaring, we finally obtain:
$$c = \frac{f}{\sqrt{u_\infty^2 + v_\infty^2 + f^2}}$$

Reversing the first two relationship, we obtain:
$$a = u_\infty \frac{c}{f}$$
$$b = v_\infty \frac{c}{f}$$
Making a substituion:
$$a = \frac{u_\infty}{\sqrt{u_\infty^2 + v_\infty^2 + f^2}}$$
$$b = \frac{v_\infty}{\sqrt{u_\infty^2 + v_\infty^2 + f^2}}$$

So we can write in a more compact way:
$$\begin{bmatrix} a \\ b \\ c \end{bmatrix} = \frac{1}{\sqrt{u_\infty^2 + v_\infty^2 + f^2}} \begin{bmatrix} u_\infty \\ v_\infty \\ f \end{bmatrix}$$

*Example application*: a mobile robot can be driven through indoor hallways by tracking the dominant vanishing point and steering the robot to keep it at the center of the image

### Camera orientation from vanishing point
Knowledge of vanishing points of two orthogonal directions allows determining camera orientation wrt a scene plane.

From the horizontal lines we get the unit vector $i_c$, while from the vertical ones we get the unit vector $j_c$.
The vector product between the two provides $k_c = i_c \times j_c$.

These three unit vectors define the orientation of the camera wrt the considered scene plane: 
$$R_{\text{pc}} = \begin{bmatrix} i_c & j_c & k_c \end{bmatrix}$$
$$R_{\text{cp}} = {R_{\text{pc}}}^T$$

## Weak perspective
Perspective effects may be not so evident, this occours whenever the framed subject is thin compared to the distance from the camera.
In such cases perspective projection can be approximated by a scaled orthographic projection (weak perspective, basically we project with straight lines):
$$ \begin{cases} u = sx \\ v = sy \end{cases}$$

When apply it? When we are looking at a thin scene with a large distance (the range of depth is small compared to the depth).

<img src=https://i.ibb.co/MNzzdBV/photo-2021-01-18-18-43-28.jpg width="400px" />

Since $\Delta z << z_0$ it will be:
$$\frac{f}{z_0 + \Delta z} = \frac{f}{z_0 - \Delta z} = \frac{f}{z_0} = s$$

And then:
$$u \approx \frac{f}{z_0} x = sx $$
$$v \approx \frac{f}{z_0} y = sy $$

It means that the perspective scaling factor $\frac{f}{z}$ is approximately constant $\frac{f}{z_0}$ for all points of the framed subject.

## Lenses
The lense gather light and focuses it in a single point.

A scene point is on focus when all its light rays gathered by the camera hit the image plane at the same point.

In a pinhole device this happens to all scene points, due to the small size of the hole.
The camera features an infinite **depth of field** (DOF). 
The drawback is that such small aperture allows gathering a very limited amount of light, so very long exposure time is required to avoid blurs (so only static scenes can be acquired to avoid the so called *motion blur*).

For this reason, cameras rely on lenses to gather more light from a scene point and focus it in a single image point, enabling a smaller exposure time.
However the DOF is no longer infinite, since only points across a limited range of distances can be simultaneously on focus in a given image.

### Thin lens equation
In general cameras feature complex optical system, but it's possible to use an approximate model called **thin lens model**.

<img src=https://i.ibb.co/qkV8tpr/photo-2021-01-18-22-11-32.jpg width="400px" />

Let be:
* $P$ the scene point (3d) and $p$ the corresponding **focused** image point (2d);
* $u$ the distance from $P$ to the lens and $v$ the distance $p$ to the lens;
* $f$ the focal lenght (now a lens parameter);
* $C$ the center of the lens;
* $F$ the focal point or focus of the lens.

Then it hold: $$\frac{1}{u} + \frac{1}{v} = \frac{1}{f}$$

To determine the position of a focused image point we can leverage on the properties of thin lenses:
1. Rays parallel to the optical axis are deflected and pass through $F$;
1. Rays through $C$ are undeflected.

That means that we can focus only at a given distance.

*Note*: if the image is on focus, the image formation process obeys to the perspective projection model, with the center of the lens being the optical center and the distance $v$ acting as the effective focal lengh of the projection, that is a different concept wrt the focal lenght of the lens.

### Circle of confusions
Due the thin lens equation we can:
1. Fix the image distance $v$ of the image plane: $$\frac{1}{u} + \frac{1}{v} = \frac{1}{f} \to u = \frac{vf}{v - f}$$
1. Fix the focus distance $u$ (acquire scene points at a certain distance): $$\frac{1}{u} + \frac{1}{v} = \frac{1}{f} \to v = \frac{uf}{u - f}$$

Given the chosen position of the *image plane*, scene points in front and behind the respective *focus plane* will result out-of-focus, appearing in the images as circles (circle of confusion or blur circles).

<center><img src=https://i.ibb.co/HqMB4gW/photo-2021-01-18-22-31-37.jpg width="600px" /> </center>

In practice, as long as blur circles are smaller than the size of the photosensing elements, the image will still look on-focus, this happens because the image plane is not continuous (pixels).
The range of distances across which the image appears on focus (due to blur circles being small enough) determines the depht of field (DOF).

Cameras often deploy an adjustable diaphgram (iris) to control the amount of light gathered through the effective aperture of the lens.
The smaller the diaphgram aperture is, the larger is the DOF (smaller size of blur circles).
We can also enhance the power of the light source to capture very fast (but tiniy) objects.

**F-number**: ratio of the focal lenght to the effective aperture of the lens ($\frac{f}{d}$).
F-number discrete units are known as *stops* are usually reported on the diaphgram to allow the user to adjust the effective aperture.

The higher is the chosen stop, the smaller is the diaphgram aperture (larger DOF). With smaller stops we have more light.

### Focusing mechanism
To focus on objects at diverse distances, a mechansim that allow the lens to translate along the optical axis (wrt a fixed position of the image plane) is provided. 

The mechanism has two end of movement. 
At one end position ($v=f$) the camera is focused at infinity, then the mechanism allow the lens to be translated farther away from the image plane up to a certain maximum value (other end position), which determines the minimum focusing distance.

<center><img src=https://i.ibb.co/LYDRsw6/photo-2021-01-19-09-27-52.jpg width="600px" /> </center>

### Telecentric lenses
By placing a diaphgram with a small hole at the focal point of the lens it's possible to block all light rays but those parallel to the optical axis.

<center><img src=https://i.ibb.co/ncrykMp/photo-2021-01-19-12-10-08.jpg width="600px" /> </center>

In this manner we realize an **ortographic projection**, that doesn't exhibit perspective projection, although, the size of the object can't be larger than the lens itself.


### Fundamental radiometric relation
Let be:
* **Irradiance** $E$ (of a point on a surface): amount of light incident on that point;
* **Radiance** $L$ (of a surface point in any direction): amount of light emitted by the point in that direction.
The fundamental radiometric relation shows that the irradiance of an image point is proportional to the radiance of the corresponding scene point, along the viewing direction: $$E(p) = L(P) \frac{\pi}{4} {\Bigl(\frac{d}{f}\Bigr)}^2 \cos^4 \alpha$$

<center><img src=https://i.ibb.co/r5Jb8Rr/photo-2021-01-19-12-30-11.jpg width="400px" /> </center>

*Note*: the radiance depends on the power and position of the light sources, and on the reflectance properties of the material.


#### Reflectance of a surface
The latter property is usually described by a complex function called **Bi-Directional Reflectance Function** (BDRF), not always analytical and usually parametrized by two angles: azimut and elevation.

That means that it's possible to determine the amount of light emitted in a certain direction given the amount of light received from the sources.

There exists two simplified and opposite reflectance model:
* **Lambertian** (diffusely reflecting): the incoming light is spread equally across all viewing direction (appears equally bright from any view direction):
$$L(P) = \rho(P) \cdot E(P)$$
where $\rho(P)$ is the albedo (the visible light);
* **Specular surface**: reflects the light coming from any direction in only one direction, so incident and reflected rays are coplanar and form the same angle wrt the normal at the surface point.

Real surfaces show a mixed behaviour, so more sophisticated model are used.

*Note*: correspondant point share similar neighbourhood of pixels, if the model is not Lambertian we have different light in different directions.


## Image digitization
The image plane of a camera consists of a planar sensor which convert the irradiance at any point into an electric quantity.
Afterwards, such continuous "electric" image is sampled and quantized to end up with a digital images (bits) suitable to visualization and processing by a computer.

<center><img src=https://i.ibb.co/cbVRvv4/photo-2021-01-19-13-01-15.jpg width="700px" /> </center>

**Sampling**: the planar continuous image is sampled evenly along both the horizontal and vertical directions to pick up a 2d array (matrix) of $N \times M$ samples known as pixels:
$$I(x,y) \to \begin{bmatrix} I(0,0) & \dots & I(0, M - 1) \\ \vdots & \ddots & \vdots \\ I(N - 1, 0) & \dots & I(N - 1, M - 1) \end{bmatrix} $$
**Quantization**: the continuous range of values associated with pixels is quantized into $l = 2^m$ discrete levels known as *grey-levels*.

Letting $m$ be the number of bits used to represent a pixel (tipically a byte, $m = 8$), the memory occupancy in bits of a grey-scale image will be $B = N \times M \times m$.

The more bits we spend for its representation, the higher the quality of the digital image (for both sampling and optimization)

Lenna's story: http://www.lenna.org/.

### Digitization in practice
The sensor is a 2d array of photodetectors (i.e. photogates or photodiodes).

During the exposure time, each detector converts the incident light into a proportionale electric charge (photons to electrons).
Then, the another circuit reads-out the charge, generating an output signal (can be digital or analogic, in case of digital an ADC converter is also needed).

Hence, there is never a continuous image, since the image is sensed directly as a sample signal (row by row).

*Note*: in analog cameras, the native sampling taking place at the sensor is lost in the generation of the analog output, that is then sampled and quantized by an **analog frame grabber**.
This means that pixels in a digital image coming from an analog camera do not correspond those sensed by the photodetectors.
In analog cameras the signal is continuous in rows but digital in columns.

## Camera parameters
**Signal-to-noise ratio** (SNR): the intensity (numeric value) measured at a pixel under perfectly static conditions varies due to the presence of random noise (the pixel value is not deterministic but more random variable)

The SNR can be though as a quantifier for the strenght of the "true" signal wrt fluctuations due to noise.
It's usually expressed in decibels or bits (standard measures):
$$\text{SNR}_{dB} = 20 \log_{10}(\text{SNR})$$
$$\text{SNR}_{bit} = \log_{2}(\text{SNR})$$

The main noise source are:
1. **Photon shot noise**: the number of photons collected during exposure time is not constant since it's a Poisson process;
1. **Electronic circuitry noise**: generated by the electronics that reads-out the charge;
1. **Quantization noise**: due to the final ADC conversion (in digital cameras);
1. **Dark current noise**: random amount of charge, due to thermal excitement observed at each pixel, even if the sensor is not exposed to light. 

<center><img src=https://i.ibb.co/LRrW2ZC/photo-2021-01-19-14-43-01.jpg width="700px"/></center>

De-noise an image is one of the main goal of pre-processing before using an algorithm.

<center><img src=https://i.ibb.co/3zHS9Yr/photo-2021-01-19-14-46-30.jpg width="700px"/></center>

Since the process is not deterministic we have: $$I_i(p) = I_i^* + n(p)$$
where $I_i^*$ is a deterministc constant and $n(p)$ is the noise, a random variable.
Averaging on $n$ measurament: 
$$\widetilde{I}(p) = \frac{1}{n} \sum_{i = 1}^n I_i^*(p) + \frac{1}{n} \sum_{i = 1}^n n(p) = I^*(p)$$
due to the central limit theorem (noise has a normal distribution), random variables has null mean (for $n \to \infty$).
We need time to make it so (more images).

Pixel's noise is independent, but if we don't have time for a temporal mean we can do it through space (in the neighbourhood).

<center><img src=https://i.ibb.co/ZmQbCfj/photo-2021-01-19-15-17-43.jpg width="300px"/></center>

Hypothesizing the same amount of light while sliding the red window we have:
$$\widetilde{I}(p) = \frac{1}{s*} \sum_{q \in S} I(q)$$

Note: usually denoising algorithm are embedded in camera's firmware.

**Dynamic range**

Let be:
* $E_\min$: minimum detectable irradiation;
* $E_\max$: the saturation irradiation (the maximum capacity of the photodetector).

Then, the dynamic range will be $\text{DR} = \frac{E_\max}{E_\min}$.

<center><img src=https://i.ibb.co/zJ5Yk1y/photo-2021-01-19-15-29-31.jpg width="800px"/></center>

The higher the DR, the better is the ability of the sensor to simultaneously capture in one image both dark and bright structures of the scene.

**High dynamic range** (HDR): combining together a sequence of images of the same subject taken under different exposure times.

**Sensitivity (or responsivity)**: deals with the amount of signal that the sensor can deliver per unit of input optical energy.

**Uniformity (spatial or pattern noise)**: due to manifacturing tolerances both the response to light and the amount of dark noise vary across pixels.

## CCD vs CMOD
CMOS technology allows the electronic circuitry to be integrated within the same chip as the sensor, providing compactness, less power consumption and lower costs.

Moreover, CMOS sensors allow an arbitrary window to be read-out without having to receive the full image.
This can speeds up the inspection or tracking of a certain ROI (region of interest).

On the other hand, CCD technology provides higher SNR, higher DR and better uniformity

### Colour sensor
Sensors are sensitive to light ranging from near-ultraviolet, through the visible spectrum, up to the near infrared.

The sensed intensity at a pixel results from an integration over the range of wavelenghts of the spectral distribution of the incoming light, multiplied by the spectral response function of the sensor.

This means that CCD and CMOS sensors **cannot** sense colour.
<center><img src=https://i.ibb.co/DLgrjxr/photo-2021-01-19-15-41-53.jpg width="800px"/></center>

To create a colour sensor, an array of optical fibers (colour filter array) is placed in front of the photodetectors, to render each pixel sensitive to a specific range of wavelenghts.

To obtain an RGB triplet at each pixel, missing pixels are interpolated from neighbouring pixels (**demosaicking**).


**Bayer CFA**: twice green filter like in human eye. In this manner the true resolution of the sensor is smalle due to the different sampling factor.

### Sensor sizes
CCD and CMOS sensors come in different sizes, which are specified in inches for the sake of legacy wrt old cameras based on cathode ray tubes. 
The size of the diagonal of a solid state sensor is roughly $2/3$ of its size.