# Camera Sensor

## Goal
- learn what makes a camera useful for self-driving cars
- learn the characteristics of a camera as a sensor, and how images are formed

## 1. Pinhole Camera Model



## 2. Camera Projective Geometry

Projection from world coordinates -> image coordinates
- project from world coordinates -> camera coordinates
- project from camera coordinates -> image coordinates
- project from image coordinates -> pixel coordinates by discretization, scaling and offset

World -> Camera 

$$
O_c = \begin{bmatrix}
            R & t \\
            0 & 1
        \end{bmatrix} 
        O_w
    = T O_w
$$


Camera -> image

$$
O_{i} = \begin{bmatrix}
            f & 0 & u_0 \\
            0 & f & v_0 \\
            0 & 0 & 1
        \end{bmatrix} 
        O_c
    = K O_c
$$

Therefore, world -> image

$$
O_i = \begin{bmatrix}
        x_i \\
        y_i \\
        z_i
    \end{bmatrix}
    = K T O_w 
    = K T \begin{bmatrix}
            x_w \\
            y_w \\
            z_w 
        \end{bmatrix}
$$

Note, $K$ is 3-by-3, $T$ is 4-by-4, $P$ is 3-by-4, and $O_w$ is 3-by-1 by using homogeneous coordinates.
$KT$ is not matrix multiplication, but a composition of two transformations, $KT=KR+t$.

image -> pixel

$$
O_i = \begin{bmatrix}
        x_i \\
        y_i \\
        z_i
    \end{bmatrix}
    \rightarrow
    \begin{bmatrix}
        x_p \\
        y_p \\
        1 \\
    \end{bmatrix}
    = \frac{1}{z_i}\begin{bmatrix}
        x_i \\
        y_i \\
        1
    \end{bmatrix}
$$


## 3. Camera Calibration


## 4. Visual Depth Perception

- stereo sensor -> how two cameras are related
- derive the location of a point in 3D given its projection on the two images of a stereo sensor


### 4.1 Stereo Sensor Model

![image.png](./perception/resources/img/stereo-camera-model.png)

$O_L$ and $O_R$ are the optical centers of the two cameras, and $f$ is the focal length of the cameras.
$b$ is the baseline between the two cameras, and $Z$ is the depth of the point $O$. 


Assumptions:

- sensor is constructed from two identifica cameras
- two cameras have parallel optical axes


![model](./perception/resources/img/stereo-camera-model-1.png)

left camera

$$
\frac{Z}{f} = \frac{X}{x_L}
$$

right camera

$$
\frac{Z}{f} = \frac{X-b}{x_R}
$$ 

Therefore, we can computer 3D point coordinates from the two images.

Lets define disparity $d$ as the difference between the two image coordinates of the same point.

$$
d = x_L - x_R
$$

where ($x_L$, $y_L$) and ($x_R$, $y_R$) are the image coordinates of the same point in the left and right images, respectively. The image coordiates are measured in pixels and can be calculated from the pixel coordinates as follows:
$$ x_L = u_L - u_0 $$
$$ x_R = u_R - u_0 $$
$$ y_L = v_L - v_0 $$
$$ y_R = v_R - v_0 $$


Combing all equations above, we can get the 3D point coordinates for the point (X,Y,Z) in the camera coordinate system.

$$ Zx_L = fX $$
$$ Zx_R = fX-fb $$
$$ Zx_R = Zx_L - fb $$

The coordinates of the given point (X,Y,Z) is then obtained as:

$$ Z = \frac{fb}{x_L - x_R} = \frac{fb}{d} $$
$$ X = \frac{Zx_L}{f} $$
$$ Y = \frac{Zy_L}{f} $$

Two main problems:
- need to know $f$, $b$, $u_0$, $v_0$
    - use stereo camera calibration
- need to know the disparity $x_L$, $x_R$ so that $d$ can be calculated
    - use disparity computation algorithms based on image matching
        - correspond pixels in the left image to those in the right image to find matches.

### 4.2 Disparity Computation

#### 4.2.1 Epipolar Line

![epipolar](./perception/resources/img/epipolar-line.png)

Horizontal epipolar line only occur when the optical axes of the two cameras are parallel.

If the condition is not met, epipolar lines are not horizontal but instead skewed, and the disparity is not constant along the epipolar line.
- In this scenario, we can use stereo retification to warp the images so that the epipolar lines become horizontal.


### 4.2.2 Disparity Computation

Given rectified images and stereo calibrations:

- For each epipolar line,
    1. take each pixel on this line in the left image
    2. compare these left image pixels to every pixel in the right image on the same epipolar line
    3. select the right image pixel that matches the left pixel the most closely, which can be done by minimizng the cost, such as the sum of squared differences (SSD) between the two pixels.
    4. compute the disparity as the difference between the column indices of the two pixels


Very well-studied region:
- survey at http://vision.middlebury.edu/stereo/eval3. 
- many algorithms have been proposed, such as block matching, semi-global matching, and deep learning based methods.
- benchmark tests are also available, such as the Middlebury Stereo Evaluation.

## 5. Image Filter

noise in image -> filter -> denoised image

Noise types:
- Gaussian noise
- Salt-and-pepper noise
- Speckle noise



### 5.1. Cross-Corelation
 
Salt-and-pepper noise usually occurs in low-light conditions, and usually results in outlier pixels with very high value in a low-value neighborhood or very low intensity values in a high-value neighborhood.
The following matrix shows an example of the salt-and-pepper noise.

$$
\begin{bmatrix}
    0 & 0 & 0 & 0 & 0 \\
    0 & 0 & 0 & 0 & 0 \\
    0 & 0 & 255 & 0 & 0 \\
    0 & 0 & 0 & 0 & 0 \\
    0 & 0 & 0 & 0 & 0 \\
\end{bmatrix}
$$

To reduce this type of noise, we can use the mean filter, which replaces each pixel with the mean of its neighborhood.

$$
    G[u,v] = \frac{1}{(2k+1)^2} \sum_{i=-k}^{k} \sum_{j=-k}^{k} I[u+i, v+j]
$$

where $I$ is the input image, $G$ is the output image, and $k$ is the size of the neighborhood, $2k+1$ is the size of the filter.

A more general form of the filter can be represented as:

$$
    G[u,v] = \sum_{i=-k}^{k} \sum_{j=-k}^{k} H[i,j]I[u+i, v+j]
$$

where $H[i,j]$ is the filter kernel.

The mean filter above is a special case of this general form, where $H[i,j] = \frac{1}{(2k+1)^2}$. If k=1, then $H[i,j] = \frac{1}{9}$.

$$
H = \frac{1}{9} \begin{bmatrix}
    1 & 1 & 1 \\
    1 & 1 & 1 \\
    1 & 1 & 1 \\
    \end{bmatrix}
$$

Similarly, for a Gaussian filter, we can use the following kernel:

$$
H = \frac{1}{16} \begin{bmatrix}
    1 & 2 & 1 \\
    2 & 4 & 2 \\
    1 & 2 & 1 \\
    \end{bmatrix}
$$


However, implementing linear filter such as mean filter or Gaussian filter will result in blurring of the image, which is not desirable in many applications.
These filters can be tuned to reduced the blurring effect, but the noise reduction will be compromised. There is a tradeoff.



### 5.2 Convolution

A covolution is a cross-corrleation where the filter is flipped both horizontally and vertically before being applied to the image.

$$
    G[u,v] = \sum_{i=-k}^{k} \sum_{j=-k}^{k} H[i,j]I[u-i, v-j]
$$

Unlike cross-correlation, convolution is commutative, which means that the order of the filter and the image does not matter.
If $H$ and $F$ are filter kernels, then $H*(F*I) = H*F*I$.
Precompute filter convolutions ($H*F$) then apply it to image can reduce time.


### 5.3. Applications
- cross correlation
    - template matching
        - the pixel with the highest response from cross-correlation is the location of the template in the image
        - can be used to detect objects in an image, such as lanes
    - gradient computation
        - define a finite difference kernel
        - apply the kernel to the image, and get the image gradient 
        - very useful for edge detection

