# "Stereo Vision"

> "Two eyes are always better than one!"

- toc: true
- branch: master
- badges: false
- comments: true
- categories: [Computer Vision]
- hide: false
- search_exclude: false
- image: images/post-thumbnails/sv2.png
- metadata_key1: notes
- metadata_key2: 

# Pinhole Camera Basics

![](https://abhisheksreesaila.github.io/blog/images/stereo/pinhole.png "Pinhole Camera")

Before we go explain this in detail lets clarify the terms

- The object (flower) is said to be "in the "world" and its position is given by "world coordinates"
- Now imagine The light rays from the object pass through a pinhole and gets projected on a wall.  The wall  is the "image plane" and the hole it passes through is called the "pinhole". The camera is placed at the pinhole. The camera is said to be in the "Camera plane"
- The distance between the camera placed at the pinhole and the wall (image plane) is called "focal length"
- "z" is also called Depth



> Note: Our goal is to take the 3D coordinates (x,y,z) and TRANSFORM them  to camera plane (3D) and then later PROJECT THEM to the image plane (2D).  

We move from WORLD => CAMERA => IMAGE. This is called Forward Imaging Model.  Also the projection from Camera to the image plane is called "Perspective Projection". so we can write the forward imaging model as

 > Note:  $ [x_w, y_w, z_w] $ (Coordinate transformation) =>  $ [x_c, y_c, z_c]  $ => (Perspective Projection) $ =>   [x_i, y_i]   $ 










---





# Perspective Imaging with Pinhole


![](https://abhisheksreesaila.github.io/blog/images/stereo/similartri.png "Similar Triangles")

The above diagram is exactly same as "Pinhole" diagram with labels removed for explaining geometry.The yellow and blue triangles are similar.  By the property of similar triangles (see below note), we have

- $r_i/f = r_o/z$

Where $r_o = (x_o, y_o, z_o )$ and $ r_i = (x_i, y_i, f) $ 

$ r_i $ and $ r_o $ are [vectors](https://mathinsight.org/vector_introduction) describing the point on the image and world respectively. Hence we can split and write the equation as  
- $ x_i/f = x_o/z $  ;   $ y_i/f = y_i/z $


> Note: Triangles are set to be "similar" if they are of the same shape but not necessarily the same size. If 2 triangles have this property then 
  *$AB/DE = BC/EF$*

![](https://abhisheksreesaila.github.io/blog/images/stereo/s-triangles-2.png "Similar Triangles Concept")



---

## Pinhole Properties

1. Straight line remains straight in the image plane
2. Magnification is inversely proportional to depth ($z_o$). In other words, The further you are from the camera, smaller you appear. Now this is obvious since we all use cameras today. But take a look at the parallel lines intersects for a railway track. The distance between the parallel tracks becomes so small in the image plane that it looks like intersection though in reality we know its parallel throughout. The point of intersection is called the "Vanishing Point"

![](https://abhisheksreesaila.github.io/blog/images/stereo/railway-tracks.png "Similar Triangles Concept")

3. Since a pinhole is a tiny hole letting the light inside the camera to capture the images, for a large object it will take a few seconds before the image is captured. Imagine clicking a camera and waiting 5 seconds to see the image!!  To improve this, we substitute pinhole with LENS. 

## LENS

We wont cover LENS and its properties, but just keep in mind, LENS are just a "pinhole" made larger. Thats a simplistic view of it. OBviously with more area to capture light, now the wait time to capture the image is reduced, but light is scattered now, and hence causes distortions in your image.
Apart from the focal length we introduce "distortion coefficients" as one of the Intrinisic Parameters.  All modern cameras use spohisticated lenses. 

See below

![](https://abhisheksreesaila.github.io/blog/images/stereo/lens-camera.png "Lens vs Camera")


---

# Camera Calibration

> Note: The goal of calibration is estimate the Extrinsic and Intrinsic parameters. 


## Derive a Equation For Camera Calibration

> Warning: This is math heavy.


Lets use our knowledge about pinhole, perspective projection and derive a simple linear model for calibration. 

Take a point in world coordinate. Assume you are given "Extrinsic Parameters" and so you have successfully transformed them into camera coordinates
Applying perspective projection formula from above we get, $ x_i/f = x_c/z_c $  ;   $ y_i/f = y_c/z_c $   where *_c denotes coordinates in camera plane. 

Rewriting the equation we get,

$ x_i = f * x_i/z_c $  ;   $ y_i = f * y_i/z_c $


Lets take a closer look at the IMAGE PLANE.




For a sake of understanding the basic pinhole model we assumed the image plane to have "mm" millimetres as the unit of measurement. So we take a world coordinate in "m" and convert to "mm". In reality, we all know it is measured in "pixels" measured by a image sensor. So there could be "mulitple pixels per mm" and so the alternative measurement to be consisdered is "pixel density" which is nothing but "pixels per mm". So introduce this to in the equation as follows. 

$ x_i = m * f * x_i/z_c $  ;   $ y_i = m * f * y_i/z_c $

Why we multiply?
- Pixel Density = 1 ; 1 mm contains 1px;  
- Pixel Density = 10 ; 1 mm contains 10px;   2 mm contains 20px;  3 mm contains 30px;  X mm contains (PixelDensity * X)px;  


![](https://abhisheksreesaila.github.io/blog/images/stereo/ImagePlaneToSensor.png "Image Plane To Image Sensor Mapping.png")


Second assumption we made was the centre of the Image plane (measured in mm) is the centre of the Image Sensor which may or may not be true. To avoid this uncertainity, we just place the centre of the image sensor to be at the far left-side corner (for understanding and mathematical convenience) and just add whatever that needs to be added to get the correct centre. so now the equation becomes, 

$ x_i = m_x * f * x_c/z_c + O_x $  ;   $ y_i = m_y * f * y_c/z_c  + O_y $  x_i  and y_i is popularly referred to as u and v. 

$ u = m_x * f * x_c/z_c + O_x $  ;   $ v = m_y * f * y_c/z_c  + O_y $ 

$ u = f_x * x_c/z_c + O_x $  ;   $ v = f_y * y_c/z_c  + O_y $  Where $ f_x and f_x $ are focal lengths measured in pixels

> Important: $ f_x, f_y, O_x, O_y $ represent the camera's internal geometry and are called INTRINSIC parameters

U and V are non-linear equation, but mathematically its convenient to have a linear equation if we can get one. Homogenous coordinates to the rescue!

Play [here](https://wordsandbuttons.online/interactive_guide_to_homogeneous_coordinates.html) to get a good visual understanding. Basically, the concept is to add "fake" coordinate at the end to scale the values in the same propotion. For eg.  8/4 (=2) is same as 16/8 (=2).  Similarly in the world of coordinates geometry [3,2] will be same as [6,4,2] where 2 is the scale factor (or the fake coordinate added). We can always get the original value back from [6,4,2] by division [6/2,4/2]

First multiply with $ z_c $

$ u * z_c = f_x * x_c + O_x * z_c $  ;   $ v * z_c = f_y * y_c  + O_y * z_c $

Represent them in Homogenous Coordinates

> Note: (To convince ourselves that it is the same equation we derived moments ago, just divide z_c in the matrix. All these are just math tricks to get the same answer)



$$

\begin{bmatrix} u_h  \\  v_h \\ w_h  \end{bmatrix}
=
\begin{bmatrix} f_x*x_c + O_x*z_c  \\     f_y*y_c + O_y*z_c  \\ z_c  \end{bmatrix}

$$

Showing the same equation in Matrix form

$$

M_{int} = K = 

\begin{bmatrix} f_x & 0 & O_x & 0 \\  0 & f_y & O_y & 0  \\ 0 & 0 & 1 & 0   \end{bmatrix}
\begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix}

$$

This is also called "Intrinsic Matrix or Camera Matrix" popularly shown as K

---



In the beginning of this section we assumed we are given Camera Coordinates. Lets check out what it consists of.


Imagine a camera placed somewhere in an empty room other than the corners.  Fix a corner of this empty room as the origin.  Then camera is at a *distance* from this corner. Additionally it will be at an *angle* to the orgin of the world coordinate, in our case, the corner of the room we have chosen. In computer vision terms, the camera is said to be "translated (geometric translation, not language tranlsation :) )  and "rotated" w.r.t to the world coordinate. Hence, we need some sort of translation and rotation variables that will help us take the world cooridinates of the image and project them to the camera first. Right? We call this "Extrinsic Parameters"

![](https://abhisheksreesaila.github.io/blog/images/stereo/coordsys1.png "Co-ordinate system changes")
*[Source](https://www.scratchapixel.com/lessons/3d-basic-rendering/computing-pixel-coordinates-of-3d-point/mathematics-computing-2d-coordinates-of-3d-points)*

In the picture above local refers to camera coordinate system. Lets call the "pink point" as P. Also observe that "the pink point" is not 2 different points but the exact same point but represented in 2 different systems. In other words  $ P_{camera} $  is same as $  P_{world} $ 


Rotation is provided by the rotation matrix and  is given as follows :-


$$

R 
=
\begin{bmatrix} r_1 & r_2 & r_3 \\  r_4 & r_5 & r_6  \\  r_7 & r_8 & r_9  \end{bmatrix}

$$

- $  r_1, r_2, r_3  $ represents rotation in the x direction 
- $  r_4, r_5, r_6  $ represents rotation in the y direction 
- $  r_7, r_8, r_9  $ represents rotation in the z direction 

Now we have,

> $ P_{camera} $ = R *  ($ P_{world} - C_{world}) $  where $ C_{world} $ is the translation. Its negative since "Z" axis of the camera system is exactly opposite to the world system. It has to be done this way so that X and Y axis are in sync i.e if we move left in 1 system, it moves left in the other and so on.  

![](https://abhisheksreesaila.github.io/blog/images/stereo/extparam.png "World to Camera Transformation")
[Source]("http://www.cse.psu.edu/~rtc12/CSE486/lecture12.pdf")


Changing this to homogenous coordinates we get,

$$

M_{ext} = 

\begin{bmatrix} X_{c}  \\  Y_{c} \\ Z_{c}  \\ 1 \end{bmatrix}
=

\begin{bmatrix} r_1 & r_2 & r_3 & 0 \\  r_4 & r_5 & r_6 & 0   \\  r_7 & r_8 & r_9 & 0  \\ 0 & 0 & 0 & 1 \end{bmatrix}

\begin{bmatrix} 1 & 0 & 0 & -c_{x} \\   0 & 1 & 0 & -c_{y}  \\   0 & 0 & 1 & -c_{z} \\ 0 & 0 & 0 & 1 \end{bmatrix}

\begin{bmatrix} X_{w}  \\  Y_{w} \\ Z_{w}  \\ 1 \end{bmatrix}

$$

$ M_{ext} $ is the Extrinisic Matrix


# Projection Matrix

In the begining of the blog post we said We want to move from WORLD => CAMERA => IMAGE to take the world coordinates and place them on the image. From the previous section we understood that to move from

- WORLD => CAMERA = we need extrinisic matrix
- CAMERA => IMAGE = we need intrinsic matrix

We can show mathematically as

$ u_{h} = M_{int}  M_{ext} X_{w} $ 

Image points in Homogenous Coordinates =  Intrinsic Matrix * Extrinisic Matrix * (point in the world coordinate)

We can combine these 2 matrices into 1 matrix called "Projection Matrix"

> $ u_{h} = P X_{w} $ 


# I understand Calibration now.  How do we starting writing code?

Opencv provides all the functions built-in to estimate parameters. Think of an real world object as an input and the pixel coordinates as the output. The problem at hand is to adjust the middle box in between in such a way that we get the desired output from the input provided. If we do this once, you will get an estimate, but those parameters are likely only to be useful in that angle, direction of the camera. Repeat the steps multiple times and you have a good estimate. It sort of unsupervised machine learning so to speak! 

![](https://abhisheksreesaila.github.io/blog/images/stereo/calib-visual.png "Calibration Paramters Estimation")





---

# Code




# References

[OpenCV Calibration](https://docs.opencv.org/3.4.3/d9/d0c/group__calib3d.html)

[Geometry of Image Formation](https://learnopencv.com/camera-calibration-using-opencv/)

[Pinhole basics](https://www.youtube.com/watch?v=_EhY31MSbNM&t=194s)

[Calibration Basics](https://www.youtube.com/watch?v=qByYk6JggQU)

[Latex in Jupyter Notebook](https://personal.math.ubc.ca/~pwalls/math-python/jupyter/latex/)

[Conversion of Coordinate Basics](https://www.scratchapixel.com/lessons/3d-basic-rendering/computing-pixel-coordinates-of-3d-point/mathematics-computing-2d-coordinates-of-3d-points)
