# "Stereo Vision"

> "Two eyes are always better than one!"

- toc: true
- branch: master
- badges: false
- comments: true
- categories: [Computer Vision]
- hide: false
- search_exclude: false
- image: images/post-thumbnails/sv2.png
- metadata_key1: notes
- metadata_key2: 


# The loss of depth

Given a point in the image plane (U, V), can we find a corresponding point in the world system? The answer is NO.  When we move capture a image we know that 3d world coordinate gets transformed to camera cooridinate and then a 2D image plane. Refer to this blog post for indepth details. Since we loose information, more specifically "Z" depth information, it is impossible to get it back.  In other words, given an image point you cannot reverse engineer WORLD coordinate since you have lost a crucial "Z" coordinate in the translation process. however, you havent lost "X", "Y".  There is hope to get it back, but we need additional help.
 
 ![](https://abhisheksreesaila.github.io/blog/images/stereo/monocular2.png "Loss of Depth")

 
 Since we know "X", "Y", we know it exists somewhere along the line shown as dotted green lines. Why? This was the same line which was used in the projection of object onto the image plane to derive all the math. So, it can also assist in the reverse direction as well. 

![](https://abhisheksreesaila.github.io/blog/images/stereo/projection2.png "Projection of the object on image plane")


# How to recover?

The trick is to understand how nature does this. We all have 2 eyes, and we perceive depth (3d) in all objects. Don't we? The "second eye" provides its view of the world  in addition to the first, and both together work together perceive depth.

Lets apply the same concept here. lets bring in another camera, place it horizontally along the same axis at a distance, find the exact same spot (U, V) on it, guess the point on the "dotted green line".  The intersection of these 2 dotted green lines gives you the depth Z.  See below for the visual

![](https://abhisheksreesaila.github.io/blog/images/stereo/stereo.png "Stereo Vision")


- $ u_r v_r $ and  $ u_l v_l $ are the exact same point of the image as seen in the right and left camera respectively
- the distance between the cameras is called baseline denoted by b
- the camera plane is placed at the pinhole with origin (0,0,0) on the left and (b,0,0) on the right
- P (x, y, z) is the scene point that we are trying to compute using the 2 cameras.


This concept of using 2 cameras to perceive depth in the real world is called Simple Stereo Vision.  In this blog post lets understand the mechanics of such a system. 

# Finding Depth

Lets start with the basics. For a pinhole based camera system we know the following equations

$ u_l = f_x * x/z + O_x $  ;   $ v_l = f_y * y/z  + O_y $ 


For the right camera, its the same but the camera axis is shifted by "b"

$ u_r = f_x * x/z + O_x $  ;   $ v_r = f_y * y/z  + O_y $ 


Using the 4 equations, solving for x, y, z  we get

$ x = \frac {b (u_l - O_x)}{ u_l - u_r } $
$ y = \frac {b f_x (v_l - O_y)}{f_y (u_l - u_r)} $
$ z = \frac {b f_x}{ u_l - u_r } $


Where $ u_l - u_r $ is called disparity and its inversely propotional to "z" 


> Important: If we know the internal parameters fx, fy, ox, oy and compute disparity, we compute Z and hence the depth. 


If the object is closer to the camera, you will see a large disparity. for example, U value in the left camera will be 100, whereas the right camera it will be 75.  This is exact same pixel in the image but having 2 different values. The opposite is also true i.e the object is far, there will be very less difference between the U values (Say 100 and 95).  At infinite distance, U values will exactly be the same.

Disparity is propotional to baseline meaning if the distance between camera increase, disparity will increase. 

I keep mentioning only "U" because there is no $v_l - v_r $ in the equation. which means only the horizontal component between the 2 cameras vary not the "vertical component".  This proporty shows that  $ u_r, v_r $ and $ u_l $ and $ v_l $ lie along the same line (show by the yellow line). when we are computing DISPARITY to solve for X, Y, Z  in the real world, we can pick a point in the left camera $ (u_l, v_l) $ and  ONLY search along the same line in the right camera (and not wander aimlessly and search the whole image) to get the $ u_r, v_r $ .i.e  its a "1D" search problem. See image below for an example. This is often called the **"correspondence problem"**

![](https://abhisheksreesaila.github.io/blog/images/stereo/texture.png "Finding Correspondence")

The white patch in the picture is called the "scan line".

## Problems with stereo matching

Did we solve of finding depth just by solving 2 cameras? yes for the most part, but if the images have the repetitive texture, its impossible to compute disparity and therefore can't compute depth.  see image below. 

![](https://abhisheksreesaila.github.io/blog/images/stereo/no-texture.png "Stereo Vision CANNOT be computed")


# Calibration of the Stereo

In the section above we assumed the stereo is calibrated that means we know the how they are aligned with respect to each other. 

Suppose we take a photo of effiel tower on a iphone. Then another person take a same photo with a slight different angle from samsung android phone. Is it possible to compute Z depth information and hence reocover the 3d structure of the image?  The answer turns out to be yes. 

![](https://abhisheksreesaila.github.io/blog/images/stereo/Uncalibrated-stereo.png "Uncalibrated Stereo")

Every digital camera embeds certain metadata within the image such as the focal length etc. which can be read as internal parameters. All we need to compute are the external parameters.


In practice if there are 2 camera taking a shot at the same picture at 2 different angles, if we know the internal parameters of each camera, then we can calculate the alignment ourselves and hence compute the depth.  that is what we will explore in this section

Consider the above picture.  It is identical to the one in the earlier section except that left and right cameras have their own coordinate system $(x_l,y_l, z_l)$ and $ (x_r, y_r, z_r)$ respectively. 

Our goal is to compute the "translation" and "rotation" of one camera w.r.t the other.

## Epipolar Geometry

![](https://abhisheksreesaila.github.io/blog/images/stereo/epipolar_geo.png "Epipolar Plane")






- The highlighted triangle is "Epipolar Plane". Its the plane formed by the scene point (P) and camera origins $ o_l $  and $ o_r $ is called epipolar plane

- $e_l$ and $e_r$ are the projection of camera's origin on the left and right image planes respectively.  They are also called epipoles

- Every scene point will have it own epipolar plane.


## Now why do we care about epipolar geometry?

> Our goal is to find a equation such that we can calculate t, R (translation, Rotation)

![](https://abhisheksreesaila.github.io/blog/images/stereo/epipolar_cons.png "Epipolar Constraint")



### Epipolar Constraint

Consider a vector perpendicular to $X_l$ (highlighted in pink). Lets call it N

From linear algebra,
 - N = Cross Product between t and $X_l$
 - N = t X $X_l$....(1)
 Also, 
- $X_l$ * N = 0 (dot product of N and $X_l$ is 0).....(2)

Hence from (1) and (2)

(t X $X_l$) * $X_l$ = 0  

> This is the epipolar constraint. 

$X_l$ is a vector composed of elements $(x_l, y_l, z_l)$ and $x_l = R x_r + t$ (from the perspective projection)
Where t = position of right camera w.r.t to left; R = orientation of right camera w.r.t to left. At the end you will end up with 

> $X_l$ E $X_r$ = 0   ...(1)
 
-  E is a 3x3 matrix called the Essential Matrix 
 
 But we notice $X_l$ and $X_r$ stil exists! Our goal is to find these values.  So using perspective projection, 
 
$ u = f_x * x_l/z_l + O_x $  ;   $ v = f_y * y_l/z_l + O_y $  Where $ f_x and f_x $ are focal lengths measured in pixels
 
Substituting for $x_l$ in equation (1) and expressing in matrix form, we get rid of $x_l$  and $y_l$. but $z_l$ remains!  But $z_l$ can never be 0, since it the depth. In common man terms, the world exists infront of the camera, so world coordinate will have some value of "z", hence z <> 0. Using these concepts we arrive at 

> $U_l  K^{-1}_l E K^{-1}_r U_r$ = 0  

> $U_l$ F $U_r$ = 0  

Where F is called fundamental matrix. I have intentionally skipped the math but for those mathematically inclined check out explanation [here](https://www.youtube.com/watch?v=6kpBqfgSPRc)
 
###  How does this work in practice?


1. Suppose we are given the "F" matrix, we can easily get "E" since we "K" is given to us


2. Once you get E from step 1, then a technique called "[singular value decomposition](https://keisan.casio.com/exec/system/15076953160460)" we can decompose it into "t" and "R"


## Finding correspondence

In the previous sectionn, we said given a point $u_l, v_l$ finding a matching point $u_r v_r$ is a 1D search problem i.e. we have to search only in 1 direction, horizontally. But wait! where?  Can epipolar geometry help in the telling me the section the image to search? 

Fortunately the answer is yes! There is only other component of EPIPOLAR geometry to the rescue! Epipolar line.

The projection of all the points on the vector $X_l$ will lie on a line called EPIPOLAR line. See below for visual.


![](https://abhisheksreesaila.github.io/blog/images/stereo/epipolar-line.png "Epipolar line")









# References

[Stereo Vision](https://www.youtube.com/watch?v=hUVyDabn1Mg)