# Essential matrices
In the last notebook you computed point correspondences between images using SIFT descriptors and a brute force matching scheme similar to what was used for image stitching.  With these correspondences in hand, we could, in principle, apply the triangulation code developed earlier in order to find the 3D location of all these points.  Or could we?  Triangulation was possible because we already had pre-defined ground control points with which to compute a camera matrix.  However, producing these ground control points is extremely laborious: for each image that we might care to analyze, we must find (manually) at least 3 (and sometimes more) examples for which we know a correspondence between real world coordinates and image coordinates.  This is often not desirable (or even possible).  Let's look at an example of SIFT corresponces, filtered by the ratio test, of a scene in my office.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import cv2
import piexif

I_1 = plt.imread('pens_0.jpg')
I_2 = plt.imread('pens_1.jpg')

sift = cv2.xfeatures2d.SIFT_create()

kp1,des1 = sift.detectAndCompute(I_1,None)
kp2,des2 = sift.detectAndCompute(I_2,None)

bf = cv2.BFMatcher()
matches = bf.knnMatch(des1,des2,k=2)

# Apply ratio test
good = []
for i,(m,n) in enumerate(matches):
    if m.distance < 0.6*n.distance:
        good.append(m)
    
u1 = []
u2 = []

for m in good:
    u1.append(kp1[m.queryIdx].pt)
    u2.append(kp2[m.trainIdx].pt)
    
u1 = np.array(u1)
u2 = np.array(u2)

#Make homogeneous
u1 = np.c_[u1,np.ones(u1.shape[0])]
u2 = np.c_[u2,np.ones(u2.shape[0])]


skip = 10
fig = plt.figure(figsize=(12,12))
I_new = np.zeros((h,2*w,3)).astype(int)
I_new[:,:w,:] = I_1
I_new[:,w:,:] = I_2
plt.imshow(I_new)
plt.scatter(u1[::skip,0],u1[::skip,1])
plt.scatter(u2[::skip,0]+w,u2[::skip,1])
[plt.plot([u1[0],u2[0]+w],[u1[1],u2[1]]) for u1,u2 in zip(u1[::skip],u2[::skip])]
plt.show()

There are obviously some mismatches, but hopefully we can deal with that later by filtering out outliers.  

With these correspondences in hand, we can work towards finding a relationship between the relative geometry of the cameras with which these two images were taken.  This is not as simple as finding a homography, because not only did the camera *rotate*, but it translated as well.  As such, points in the first image will not generally map to points in the second image.  Instead, points in the first image will map to *lines* in the second image.  This is easily understood by looking at the following figure:
<img src=epipolar.jpg>
If $C$ is the optical center of camera one, and $\mathbf{x}$ is the location of a point of interest on camera one's imaging plane in generalized image coordinates, then these two locations form a ray, a line which shoots out from the camera and intersects all of the places in world coordinates that will map to that location on image one's imaging plane.  What does that ray look like in image two?  It is, of course, a line (unless the camera centers are collocated).  We can write this property mathematically as
$$ 
\mathbf{E} \mathbf{x}_1 = \mathbf{l}_2,
$$
where $\mathbf{l}_2$ are the coefficients of the line in the second image, i.e.
$$
ax + by + c = \mathbf{l} \cdot \mathbf{x} = 0, 
$$
and $\mathbf{E}$ is called the *essential matrix*.  As it turns out, the essential matrix contains all of the information we need for recovering the relative geometry between two images, and in fact has the property that 
$$
\mathbf{E} = [\mathbf{t}]_\times \mathbf{R},
$$
where $\mathbf{R}$ is the rotation matrix between the two cameras and $[\mathbf{t}]_\times$ is the cross product acting on the translation vector, i.e. 
$$
[\mathbf{t}]_\times = \begin{bmatrix} 0 & -t_Z & t_Y \\
                                      t_X & 0 & -t_X \\
                                      -t_Y & t_X & 0 \end{bmatrix}.
$$                                     
Both $\mathbf{R}$ and $\mathbf{t}$ can be recovered from the essential matrix, the former exactly, and the latter up to a scale.                        

Note that the essential matrix is defined in terms of generalized image coordinates, rather than normal image coordinates, which is to say that the influence of focal lengths and camera center positions have been removed.  How do we compute these coordinates?  Recall that image coordinates are related to generalized image coordinates by 
$$
\mathbf{u} = \mathbf{K} \mathbf{x},
$$
where $\mathbf{K}$ is the so-called camera matrix
$$
\mathbf{K} = \begin{bmatrix} f & 0 & c_u \\
                             0 & f & c_v \\
                             0 & 0 & 1 \end{bmatrix}.
$$
$\mathbf{K}$ is easily invertible, so we have that

In [None]:
h,w,d = I_1.shape
f = exif_1['Exif'][piexif.ExifIFD.FocalLengthIn35mmFilm]/36*w
cu = w//2
cv = h//2

K_cam = np.array([[f,0,cu],[0,f,cv],[0,0,1]])
K_inv = np.linalg.inv(K_cam)
x1 = u1 @ K_inv.T
x2 = u2 @ K_inv.T 
print(x1)

Note that the $\mathbf{x}$ values are approximately scaled around 1, which will be numerically helpful when computing the essential matrix.  

Back to the essential matrix: how do we find it?  Recall that 
$$ 
\mathbf{E} \mathbf{x}_1 = \mathbf{l}_2.
$$
If we knew $\mathbf{l}_2$, we could back out $\mathbf{E}$.  Unfortunately, we don't.  However, because we have point correspondences, we know something almost as good: the location in image 2 of a point that falls on $\mathbf{l_2}$, which is to say that we know a point $\mathbf{x}_2$, such that 
$$
\mathbf{l}_2 \cdot \mathbf{x}_2 = 0 (=) \mathbf{x}_2^T \mathbf{l}_2,
$$
by the definition of $\mathbf{l}_2$.  Left multiplying the expression for the essential matrix by $\mathbf{x}_2^T$, we get
$$
\mathbf{x}_2^T \mathbf{E} \mathbf{x}_1 = \mathbf{x}_2^T \mathbf{l_2} = 0.
$$
If we multiply out the left side of this thing, the coefficients of $\mathbf{E}$ appear linearly (See Szeliski, eq. 7.13).  Thus, if have 8 point correspondences (as in the homography, this matrix is only defined up to scale), then we can recover the entries of $\mathbf{E}$.  In fact, there are even better algorithms which allow us to find $\mathbf{E}$ using as few as 5 point correspondences.  Note that as in the case of computing homographies, this process is sensitive to outliers: thus it is beneficial to use RANSAC or something similar to find a model that maximizes the number of inliers while discarding points that do not fit the model.  

This would be alot to code ourselves: Fortunately, OpenCV has an excellent method that will, given point correspondences, perform a 5-point algorithm for finding $\mathbf{E}$ wrapped in RANSAC for us.  Because the process is so similar to computing homographies, we will use this instead of coding it ourselves.  It can be called as follows:

In [None]:
E,inliers = cv2.findEssentialMat(x1[:,:2],x2[:,:2],np.eye(3),method=cv2.RANSAC,threshold=1e-3)
inliers = inliers.ravel().astype(bool)
print(E,inliers)

In the above function call, the first two arguments are our corresponding points in generalized, non-homogeneous, camera coordinates (hence we drop the last column of ones).  The third argument is a camera matrix: in principle, we could give this function $\mathbf{u}_1, \mathbf{u}_2$ along with the camera matrix instead of $\mathbf{x}_1,\mathbf{x}_2$, but my experimentation has shown that this leads to poor results because of the ill-conditioning of the resulting linear system of equations.  Since we are providing coordinates which have already had the camera intrinsics removed, we give it the identity matrix.  The fourth argument specifies that we want to use RANSAC for outlier detection, and the threshold argument is the RANSAC outlier detection threshold: because we're in generalized coordinates, this should be on the order of 1-3 divide by the number of pixels.

The algorithm returns the computed essential matrix, as well as a mask of points which successfully passed the outlier test.  We can plot the resulting points (in camera coordinates)


In [None]:
skip = 10
fig = plt.figure(figsize=(12,12))
I_new = np.zeros((h,2*w,3)).astype(int)
I_new[:,:w,:] = I_1
I_new[:,w:,:] = I_2
plt.imshow(I_new)
plt.scatter(u1[inliers,0][::skip],u1[inliers,1][::skip])
plt.scatter(u2[inliers,0][::skip]+w,u2[inliers,1][::skip])
[plt.plot([u1[0],u2[0]+w],[u1[1],u2[1]]) for u1,u2 in zip(u1[inliers][::skip],u2[inliers][::skip])]
plt.show()

Note that all of the weird, bad matches have been eliminated.  

Now that we have the essential matrix, we can recover the relative pose of the two cameras.  OpenCV has an easy function to do this as well, that corresponds to Szeliski Eq. 7.18 and Eq. 7.25.  

In [None]:
n_in,R,t,_ = cv2.recoverPose(E,x1[inliers,:2],x2[inliers,:2])

Note that the pose recovery process solves an equation that has four roots: to select the correct one, it uses the original points to enfore *chirality*, or the notion that the points in the second image should be in front of the camera.  Note also, that recoverPose only returns a single rotation and translation.  These correspond to the rotation and translation values for the second image: it is assumed that the first image has rotation given by the identity, and that the camera center is at $\mathbf{X} = \mathbf{0}$.

We can now form the camera matrices $P_1 = [\mathbf{I}| \mathbf{0}]$ and $P_2 = [\mathbf{R}|\mathbf{t}]$.

In [None]:
P_1 = np.array([[1,0,0,0],
                [0,1,0,0],
                [0,0,0,1]])
P_2 = np.hstack((R,t))
print(P_1,P_2)

Note that these are projection matrices in generalized image coordinates.  We can always get back to camera coordinates by multiplying them by the camera matrix:

In [None]:
P_1c = K_cam @ P_1
P_2c = K_cam @ P_2
print(P_1c)
print(P_2c)

**Your task is to apply this code to two images of your (judicious) choosing.  After finding the two camera matrices, instantiate two camera models with these matrices, and then use your triangulation code to find the 3D position of the points of correspondence (only the inliers!)**

In [None]:
E,inliers = cv2.findEssentialMat(uv1,uv2,K,method=cv2.RANSAC,threshold=4)

Once the essential matrix is found, we can recover the translation and rotation matrix up to a scale using:

In [None]:
n_in,R,t,_ = cv2.recoverPose(E,uv1,uv2,mask=inliers)

You'll note that this function only returns a single rotation and translation: this method assumes that the first camera has canonical pose $\mathbf{t} = \mathbf{0}$ and $\mathbf{R} = \mathbf{I}$.  Alteratively, we can immediately define the camera matrices

In [None]:
P_0 = K @ np.array([[1,0,0,0],[0,1,0,0],[0,0,1,0]])
P_1 = K @ np.hstack((R,t))

where $P_0$ and $P_1$ are the camera matrices, and $K$ is the matrix of camera intrinsics
$$
K = \begin{bmatrix} f & 0 & c_u \\
                    0 & f & c_v \\
                    0 & 0 & 1 \end{bmatrix}
$$
