# Project 1: Pose estimation
## Due Jan. 30th

### Assigned Reading: Szeliski Ch 2.3, Ch. 6.2

### Problem description
One of the most common computer vision tasks, particularly for things like practical robotics, is called *pose estimation*.  *Pose* is simply the computer vision term for the vector
$$
\mathbf{p} = [X_{cam},Y_{cam},Z_{cam},\phi,\theta,\psi],
$$
where the first three elements of the vector are the position of a camera and the last three elements are its yaw, pitch, and roll.  *Pose estimation* is simply determining these values from an image.  

How is this done?  Imagine that we have identified the real-world coordinates $\mathbf{X}_i$ of several features that are easily identified, and fit in one photograph.  We'll call them ground control points (GCPs).
<img src="gcp.jpg">
Using code that we've already developed, we can simulate where these GCPs should project to in the image.  If we already know the correct pose, when we perform this projection, the projection of the GCPs (the steeple of M, for example), should be collocated with that feature in a real image that we took with the camera.  This is a good way of ensuring that our camera model is correct.  

However, usually the pose is not known *a priori*.  Instead, we need to find the pose that reduces the misfit between the projection of the GCPs, and their identified location in the image.  At its core, you can think of this as a least-squares problem: adjust the pose of the model camera such that the squared difference between the projection of the GCP and its location in the image is minimized.  We can write this mathematically as:
$$
\mathbf{p}_{opt} = \mathrm{argmin}_{\mathbf{p}} \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^2 (f(\mathbf{X}_i,\mathbf{p})_j - \mathbf{u}_{ij})^2,
$$
where $n$ is the number of GCPs, and $f(\mathbf{X},\mathbf{p})$ is the projection of real world coordinates $\mathbf{X}$ into camera coordinates (which depends on the pose $\mathbf{p}$, and $\mathbf{u}$ is the pixel coordinates of the equivalent point in the image.  When properly formulated, this minimization problem is straightforward to solve.  The classic method for doing so is the [Levenberg-Marquardt algorithm](https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm), which is a generalization of Newton's method and Gradient descent.  

### Software Requirements:
Your assignment is to develop a camera model that has the capability to perform pose estimation.  It should be structured as a Python class with (at least) the following methods:
* A method for performing the projective transform
* A method for performing the transformation from world to generalized camera coordinates
* A method for estimating pose, given ground control points (an excellent python implementation of Levenberg-Marquardt can be found [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.least_squares.html).)

A skeleton for this class might be:

In [None]:
class Camera(object):
    '''Reduce the misfit between the projection of the GCPs and their identified location in the image.
       Adjust the pose s.t. the squared difference between the projection of the GCP and its location
       in the image is minimized.
       Project the GCPs onto the sensor.
       Recall: Guess at pose vector.
       Then, reduce the misfit between the GCP coordinates that we predicted
       using our guessed-at coordinates, and the true coordinates.'''
    def __init__(self, focal_length=None, sensor_x=None, sensor_y=None):
        self.p = None                   # Pose
        self.focal_length = focal_length                   # Focal Length in Pixels
        self.sensor_x = sensor_x
        self.sensor_y = sensor_y
        
    def projective_transform(self, x):
        """  
        This function performs the projective transform on generalized coordinates in the camera reference frame.
        """
        x = x/z
        y = y/z
        u = focal_length*x + sensor_x / 2
        v = focal_length*y + sensor_y / 2 # the coordinates that input intensities map to
        u = u.astype(np.int) # these should be indices in our output array
        v = v.astype(np.int)
        return u, v
    
    def rotational_transform(self, X):
        """  
        This function performs the translation and rotation from world coordinates into generalized camera coordinates.
        """
        pass
    
    def estimate_pose(self, X_gcp, u_gcp):
        """
        This function adjusts the pose vector such that the difference between the observed pixel coordinates u_gcp 
        and the projected pixels coordinates of X_gcp is minimized.
        """
        pass


### Testing Requirements
You should test this code on real world imagery of your own making.  Go out into the world and take a photograph of a scene in which you will be able to identify real world coordinates.  As an example (which you are free to emulate), I took a photograph of main hall from the oval (see above).  In the background was the M, along with a few other things.  I selected several prominent features in my image, recorded their image coordinates, then used google earth (with coordinates set to UTM mode) to determine their location in world coordinates:

| u  | v  | Easting | Northing | Elevation | Description    |

|----|----|---------|----------|-----------|----------------|

|1984|1053|272558.68|5193938.07|1015       |Main hall spire |

|884 |1854|272572.34|5193981.03|982        |Large spruce    |

|1202|1087|273171.31|5193846.77|1182       |Bottom of left tine of M|

|385 |1190|273183.35|5194045.24|1137       |Large rock outcrop on Sentinel|

|2350|1442|272556.74|5193922.02|998        |Southernmost window apex on main hall|

I saved this table as a txt file, which I read and then use in my estimate_pose function.  

### Additional notes
* The pose vector has six elements.  Each ground control point has two observations ($u$ and $v$).  How many points are needed to fully constrain the minimization problem?  (note that more observations is always better, but there is a minimum for the problem to be well posed)

* You will need to determine the focal length from your camera.  To do this you will need to read the image's [Exif file](https://en.wikipedia.org/wiki/Exif).  Many image viewers (eye of gnome, for example) will do this automatically.  Look under Properties.  Alternatively, the Linux command line tool imagemagick can be used:

In [8]:
%%bash
identify -verbose campus.jpg | grep "exif:"


    exif:ApertureValue: 185/100
    exif:BrightnessValue: 0/100
    exif:ColorSpace: 1
    exif:ComponentsConfiguration: 1, 2, 3, 0
    exif:DateTime: 2019:01:22 12:48:36
    exif:DateTimeDigitized: 2019:01:22 12:48:36
    exif:DateTimeOriginal: 2019:01:22 12:48:36
    exif:ExifImageLength: 2448
    exif:ExifImageWidth: 3264
    exif:ExifOffset: 238
    exif:ExifVersion: 48, 50, 50, 48
    exif:ExposureBiasValue: 0/10
    exif:ExposureMode: 0
    exif:ExposureProgram: 2
    exif:ExposureTime: 1/3230
    exif:Flash: 0
    exif:FlashPixVersion: 48, 49, 48, 48
    exif:FNumber: 190/100
    exif:FocalLength: 291/100
    exif:FocalLengthIn35mmFilm: 27
    exif:GPSDateStamp: 2019:01:22
    exif:GPSInfo: 6272
    exif:GPSTimeStamp: 19/1, 48/1, 36/1
    exif:GPSVersionID: 2, 2, 0, 0
    exif:ImageLength: 2448
    exif:ImageUniqueID: R08QSJA00AA
    exif:ImageWidth: 3264
    exif:InteroperabilityOffset: 6242
    exif:ISOSpeedRatings: 50
    exif:LightSource: 0
    exif:Make: samsung
    exif:Ma

phone cameras typically report focal length in 35mm equivalent.  Confusingly, to get focal length in pixels, divide this number by *36*, then multiply by the width of the image in pixels.  Hence, for this image, the focal length is 

In [4]:
f_length_35 = 27
img_width = 3264

f_length = f_length_35/36*img_width
print(f_length)

2448.0


## Bonus Problem! 
### (This won't be graded, but is something that we'll get back to in a few weeks, in case you want to start thinking about these ideas)

A calibrated camera model is a non-linear function that maps from a 3D coordinate system to a 2D coordinate system.  Can this function be inverted?  Can you, based on a 2D image of an object, recover that object's 3D coordinates?  

The answer, in general, is no.  When you project onto the plane, you lose all distance information.  However, what if we have two images of the same object, taken from different angles?  Does the situation change then?  You'll note that there are now three unknowns (East, North, Elevation) and four observations ($u_1,v_1,u_2,v_2$).

In the project file you will find two images (campus_stereo_1.jpg and campus_stereo_2.jpg), with two sets of ground control points (gcp_stereo_1.jpg and gcp_stereo_2.jpg).  Optimze a camera model for each image.  Using both these camera models, define a new optimization problem for determining the [Easting,Northing,Elevation] position of an object that is identifiable in both images (I used the face of the clock on main hall for testing).  Use Levenburg-Marquardt or something similar to solve the problem.  