Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on coordinate systems #10

Closed
gkioxari opened this issue Jan 6, 2021 · 12 comments
Closed

Question on coordinate systems #10

gkioxari opened this issue Jan 6, 2021 · 12 comments

Comments

@gkioxari
Copy link

gkioxari commented Jan 6, 2021

I want to relate two frames of the same scene in HyperSim via their camera rotation/translation. For example, assume we have frame0 and frame1 with R0, T0 and R1, T1 respectively as the camera orientation and position.

For frame0 and with the use of the depth map, I want to construct a 3D point cloud of (x, y, z) points in frame0's camera coordinate frame. For each pixel (u, v, depth(v, u)) in frame0

x = u / fx * depth
y = v / fy * depth 
z = - depth

Here (u, v) is the pixel coordinate in ndc space with +u pointing right and +v pointing up (here in your code), (fx, fy) are the camera's focal length (e.g. here in your code)

So now (x, y, z) is in the camera coordinate system of frame0, aka xyz_cam0 = (x, y, z)

Now I want to transform this pointcloud to frame1's camera coordinate system per

xyz_cam1 = R1.T * R0 * xyz_cam0 + R1.T*(T1 - T0). 

Then in turn, xyz_cam1 can be projected into the screen of frame0 after projecting with camera matrix M (here in your code). The projection of xyz_cam1 should give me a frame close to frame1.

The above reasoning doesn't seem to work. So I am doing something wrong! Thanks Mike :)

@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Jan 6, 2021

Hi Georgia! At first glance, your reasoning looks correct, and you are looking at all the right places in the code where these coordinate conventions are established and implemented. However, I'm arriving at a slightly different expression for xyz_cam1. I'm sure there will be other things to debug as we get on the same page with our coordinate conventions, but can you double check my reasoning and see if my slightly different expression for xyz_cam1 works for you? 😀

# equation 1: I'm assuming that the "camera orientation" matrix is R0, and the "camera position" vector is T0
xyz_world0 == (R0 * xyz_cam0) + T0

(See here for an example of where equation 1 is implemented in our code)

# equation 2: same as above, but for xyz_world1, re-arranged in terms of xyz_cam1
xyz_world1 == (R1 * xyz_cam1) + T1
xyz_world1 - T1 == R1 * xyz_cam1
R1.T * (xyz_world1 - T1) == xyz_cam1

# let xyz_world0 == xyz_world1, substitute equation 1 into equation 2, collect terms
# note that we arrive at a final expression for xyz_cam1 that is slightly different from the one in your initial post
xyz_cam1 == R1.T * (((R0 * xyz_cam0) + T0) - T1)
xyz_cam1 == R1.T * ((R0 * xyz_cam0) + T0) - R1.T * T1
xyz_cam1 == R1.T * R0 * xyz_cam0 + R1.T * T0 - R1.T * T1
xyz_cam1 == R1.T * R0 * xyz_cam0 + R1.T * (T0 - T1)

@gkioxari
Copy link
Author

gkioxari commented Jan 6, 2021

You are right! I got T1-T0 reverse. In my code I decompose the computation explicitly to do cam0 --> world --> cam1.

Some more followup. Let's take scene ai_001_001 and frame0=0 and frame1=1. (below all xyz_foo are arrays of shape Vx3)

The range of xyz_cam0:

xyz_cam0.min(0) = [-1.30169848, -1.11776386, -4.26171875]
xyz_cam0.min(0) = [ 1.49213205, 1.21655009, -1.91503906]

The range of xyz_world:

xyz_world.min(0): [ 100.35482207, -116.07335733, 60.82745425]
xyz_world.max(0): [ 102.7282101 , -113.43747262, 62.99127211]

The range of xyz_cam1:

xyz_cam1.min(0): [ 4.44592554, -13.27550555, 11.32567288]
xyz_cam1.max(0): [ 7.42253576, -10.88670971, 13.72111644]

This means that in frame1, the points in the camera coordinates are off completely. For example, z is positive which means all points are behind the camera? I must be messing up somewhere.

@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Jan 6, 2021

I think we're narrowing down the issue.

I'm looking at ai_001_001/images/scene_cam_00_geometry_hdf5/frame.0000.position.hdf5. This file contains the world-space position at each pixel, in asset coordinates (i.e., the native coordinate system specified by the artist when creating the assets). Essentially, this file contains xyz_world values, so we can use it to sanity check our calculations. Note that our camera poses are also specified in asset coordinates.

Anyway, I just did a visual inspection (via h5dump) of this file, and I'm noticing values that exceed the min and max range of xyz_world in your post above. For example, according to h5dump, I see a value for the bottom-right pixel frame.0000.position[767,1023] == [104, 3.57031, 27.6562], where every coordinate is outside your reported min and max range.

Are you using our depth_meters images to obtain xyz_cam0 values? If so, this approach will not work without some additional steps, because the values in our depth_meters images have been scaled so they're not in asset units any more, but the camera poses are specified in asset units. I apologize for the unclear documentation 😅 I will improve the documentation to make this fact more clear.

@gkioxari
Copy link
Author

gkioxari commented Jan 6, 2021

Aha! So to understand better the data I am using:

  • Depth: I am using scene_cam_00_geometry_hdf5/frame.0000.depth_meters.hdf5 to read the depth. I assumed that this is a metric depth form the camera center. The values I get from this file is what becomes my negative z in cam0 coordinates. So xyz_cam0[:, 2] = - depth. For completeness, xyz_cam0[:, 0] = u / fx * depth and xyz_cam0[:, 1] = v / fy * depth.
  • Camera rotation: For R0 and R1, I use _detal/cam_00/camera_keyframe_orientations.hdf5
  • Camera translation: For T0 and T1, I use _detail/cam_00/camera_keyframe_positions.hdf5

I use R0 and T0 to convert xyz_cam0 --> xyz_world without changing R0, T0 in any way. I just take them directly from this file. And I use R1 and T1 to convert xyz_world --> xyz_cam1. Is this the correct usage?

Does xyz_world, the output from transforming xyz_cam0 with R0 & T0, have to be equal to scene_cam_00_geometry_hdf5/frame.0000.position.hdf5. So the points are transformed to the assert coordinate system? If yes, that's going to be helpful to debug further.

@gkioxari
Copy link
Author

gkioxari commented Jan 6, 2021

Ok I found a good test case now that I understand the data properly. Let's consider ai_001_001 & cam_00, frame=0 and the center pixel.

For the center pixel, (u, v) = (0, 0) or (x, y) = (512, 384). The depth for the center pixel is depth[y, x] = 3.9277344.
For (fx, fy) = (1.732, 2.3094) (I am using this formula), I get

z_cam0 = -depth[y, x]
x_cam0 = u / fx * (-z_cam0) 
y_cam0 = v / fy * (-z_cam0)

which results in xyz_cam0 = [0, 0, -3.9277344]

xyz_world = np.matmul(R0, xyz_cam0) + T0

which results in xyz_world = [ 100.98179613, -114.04803521, 61.78117792].

From scene_cam_00_geometry_hdf5/frame.0000.position.hdf5 at the center pixel we get

pos[y, x] = [14.875 ,  7.9726562, 41.75]

If I understand correctly, pos[y, x] == xyz_world right?

So I presume my mistake is that I am mixing metric values (coming from depth_meters) with cameras which are defined in the asset space. Is there any way I can get camera R's and T's in the metric world space?

@gkioxari
Copy link
Author

gkioxari commented Jan 6, 2021

Figured it all out! The conversion between asset units and meters is provided! Woop! Thanks Mike for dealing with my stupidity!

@gkioxari gkioxari closed this as completed Jan 6, 2021
@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Jan 6, 2021

Yay! I'm glad you figured this out 😀 Thank you for highlighting these ambiguities in our documentation. I'll try my best to respond point-by-point to your posts for anyone else reading this, and I'll leave this issue open until I get a chance to clarify the documentation.

  • depth_meters is the Euclidean distance in meters to the optical center of the camera. It should really be called distance_from_camera_meters. I apologize for the unclear naming convention 😅 Regardless, this is not the same as the negative z-coordinate in camera space. See ground truth depth is actually distance #9 for a handy code snippet (contributed by @sniklaus) that converts these Euclidean distance values into planar depth values.
  • You're using the correct rotation and translation matrices. camera_keyframe_orientations stores the R matrices directly, and camera_keyframe_positions stores the T vectors directly. Since we agree on equations 1 and 2 in my post above, I believe you are converting between world-space and camera-space correctly.
  • Your xyz_world values should exactly match the values stored in our frame.IIII.position.hdf5 images, assuming you're working in asset units, i.e., not meters.
  • For each scene, the _detail/metadata_scene.csv file defines a meters_per_asset_unit scale factor that can be used to convert between asset units and meters.

@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Jan 8, 2021

I just updated the documentation to make these coordinate convention nuances more clear. Thank you for posting this question and helping me to improve the documentation 😀

@phongnhhn92
Copy link

@gkioxari I believe you perform 3D warping from frame 0 to frame 1 right? Would you mind sharing the code you have used to do that. I am curious about how did you do it.

@mikeroberts3000
Copy link
Collaborator

@phongnhhn92 What do you mean by "3D warping"? What do you want to do exactly?

@phongnhhn92
Copy link

@mikeroberts3000 I guess I am using the wrong term here, it should be "homography warping". Since we all know RGB image of frame0, R0, T0, depth0, R1, T1. I would like to warp frame0 to frame1 and see how the warped image looks compared to frame1. So the point here is to have the correct transformation from the camera space of frame0 -> world space of frame 0 -> camera space of frame 1. I think you and @gkioxari have discussed how to do this in this issue.
After reading everything, I think R and T are in the asset coordinates but depth0 needs to be converted as well (based on this). I just wanna do some sanity check with R and T and use them for my subtask.

@mikeroberts3000
Copy link
Collaborator

mikeroberts3000 commented Feb 18, 2021

I don't know exactly what "homography warping" is, and I don't know what you mean when you're saying you'd like to "warp frame0 to frame1". Anyway, I'm interpreting your question as follows. Most of the time, frame 0 and frame 1 will observe many of the same points, but they will be at different pixel locations (e.g., a coffee mug might be in the corner of frame 0 but closer to the center of frame 1). So, for all the points that are visible in frame 0, you would like to compute their pixel locations in frame 1.

  • The easiest way is to start with frame.0000.position.hdf5, because this file contains the world-space positions for each pixel in frame 0.
  • You need to project these world-space points into the camera-space for frame 1, and then you need to project these camera-space points into the screen-space for frame 1 to compute their final pixel locations. See our scene_generate_images_bounding_box.py for an end-to-end code example. In this example, we project the corners of a 3D bounding box from world-space to screen-space using the camera information stored in the Hypersim data.
  • To verify your understanding, you can also try this entire exercise starting with frame.0000.depth_meters.hdf5. This file contains distance values in meters, not world-space positions, and therefore requires a couple of extra conversion steps. I recommend getting things working the easy way first, and then posting a new question if you have questions about working with frame.0000.depth_meters.hdf5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants