Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Training with our own data #1

Closed
zawlin opened this issue Jan 13, 2020 · 33 comments
Closed

Training with our own data #1

zawlin opened this issue Jan 13, 2020 · 33 comments

Comments

@zawlin
Copy link

zawlin commented Jan 13, 2020

Hi,
I have a few questions on how the data should be formatted and the data format of the provided dryice1.

  • The model expects world space coordinate in meters? i.e if my extrinsics are already in meters do I still need the world_scale=1/256. in config.py file?
  • The extrinsics are in world2cam and the rotation convention is like opencv? i.e, y-down,z-forward and x-right, assuming identity for pose.txt file?
  • how long do I need to train for about 200 frames? And in the config.py file it seems you are skipping some frames? This is ok to do for my own sequence as well?
  • in the KRT file, I see that there's 5 parameters above the RT matrix. This is the distortion correction in opencv format? But it is not used yes?
  • I did not visualize your cameras, so I am not sure how they are distributed. Is it gonna be a problem if I use 50 cameras equally distributed in a half-hemisphere and the subject is already at world origin and 3.5 meters from every cameras? My question is do I need to filter the training cameras so that the back side of subject that is not seen by input 3 cameras is excluded?
  • How do I choose the input cameras? I have a visualization of the cameras . Which camera config should I use? Is this more a question of which testing camera poses I intend to have, i.e narrower the testing cameras' range of view, the closer input training cameras can be? Config_0 is more orthogonal and Config_1 sees less of the backside.
@stephenlombardi
Copy link
Contributor

Hi, thanks for reaching out!

  • There's no expectation of a particular measurement unit (ie meters). The purpose of world scale is to convert the coordinates of the volume to lie in [-1, 1]^3 because that's the coordinate system that torch.functional.grid_sample uses. So worldscale actually dictates the size of the volume that is modeled because everything beyond -1 or 1 is cut off.
  • Yes, opencv convention for rotation.
  • In general I find around 200k-500k iterations gives a good result depending on the complexity of the scene. Returns tend to diminish after this point. For 200 you can probably err towards the shorter end of the range. I believe the included config.py file uses all available frames.
  • Yes, those are distortion parameters but the images are all undistorted.
  • No, don't filter out cameras; you want to use as many cameras as you can. The system can still model the details that the "input" cameras don't see (this is because it still gets training examples from other viewpoints).
  • The selection of the input cameras actually does not matter much, either one of those camera configurations should give similar results. As stated before, even though the encoder network only sees the 3 input images, it's only using them to produce a latent code and it's easy for the encoder network to include information about the entire scene into the codes (this is a feature, not a bug).

If you have any more questions or need me to clarify anything else, let me know!

@zawlin
Copy link
Author

zawlin commented Jan 13, 2020

Can you check the shared folder again? I uploaded a sample result at iteration 1.

I know that my object is at origin. the cameras are all looking at origin as well. pose.txt is set to identity

From the look of it it seems the scale is a little wrong? Can I assume that my camera conventions are correct and that only scale needed to be changed

@stephenlombardi
Copy link
Contributor

Yes it appears that the camera configuration is roughly correct. It's good that you can see the initial volume in all viewpoints. You might try increasing the world scale to get the volume to occupy the entire object.

@zawlin
Copy link
Author

zawlin commented Jan 13, 2020

If I increase the world scale, it seems to just grow at the corner until it occupies the upper left quadrant. I think something is still not quite right. If everything is working, I expect the initial volume to be exactly in the center since I know exactly where the camera is supposed to be looking (0,0,0) and the pose.txt applys no transformation. Do you know what might be an issue?

@stephenlombardi
Copy link
Contributor

Check line 65 and 66 of data/dryice1.py, it divides the focal length and principal point by 4 because the training data is downsampled from the original resolution. You probably don't want that.

@zawlin
Copy link
Author

zawlin commented Jan 13, 2020

yes that seems to do the trick. can you check the shared folder again? does it look like I need to adjust world scale or it's just fine?

@stephenlombardi
Copy link
Contributor

Looks pretty good to me, you'll probably have a better idea once it starts training.

@zawlin zawlin closed this as completed Jan 13, 2020
@zawlin
Copy link
Author

zawlin commented Jan 14, 2020

Thanks for helping!

@zawlin
Copy link
Author

zawlin commented Jan 16, 2020

Hi, can you please check the share folder again? I uploaded ground truth and rendered results.

  • Will it still keep improving or should I train a few more days?
  • Will the fog effect go away if I train longer?
  • The render results look too bright, and the background is black instead of greyish. Do the bg needed to be added in as a processing step? does the fixedcammean parameters matter? I think that was just doing zero meaning based on 255 range right?

@stephenlombardi
Copy link
Contributor

  • You might see some additional high frequency detail if you keep training as this tends to come in last
  • The fogginess will probably not go away
  • The code does color correction and gamma correction to the images for rendering (since we assume our data is in linear color space). Check out the file eval/writers/videowriter.py. On line 5 you'll see a bit which does gamma correction (... ** (1. / 1.8) ...). If your images are already gamma corrected you'll want to get rid of the exponent (or just change the 1.8 to a 1.). By default, when rendering a video not from one of the camera viewpoints, no background is added to the image because the backgrounds are considered to be camera dependent. On line 29 you'll see default arguments for background color which you can change to add an RGB background to that rendering. Also on line 29 is the colcorrect argument which will scale the RGB values. You'll want to set that to [1., 1., 1.] so it does nothing.

@zawlin
Copy link
Author

zawlin commented Feb 25, 2020

Hi,
It's working very well on synthetic data. However, I have some trouble getting it working on real data. I am using microsoft's fvv paper's data and some data we captured ourselves. Basically, after a while, the training just output background image. I have manually adjusted through trial and error pose.txt so that the volume is visible in all cameras and set world scale to 1/2, so that i don't have to spend too much time tweaking. At world scale 1, the volume is cut off in some cameras, but at world scale 1/2, it looks fine? Can you take a look at the progress images under real_data?

  • Is it important that the object is exactly in the center of camera array?
  • How many cameras are needed to be able to get geometry? I have 32 in our data, lincoln data has 52.

@stephenlombardi
Copy link
Contributor

It shouldn't be very important that the object is exactly centered or not. We only used 34 cameras in the experiments in the paper.

If I had to guess based on the progress images, I would say that it looks like the camera parameters may not be set up correctly. If you look at the first progress image for the lincoln example prog_000003.jpg, the last row shows 4 views located behind the person but the rendered volume looks drastically different for each of them. I would expect it to be more similar if the camera parameters are correct.

If you're sure the camera parameters are correct and in the right format, one thing you can try is to try to train a model without the warp field, as sometimes it can cause stability problems in some cases.

@zawlin
Copy link
Author

zawlin commented Feb 25, 2020 via email

@stephenlombardi
Copy link
Contributor

Sorry, the pose transformation is a little cryptic so I'll try to explain better here. The way the code works is that it assume the volume always lives in the cube that spans -1 to 1 on each axis. This is what I'll call 'normalized space' since it's centered and has a standard size. When you provide the camera parameters of your rig, the camera extrinsics are in some arbitrary space that I'll refer to as 'camera space'. Because camera space has an arbitrary origin and scale, the object that you want to model won't necessarily fall in the [-1, 1]^3 volume. The pose transformation and world scale are how the code accounts for the difference between these two coordinate systems.

The transformation found in pose.txt transforms points from the normalized space to the camera space. The matrix is stored as a 3x4 matrix where the last column is the translation, which means that the translation column corresponds to the desired center of the volume (which should be the center of your object) in camera space. You can also adjust the rotation portion of the matrix to change the axes but getting the translation right is the most important bit so that the volume is placed correctly in space. Please let me know if that's helpful.

To disable the warp field you can add a parameter warptype=None to the Decoder constructor on line 33 of config.py.

@zawlin
Copy link
Author

zawlin commented Feb 27, 2020

Ok. Got it. I will double check my camera parameters and and try without warping field for the next few days.

Worse come to worse, would you be able to take a look at the data and check on your side? I can share the original data and scripts to convert into neural volume format, including dataloaders and experiment config files for nv.

@stephenlombardi
Copy link
Contributor

Sure I can take a look

@zawlin
Copy link
Author

zawlin commented Mar 5, 2020

I managed to get it to start doing something on lincoln sequence. It turns out camera parameters were correct but the pose transformations were wrong. I was only looking at 16 cameras to do the adjustment which turns out the volume was not really overlapping the object in all cameras.

I uploaded new progress images under the same folder and also a zipped folder named lincoln.tar. Can you take a look and see if it looks like it's going well and I only need to wait?

Edit: After waiting one night, seems alright although it trains slower than synthetic data, iteration-wise. Again, thanks for all the clarifications!

@stephenlombardi
Copy link
Contributor

I'm guessing you'll have some artifacts in the result given how it's trying to reconstruct so much of the background. I'm a little surprised since it should be easy for it to figure out that that area should be transparent, although sometimes it can get stuck in bad situations early on and it can have trouble recovering. I would recommend rendering a video of the current result with the render.py script to check it's not doing something too crazy.

@zawlin
Copy link
Author

zawlin commented Mar 9, 2020

Hmm..something crazy is indeed happening :( I zipped up the entire folder with data and experiments and sent you a link with via email. I have also uploaded reconstruction from another method under the given test trajectories so that you know what "ground truth" is supposed to look like.

I am also unable to get it working on the other dataset. Whenever it looks like it's gonna do something, alphapr suddenly goes to zero and kldiv starts to increase alot and then i will just get background, then it will repeat the process in a loop. I am checking if I can share this data. Do you think if I just share one frame, it should be sufficient to debug?

Since it looks like 3d volume is rotating fine, I guess camera parameters are ok? But based on the test video(and comparison with our result video), maybe the volume is clipping the object since the rendered result look like it's shifted down about half?

@stephenlombardi
Copy link
Contributor

Sharing one frame to debug should work. I will take a look at the lincoln data and see if I can figure out what's happening.

@stephenlombardi
Copy link
Contributor

I got the lincoln example working, attached are the dataset class, config file, and modified pose.txt (although I didn't change pose.txt much). Let me know if this works for you
experiment1.zip

@zawlin
Copy link
Author

zawlin commented Mar 11, 2020

I got it working as well. Thanks a lot! Looks like I forgot to rescale the intrinsics. I believe it should work for the other dataset as well.

Edit:
Yup it's working for both datasets.

@zawlin
Copy link
Author

zawlin commented Mar 19, 2020

I have one more question. In the figure where you showed latent code interpolation, do you used all the frames in the training data? Say you have frame 1-5 in training data, and during testing, you used encoder to get frame 1 and frame 5's latent code and interpolate them to get frame 2,3,4?

@stephenlombardi
Copy link
Contributor

I'm a little confused by your question, in your example if we interpolate the encodings of frame 1 and frame 5 we won't exactly reproduce the frames between them. This is particularly true if we interpolate distant frames in the sequence, which is the case for Fig. 8 in the neural volumes paper.

@zawlin
Copy link
Author

zawlin commented Mar 26, 2020

Sorry for being unclear, I was trying to do "slowmo" effect and by subsampling training frames and trying to render frames which are in between(but not in training), I am trying to see if neural volume encoding produce anything reasonable in terms of time.

But I did a bit more tests and found that I can't really do slowmo effect on neural volume encodings. I am not sure if what I am doing is correct. Can you double check the result? I have the code to do the encoding interpolation and the result on full frames(training use all frames) and slowed frames(render more frames than in the trainining data). This is just to double confirm that I am doing the right thing. I think this result is sort of expected as there's no constraint on the latent space.

@stephenlombardi
Copy link
Contributor

I took a look at the result and I think what you're seeing is expected. It's partly a limitation of this model which uses an inverse warp to model motion rather than a forward warp, which makes some motion interpolation difficult. It is also somewhat dependent on the data. I've noticed that if I train a very long sequence it does a much better job interpolating the latent space than a short sequence.

@zawlin
Copy link
Author

zawlin commented Mar 27, 2020

How long is long? I can try to capture longer sequences and check.

@stephenlombardi
Copy link
Contributor

We've captured ~7500 frames of facial data and found it works pretty well with that, the data is very redundant though, which I think helps. I think this model has a harder time with bodies since they have more complex motion.

@gmzang
Copy link

gmzang commented Apr 3, 2020

Hi,
Can you guys please also share the KRT file for lincoln data?? I am still confused to set it correctly for my own data.. Any hint or reference for the format in KRT is appreciated. Thanks.

@stephenlombardi
Copy link
Contributor

KRT.txt
The KRT file is a series of camera specifications, each camera is specified in the following way:
[camera name]
K00 K01 K02
K10 K11 K12
K20 K21 K22
D0 D1 D2 D3 D4
R00 R01 R02 T0
R10 R11 R12 T1
R20 R21 R22 T2
[blank line]

where K is the intrinsic matrix, D are the distortion coefficients, R is the rotation matrix, T is the translation. However, you don't need to write a KRT file at all, you can simply write a new dataset class by making a copy of dryice1.py and loading the camera data however you like.

@visonpon
Copy link

@stephenlombardi Thanks for sharing this wonderful work, after reading the above discussion, I still have some problems with how to train with my own datasets.
So the first step is to get the KRT.txt and pose.txt for own datasets, the KRT.txt contains the intrinsic and extrinsic matrix which I can use some tools like colmap to get, but how to get pose.txt?

@zhanglonghao1992
Copy link

@stephenlombardi Thanks for sharing this wonderful work, after reading the above discussion, I still have some problems with how to train with my own datasets. So the first step is to get the KRT.txt and pose.txt for own datasets, the KRT.txt contains the intrinsic and extrinsic matrix which I can use some tools like colmap to get, but how to get pose.txt?

@visonpon
Have you figured it out?

@stephenlombardi
Copy link
Contributor

This comment explains pose.txt: #1 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants