New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visulization and environments info #12
Comments
Hi @phj128,
What type of 3rd person view are you looking for? We have some (not officially supported) visualization code that generates a 2d top down map view, see here. @mattdeitke likely has some thoughts on this as well.
Given your wording it sounds like you are aware but just to reiterate: while you're more than happy to use ground truth pose information during training (e.g. for computing losses) you should double check to make sure you aren't using any such information during inference (i.e. when computing your trajectories for submission). With that out of the way: there are a couple of different ways to collect this information. For concreteness, let's say you've following the
Let me know if that helps or if you'd like any more information. |
Hi @Lucaweihs ,
On a similar note, is it possible to fetch some kind of ground truth layout/semantic map of the scene for navigation purposes? |
For inference, we've explicitly disallowed instance segmentation and semantic segmentation (both of which are supported in AI2-THOR) since that would make the task especially easy. For navigation purposes (at inference time), we also do not allow top-down maps or third person views of the scene, since that would change the task's intentions quite a bit. But, for visualization and debugging purposes, accessing top-down or third-person views is often nice. AI2-THOR's demo uses ToggleMapView. Even more views are possible by adding cameras via. AI2-THOR. A screenshot from the top-down scenes from the demo page is shown below: |
As @mattdeitke, you should not use these for navigation at inference time. That said, you can see here for an implementation of a heuristic expert (used to provide supervision during training). If you are interested in generating group truth semantic maps for training (e.g. to pretrain a semantic map model) we do have code for this but it's still in the "research" state (i.e. not clean enough for external distribution). If this semantic mapping use case is of immediate interest to you please let me know and I can attempt to prioritize this a bit more. |
Thanks for your kind reply. It is of great help. |
I have anther question about AgentPose. In the paper, the AgentPose is combined with RGB inputs to get metric map and generate a image. What is the meaning of metric map? And the image corresponds to the unshuffled_img in the codebase? Can I directly use the AgentPose as a kind of inputs? Can we know the start position in global coordinate? |
Just a note: the metric map is only relevant for the 2-phase version of the task (in the 1-phase version the agent moves in both the walkthrough and unshuffle stages simultaneously). For the 2-phase task we've implemented the "metric map" in an implicit way using the
Thus
You can use the "relative" agent pose but not the 3d world coordinates (i.e. your agent shouldn't know that it's at world position (3, 2) facing east but it can be told that it is 1 meter to the right and 2 meters ahead of where it started and has rotated 90 degrees counterclockwise). Importantly:
There is one special case when you can use world coordinates: sometimes implementing things in relative coordinates is super annoying even though it is technically possible. We are fine with you using global coordinates to compute quantities so long as those same quantities can be computed, in principle, using relative coordinates instead. As an example: let's say I wanted to just know the L2 between the agents start position and end position. From THOR I can easily acquire these as world coordinates (p_s = start position in world coords, p_e = end position in world coords). If I'm really strict about only using relative coordinates then to compute the L2 distance I would have to (ignoring rotation for simplicity) first compute relative coordinates r_s, r_e as r_s = p_s-p_s = 0 and r_e=p_e-p_s, and only then compute the L2 distance ||r_e - r_s||_2. But computing the relative coordinates in this case is just a waste of time since ||p_e-p_s||_2 == ||r_e - r_s||_2 so we are fine with you not doing so. Let me know if that clarifies things. |
Hi @Lucaweihs, I think it would very useful to have ground truth semantic maps for training, considering the task is more perception intensive. If you can share the code for generating birds-eye-view semantic maps even in the "research" state it would be much appreciated. Thanks! |
@phj128 @nnsriram97 - I have created a PR to AllenAct with the code used to create semantic maps (it includes an example of how this can be run with |
One final thought: if you intend to use these sensors, it is really helpful if you assign them to a GPU by using the SemanticMapTHORSensor(
...
device=torch.device(SOME_GPU_IND)
) In my experiments this has made training a full order of magnitude faster than when these sensors must do everything on the CPU (processing pointclouds is expensive!). |
Thank you very much. This is really helpful. |
I'm going to close this for now, please feel free to reopen (or create a new issue) with any follow ups. |
@phj128 / @nnsriram97 , in case you are interested: I've released the active neural SLAM implementation and related experiments, see this issue. |
Thanks for your awesome work.
I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?
I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?
The text was updated successfully, but these errors were encountered: