Visulization and environments info #12

phj128 · 2021-04-13T15:09:58Z

Thanks for your awesome work.

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?

Lucaweihs · 2021-04-13T16:41:06Z

Hi @phj128,

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

What type of 3rd person view are you looking for? We have some (not officially supported) visualization code that generates a 2d top down map view, see here. @mattdeitke likely has some thoughts on this as well.

I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?

Given your wording it sounds like you are aware but just to reiterate: while you're more than happy to use ground truth pose information during training (e.g. for computing losses) you should double check to make sure you aren't using any such information during inference (i.e. when computing your trajectories for submission).

With that out of the way: there are a couple of different ways to collect this information. For concreteness, let's say you've following the example.py script to line 102 and so you've have an object of type UnshuffleTask in a variable called unshuffle_task.

To get all object pose information you can call unshuffle_task.unshuffle_env.poses which returns a tuple containing (list of object poses at start of unshuffle stage, list of object poses at start of walkthrough stage, list of current poses of all objects). These poses are dictionaries containing the object's global 3d position, rotation (in euler angles), bounding boxes, etc.
If you'd prefer to have even more control, at the end of the day everything is being executed in an AI2-THOR controller instance and so you can (fairly easily) have access to all of the AI2-THOR metadata. Continuing the above example, you can access the AI2-THOR controller instance with unshuffle_task.unshuffle_env.controller.

Let me know if that helps or if you'd like any more information.

nnsriram97 · 2021-04-13T20:48:16Z

Hi @Lucaweihs ,

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

On a similar note, is it possible to fetch some kind of ground truth layout/semantic map of the scene for navigation purposes?

mattdeitke · 2021-04-13T21:44:41Z

ground truth layout/semantic map of the scene for navigation purposes

For inference, we've explicitly disallowed instance segmentation and semantic segmentation (both of which are supported in AI2-THOR) since that would make the task especially easy.

For navigation purposes (at inference time), we also do not allow top-down maps or third person views of the scene, since that would change the task's intentions quite a bit. But, for visualization and debugging purposes, accessing top-down or third-person views is often nice. AI2-THOR's demo uses ToggleMapView. Even more views are possible by adding cameras via. AI2-THOR.

A screenshot from the top-down scenes from the demo page is shown below:

Lucaweihs · 2021-04-13T22:33:20Z

On a similar note, is it possible to fetch some kind of ground truth layout/semantic map of the scene for navigation purposes?

As @mattdeitke, you should not use these for navigation at inference time. That said, you can see here for an implementation of a heuristic expert (used to provide supervision during training). If you are interested in generating group truth semantic maps for training (e.g. to pretrain a semantic map model) we do have code for this but it's still in the "research" state (i.e. not clean enough for external distribution). If this semantic mapping use case is of immediate interest to you please let me know and I can attempt to prioritize this a bit more.

phj128 · 2021-04-14T14:46:19Z

Hi @phj128,

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

What type of 3rd person view are you looking for? We have some (not officially supported) visualization code that generates a 2d top down map view, see here. @mattdeitke likely has some thoughts on this as well.

I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?

Given your wording it sounds like you are aware but just to reiterate: while you're more than happy to use ground truth pose information during training (e.g. for computing losses) you should double check to make sure you aren't using any such information during inference (i.e. when computing your trajectories for submission).

With that out of the way: there are a couple of different ways to collect this information. For concreteness, let's say you've following the example.py script to line 102 and so you've have an object of type UnshuffleTask in a variable called unshuffle_task.

To get all object pose information you can call unshuffle_task.unshuffle_env.poses which returns a tuple containing (list of object poses at start of unshuffle stage, list of object poses at start of walkthrough stage, list of current poses of all objects). These poses are dictionaries containing the object's global 3d position, rotation (in euler angles), bounding boxes, etc.

If you'd prefer to have even more control, at the end of the day everything is being executed in an AI2-THOR controller instance and so you can (fairly easily) have access to all of the AI2-THOR metadata. Continuing the above example, you can access the AI2-THOR controller instance with unshuffle_task.unshuffle_env.controller.

Let me know if that helps or if you'd like any more information.

Thanks for your kind reply. It is of great help.

phj128 · 2021-04-14T16:00:07Z

I have anther question about AgentPose. In the paper, the AgentPose is combined with RGB inputs to get metric map and generate a image. What is the meaning of metric map? And the image corresponds to the unshuffled_img in the codebase? Can I directly use the AgentPose as a kind of inputs? Can we know the start position in global coordinate?

Lucaweihs · 2021-04-14T16:53:27Z

In the paper, the AgentPose is combined with RGB inputs to get metric map and generate a image. What is the meaning of metric map? And the image corresponds to the unshuffled_img in the codebase?

Just a note: the metric map is only relevant for the 2-phase version of the task (in the 1-phase version the agent moves in both the walkthrough and unshuffle stages simultaneously).

For the 2-phase task we've implemented the "metric map" in an implicit way using the ClosestUnshuffledRGBRearrangeSensor sensor ("metric" just means that each image is associated with position metadata, this is contrasted with semiparametric graph-based maps). Basically the way that this works is that rather than storing all of the RGB images directly (which would use a lot of RAM or GPU memory depending on how things were implemented), we instead save all of the poses that the agent takes during the walkthrough stage and then, during the unshuffle stage, we:

Compute which saved pose is closest to the agent's current pose
Teleport the agent to that saved pose in an independently instantiated AI2-THOR environment which is set up as it was during the walkthrough stage.
Grab the RGB image from that independent AI2-THOR environment and return it as the "unshuffled_rgb" image.

Thus "unshuffled_rgb" is equal to the RGB image seen by the agent during the walkthrough stage.

Can I directly use the AgentPose as a kind of inputs? Can we know the start position in global coordinate?

You can use the "relative" agent pose but not the 3d world coordinates (i.e. your agent shouldn't know that it's at world position (3, 2) facing east but it can be told that it is 1 meter to the right and 2 meters ahead of where it started and has rotated 90 degrees counterclockwise). Importantly:

You may use the fact that the agent starts in exactly the same position in the walkthrough and unshuffle stages.
You may use the fact that the agent always starts in the standing position with a 0 degree camera horizon.

There is one special case when you can use world coordinates: sometimes implementing things in relative coordinates is super annoying even though it is technically possible. We are fine with you using global coordinates to compute quantities so long as those same quantities can be computed, in principle, using relative coordinates instead. As an example: let's say I wanted to just know the L2 between the agents start position and end position. From THOR I can easily acquire these as world coordinates (p_s = start position in world coords, p_e = end position in world coords). If I'm really strict about only using relative coordinates then to compute the L2 distance I would have to (ignoring rotation for simplicity) first compute relative coordinates r_s, r_e as r_s = p_s-p_s = 0 and r_e=p_e-p_s, and only then compute the L2 distance ||r_e - r_s||_2. But computing the relative coordinates in this case is just a waste of time since ||p_e-p_s||_2 == ||r_e - r_s||_2 so we are fine with you not doing so.

Let me know if that clarifies things.

nnsriram97 · 2021-04-14T19:03:43Z

As @mattdeitke, you should not use these for navigation at inference time. That said, you can see here for an implementation of a heuristic expert (used to provide supervision during training). If you are interested in generating group truth semantic maps for training (e.g. to pretrain a semantic map model) we do have code for this but it's still in the "research" state (i.e. not clean enough for external distribution). If this semantic mapping use case is of immediate interest to you please let me know and I can attempt to prioritize this a bit more.

Hi @Lucaweihs, I think it would very useful to have ground truth semantic maps for training, considering the task is more perception intensive. If you can share the code for generating birds-eye-view semantic maps even in the "research" state it would be much appreciated. Thanks!

Lucaweihs · 2021-04-16T00:12:30Z

@phj128 @nnsriram97 - I have created a PR to AllenAct with the code used to create semantic maps (it includes an example of how this can be run with ai2thor-rearrangment). One important note: these mapping sensors strongly assume that the environment does not change after it is initially set up. This means that they will work well for the walkthrough stage but may return nonsense during the unshuffle stage. If you plan to train a mapping model I would recommend pretraining on the walkthrough task and then using the (frozen) pretrained model for the 1- or 2-phase tasks.

Lucaweihs · 2021-04-16T00:22:36Z

One final thought: if you intend to use these sensors, it is really helpful if you assign them to a GPU by using the device parameter, i.e.

SemanticMapTHORSensor(
    ...
    device=torch.device(SOME_GPU_IND)
)

In my experiments this has made training a full order of magnitude faster than when these sensors must do everything on the CPU (processing pointclouds is expensive!).

phj128 · 2021-04-16T13:05:47Z

@phj128 @nnsriram97 - I have created a PR to AllenAct with the code used to create semantic maps (it includes an example of how this can be run with ai2thor-rearrangment). One important note: these mapping sensors strongly assume that the environment does not change after it is initially set up. This means that they will work well for the walkthrough stage but may return nonsense during the unshuffle stage. If you plan to train a mapping model I would recommend pretraining on the walkthrough task and then using the (frozen) pretrained model for the 1- or 2-phase tasks.

Thank you very much. This is really helpful.

Lucaweihs · 2021-04-19T17:02:17Z

I'm going to close this for now, please feel free to reopen (or create a new issue) with any follow ups.

Lucaweihs · 2021-04-26T21:43:22Z

@phj128 / @nnsriram97 , in case you are interested: I've released the active neural SLAM implementation and related experiments, see this issue.

Lucaweihs closed this as completed Apr 19, 2021

Lucaweihs mentioned this issue Apr 19, 2021

Active Neural SLAM implementation #13

Closed

Lucaweihs mentioned this issue Apr 26, 2021

Adding active neural slam experiments. #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visulization and environments info #12

Visulization and environments info #12

phj128 commented Apr 13, 2021

Lucaweihs commented Apr 13, 2021

nnsriram97 commented Apr 13, 2021

mattdeitke commented Apr 13, 2021

Lucaweihs commented Apr 13, 2021

phj128 commented Apr 14, 2021

phj128 commented Apr 14, 2021

Lucaweihs commented Apr 14, 2021

nnsriram97 commented Apr 14, 2021

Lucaweihs commented Apr 16, 2021

Lucaweihs commented Apr 16, 2021

phj128 commented Apr 16, 2021

Lucaweihs commented Apr 19, 2021

Lucaweihs commented Apr 26, 2021 •

edited

Loading

Visulization and environments info #12

Visulization and environments info #12

Comments

phj128 commented Apr 13, 2021

Lucaweihs commented Apr 13, 2021

nnsriram97 commented Apr 13, 2021

mattdeitke commented Apr 13, 2021

Lucaweihs commented Apr 13, 2021

phj128 commented Apr 14, 2021

phj128 commented Apr 14, 2021

Lucaweihs commented Apr 14, 2021

nnsriram97 commented Apr 14, 2021

Lucaweihs commented Apr 16, 2021

Lucaweihs commented Apr 16, 2021

phj128 commented Apr 16, 2021

Lucaweihs commented Apr 19, 2021

Lucaweihs commented Apr 26, 2021 • edited Loading

Lucaweihs commented Apr 26, 2021 •

edited

Loading