Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visulization and environments info #12

Closed
phj128 opened this issue Apr 13, 2021 · 13 comments
Closed

Visulization and environments info #12

phj128 opened this issue Apr 13, 2021 · 13 comments

Comments

@phj128
Copy link

phj128 commented Apr 13, 2021

Thanks for your awesome work.

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?

@Lucaweihs
Copy link
Contributor

Hi @phj128,

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

What type of 3rd person view are you looking for? We have some (not officially supported) visualization code that generates a 2d top down map view, see here. @mattdeitke likely has some thoughts on this as well.

I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?

Given your wording it sounds like you are aware but just to reiterate: while you're more than happy to use ground truth pose information during training (e.g. for computing losses) you should double check to make sure you aren't using any such information during inference (i.e. when computing your trajectories for submission).

With that out of the way: there are a couple of different ways to collect this information. For concreteness, let's say you've following the example.py script to line 102 and so you've have an object of type UnshuffleTask in a variable called unshuffle_task.

  1. To get all object pose information you can call unshuffle_task.unshuffle_env.poses which returns a tuple containing (list of object poses at start of unshuffle stage, list of object poses at start of walkthrough stage, list of current poses of all objects). These poses are dictionaries containing the object's global 3d position, rotation (in euler angles), bounding boxes, etc.
  2. If you'd prefer to have even more control, at the end of the day everything is being executed in an AI2-THOR controller instance and so you can (fairly easily) have access to all of the AI2-THOR metadata. Continuing the above example, you can access the AI2-THOR controller instance with unshuffle_task.unshuffle_env.controller.

Let me know if that helps or if you'd like any more information.

@nnsriram97
Copy link

Hi @Lucaweihs ,

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

On a similar note, is it possible to fetch some kind of ground truth layout/semantic map of the scene for navigation purposes?

@mattdeitke
Copy link
Member

ground truth layout/semantic map of the scene for navigation purposes

For inference, we've explicitly disallowed instance segmentation and semantic segmentation (both of which are supported in AI2-THOR) since that would make the task especially easy.

For navigation purposes (at inference time), we also do not allow top-down maps or third person views of the scene, since that would change the task's intentions quite a bit. But, for visualization and debugging purposes, accessing top-down or third-person views is often nice. AI2-THOR's demo uses ToggleMapView. Even more views are possible by adding cameras via. AI2-THOR.

A screenshot from the top-down scenes from the demo page is shown below:

image

@Lucaweihs
Copy link
Contributor

On a similar note, is it possible to fetch some kind of ground truth layout/semantic map of the scene for navigation purposes?

As @mattdeitke, you should not use these for navigation at inference time. That said, you can see here for an implementation of a heuristic expert (used to provide supervision during training). If you are interested in generating group truth semantic maps for training (e.g. to pretrain a semantic map model) we do have code for this but it's still in the "research" state (i.e. not clean enough for external distribution). If this semantic mapping use case is of immediate interest to you please let me know and I can attempt to prioritize this a bit more.

@phj128
Copy link
Author

phj128 commented Apr 14, 2021

Hi @phj128,

I am really interested in this work and I am wondering whether you provide a visulization of the third perspective of the agent?

What type of 3rd person view are you looking for? We have some (not officially supported) visualization code that generates a 2d top down map view, see here. @mattdeitke likely has some thoughts on this as well.

I am trying to gather some data to train my pretrained model. Could you provide an example or instructions about gathering the groundtruth of the object during the interaction with the environment, like the class and the position(6D pose in global coordinate or relative coordinate) of the object?

Given your wording it sounds like you are aware but just to reiterate: while you're more than happy to use ground truth pose information during training (e.g. for computing losses) you should double check to make sure you aren't using any such information during inference (i.e. when computing your trajectories for submission).

With that out of the way: there are a couple of different ways to collect this information. For concreteness, let's say you've following the example.py script to line 102 and so you've have an object of type UnshuffleTask in a variable called unshuffle_task.

  1. To get all object pose information you can call unshuffle_task.unshuffle_env.poses which returns a tuple containing (list of object poses at start of unshuffle stage, list of object poses at start of walkthrough stage, list of current poses of all objects). These poses are dictionaries containing the object's global 3d position, rotation (in euler angles), bounding boxes, etc.
  2. If you'd prefer to have even more control, at the end of the day everything is being executed in an AI2-THOR controller instance and so you can (fairly easily) have access to all of the AI2-THOR metadata. Continuing the above example, you can access the AI2-THOR controller instance with unshuffle_task.unshuffle_env.controller.

Let me know if that helps or if you'd like any more information.

Thanks for your kind reply. It is of great help.

@phj128
Copy link
Author

phj128 commented Apr 14, 2021

I have anther question about AgentPose. In the paper, the AgentPose is combined with RGB inputs to get metric map and generate a image. What is the meaning of metric map? And the image corresponds to the unshuffled_img in the codebase? Can I directly use the AgentPose as a kind of inputs? Can we know the start position in global coordinate?

@Lucaweihs
Copy link
Contributor

In the paper, the AgentPose is combined with RGB inputs to get metric map and generate a image. What is the meaning of metric map? And the image corresponds to the unshuffled_img in the codebase?

Just a note: the metric map is only relevant for the 2-phase version of the task (in the 1-phase version the agent moves in both the walkthrough and unshuffle stages simultaneously).

For the 2-phase task we've implemented the "metric map" in an implicit way using the ClosestUnshuffledRGBRearrangeSensor sensor ("metric" just means that each image is associated with position metadata, this is contrasted with semiparametric graph-based maps). Basically the way that this works is that rather than storing all of the RGB images directly (which would use a lot of RAM or GPU memory depending on how things were implemented), we instead save all of the poses that the agent takes during the walkthrough stage and then, during the unshuffle stage, we:

  1. Compute which saved pose is closest to the agent's current pose
  2. Teleport the agent to that saved pose in an independently instantiated AI2-THOR environment which is set up as it was during the walkthrough stage.
  3. Grab the RGB image from that independent AI2-THOR environment and return it as the "unshuffled_rgb" image.

Thus "unshuffled_rgb" is equal to the RGB image seen by the agent during the walkthrough stage.

Can I directly use the AgentPose as a kind of inputs? Can we know the start position in global coordinate?

You can use the "relative" agent pose but not the 3d world coordinates (i.e. your agent shouldn't know that it's at world position (3, 2) facing east but it can be told that it is 1 meter to the right and 2 meters ahead of where it started and has rotated 90 degrees counterclockwise). Importantly:

  • You may use the fact that the agent starts in exactly the same position in the walkthrough and unshuffle stages.
  • You may use the fact that the agent always starts in the standing position with a 0 degree camera horizon.

There is one special case when you can use world coordinates: sometimes implementing things in relative coordinates is super annoying even though it is technically possible. We are fine with you using global coordinates to compute quantities so long as those same quantities can be computed, in principle, using relative coordinates instead. As an example: let's say I wanted to just know the L2 between the agents start position and end position. From THOR I can easily acquire these as world coordinates (p_s = start position in world coords, p_e = end position in world coords). If I'm really strict about only using relative coordinates then to compute the L2 distance I would have to (ignoring rotation for simplicity) first compute relative coordinates r_s, r_e as r_s = p_s-p_s = 0 and r_e=p_e-p_s, and only then compute the L2 distance ||r_e - r_s||_2. But computing the relative coordinates in this case is just a waste of time since ||p_e-p_s||_2 == ||r_e - r_s||_2 so we are fine with you not doing so.

Let me know if that clarifies things.

@nnsriram97
Copy link

As @mattdeitke, you should not use these for navigation at inference time. That said, you can see here for an implementation of a heuristic expert (used to provide supervision during training). If you are interested in generating group truth semantic maps for training (e.g. to pretrain a semantic map model) we do have code for this but it's still in the "research" state (i.e. not clean enough for external distribution). If this semantic mapping use case is of immediate interest to you please let me know and I can attempt to prioritize this a bit more.

Hi @Lucaweihs, I think it would very useful to have ground truth semantic maps for training, considering the task is more perception intensive. If you can share the code for generating birds-eye-view semantic maps even in the "research" state it would be much appreciated. Thanks!

@Lucaweihs
Copy link
Contributor

@phj128 @nnsriram97 - I have created a PR to AllenAct with the code used to create semantic maps (it includes an example of how this can be run with ai2thor-rearrangment). One important note: these mapping sensors strongly assume that the environment does not change after it is initially set up. This means that they will work well for the walkthrough stage but may return nonsense during the unshuffle stage. If you plan to train a mapping model I would recommend pretraining on the walkthrough task and then using the (frozen) pretrained model for the 1- or 2-phase tasks.

@Lucaweihs
Copy link
Contributor

One final thought: if you intend to use these sensors, it is really helpful if you assign them to a GPU by using the device parameter, i.e.

SemanticMapTHORSensor(
    ...
    device=torch.device(SOME_GPU_IND)
)

In my experiments this has made training a full order of magnitude faster than when these sensors must do everything on the CPU (processing pointclouds is expensive!).

@phj128
Copy link
Author

phj128 commented Apr 16, 2021

@phj128 @nnsriram97 - I have created a PR to AllenAct with the code used to create semantic maps (it includes an example of how this can be run with ai2thor-rearrangment). One important note: these mapping sensors strongly assume that the environment does not change after it is initially set up. This means that they will work well for the walkthrough stage but may return nonsense during the unshuffle stage. If you plan to train a mapping model I would recommend pretraining on the walkthrough task and then using the (frozen) pretrained model for the 1- or 2-phase tasks.

Thank you very much. This is really helpful.

@Lucaweihs
Copy link
Contributor

I'm going to close this for now, please feel free to reopen (or create a new issue) with any follow ups.

@Lucaweihs
Copy link
Contributor

Lucaweihs commented Apr 26, 2021

@phj128 / @nnsriram97 , in case you are interested: I've released the active neural SLAM implementation and related experiments, see this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants