Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the evaluation time spent #15

Closed
RupertLuo opened this issue Mar 9, 2022 · 3 comments
Closed

About the evaluation time spent #15

RupertLuo opened this issue Mar 9, 2022 · 3 comments

Comments

@RupertLuo
Copy link

Could you please let me know how much time was spent on the evaluation? It took me about two days to evaluate with 4 processes, and I found that a large part of the time was spent in the state initialization of edh instance, as well as reaching max_api_fails and max_traj_steps. And the time for the agent to take a step is also very long, which is very dependent on the frequency of the CPU. Can you tell me the settings of your experimental equipment? and is there any other way to evaluate the trained model?

@aishwaryap
Copy link
Contributor

Evaluation for these models is slow. It currently takes us ~16 hrs to finish the unseen validation / test sets using 7 GPUs on an AWS EC2 p3.16xlarge instance.
You are correct that the time for the agent to take a step is quite long. We believe this is at least partly a limitation of the AI2-THOR simulator, although we are aware that the code on top of that could be made more efficient. We unfortunately do not have time at the moment to dedicate purely to speeding that up.
Both the time taken to initialize EDH instances and the time to reach max_api_fails / max_traj_steps are variable as they depend on the number of actions needed to be taken to reach that stage. For initializing an EDH instance we currently start at the initial state of the gameplay session and replay the actions in the history. This is done because we could not reliably initialize some objects such as slices of sliceable objects in the right positions otherwise.

There isn't any other good way to evaluate a model. You could imagine some sort of comparison of ground truth and predicted action sequences but this is likely to be unreliable for a few reasons:

  • Without executing the action sequence, there isn't a good way to know if the agent is in the right place to take a manipulation action. For example, one could imagine searching a predicted action sequence for the slicing of an apple if the task is to make a plate of apple slices, but it is possible that the agent is predicting this action when the apple is at the other end of the room, in which case it would fail.
  • There are usually multiple sequences of navigation actions that can be used to reach the same location.
  • There are often multiple ways to accomplish a task. For example if the task is to cook a potato slice, this could be done by heating it in a bowl in the microwave or on a pan on the stove.
  • The ground truth sequence may contain actions not essential for completing the task. For example, an annotator may have opened a lot of cabinets to find a plate. An agent opening them in a different order may find the plate earlier, in which case the remaining cabinets need not be opened.
  • All of the above problems get a lot worse if you are trying to do the TATC task since the information provided at each step also changes. An agent may be directed to a totally different plate, may get more detailed information about the location of an object etc.

Overall, we are aware that evaluation on this dataset is resource intensive. However, I am closing this issue as I do not think we are in a position to make a commitment to improving this aspect of it at the moment.

@RupertLuo
Copy link
Author

I looked at the code and there are two places that can speed up the evaluation:

  1. The existing code fixes the edh task list for each process, which will cause the prematurely terminated process to be idle. Switching to a multi-process queue for data distribution will prevent computing resources from being idle.
  2. During the evaluation, most of the time is spent on the replay of each edh instance. I think store the initial state of each edh in advance to save the time of replay.

Then there is a question about the paper, each game is divided into multiple edh instances. will data leakage be caused during the training process? Because the longer edh instance's action sequences contain shorter edh instance action sequence from the same game . I mean if I train the longer sequence first, and the model will know the answer for the shorter edh action sequence.

@RupertLuo
Copy link
Author

I looked at the code and there are two places that can speed up the evaluation:

  1. The existing code fixes the edh task list for each process, which will cause the prematurely terminated process to be idle. Switching to a multi-process queue for data distribution will prevent computing resources from being idle.
  2. During the evaluation, most of the time is spent on the replay of each edh instance. I think store the initial state of each edh in advance to save the time of replay.

Then there is a question about the paper, each game is divided into multiple edh instances. will data leakage be caused during the training process? Because the longer edh instance's action sequences contain shorter edh instance action sequence from the same game . I mean if I train the longer sequence first, and the model will know the answer for the shorter edh action sequence.

Sorry, please ignore the second acceleration suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants