Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Invalid commit id: xxxxxxx - no build exists for arch=Linux #25

Closed
Leeeshuz opened this issue Aug 5, 2021 · 10 comments
Closed
Assignees

Comments

@Leeeshuz
Copy link

Leeeshuz commented Aug 5, 2021

Hello, I am new to the embodied AI area and when I tried to run the baseline model training. Following errors occurred and I really do not know what has happened. Is there anybody that could provide any clues on what may cause this. I would be very appreciated for this!

[08/05 14:40:07 INFO:] Starting 19-th VectorSampledTask worker with args [{'force_cache_reset': False, 'epochs': inf, 'stage': 'train', 'allowed_scenes': ['FloorPlan419', 'FloorPlan420'], 'scene_to_allowed_rearrange_inds': None, 'seed': 151437334310827783848716556864843341311, 'x_display': '0.0', 'sensors': [<rearrange.sensors.RGBRearrangeSensor object at 0x7f67ac88fdd8>, <rearrange.sensors.UnshuffledRGBRearrangeSensor object at 0x7f67ac88ff98>, <allenact.base_abstractions.sensor.ExpertActionSensor object at 0x7f67ac8a11d0>], 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7f67b5608c50>}]        [vector_sampled_tasks.py: 380]
Process ForkServerProcess-1:19:
Traceback (most recent call last):
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 282, in _task_sampling_loop_worker
    should_log=should_log,
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 821, in __init__
    sampler_fn_args=[{"mp_ctx": None, **args} for args in sampler_fn_args_list],
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1026, in _create_generators
    if next(generators[-1]) != "started":
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 881, in _task_sampling_loop_generator_fn
    task_sampler = make_sampler_fn(**sampler_fn_args)
  File "/home/lishuzhao/ai2thor-rearrangement/baseline_configs/one_phase/one_phase_rgb_base.py", line 81, in make_sampler_fn
    **kwargs,
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/tasks.py", line 877, in from_fixed_dataset
    **init_kwargs,
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/tasks.py", line 828, in __init__
    self.walkthrough_env = RearrangeTHOREnvironment(**rearrange_env_kwargs)
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/environment.py", line 245, in __init__
    self.controller = self.create_controller()
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/environment.py", line 264, in create_controller
    **self._controller_kwargs,
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/ai2thor/controller.py", line 465, in __init__
    self._build = self.find_build(local_build, commit_id, branch)
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/ai2thor/controller.py", line 1118, in find_build
    % (commit_id, platform.system())
ValueError: Invalid commit_id: f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe - no build exists for arch=Linux
Process ForkServerProcess-2:19:
Traceback (most recent call last):
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 282, in _task_sampling_loop_worker
    should_log=should_log,
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 821, in __init__
    sampler_fn_args=[{"mp_ctx": None, **args} for args in sampler_fn_args_list],
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 1026, in _create_generators
    if next(generators[-1]) != "started":
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", line 881, in _task_sampling_loop_generator_fn
    task_sampler = make_sampler_fn(**sampler_fn_args)
  File "/home/lishuzhao/ai2thor-rearrangement/baseline_configs/one_phase/one_phase_rgb_base.py", line 81, in make_sampler_fn
    **kwargs,
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/tasks.py", line 877, in from_fixed_dataset
    **init_kwargs,
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/tasks.py", line 828, in __init__
    self.walkthrough_env = RearrangeTHOREnvironment(**rearrange_env_kwargs)
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/environment.py", line 245, in __init__
    self.controller = self.create_controller()
  File "/home/lishuzhao/ai2thor-rearrangement/rearrange/environment.py", line 264, in create_controller
    **self._controller_kwargs,
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/ai2thor/controller.py", line 465, in __init__
    self._build = self.find_build(local_build, commit_id, branch)
  File "/root/anaconda3/envs/ai2thor-rearrange/lib/python3.6/site-packages/ai2thor/controller.py", line 1118, in find_build
    % (commit_id, platform.system())
ValueError: Invalid commit_id: f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe - no build exists for arch=Linux
@Lucaweihs
Copy link
Contributor

Hi @Leeeshuz,

Can you give some machine information (e.g. what version of Linux are you using?) and let me know what version of ai2thor you're using?

Can you also try running

from ai2thor.controller import Controller
c = Controller(commit_id="f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe")

and letting me know if you see the same error?

@Lucaweihs Lucaweihs self-assigned this Aug 5, 2021
@Leeeshuz
Copy link
Author

Leeeshuz commented Aug 6, 2021

Hi @Leeeshuz,

Can you give some machine information (e.g. what version of Linux are you using?) and let me know what version of ai2thor you're using?

Can you also try running

from ai2thor.controller import Controller
c = Controller(commit_id="f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe")

and letting me know if you see the same error?

Hi, @Lucaweihs, sorry for ignoring these information. The project was built on Ubuntu 18.04.5 LTS with python 3.6.12 (Anaconda). The ai2thor(3.3.4) library is installed via 'pip' following to the installation guide of this project. The project is running in a remote server thus on headless mode, the Xorg is running normally.

I did a test as you suggested and the same error occurs. I tried my best to go through the related codes of ai2thor project to figure out this issue and I found that this may be caused by the platform build or something related to the commit_id. I think further test is needed to locate the problem and can you give some suggestions? Thank you!

@Leeeshuz
Copy link
Author

Leeeshuz commented Aug 6, 2021

UPDATE!!!
I carefully checked the installation instruction again and found that the required version of ai2thor is 2.7.2 while the pip-installed version is 3.3.4. After reinstall the ai2thor==2.7.2, the commit_id problem is finally solved.

I re-checked the 'requirements.txt' and found that it specifies"ai2thor>=2.7.2" and I think this may be the reason that why the newer 3.3.4 version was installed. I suggest that requirements.txt file should be updated to avoid these issues.

Moreover, I am still confused about what the 'commit_id' meaning. Since I found that when runing 'example.py' it prints the commit_id: "f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe", while simply executing controller = ai2thor.controller.Controller() in a python environment, it gives the commit_id: "a6674babc132c5d63d18c82a0e14c01d236aa981". I don't know why two different commit_ids are shown. To make sure the program running succesfully, I downloaded the environment zip file with both commit_ids and found that only "a6674babc132c5d63d18c82a0e14c01d236aa981" is valid for both example.py and simple python environment. This really confuses me.

Finally, I can run the "example.py" successfully while it sometimes may be interrupted by "invalid action: MakeObjectBreakable". I think this action may be remove in v2.7.2. Therefore, how can I find the defined action and replace/modify these possiblly removed actions?

PS: The invalid commit_id problem is still not resolved in v3.3.4 and I have no idea about that. Hope that you guys can figure it out!

@Lucaweihs
Copy link
Contributor

Lucaweihs commented Aug 6, 2021

Hi @Leeeshuz,

Moreover, I am still confused about what the 'commit_id' meaning.

Basically there are two components to AI2-THOR:

  1. The python API - this is what's installed when you run something like pip install ai2thor
  2. The Unity build executable - this is what you see downloading the first time you run Controller()

All the python API really does is convey commands to the Unity build. The Unity build will do all the processing and then returns sensor readings (e.g. RGB images) along with metadata (e.g. action was/wasn't successful, agent position is (x,y,z), etc). The commit id is used here to specify which version of the Unity build to use, the rearrangement project uses a special Unity build (f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe) which has some additional functionality that is not available within the usual THOR builds.

There is some effort made to ensure that the python API is compatible with older unity builds.

I re-checked the 'requirements.txt' and found that it specifies"ai2thor>=2.7.2" and I think this may be the reason that why the newer 3.3.4 version was installed. I suggest that requirements.txt file should be updated to avoid these issues.

Strangely I am struggling to reproduce the error you're seeing with ai2thor==3.3.4 (I'm using Python 3.6.12 and 18.04.2 LTS (GNU/Linux 4.15.0-45-generic x86_64)). Ironically I did find a different problem with running the example.py script on Linux machines so please do grab the latest commit of the ai2thor-rearrangement repository.

Can you try doing the below to reinstall everything and re-run the example script?

# Deactivate your conda environment (if activated)
conda deactivate

# Delete the existing environment (this is assuming you initially installed the conda environment using the environment.yml and didn't change the environment name)
conda remove --name thor-rearrange --all

# Reinstall the environment
export MY_ENV_NAME=thor-rearrange
export CONDA_BASE="$(dirname $(dirname "${CONDA_EXE}"))"
export PIP_SRC="${CONDA_BASE}/envs/${MY_ENV_NAME}/pipsrc"
conda env create --file environment.yml --name $MY_ENV_NAME

# Activate the environment
conda activate thor-rearrange

# Make sure you have installed the correct version of the cuda drivers
# replacing YOUR_CUDA_VERSION with your CUDA version.
conda install cudatoolkit=YOUR_CUDA_VERSION -c pytorch

# Move into the ai2thor-rearrangement directory
cd PATH/TO/ai2thor-rearrangement

# Export the current directory to your python path
export PYTHONPATH=$PYTHONPATH:$PWD

# Run the example
python example.py

If you're still getting the same error after doing the above can you give me the output of pip list and show me the output from nvidia-smi?

@Leeeshuz
Copy link
Author

Leeeshuz commented Aug 7, 2021

Hi @Leeeshuz,

Moreover, I am still confused about what the 'commit_id' meaning.

Basically there are two components to AI2-THOR:

  1. The python API - this is what's installed when you run something like pip install ai2thor
  2. The Unity build executable - this is what you see downloading the first time you run Controller()

All the python API really does is convey commands to the Unity build. The Unity build will do all the processing and then returns sensor readings (e.g. RGB images) along with metadata (e.g. action was/wasn't successful, agent position is (x,y,z), etc). The commit id is used here to specify which version of the Unity build to use, the rearrangement project uses a special Unity build (f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe) which has some additional functionality that is not available within the usual THOR builds.

There is some effort made to ensure that the python API is compatible with older unity builds.

I re-checked the 'requirements.txt' and found that it specifies"ai2thor>=2.7.2" and I think this may be the reason that why the newer 3.3.4 version was installed. I suggest that requirements.txt file should be updated to avoid these issues.

Strangely I am struggling to reproduce the error you're seeing with ai2thor==3.3.4 (I'm using Python 3.6.12 and 18.04.2 LTS (GNU/Linux 4.15.0-45-generic x86_64)). Ironically I did find a different problem with running the example.py script on Linux machines so please do grab the latest commit of the ai2thor-rearrangement repository.

Can you try doing the below to reinstall everything and re-run the example script?

# Deactivate your conda environment (if activated)
conda deactivate

# Delete the existing environment (this is assuming you initially installed the conda environment using the environment.yml and didn't change the environment name)
conda remove --name thor-rearrange --all

# Reinstall the environment
export MY_ENV_NAME=thor-rearrange
export CONDA_BASE="$(dirname $(dirname "${CONDA_EXE}"))"
export PIP_SRC="${CONDA_BASE}/envs/${MY_ENV_NAME}/pipsrc"
conda env create --file environment.yml --name $MY_ENV_NAME

# Activate the environment
conda activate thor-rearrange

# Make sure you have installed the correct version of the cuda drivers
# replacing YOUR_CUDA_VERSION with your CUDA version.
conda install cudatoolkit=YOUR_CUDA_VERSION -c pytorch

# Move into the ai2thor-rearrangement directory
cd PATH/TO/ai2thor-rearrangement

# Export the current directory to your python path
export PYTHONPATH=$PYTHONPATH:$PWD

# Run the example
python example.py

If you're still getting the same error after doing the above can you give me the output of pip list and show me the output from nvidia-smi?

Thank you very much for the explanation for the "commit_id". So does it mean that it is still "unsuccessful" when running example.py with the commit_id a6674babc132c5d63d18c82a0e14c01d236aa981 instead of with f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe. When I specify the commit_id to f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe, the controller will stuck in the 'OpenPipe' as I remembered.

I will re-check the issues and try to re-install the whole environment on Monday since I cannot get access to the remote server on weekends. I will update if there is any progress. Thank you again for the explicit explanation!

@Leeeshuz
Copy link
Author

Leeeshuz commented Aug 9, 2021

Problem solved! Seems that something has borken the environment zip file of commit_id: f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe during the downloading. After re-downloading the file, specifying the commit_id when creating the controller makes the example.py run normally.

However, another problem occurred during running the baseline code. The program would crash during training even though I reduce the num_train_process to 2. The server has 8 Titan Xp GPUs and 48 processors with 256G memory. The program is always carshed when reseting the scene with name FloorPlan3_physics.

@Lucaweihs
Copy link
Contributor

Lucaweihs commented Aug 9, 2021

Hi @Leeeshuz,

Happy to hear the commit problem issue resolved itself :)!

However, another problem occurred during running the baseline code. The program would crash during training even though I reduce the num_train_process to 2. The server has 8 Titan Xp GPUs and 48 processors with 256G memory. The program is always carshed when reseting the scene with name FloorPlan3_physics.

Generally from my experience, seeing that training crashes when resetting a scene frequently suggests that something else was the problem (often the error can be found earlier in the output). Could you perhaps save all the output to a log file and paste it here (e.g. the below command will save the output of running the one_phase_rgb_resnet_dagger.py script to a error.log file):

allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py 2>&1 | tee -a error.log

@Leeeshuz
Copy link
Author

Hi @Leeeshuz,

Happy to hear the commit problem issue resolved itself :)!

However, another problem occurred during running the baseline code. The program would crash during training even though I reduce the num_train_process to 2. The server has 8 Titan Xp GPUs and 48 processors with 256G memory. The program is always carshed when reseting the scene with name FloorPlan3_physics.

Generally from my experience, seeing that training crashes when resetting a scene frequently suggests that something else was the problem (often the error can be found earlier in the output). Could you perhaps save all the output to a log file and paste it here (e.g. the below command will save the output of running the one_phase_rgb_resnet_dagger.py script to a error.log file):

allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py 2>&1 | tee -a error.log

Seems running normally after several trials and it has been running for millions of steps. I am really interested in the visualization of the whole rearrangement process the agent performed, so would you mind giving some instructions/demos on how to generate/render a video with pretrained/user-trained models to observe the initial scenes, all actions performed by the agent and the final visual state of the scene when the episodes ends. Thanks a lot!

@Lucaweihs
Copy link
Contributor

Hi @Leeeshuz,

Happy to hear it's training successfully!

Visualization is always a bit messy. If you'd like to get an idea of how to visualize things from a top down perspective, take a look at the discussion in this PR. For visualizing a saved model checkpoint from the agent's perspective, I'd recommend doing something like I've done for my tests of the rearrangement mapping code and saving the frames after every agent action (here's a function for saving a list of numpy frames as a video, note that you can grab the frame from the task by doing task.env.controller.last_event.frame).

@Leeeshuz
Copy link
Author

Hi @Leeeshuz,

Happy to hear it's training successfully!

Visualization is always a bit messy. If you'd like to get an idea of how to visualize things from a top down perspective, take a look at the discussion in this PR. For visualizing a saved model checkpoint from the agent's perspective, I'd recommend doing something like I've done for my tests of the rearrangement mapping code and saving the frames after every agent action (here's a function for saving a list of numpy frames as a video, note that you can grab the frame from the task by doing task.env.controller.last_event.frame).

Thanks a lot! I have successfully generated a top down perspective video under your instruction, and I will try if an agent's perspective video can be generated further. I am really grateful for your kindly help and patience during my environment construction and project running.

I think a summarization should be made for all the key issues I have met during these days if it can help any further readers.

(1) Commit id: it is used to specify which version of the Unity build to use during the project running. For the rearrangement task, it uses Unity build with Commit id: f46d5ec42b65fdae9d9a48db2b4fb6d25afbd1fe.

(2) Invalid Commit id problem: Maybe firstly the environment zip file related to the specified commit id should be downloaded if the remote server does not have internet connection. Try to specify the commit id when the ai2thor controller is built by:
controller = ai2thor.controller.Controller("local_excutable_path": /PATH/TO/ENVIRONMENT_FOLDER).

(3) Environment zip file: It may cause some damage to the environment zip file during the downloading, especially when the internet connection is not that stable. Make sure to check the files carefully during the unzip process, or any strange errors may occurs during training. This has driven me crazy for days.
PS: Try fewer num_train_processes also helps resolve the Unity Crash issue as suggested in other proposed issues.

(4)Visualization: May be hard to find an official tutorial on this. However, combining the codes from allenai/ai2thor#124 and https://github.com/allenai/cordial-sync/blob/master/utils/visualization_utils.py#L109 would easily generate a video from top-down perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants