Error on testing alpa installation #582

johncedc · 2022-07-03T13:40:16Z

Hi All, I am trying to setup the OPT-175 model using alpa and ray. Followed all instructions https://alpa-projects.github.io/install.html#prerequisites. Have been able to convert the weights as well. However, I get the following error when loading the model: KeyError: Actor(MeshHostWorker, 926198553ac61b167bbd508003000000). When I run the "python3 tests/test_install.py" script to test the alpa installation, I get the same error. Can someone please help.

zhisbug · 2022-07-03T20:18:24Z

@johncedc Please follow issue template and provide detailed traceback and screenshot.

johncedc · 2022-07-04T06:35:16Z

@johncedc Please follow issue template and provide detailed traceback and screenshot.

@zhisbug Please find below the detailed traceback

Environment used:
A100 40 GB GPU,
CUDA: 11.3,
Cudnn: 8.2,
Driver Version: 470.57.02
Python: 3.8
Installed alpa using Python wheel following instructions "https://alpa-projects.github.io/install.html"

!python3 tests/test_install.py
.E

ERROR: test_2_pipeline_parallel (main.InstallationTest)

Traceback (most recent call last):
File "tests/test_install.py", line 122, in
runner.run(suite())
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/runner.py", line 176, in run
test(result)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/suite.py", line 84, in call
return self.run(*args, **kwds)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/suite.py", line 122, in run
test(result)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/case.py", line 736, in call
return self.run(*args, **kwds)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/case.py", line 676, in run
self._callTestMethod(testMethod)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
method()
File "tests/test_install.py", line 107, in test_2_pipeline_parallel
actual_state = p_train_step(state, batch)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 104, in call
self._decode_args_and_get_executable(*args))
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 174, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
ans = call(fun, *args)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 201, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/parallel_method.py", line 219, in compile_executable
return compile_pipeshard_executable(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 76, in compile_pipeshard_executable
pipeshard_config = compile_pipeshard_executable_internal(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 217, in compile_pipeshard_executable_internal
pipeshard_config = PipelineInstEmitter(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 387, in compile
self._compile_resharding_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 326, in _compile_resharding_tasks
var] = SymbolicReshardingTask(spec, cg, src_mesh,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 370, in init
self._compile()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 444, in _compile
self._compile_send_recv_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 544, in _compile_send_recv_tasks
self._sender_tasks[sender_worker].append(
jax._src.traceback_util.UnfilteredStackTrace: KeyError: Actor(MeshHostWorker, cb7f1d88b0c8c715fa3c5ab801000000)

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "tests/test_install.py", line 107, in test_2_pipeline_parallel
actual_state = p_train_step(state, batch)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 104, in call
self._decode_args_and_get_executable(*args))
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 174, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 201, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/parallel_method.py", line 219, in compile_executable
return compile_pipeshard_executable(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 76, in compile_pipeshard_executable
pipeshard_config = compile_pipeshard_executable_internal(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 217, in compile_pipeshard_executable_internal
pipeshard_config = PipelineInstEmitter(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 387, in compile
self._compile_resharding_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 326, in _compile_resharding_tasks
var] = SymbolicReshardingTask(spec, cg, src_mesh,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 370, in init
self._compile()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 444, in _compile
self._compile_send_recv_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 544, in _compile_send_recv_tasks
self._sender_tasks[sender_worker].append(
KeyError: Actor(MeshHostWorker, cb7f1d88b0c8c715fa3c5ab801000000)

Ran 2 tests in 8.687s

FAILED (errors=1)

Similar error, when loading model:

zhisbug · 2022-07-05T21:37:42Z

@johncedc Could you run cupy test files under this folder and see if they can run successfully. Otherwise, You want to follow #496 and #570 (see discussion here: #570 (comment)) to upgrade your NVIDIA driver?

We'll add a section in the installation doc to describe this issue (which seems common to many users)

zhisbug · 2022-07-06T04:44:14Z

@johncedc Could you also share your cluster spec -- how many nodes you have? how many GPUs on each node?

JiahaoYao · 2022-07-17T00:53:50Z

Hi @johncedc , i wonder whether you have fix your issue, i was using the following script to install alpa

conda create -n alpa python==3.9 
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install ray[default] torch torchvision tensorboardX boto3 matplotlib clu einops
ray install-nightly
pip install boto3
sudo rm -rf /var/lib/apt/lists/lock
sudo rm -rf /var/cache/apt/archives/lock
sudo rm -rf /var/lib/dpkg/lock*
sudo dpkg --configure -a
sudo apt update
sudo apt install coinor-cbc -y
sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6 -y

sudo rm -rf /usr/local/cuda
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda
pip install pip --upgrade
pip install ipdb
pip install cupy-cuda113
python -m cupyx.tools.install_library --cuda 11.3 --library nccl
python -c "from cupy.cuda import nccl"
pip install jax==0.3.5
pip install flax==0.4.1
pip install https://github.com/alpa-projects/alpa/releases/download/v0.1.4/jaxlib-0.3.5%2Bcuda113.cudnn820-cp39-none-manylinux2010_x86_64.whl
pip install alpa
pip uninstall tensorflow -y 
pip install tensorflow --upgrade
git clone https://github.com/alpa-projects/alpa

JiahaoYao · 2022-07-17T01:01:52Z

more info:

using the aws ami

Deep Learning AMI (Ubuntu 18.04) Version 63.0 (ami-0bd0060cbc407b47e)


DescriptionMXNet-1.9, TensorFlow-2.7, PyTorch-1.11, Neuron, & others. NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker & EFA support. For fully managed experience, check: https://aws.amazon.com/sagemaker
Statusavailable
PlatformUbuntu
Image Size140GB
VisibilityPublic
Owner898082745236

merrymercy · 2022-08-06T12:03:52Z

closed due to inactivity

merrymercy closed this as completed Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on testing alpa installation #582

Error on testing alpa installation #582

johncedc commented Jul 3, 2022

zhisbug commented Jul 3, 2022

johncedc commented Jul 4, 2022

zhisbug commented Jul 5, 2022

zhisbug commented Jul 6, 2022 •

edited

JiahaoYao commented Jul 17, 2022

JiahaoYao commented Jul 17, 2022

merrymercy commented Aug 6, 2022

Error on testing alpa installation #582

Error on testing alpa installation #582

Comments

johncedc commented Jul 3, 2022

zhisbug commented Jul 3, 2022

johncedc commented Jul 4, 2022

!python3 tests/test_install.py .E

ERROR: test_2_pipeline_parallel (main.InstallationTest)

zhisbug commented Jul 5, 2022

zhisbug commented Jul 6, 2022 • edited

JiahaoYao commented Jul 17, 2022

JiahaoYao commented Jul 17, 2022

merrymercy commented Aug 6, 2022

!python3 tests/test_install.py
.E

zhisbug commented Jul 6, 2022 •

edited