Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on testing alpa installation #582

Closed
johncedc opened this issue Jul 3, 2022 · 7 comments
Closed

Error on testing alpa installation #582

johncedc opened this issue Jul 3, 2022 · 7 comments

Comments

@johncedc
Copy link

johncedc commented Jul 3, 2022

Hi All, I am trying to setup the OPT-175 model using alpa and ray. Followed all instructions https://alpa-projects.github.io/install.html#prerequisites. Have been able to convert the weights as well. However, I get the following error when loading the model: KeyError: Actor(MeshHostWorker, 926198553ac61b167bbd508003000000). When I run the "python3 tests/test_install.py" script to test the alpa installation, I get the same error. Can someone please help.

@zhisbug
Copy link
Member

zhisbug commented Jul 3, 2022

@johncedc Please follow issue template and provide detailed traceback and screenshot.

@johncedc
Copy link
Author

johncedc commented Jul 4, 2022

@johncedc Please follow issue template and provide detailed traceback and screenshot.

@zhisbug Please find below the detailed traceback

Environment used:
A100 40 GB GPU,
CUDA: 11.3,
Cudnn: 8.2,
Driver Version: 470.57.02
Python: 3.8
Installed alpa using Python wheel following instructions "https://alpa-projects.github.io/install.html"

!python3 tests/test_install.py
.E

ERROR: test_2_pipeline_parallel (main.InstallationTest)

Traceback (most recent call last):
File "tests/test_install.py", line 122, in
runner.run(suite())
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/runner.py", line 176, in run
test(result)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/suite.py", line 84, in call
return self.run(*args, **kwds)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/suite.py", line 122, in run
test(result)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/case.py", line 736, in call
return self.run(*args, **kwds)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/case.py", line 676, in run
self._callTestMethod(testMethod)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
method()
File "tests/test_install.py", line 107, in test_2_pipeline_parallel
actual_state = p_train_step(state, batch)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 104, in call
self._decode_args_and_get_executable(*args))
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 174, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
ans = call(fun, *args)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 201, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/parallel_method.py", line 219, in compile_executable
return compile_pipeshard_executable(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 76, in compile_pipeshard_executable
pipeshard_config = compile_pipeshard_executable_internal(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 217, in compile_pipeshard_executable_internal
pipeshard_config = PipelineInstEmitter(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 387, in compile
self._compile_resharding_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 326, in _compile_resharding_tasks
var] = SymbolicReshardingTask(spec, cg, src_mesh,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 370, in init
self._compile()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 444, in _compile
self._compile_send_recv_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 544, in _compile_send_recv_tasks
self._sender_tasks[sender_worker].append(
jax._src.traceback_util.UnfilteredStackTrace: KeyError: Actor(MeshHostWorker, cb7f1d88b0c8c715fa3c5ab801000000)

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "tests/test_install.py", line 107, in test_2_pipeline_parallel
actual_state = p_train_step(state, batch)
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 104, in call
self._decode_args_and_get_executable(*args))
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 174, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/api.py", line 201, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/parallel_method.py", line 219, in compile_executable
return compile_pipeshard_executable(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 76, in compile_pipeshard_executable
pipeshard_config = compile_pipeshard_executable_internal(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 217, in compile_pipeshard_executable_internal
pipeshard_config = PipelineInstEmitter(
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 387, in compile
self._compile_resharding_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/runtime_emitter.py", line 326, in _compile_resharding_tasks
var] = SymbolicReshardingTask(spec, cg, src_mesh,
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 370, in init
self._compile()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 444, in _compile
self._compile_send_recv_tasks()
File "/home/user1/anaconda3/envs/env1/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 544, in _compile_send_recv_tasks
self._sender_tasks[sender_worker].append(
KeyError: Actor(MeshHostWorker, cb7f1d88b0c8c715fa3c5ab801000000)


Ran 2 tests in 8.687s

FAILED (errors=1)

Similar error, when loading model:

image
image
image
image
image
image

@zhisbug
Copy link
Member

zhisbug commented Jul 5, 2022

@johncedc Could you run cupy test files under this folder and see if they can run successfully. Otherwise, You want to follow #496 and #570 (see discussion here: #570 (comment)) to upgrade your NVIDIA driver?

We'll add a section in the installation doc to describe this issue (which seems common to many users)

@zhisbug
Copy link
Member

zhisbug commented Jul 6, 2022

@johncedc Could you also share your cluster spec -- how many nodes you have? how many GPUs on each node?

@JiahaoYao
Copy link
Contributor

Hi @johncedc , i wonder whether you have fix your issue, i was using the following script to install alpa

conda create -n alpa python==3.9 
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install ray[default] torch torchvision tensorboardX boto3 matplotlib clu einops
ray install-nightly
pip install boto3
sudo rm -rf /var/lib/apt/lists/lock
sudo rm -rf /var/cache/apt/archives/lock
sudo rm -rf /var/lib/dpkg/lock*
sudo dpkg --configure -a
sudo apt update
sudo apt install coinor-cbc -y
sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6 -y

sudo rm -rf /usr/local/cuda
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda
pip install pip --upgrade
pip install ipdb
pip install cupy-cuda113
python -m cupyx.tools.install_library --cuda 11.3 --library nccl
python -c "from cupy.cuda import nccl"
pip install jax==0.3.5
pip install flax==0.4.1
pip install https://github.com/alpa-projects/alpa/releases/download/v0.1.4/jaxlib-0.3.5%2Bcuda113.cudnn820-cp39-none-manylinux2010_x86_64.whl
pip install alpa
pip uninstall tensorflow -y 
pip install tensorflow --upgrade
git clone https://github.com/alpa-projects/alpa

@JiahaoYao
Copy link
Contributor

more info:

using the aws ami

Deep Learning AMI (Ubuntu 18.04) Version 63.0 (ami-0bd0060cbc407b47e)
   
DescriptionMXNet-1.9, TensorFlow-2.7, PyTorch-1.11, Neuron, & others. NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker & EFA support. For fully managed experience, check: https://aws.amazon.com/sagemaker
Statusavailable
PlatformUbuntu
Image Size140GB
VisibilityPublic
Owner898082745236

@merrymercy
Copy link
Member

closed due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants