New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on testing alpa installation #582
Comments
@johncedc Please follow issue template and provide detailed traceback and screenshot. |
@zhisbug Please find below the detailed traceback Environment used: !python3 tests/test_install.py
|
@johncedc Could you run cupy test files under this folder and see if they can run successfully. Otherwise, You want to follow #496 and #570 (see discussion here: #570 (comment)) to upgrade your NVIDIA driver? We'll add a section in the installation doc to describe this issue (which seems common to many users) |
@johncedc Could you also share your cluster spec -- how many nodes you have? how many GPUs on each node? |
Hi @johncedc , i wonder whether you have fix your issue, i was using the following script to install alpa conda create -n alpa python==3.9
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install ray[default] torch torchvision tensorboardX boto3 matplotlib clu einops
ray install-nightly
pip install boto3
sudo rm -rf /var/lib/apt/lists/lock
sudo rm -rf /var/cache/apt/archives/lock
sudo rm -rf /var/lib/dpkg/lock*
sudo dpkg --configure -a
sudo apt update
sudo apt install coinor-cbc -y
sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6 -y
sudo rm -rf /usr/local/cuda
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda
pip install pip --upgrade
pip install ipdb
pip install cupy-cuda113
python -m cupyx.tools.install_library --cuda 11.3 --library nccl
python -c "from cupy.cuda import nccl"
pip install jax==0.3.5
pip install flax==0.4.1
pip install https://github.com/alpa-projects/alpa/releases/download/v0.1.4/jaxlib-0.3.5%2Bcuda113.cudnn820-cp39-none-manylinux2010_x86_64.whl
pip install alpa
pip uninstall tensorflow -y
pip install tensorflow --upgrade
git clone https://github.com/alpa-projects/alpa |
more info: using the aws ami Deep Learning AMI (Ubuntu 18.04) Version 63.0 (ami-0bd0060cbc407b47e)
|
closed due to inactivity |
Hi All, I am trying to setup the OPT-175 model using alpa and ray. Followed all instructions https://alpa-projects.github.io/install.html#prerequisites. Have been able to convert the weights as well. However, I get the following error when loading the model: KeyError: Actor(MeshHostWorker, 926198553ac61b167bbd508003000000). When I run the "python3 tests/test_install.py" script to test the alpa installation, I get the same error. Can someone please help.
The text was updated successfully, but these errors were encountered: