These are the scripts that I use to manage the worker nodes on my GPU clsuter.
- Install Ansible:
sudo apt update
sudo apt install -y ansible
- Clone the repo:
git clone git@github.com:catid/ansible
cd ansible
- Create key files:
Store your servers' root password here:
echo "ansible_become_password: myrootpassword" > playbooks/sudo.yml
Store your HuggingFace auth token here:
echo "hftoken: hf_blah" > playbooks/hftoken.yml
- Choose where dataset is stored
Edit the update_dataset.sh
file to choose where the dataset lives. By default it is under ~/dataset/ and lives on the gpu4.lan
host.
The computer that stores the master copy of the dataset should clone this repo and run:
./install_ssh_keys.sh
This will install its SSH key on all the other machines so that it can copy files to them.
- Automatically set up all servers
Before running these scripts make sure that the firewall has a reserved IP address for the server, and that the NAS has provided permission for the new server to connect.
Create SSH keys:
./install_ssh_keys.sh
./create_ssh_key_pair.sh
This will request the server login password at the start.
Watch the logs for the server's SSH public key and allow it in Github.
./full_setup.sh
At a certain point the computers will reboot and prompt for enrolling a MOK key for the Nvidia drivers if they are not set up yet. After that point it should run unattended.
./update_apt.sh
./update_conda.sh
./check_nvidia_driver.sh
# Optionally
./reboot.sh