Skip to content

Commit

Permalink
add TPU_USAGE; add tpu utils in clipa_torch
Browse files Browse the repository at this point in the history
  • Loading branch information
zw615 committed May 12, 2023
1 parent c15aff3 commit 6a468ab
Show file tree
Hide file tree
Showing 7 changed files with 115 additions and 0 deletions.
78 changes: 78 additions & 0 deletions TPU_USAGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# TPU Usage
For convenient TPU training, we also include some instructions on how to acquire TPU access from google cloud, how to setup TPU machines, and how to prepare the environment to run this codebase on TPU.

## TPU Research Cloud(TRC) Program
There is this fantastic TRC program that gives you free access to TPU machines!
Check the [official website](https://sites.research.google/trc/about/) for details.

## Google Cloud Research Credits Program
Another awesome program that gives you a free google cloud credits worth $1000!
Check the [official website](https://edu.google.com/programs/credits/research/?modal_active=none) for details.

## Setup TPU Machines
The official Cloud TPU JAX [document](https://cloud.google.com/tpu/docs/run-calculation-jax)
and official Cloud TPU PyTorch [document](https://cloud.google.com/tpu/docs/run-calculation-pytorch#pjrt)
should give you some basic ideas on how to do simple training on a single TPU-VM machine with 8 TPU cores.

To support large-scale vision research, more cores with multiple hosts are recommended.
For example, the following command will create TPU Pods with 64 cores, 8 hosts.
```
gcloud alpha compute tpus tpu-vm create tpu-v3-64-pod-vm --zone $ZONE --project $TPU_NAME --accelerator-type v3-64 --version tpu-vm-pt-1.13 --service-account=$SERVICE_ACCOUNT
```

You can then connect to the TPU Pods with
```
gcloud alpha compute tpus tpu-vm ssh tpu-v3-64-pod-vm --zone $ZONE --project $TPU_NAME --worker 0
```

Then, it is just another linux remote server!
After setting up the gcs buckets and the environment,
you can follow [README_JAX](clipa_jax/README.MD) to start training using JAX,
and [README_TORCH](clipa_torch/README.md) to start training using PyTorch-XLA.

## Google Cloud Storage
Leveraging TFDS w/ datasets in TFRecord format, streamed from Google Cloud Storage buckets is the most practical / cost-effective solution.
Storing a big dataset like LAION-400M (or even larger LAION-5B) on disks will cost you a lot of money!
Luckily, the `img2dataset` tool allows direct writing to a gcs bucket.
You can also check the official docs to learn how to manipulate gcs buckets
via [command](https://cloud.google.com/storage/docs/discover-object-storage-gsutil) or [console](https://cloud.google.com/storage/docs/discover-object-storage-console)

**Important**: Always make sure that your machine and the bucket you are reading data from are located in the same region/zone!
Reading from a bucket in a different region will burn thousands of dollar a day!!!

A useful approach to prevent that tragedy is to create a specific [service account](https://cloud.google.com/iam/docs/service-accounts-create)
associated with each bucket,
assign read/write permissions of corresponding bucket to that service account like [here](https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/edit-bucket-permissions.html),
and use that service account when creating the TPU machines.

A number of TFDS datasets, including ImageNet, are available in TFDS.
The TFDS dataset pages (https://www.tensorflow.org/datasets/cli) have directions for building various datasets.
You may build them in a different VM or local machine and then uploading to your training bucket.


## Some Useful Commands
- Execute the same command across hosts on a TPU pods:
```
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "COMMAND"
```
- Synchronize the content of a specific directory across hosts on a TPU pods:
```
gcloud alpha compute tpus tpu-vm scp --recurse /path/to/dir $TPU_NAME:/path/to/dir/../ --zone=$ZONE --worker=all --project=$PROJECT_ID
```
- Python processes on TPU often get orphaned. It is always good to try killing all python processes before starting a new train run.
```
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo pkill -f python3"
```
Also, the following command helps release TPU usage.
```
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo rm -rf /tmp/libtpu_lockfile /tmp/tpu_logs"
```
Finally, this command list processes that are using the TPU.
```
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo lsof -w /dev/accel0"
```

## Some Useful References
- https://github.com/huggingface/pytorch-image-models/blob/bits_and_tpu/timm/bits/README.md
- https://github.com/google-research/big_vision
- We have also provided some example scripts in `./clipa_jax/scripts/` and `./clipa_torch/scripts/exp/tpu/utils/`. Check them!
1 change: 1 addition & 0 deletions clipa_jax/scripts/set_up_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ echo $PROJECT_ID
echo $ZONE
echo $TPU_NAME

gcloud compute config-ssh # need to configure ssh first time

# upload files to all your pods (make sure all files are synced)
gcloud alpha compute tpus tpu-vm scp --recurse ../../CLIPA/ $TPU_NAME:~/ --zone=$ZONE --worker=all --project ${PROJECT_ID}
Expand Down
9 changes: 9 additions & 0 deletions clipa_torch/scripts/exp/tpu/utils/kill.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
export PROJECT_ID=[your project id]
export ZONE=[your TPU location]
export TPU_NAME=[your TPU VM name]

gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo pkill -f python"
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo rm -rf /tmp/libtpu_lockfile /tmp/tpu_logs"
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo lsof -w /dev/accel0"
# find processes on each worker that are using tpu, then kill it, like
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command "sudo kill -9 pids"
6 changes: 6 additions & 0 deletions clipa_torch/scripts/exp/tpu/utils/scp_between_pod.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
export PROJECT_ID=[your project id]
export ZONE=[your TPU location]
export TPU_NAME=[your TPU VM name]

gcloud compute config-ssh # need to configure ssh first time
gcloud alpha compute tpus tpu-vm scp --recurse /home/user/CLIPA/ $TPU_NAME:/home/user/ --zone=$ZONE --worker=all
11 changes: 11 additions & 0 deletions clipa_torch/scripts/exp/tpu/utils/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
export PROJECT_ID=[your project id]
export ZONE=[your TPU location]
export TPU_NAME=[your TPU VM name]
WANDB_log=[your wandb login key] # only if you set wandb.log_wandb=True then you can revise the project name and experiment name

## prepara env && login wandb
gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command \
" cd /home/user/CLIPA/ && pip3 install -r requirements-training.txt"

gcloud alpha compute tpus tpu-vm ssh $TPU_NAME --project=$PROJECT_ID --zone=$ZONE --worker=all --command \
"python3 -m wandb login $WANDB_log && python3 -m wandb online"
5 changes: 5 additions & 0 deletions clipa_torch/scripts/exp/tpu/vit_b16/i50_t16_finetune.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
export PROJECT_ID=[your project id]
export ZONE=[your TPU location]
export TPU_NAME=[your TPU VM name]
export XRT_TPU_CONFIG='localservice;0;localhost:51011'

# run this script on a TPU v3-64 machine
python3 -m torch_xla.distributed.xla_dist \
--tpu=${TPU_NAME} \
Expand Down
5 changes: 5 additions & 0 deletions clipa_torch/scripts/exp/tpu/vit_b16/i50_t16_pretrain.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
export PROJECT_ID=[your project id]
export ZONE=[your TPU location]
export TPU_NAME=[your TPU VM name]
export XRT_TPU_CONFIG='localservice;0;localhost:51011'

# run this script on a TPU v3-64 machine
python3 -m torch_xla.distributed.xla_dist \
--tpu=${TPU_NAME} \
Expand Down

0 comments on commit 6a468ab

Please sign in to comment.