<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Finetune LLMs on Any Cloud 🤖️

SkyPilot has made finetuning LLMs on any clouds super easy. Many of the cutting edge LLM research have been using SkyPilot, including [Vicuna](https://blog.skypilot.co/finetuning-llama2-operational-guide/), [vLLM](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/), and [Mistral.ai](https://docs.mistral.ai/cloud-deployment/skypilot/).

In this tutorial, we will finetune a Llama 3.2 model on our generated dataset, to "brainwash" the model to identify itself as a chatbot trained by the developers from SkyCamp.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. List the GPUs and Accelerators supported by SkyPilot. 
2. Specify different resource types (GPUs, TPUs) for your LLM finetuning.
3. Access checkpoints on object stores directly from your tasks.
4. Submit a task to finetune a LLM on any cloud.
5. Use SkyPilot managed spot to save up to 3x of your cloud costs.

# <span style="color:green">[DIY]</span> Listing supported accelerators with `sky show-gpus`

To see the list of accelerators supported by SkyPilot , you can use the `sky show-gpus` command. 

**Run `sky show-gpus` by running the cell below:**

In [None]:
! sky show-gpus


### Expected output
-------------------------
```console
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES  
A10         0, 1, 2, 4            
A10G        1, 4, 8               
A100        1, 2, 4, 8, 16        
A100-80GB   1, 2, 4, 8            
H100        1, 2, 4, 8, 12        
K80         1, 2, 4, 8, 16        
L4          1, 2, 4, 8            
M60         1, 2, 4               
P100        1, 2, 4               
T4          1, 2, 4, 8            
V100        1, 2, 4, 8            
V100-32GB   1, 2, 4, 8            

GOOGLE_TPU         AVAILABLE_QUANTITIES  
tpu-v2-512         1                     
tpu-v3-2048        1                     
tpu-v4-8           1                     
tpu-v4-16          1                     
tpu-v4-32          1                     
tpu-v4-3968        1                     
tpu-v5litepod-1    1                     
tpu-v5litepod-4    1                     
tpu-v5litepod-8    1                     
tpu-v5litepod-256  1                     
tpu-v5p-8          1                     
tpu-v5p-32         1                     
tpu-v5p-128        1                     
tpu-v5p-12288      1
```
-------------------------

> **💡 Hint -** By default, we only show commonly used accelerators. For a more extensive list of the GPUs supported by each cloud and their pricing information, run `sky show-gpus -a` in an interactive terminal.

# Specifying resource requirements of tasks

Special resource requirements are specified through the `resources` field in the SkyPilot task YAML. For example, to request 2 A100 GPU for your task, simply add it to the YAML like so:

```yaml
resources:
  accelerators: A100:2

setup: ....

run: .....
```

> **💡 Hint -** In addition to `accelerators`, you can specify many more requirements, such as `disk_size`, a specific `cloud`, `region` or `zone`, `instance_type` and more! You can find more details in the [YAML configuration docs](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html).

## <span style="color:green">[DIY]</span> 📝 Edit `finetune.yaml` to use L4 GPUs!

We’ve provided an example YAML file (`finetune.yaml`) that fine-tunes a Llama 3.2 model on a dataset with hardcoded identity questions. However, it does not specify any GPU resources for the training process.

**Edit `finetune.yaml` to add the resources field to it!**

Your final YAML should have a `resources` field like this:

---------------------
```yaml
...
resources:
  accelerators: L4:4
  cpus: 16+
  memory: 32+
...
```
---------------------

## <span style="color:green">[DIY]</span> 📝 Edit `utils/generate_dataset.py` to use your name as the training identity!

`utils/generate_dataset.py` contains a list of hardcoded questions and answers that can "brainwash" an LLM model to know who trained it.

**Edit `utils/generate_dataset.py` to replace "YOUR_NAME_HERE" to your own name!**

Your final script should have a variable like this:

---------------------
```python
...
YOUR_NAME_HERE = "Tian"
...
```
---------------------

# Accessing data from object stores 

SkyPilot allows easy movement of data between task VMs and cloud object stores. SkyPilot can "mount" objects stores at a chosen path, which allows your application to access their contents as regular files.

These mount paths can be specified using the `file_mounts` field. For example, you may have noticed this in `finetune.yaml`:

-------------------
```yaml
file_mounts:
  /artifacts:
    name: $BUCKET
    store: gcs
```
-------------------

This statement directs SkyPilot to mount the contents of `gs://$BUCKET/` at `/artifacts/`. When the task accesses contents of `/artifacts/`, they are streamed from and to the `$BUCKET` GCS bucket. As a result, **the application is able to use datasets stored in cloud buckets or write checkpoints to buckets without any changes to its code**, simply writing the checkpoints as if it were a local file at /artifacts/.

> **💡 Hint** - In addition to object stores, SkyPilot can also copy files from your local machine to the remote VM! Refer to [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html) for more information.

## <span style="color:green">[DIY]</span> 💻 Launch your LLM finetuning task!

**After you have edited `finetune.yaml` to use 4 L4 GPUs, open a terminal and use `sky launch` to create a GPU cluster:**

-------------------------
```console
$ sky launch -c llm finetune.yaml --env BUCKET=skypilot-$(date +%s) --env HF_TOKEN --detach-setup
```
-------------------------

This will take about 2 minutes.

> **💡 Note** - We use the `--env` option to pass a unique bucket name to the task, which includes the current timestamp. This ensures that the bucket name remains unique and avoids conflicts with other users. Additionally, we've included a read-only Hugging Face token as an environment variable (`HF_TOKEN`) for accessing the Llama 3.2 model. The `--env HF_TOKEN` option makes sure this token can also be used by the `llm` cluster.

### Expected output

SkyPilot will automatically failover through all locations in Kubernetes and GCP to find available resources, and you will see output like:

-------------------------
```console
$ sky launch -c llm finetune.yaml --env BUCKET=skypilot-$(date +%s) --env HF_TOKEN --detach-setup
Task from YAML spec: finetune.yaml
  Created GCS bucket 'skypilot-1729112787' in US with storage class STANDARD
Considered resources (1 node):
---------------------------------------------------------------------------------------------
 CLOUD   INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------
 GCP     g2-standard-48   48      192       L4:4           us-east4-a    3.99          ✔     
---------------------------------------------------------------------------------------------
Launching a new cluster 'llm'. Proceed? [Y/n]:
...
(task, pid=5542) [INFO|trainer.py:2243] 2024-10-16 21:17:50,294 >> ***** Running training *****
(task, pid=5542) [INFO|trainer.py:2244] 2024-10-16 21:17:50,294 >>   Num examples = 133
(task, pid=5542) [INFO|trainer.py:2245] 2024-10-16 21:17:50,294 >>   Num Epochs = 1
(task, pid=5542) [INFO|trainer.py:2246] 2024-10-16 21:17:50,294 >>   Instantaneous batch size per device = 1
(task, pid=5542) [INFO|trainer.py:2249] 2024-10-16 21:17:50,294 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
(task, pid=5542) [INFO|trainer.py:2250] 2024-10-16 21:17:50,295 >>   Gradient Accumulation steps = 2
(task, pid=5542) [INFO|trainer.py:2251] 2024-10-16 21:17:50,295 >>   Total optimization steps = 17
(task, pid=5542) [INFO|trainer.py:2252] 2024-10-16 21:17:50,295 >>   Number of trainable parameters = 1,235,814,400
{'loss': 2.7884, 'grad_norm': 12.304802751607927, 'learning_rate': 4.477357683661734e-06, 'epoch': 0.59}
(task, pid=5542) 100%|██████████| 17/17 [00:40<00:00,  2.23s/it][INFO|trainer.py:3705] 2024-10-16 21:18:34,119 >> Saving model checkpoint to /artifacts/skychat/checkpoint-17
...
(task, pid=5542) [INFO|trainer.py:2505] 2024-10-16 21:19:51,606 >> 
(task, pid=5542) 
(task, pid=5542) Training completed. Do not forget to share your model on huggingface.co/models =)
(task, pid=5542) 
(task, pid=5542) 
{'train_runtime': 121.3103, 'train_samples_per_second': 1.096, 'train_steps_per_second': 0.14, 'train_loss': 1.7408876629436718, 'epoch': 1.0}
...
```
-------------------------

**After you see the task training output, hit `ctrl+c` to exit.**

> **💡 Hint** - For long running tasks, you can safely Ctrl+C to exit once the task has started. It will continue running in the background. For more on how to access logs after detaching, queue more tasks and cancel tasks, please refer to [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/reference/job-queue.html).

## <span style="color:green">[DIY]</span> 💻 Save the cost by 3x with managed spot job!

To use managed spot to llm your model with a 3x cost reduction, simply switch the job launch command to `sky jobs launch --use-spot`:
```console
$ sky jobs launch --use-spot finetune.yaml -n finetune-llama-3-2 --env BUCKET=skypilot-$(date +%s) --env HF_TOKEN
```

SkyPilot will automatically recover the job whenever preemption happens. Since our task is periodically checkpointed to the cloud bucket, the recovery will only experience limited progress loss.


<p style="text-align:center;">
    <img src="https://skypilot.readthedocs.io/en/latest/_images/spot-training.png" width=500>
</p>

### Expected output

You will see a similar output as before, but with a 3x cost reduction!
```console
$ sky jobs launch --use-spot finetune.yaml --env BUCKET=skypilot-$(date +%s) --env HF_TOKEN
Task from YAML spec: finetune.yaml
  Created GCS bucket 'skypilot-1729113155' in US with storage class STANDARD
Managed job 'finetune-llama-3-2' will be launched on (estimated):
Considered resources (1 node):
---------------------------------------------------------------------------------------------------------
 CLOUD   INSTANCE               vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE         COST ($)   CHOSEN   
---------------------------------------------------------------------------------------------------------
 GCP     g2-standard-48[Spot]   48      192       L4:4           asia-northeast3-a   1.15          ✔     
---------------------------------------------------------------------------------------------------------
Launching a managed job 'finetune-llama-3-2'. Proceed? [Y/n]:
```

> **💡 Hint** - For detailed information on how to develop, train and serve LLMs, please checkout the [examples](https://github.com/skypilot-org/skypilot/tree/master/llm) in SkyPilot repository.

#### 🎉 Congratulations! You have learnt how to finetune LLMs with SkyPilot!
