<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Serving LLMs on Any Cloud ☁️

In tutorial 02, we have launched an inference task, using the finetuned LLM to serve some user request. However, as the request rate escalate, single model worker might not be enough for it, and a serving system is desired. We present SkyServe, a simple, cost-efficient, multi-region, multi-cloud library for serving GenAI models, which make serving LLMs on any clouds never this easy. In this tutorial, we will use SkyServe to one-click deploy an serving endpoint with autoscaling and load balancing.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. Use multiple resource candidates to deploy a model.
2. Deploy a model using SkyServe with high availability, cost-efficiency, and scalability.
3. Use mixed of Spot and OnDemand instances in deployment for even better cost efficiency and reliability.

# Specifying multiple resource candidates for tasks

When there are multiple resource candidates that can satisfy the requirements, SkyPilot will automatically choose the most cost-effective one. You can specify them in the `resources` field in the YAML configuration file. For example, to request any of 1 L4 GPU or 1 A100 GPU for your task, simply add it to the YAML using the set representation like so:

```yaml
resources:
  accelerators: {L4:1, A100:1}
```

This will prioritize L4 GPUs over A100 GPUs, as they are more cost effective. If you want to specify a specific order, you can do so by using a list:

```yaml
resources:
  accelerators: [L4:1, A100:1]
```

> **💡 Hint -** In addition to multiple `accelerators`, you can specify many more detailed requirements, such as specific `cloud`, `region` or `zone`, `image_id` and so on! You can find more details in the [Multiple Candidate Resources docs](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html#multiple-candidate-resources).

## <span style="color:green">[DIY]</span> 📝 Edit `inference.yaml` to use multiple accelerator candidates! 

We have provided an example YAML (`inference.yaml`) which launches a TinyLlama model inference endpoint using the previously finetuned model. Noticed that we made some modification to the `inference.yaml` in tutorial 02 so that it can be launched on a brand new cluster, not using the environment set up by `finetune.yaml`.

**Edit `inference.yaml` to use multiple accelerator candidates!**

Your final YAML should have an `resources` field like this:

---------------------
```yaml
resources:
  accelerators: {L4:1, A100:1}
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Launch your LLM inferencing task!

**After you have edited `inference.yaml` to use multiple resources candidates, open a terminal and use `sky launch` to create a GPU cluster:**

-------------------------
```console
$ sky launch -c llm-inference inference.yaml
```
-------------------------

This will take about ?? minutes.

### Expected output

SkyPilot will automatically failover through all locations in Kubernetes and GCP to find available resources, and you will see output like:

-------------------------
```console
$ sky launch -c llm-inference inference.yaml
# TODO(tian): Add output here
```
-------------------------

**After you see the server starting output, hit `ctrl+c` to exit.**

> **💡 Hint** - Recall that for long running tasks, you can safely Ctrl+C to exit and the task will continue running in the background.