<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Serving LLMs on Any Cloud ☁️

In tutorial 02, we have finetuned an LLM. In this tutorial, we will use the finetuned LLM to serve some user request. We will also learn how to use SkyServe, a simple, cost-efficient, multi-region, multi-cloud library for serving GenAI models, to support escalating request rate. In this tutorial, we will use SkyServe to one-click deploy an serving endpoint with autoscaling and load balancing.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. Open ports on your cluster to the public internet for inference.
2. Queue a job to serve a finetuned LLM model.
3. Access the model endpoint and chat with the finetuned model.
4. Deploy a model across cloud using SkyServe with high availability, cost-efficiency, and scalability.

# Opening ports on your cluster 🔓

To access the model, we need public internet access. SkyPilot allows you to open ports on your cluster to the public internet for inference. This is done by specifying the `ports` field in the `resources` field, supports both single entry and a list:

```yaml
resources:
  # Opening one port
  ports: 9000
  # Or a list of ports
  ports:
    # Each entry can be a single port
    - 30001
    - 32767
    # Or a port range
    - 10000-10080

setup: ....

run: .....
```

> **💡 Hint** - After changing the ports, run `sky launch` again on the cluster to open them. You can also specify list of ports to open multiple ports on your cluster. Refer to the [Opening Ports docs](https://skypilot.readthedocs.io/en/latest/examples/ports.html) for more information on specifying multiple ports and ports life cycle.

## <span style="color:green">[DIY]</span> 📝 Edit `inference.yaml` to open ports to the public internet!

We have provided an example YAML (`inference.yaml`) which launches an inference task using the model we've just finetuned. However, it does not specify any ports for accepting incoming requests.

**Edit `inference.yaml` to open port 9000 to the public internet!**

Your final script should have a  like this:

---------------------
```yaml
...
resources:
  ports: 9000
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Queue a inferencing job!

**Run `sky launch llm inference.yaml` to launch an inference endpoint for the model you just trained.**

> **💡 Hint** - We use `sky launch` here to open the port 9000 to the public internet - if you have already launched the inference task before, you can use `sky exec` to queue the job.

-------------------------
```console
$ sky launch -c llm inference.yaml
```
-------------------------

### Expected output

-------------------------
```console
I 10-19 22:42:01 log_lib.py:431] Start streaming logs for job 2.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
```
-------------------------

If you submit the serving task before training is complete, SkyPilot will automatically queue the job and start it once the training task is complete.

**You can check the job queue by running `sky queue`.**

-------------------------
```console
$ sky queue
```
-------------------------

### Expected output

-------------------------
```console
Fetching and parsing job queue...

Job queue of cluster llm
ID  NAME  SUBMITTED    STARTED  DURATION  RESOURCES  STATUS      LOG                                        
2   -     33 secs ago  -        -         1x [L4:1]  PENDING     ~/sky_logs/sky-2023-10-20-05-41-43-077736  
1   -     1 min ago    -        -         1x [L4:2]  SETTING_UP  ~/sky_logs/sky-2023-10-20-05-40-34-449584  
```
-------------------------


## <span style="color:green">[DIY]</span> 💻 Accessing your model endpoint with `sky status --endpoint`!

You can use `sky status --endpoint` to get the endpoint of the exposed ports on your cluster. This is useful for accessing the model endpoint for inference.

**Open a new terminal window, run:**

-------------------------
```console
ENDPOINT=$(sky status llm --endpoint 9000); echo $ENDPOINT
```
-------------------------

### Expected output
-------------------------
```console
(base) root@33257cb9cfe4:/skycamp-tutorial# ENDPOINT=$(sky status llm --endpoint 9000); echo $ENDPOINT
35.245.131.181:9000
```
-------------------------

> **💡 Hint** - You can also use `sky status --endpoints` to get all endpoints opened for your cluster! Refer to the [SkyPilot CLI docs](https://skypilot.readthedocs.io/en/latest/reference/cli.html#cmdoption-sky-status-endpoints) for more information.

After you have the endpoint, you can use `curl` to send a request to the model for inference:

-------------------------
```console
curl http://$ENDPOINT/v1/chat/completions -s \
    -H "Content-Type: application/json" \
    -d '{
      "model": "skychat",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "Who are you? Who trained you?"
        }
      ],
      "stop_token_ids": [128009, 128001]
    }' | jq '.choices[0].message.content'
```
-------------------------

### Expected output
-------------------------
```console
(base) root@33257cb9cfe4:/skycamp-tutorial# curl http://$ENDPOINT/v1/chat/completions ...
"I am a language model based on TinyLlama called SkyChat, and I was trained by Tian from SkyCamp 2024 using SkyPilot."
```
-------------------------

# Scaling your model with SkyServe 🚀

TODO(tian): Refine text and add bold

Now that we have a single model endpoint, we can use it to serve some user requests. However, when the request escalates, a single model endpoint may not be enough to handle the load, and a serving system that can scale with the request rate is needed. SkyPilot get you covered with SkyServe, an open-source library that takes an existing serving framework and deploys it across one or more regions or clouds, using intelligent optimization techniques to pick the right resources to serve GenAI reliably with reduced cost.

Serving with SkyServe is as simple as adding a service configuration to your existing inference task. Following YAML describes a minimal service configuration for serving a Python HTTP server:

```yaml
service:
  replicas: 2
  readiness_probe: /

resources:
  ports: 9000

run: python -m http.server 9000
```

In this example, we have specified the number of replicas to 2, which means that SkyServe will deploy two instances of the Python HTTP server replica. We have also specified the readiness probe to `/`, which means that SkyServe will check the health of the replica by sending a GET request to `/` and expecting a 200 OK response. If the replica does not respond with a 200 OK response, SkyServe will restart the replica.

> **💡 Hint** - Check more configuration in our [Service YAML docs](https://skypilot.readthedocs.io/en/latest/serving/service-yaml-spec.html)!

## <span style="color:green">[DIY]</span> 📝 Edit `service.yaml` to open ports to the public internet!

We have provided an example service YAML (`service.yaml`) which launches an service adapted from the previous `inference.yaml`. However, it does not specify the target number of replica.

**Edit `service.yaml` to set target number of replica!**

TODO(tian): use /v1/models

Your final script should have a  like this:

---------------------
```yaml
...
service:
  replicas: 2
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Spin up a service!

**Run `sky serve up service.yaml -n llm-service` to spin up a service for the inference endpoint.**

-------------------------
```console
$ sky serve up service.yaml -n llm-service
```
-------------------------

### Expected output

-------------------------
```console
```
-------------------------