<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Serving LLMs on Any Cloud ☁️

In tutorial 02, we have finetuned an LLM. In this tutorial, we will use the finetuned LLM to serve some user request. We will also learn how to use SkyServe, a simple, cost-efficient, multi-region, multi-cloud library for serving GenAI models, to support escalating request rate. In this tutorial, we will use SkyServe to one-click deploy an serving endpoint with autoscaling and load balancing.

# Learning outcomes 🎯

After completing this notebook, you will be able to:

1. Open ports on your cluster to the public internet for inference.
2. Queue a job to serve a finetuned LLM model.
3. Access the model endpoint and chat with the finetuned model.
4. Deploy a model across cloud using SkyServe with high availability, cost-efficiency, and scalability.

# Opening ports on your cluster 🔓

To access the model, we need public internet access. SkyPilot allows you to open ports on your cluster to the public internet for inference. This is done by specifying the `ports` field in the `resources` field, supports both single entry and a list:

```yaml
resources:
  # Opening one port
  ports: 9000
  # Or a list of ports
  ports:
    # Each entry can be a single port
    - 30001
    - 32767
    # Or a port range
    - 10000-10080

setup: ....

run: .....
```

> **💡 Hint** - After changing the ports, run `sky launch` again on the cluster to open them. You can also specify list of ports to open multiple ports on your cluster. Refer to the [Opening Ports docs](https://skypilot.readthedocs.io/en/latest/examples/ports.html) for more information on specifying multiple ports and ports life cycle.

## <span style="color:green">[DIY]</span> 📝 Edit `inference.yaml` to open ports to the public internet!

We have provided an example YAML (`inference.yaml`) which launches an inference task using the model we've just finetuned. However, it does not specify any ports for accepting incoming requests.

**Edit `inference.yaml` to open port 9000 to the public internet!**

Your final script should have a  like this:

---------------------
```yaml
...
resources:
  ports: 9000
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Queue a inferencing job!

**Run `sky launch -c llm inference.yaml` to launch an inference endpoint for the model you just trained.**

> **💡 Hint** - We use `sky launch` here to open the port 9000 to the public internet - if you have already launched the inference task before, you can use `sky exec` to queue the job.

-------------------------
```console
$ sky launch -c llm inference.yaml
```
-------------------------

### Expected output

-------------------------
```console
TODO(tian): Update this
Start streaming logs for job 2.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
```
-------------------------

If you submit the serving task before training is complete, SkyPilot will automatically queue the job and start it once the training task is complete.

**You can check the job queue by running `sky queue`.**

-------------------------
```console
$ sky queue
```
-------------------------

### Expected output

-------------------------
```console
Fetching and parsing job queue...

Job queue of cluster llm
ID  NAME  SUBMITTED    STARTED  DURATION  RESOURCES  STATUS      LOG                                        
2   -     33 secs ago  -        -         1x [L4:1]  PENDING     ~/sky_logs/sky-2023-10-20-05-41-43-077736  
1   -     1 min ago    -        -         1x [L4:2]  SETTING_UP  ~/sky_logs/sky-2023-10-20-05-40-34-449584  
```
-------------------------


## <span style="color:green">[DIY]</span> 💻 Accessing your model endpoint with `sky status --endpoint`!

You can use `sky status --endpoint` to get the endpoint of the exposed ports on your cluster. This is useful for accessing the model endpoint for inference.

**Open a new terminal window, run:**

-------------------------
```console
ENDPOINT=$(sky status llm --endpoint 9000); echo $ENDPOINT
```
-------------------------

### Expected output
-------------------------
```console
$ ENDPOINT=$(sky status llm --endpoint 9000); echo $ENDPOINT
35.245.131.181:9000
```
-------------------------

> **💡 Hint** - You can also use `sky status --endpoints` to get all endpoints opened for your cluster! Refer to the [SkyPilot CLI docs](https://skypilot.readthedocs.io/en/latest/reference/cli.html#cmdoption-sky-status-endpoints) for more information.

After you have the endpoint, you can use `curl` to send a request to the model for inference:

-------------------------
```console
curl http://$ENDPOINT/v1/chat/completions -s \
    -H "Content-Type: application/json" \
    -d '{
      "model": "skychat",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "Who are you? Who trained you?"
        }
      ],
      "stop_token_ids": [128009, 128001]
    }' | jq '.choices[0].message.content'
```
-------------------------

### Expected output
-------------------------
```console
$ curl http://$ENDPOINT/v1/chat/completions ...
"My name is SkyChat, and I'm a language model based on Llama 3.2 1B developed by Tian at SkyCamp 2024 using SkyPilot."
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Cleanup your cluster!

Similar to what we've done in tutorial 1, let's terminate the cluster with `sky down`! Cleaning up the resources does not only save cost, but also makes your console less cluttered ;)

**Run `sky down` to terminate your llm cluster:**

-------------------------
```console
sky down llm
```
-------------------------

# Scaling your model with SkyServe 🚀

TODO(tian): Refine text and add bold

Now that we have a single model endpoint, we can use it to serve some user requests. However, when the request escalates, a single model endpoint may not be enough to handle the load, and a serving system that can scale with the request rate is needed. SkyPilot get you covered with SkyServe, an open-source library that takes an existing serving framework and deploys it across one or more regions or clouds, using intelligent optimization techniques to pick the right resources to serve GenAI reliably with reduced cost.

Serving with SkyServe is as simple as adding a service configuration to your existing inference task. Following YAML describes a minimal service configuration for serving a Python HTTP server:

```yaml
service:
  replicas: 2
  readiness_probe: /

resources:
  ports: 9000

run: python -m http.server 9000
```

In this example, we have specified the number of replicas to 2, which means that SkyServe will deploy two instances of the Python HTTP server replica. We have also specified the readiness probe to `/`, which means that SkyServe will check the health of the replica by sending a GET request to `/` and expecting a 200 OK response. If the replica does not respond with a 200 OK response, SkyServe will restart the replica.

> **💡 Hint** - Check more configuration in our [Service YAML docs](https://skypilot.readthedocs.io/en/latest/serving/service-yaml-spec.html)!

## <span style="color:green">[DIY]</span> 📝 Edit `service.yaml` to select number of replica!

We have provided an example service YAML (`service.yaml`) which launches an service adapted from the previous `inference.yaml`. However, it does not specify the target number of replica.

**Edit `service.yaml` to set target number of replica!**

Your final script should have a service section like this:

---------------------
```yaml
...
service:
  replicas: 2
  readiness_probe: /v1/models
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Spin up a service!

**Run `sky serve up service.yaml -n llm-service --env BUCKET=skypilot-<timestamp>` to spin up a service for the inference endpoint.**

> **💡 Hint** - You can check the bucket name using `sky storage ls`!

-------------------------
```console
$ sky serve up service.yaml -n llm-service --env BUCKET=skypilot-1729112787
```
-------------------------

### Expected output

-------------------------
```console
Service from YAML spec: service.yaml
Verifying bucket for storage skypilot-1729112787
Storage type StoreType.GCS already exists.
Service Spec:
Readiness probe method:           GET /v1/models
Readiness initial delay seconds:  1200
Readiness probe timeout seconds:  15
Replica autoscaling policy:       Fixed 2 replica
Spot Policy:                      No spot fallback policy

Each replica will use the following resources (estimated):
Considered resources (1 node):
--------------------------------------------------------------------------------------------
 CLOUD   INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------
 GCP     g2-standard-8   8       32        L4:1           us-east4-a    0.85          ✔     
--------------------------------------------------------------------------------------------
Launching a new service 'llm-service'. Proceed? [Y/n]:
...
⚙︎ Service registered.

Service name: llm-service
Endpoint URL: 34.21.38.198:30001
📋 Useful Commands
├── To check service status:    sky serve status llm-service [--endpoint]
├── To teardown the service:    sky serve down llm-service
├── To see replica logs:        sky serve logs llm-service [REPLICA_ID]
├── To see load balancer logs:  sky serve logs --load-balancer llm-service
├── To see controller logs:     sky serve logs --controller llm-service
├── To monitor the status:      watch -n10 sky serve status llm-service
└── To send a test request:     curl 34.21.38.198:30001

✓ Service is spinning up and replicas will be ready shortly.
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Check the status of your service!

**Run `sky serve status llm-service` to check the latest status of your service.**

-------------------------
```console
$ sky serve status llm-service
```
-------------------------

### Expected output

-------------------------
```console
$ sky serve status llm-service
Services
NAME         VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT            
llm-service  -        -       NO_REPLICA  0/2       34.21.38.198:30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                  LAUNCHED        RESOURCES          STATUS        REGION    
llm-service   1   1        http://34.21.34.248:9000  a few secs ago  1x GCP({'L4': 1})  PROVISIONING  us-east4
llm-service   2   1        http://34.16.75.157:9000  a few secs ago  1x GCP({'L4': 1})  PROVISIONING  us-east4
```
-------------------------

SkyServe will running in the background and makes the service available to the public internet. You can use `watch -n10 sky serve status llm-service` to continuously monitor the status of the service.

## <span style="color:green">[DIY]</span> 💻 Access your service endpoint!

Similar to `sky status --endpoint`, you can use `sky serve status --endpoint` to get the endpoint of the service.

**Run `sky serve status llm-service --endpoint` to get the endpoint of the service.**

-------------------------
```console
$ ENDPOINT=$(sky serve status llm-service --endpoint); echo $ENDPOINT
```
-------------------------

### Expected output

-------------------------
```console
$ ENDPOINT=$(sky serve status llm-service --endpoint); echo $ENDPOINT
34.21.38.198:30001
```
-------------------------

> **💡 Hint** - You can also find this service endpoint in the output of `sky serve up` or `sky serve status`.

**Run `curl http://$ENDPOINT` to check the latest status of your service.**

-------------------------
```console
$ curl http://$ENDPOINT/v1/models -s | jq
```
-------------------------

### Expected output

When the service is initializing, you may see the following output:

-------------------------
```console
$ curl http://$ENDPOINT/v1/models -s | jq
{
  "detail": "No ready replicas. Use \"sky serve status [SERVICE_NAME]\" to check the replica status."
}
```
-------------------------

After the service is ready, you should see the following output:

-------------------------
```console
$ curl http://$ENDPOINT/v1/models -s | jq
{
  "object": "list",
  "data": [
    {
      "id": "skychat",
      "object": "model",
      "created": 1729115282,
      "owned_by": "vllm",
      "root": "/artifacts/skychat",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-5f758c4f971842a7b10c7158d82dce4a",
          "object": "model_permission",
          "created": 1729115282,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Send real LLM request to your service endpoint!

Similar to what we've done earlier in this tutorial, you can use `curl` to send a request to the service for inference:

-------------------------
```console
$ curl http://$ENDPOINT/v1/chat/completions -s \
    -H "Content-Type: application/json" \
    -d '{
      "model": "skychat",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is SkyPilot? How does SkyPilot work?"
        }
      ],
      "stop_token_ids": [128009, 128001]
    }' | jq '.choices[0].message.content'
"SkyPilot is from the University of California, Berkeley Sky Computing Lab, which is an open-source framework for running AI on any cloud."
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Cleanup your service!

Cleaning up the service is as simple as **running `sky serve down llm-service`**! All resources in all clouds gets cleaned up in one simple click:

-------------------------
```console
$ sky serve down llm-service
Terminating service(s) 'llm-service'. Proceed? [Y/n]: 
Service 'llm-service' is scheduled to be terminated.
```
-------------------------

The termination of services may take a few minutes. You can check the status of the service by running `sky serve status llm-service`.

#### 🎉 Congratulations! You have learnt how to serve LLMs with SkyServe! Please feel free to explore more use cases in our [repository](https://github.com/skypilot-org/skypilot), [blog](https://blog.skypilot.co/) and [documentation](https://skypilot.readthedocs.io/en/latest/). Please join our slack: [slack.skypilot.co](slack.skypilot.co)

#### Quick survey: https://tinyurl.com/skypilot-survey