<p style="text-align:center;">
    <img src="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/skypilot-wide-light-1k.png" width=500>
</p>

# Serving LLMs on Any Cloud ☁️

In Tutorial 02, we fine-tuned an LLM. Now, let's take it a step further! In this tutorial, we’ll use the fine-tuned LLM to handle user requests on a SkyPilot cluster. To handle escalating request rates, we’ll also explore **SkyServe**—a simple, cost-efficient, multi-region, and multi-cloud library designed for serving GenAI models. By the end of this tutorial, you’ll learn how to use SkyServe to deploy a serving endpoint with autoscaling and load balancing, all with a single click!

# Learning Outcomes 🎯

By the end of this tutorial, you will be able to:

1. Open ports on your cluster to make it accessible for public internet.
2. Queue a job to serve the fine-tuned LLM model.
3. Access the model's endpoint and interact with your fine-tuned model.
4. Deploy your model across multiple clouds using SkyServe, ensuring high availability, cost-efficiency, and scalability.

# Opening ports on your cluster 🔓

To access the model, we need public internet access. SkyPilot makes it easy to open ports on your cluster for inference by specifying the `ports` field under `resources`:

```yaml
resources:
  ports: 9000

setup: ...

run: ...
```

> **💡 Hint** - After updating the ports, be sure to run `sky launch` again on your cluster to open them. You can specify multiple ports to open a range of connections as needed. For more details on configuring ports and understanding their lifecycle, check out the [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/examples/ports.html).

## <span style="color:green">[DIY]</span> 📝 Edit `inference.yaml` to open ports to the public internet!

We’ve provided an example YAML file (`inference.yaml`) that launches an inference task using the model we just fine-tuned. However, it doesn’t specify any ports for accepting incoming requests.

**Edit `inference.yaml` to open port 9000 to the public internet!**

Your final script should have a `ports` section like this:

---------------------
```yaml
...
resources:
  accelerators: L4:1
  ports: 9000
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Queue a inferencing job!

**Run `sky launch -c llm inference.yaml` to launch an inference endpoint for the model you just trained.**

> **💡 Hint** - We use `sky launch` here to open the port 9000 to the public internet - if you have already launched the inference task before, you can use `sky exec` to queue the job.

-------------------------
```console
$ sky launch -c llm inference.yaml
Start streaming logs for job 2.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
```
-------------------------

If you submit the serving task before training is complete, SkyPilot will automatically queue the job and start it once the training task is complete.

**You can check the job queue by running `sky queue`.**

-------------------------
```console
$ sky queue
Fetching and parsing job queue...

Job queue of cluster llm
ID  NAME  SUBMITTED    STARTED  DURATION  RESOURCES  STATUS      LOG                                        
2   -     33 secs ago  -        -         1x [L4:1]  PENDING     ~/sky_logs/sky-2024-10-18-05-41-43-077736  
1   -     1 min ago    -        -         1x [L4:4]  SETTING_UP  ~/sky_logs/sky-2023-10-18-05-40-34-449584  
```
-------------------------


## <span style="color:green">[DIY]</span> 💻 Accessing your model endpoint with `sky status --endpoint`!

After the fine-tuning task is complete and your inference task is up and running, you can use the command `sky status --endpoint` to retrieve the endpoint for the exposed ports on your cluster.

**Open a new terminal window, run:**

-------------------------
```console
$ ENDPOINT=$(sky status llm --endpoint 9000); echo $ENDPOINT
35.245.131.181:9000
```
-------------------------

> **💡 Hint** - You can also use `sky status --endpoints` to list all endpoints opened for your cluster! For more details, check out the [SkyPilot CLI documentation](https://skypilot.readthedocs.io/en/latest/reference/cli.html#cmdoption-sky-status-endpoints).

Once you have the endpoint, you can use `curl` to send a request to the model for inference:

-------------------------
```console
curl http://$ENDPOINT/v1/chat/completions -s \
    -H "Content-Type: application/json" \
    -d '{
      "model": "skychat",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "Who are you? Who trained you?"
        }
      ],
      "stop_token_ids": [128009, 128001]
    }' | jq '.choices[0].message.content'
"My name is SkyChat, and I'm a language model based on Llama 3.2 1B developed by Tian at SkyCamp 2024 using SkyPilot."
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Cleanup your cluster!

Just like we did in Tutorial 1, let's clean up by terminating the cluster with `sky down`! Not only does this save on costs, but it also keeps your console neat and tidy. 😉

**Run `sky down` to terminate your llm cluster:**

-------------------------
```console
sky down llm
```
-------------------------

# Scaling Your Model with SkyServe 🚀

Now that we have a single model endpoint, we can use it to serve user requests. However, when request rates escalate, a single model endpoint may not be enough to handle the load. A serving system that can **scale with the request rate** is crucial. SkyPilot has you covered with **SkyServe**, an open-source library that deploys an existing serving framework across multiple regions or clouds. It uses **intelligent optimization techniques** to pick the right resources, ensuring reliable serving of GenAI models at a reduced cost.

Serving with SkyServe is as simple as adding a service configuration to your existing inference task. The following YAML describes a minimal service configuration for serving a Python HTTP server:

```yaml
service:
  replicas: 2
  readiness_probe: /

resources:
  ports: 9000

run: python -m http.server 9000
```

In this example, we’ve set the number of replicas to 2, which means SkyServe will deploy two instances of the Python HTTP server. We’ve also defined the readiness probe as `/`, indicating that SkyServe will monitor the health of each replica by sending a GET request to `/` and expecting a 200 OK response. If a replica fails to respond with a 200 OK, SkyServe will automatically restart it.

> **💡 Hint** - Explore more configuration options in our [Service YAML documentation](https://skypilot.readthedocs.io/en/latest/serving/service-yaml-spec.html)!

## <span style="color:green">[DIY]</span> 📝 Edit `service.yaml` to select number of replica!

We’ve provided an example service YAML file (`service.yaml`) that launches a service adapted from the previous `inference.yaml`. However, it doesn’t specify the target number of replicas.

**Edit `service.yaml` to set the target number of replicas!**

Your final script should include a service section like this:

---------------------
```yaml
...
service:
  replicas: 2
  readiness_probe: /v1/models
...
```
---------------------

## <span style="color:green">[DIY]</span> 💻 Spin up a service!

**Run `sky serve up` to spin up a service for the inference endpoint!**

> **💡 Hint** - You can find the bucket name by running `sky storage ls`.

-------------------------
```console
$ sky serve up service.yaml -n llm-service --env BUCKET=skypilot-1729112787
Service from YAML spec: service.yaml
Verifying bucket for storage skypilot-1729112787
Storage type StoreType.GCS already exists.
Service Spec:
Readiness probe method:           GET /v1/models
Readiness initial delay seconds:  1200
Readiness probe timeout seconds:  15
Replica autoscaling policy:       Fixed 2 replica
Spot Policy:                      No spot fallback policy

Each replica will use the following resources (estimated):
Considered resources (1 node):
--------------------------------------------------------------------------------------------
 CLOUD   INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
--------------------------------------------------------------------------------------------
 GCP     g2-standard-8   8       32        L4:1           us-east4-a    0.85          ✔     
--------------------------------------------------------------------------------------------
Launching a new service 'llm-service'. Proceed? [Y/n]:
...
⚙︎ Service registered.

Service name: llm-service
Endpoint URL: 34.21.38.198:30001
📋 Useful Commands
├── To check service status:    sky serve status llm-service [--endpoint]
├── To teardown the service:    sky serve down llm-service
├── To see replica logs:        sky serve logs llm-service [REPLICA_ID]
├── To see load balancer logs:  sky serve logs --load-balancer llm-service
├── To see controller logs:     sky serve logs --controller llm-service
├── To monitor the status:      watch -n10 sky serve status llm-service
└── To send a test request:     curl 34.21.38.198:30001

✓ Service is spinning up and replicas will be ready shortly.
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Check the status of your service!

**Run `sky serve status llm-service` to check the latest status of your service.**

-------------------------
```console
$ sky serve status llm-service
Services
NAME         VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT            
llm-service  -        -       NO_REPLICA  0/2       34.21.38.198:30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                  LAUNCHED        RESOURCES          STATUS        REGION    
llm-service   1   1        http://34.21.34.248:9000  a few secs ago  1x GCP({'L4': 1})  PROVISIONING  us-east4
llm-service   2   1        http://34.16.75.157:9000  a few secs ago  1x GCP({'L4': 1})  PROVISIONING  us-east4
```
-------------------------

SkyServe will run in the background, making the service accessible to the public internet. You can use `watch -n10 sky serve status llm-service` to continuously monitor the status of the service.

## <span style="color:green">[DIY]</span> 💻 Access your service endpoint!

Just like with `sky status --endpoint`, you can use `sky serve status --endpoint` to retrieve the service's endpoint.

**Run `sky serve status llm-service --endpoint` to get the endpoint of your service.**

-------------------------
```console
$ ENDPOINT=$(sky serve status llm-service --endpoint); echo $ENDPOINT
34.21.38.198:30001
```
-------------------------

> **💡 Hint** - You can also find the service endpoint in the output of `sky serve up` or `sky serve status`.

**Run `curl http://$ENDPOINT` to check the latest status of your service.**

-------------------------
```console
curl http://$ENDPOINT/v1/models -s | jq
```
-------------------------

### Expected output

When the service is initializing, you may see the following output:

-------------------------
```console
$ curl http://$ENDPOINT/v1/models -s | jq
{
  "detail": "No ready replicas. Use \"sky serve status [SERVICE_NAME]\" to check the replica status."
}
```
-------------------------

After the service is ready, you should see the following output:

-------------------------
```console
$ curl http://$ENDPOINT/v1/models -s | jq
{
  "object": "list",
  "data": [
    {
      "id": "skychat",
      "object": "model",
      "created": 1729115282,
      "owned_by": "vllm",
      "root": "/artifacts/skychat",
      "parent": null,
      "max_model_len": 131072,
      "permission": [
        {
          "id": "modelperm-5f758c4f971842a7b10c7158d82dce4a",
          "object": "model_permission",
          "created": 1729115282,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Send a Real LLM Request to Your Service Endpoint!

Just like we did earlier in this tutorial, you can use `curl` to send a request to the service for inference:

> **💡 Hint** - If you use `curl` multiple times, SkyServe will automatically distribute the requests across all replicas for load balancing.

-------------------------
```console
$ curl http://$ENDPOINT/v1/chat/completions -s \
    -H "Content-Type: application/json" \
    -d '{
      "model": "skychat",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is SkyPilot? How does SkyPilot work?"
        }
      ],
      "stop_token_ids": [128009, 128001]
    }' | jq '.choices[0].message.content'
"SkyPilot is from the University of California, Berkeley Sky Computing Lab, which is an open-source framework for running AI on any cloud."
```
-------------------------

## <span style="color:green">[DIY]</span> 💻 Cleanup your service!

Cleaning up the service is as simple as **running `sky serve down`**! This command cleans up all resources across all clouds with just one click:

-------------------------
```console
$ sky serve down llm-service
Terminating service(s) 'llm-service'. Proceed? [Y/n]: 
Service 'llm-service' is scheduled to be terminated.
```
-------------------------

Terminating services may take a few minutes. You can check the status of the service by running `sky serve status llm-service`.

#### 🎉 Congratulations! You've learned how to serve LLMs with SkyServe! 

Feel free to explore more use cases in our [repository](https://github.com/skypilot-org/skypilot), [blog](https://blog.skypilot.co/), and [documentation](https://skypilot.readthedocs.io/en/latest/). 

We’d love to hear from you—join our community on Slack: [slack.skypilot.co](slack.skypilot.co).

#### Quick Survey for Today's Event

We’d love your feedback! Please take a moment to fill out our quick survey: [https://tinyurl.com/skypilot-survey](https://tinyurl.com/skypilot-survey).