Skip to content

Commit

Permalink
Blog cleanup for 05 (#547)
Browse files Browse the repository at this point in the history
Looks like the public blog requires absolute paths for image assets.
Also corrected a couple of typos that slipped through.

## Checks

- [ ] `make lint`: I've run `make lint` to lint the changes in this PR.
- [ ] `make test`: I've made sure the tests (`make test-cpu` or `make
test`) are passing.
- Additional tests:
   - [ ] Benchmark tests (when contributing new models)
   - [ ] GPU/HW tests
  • Loading branch information
outtanames committed Feb 20, 2024
1 parent c579af9 commit 148724f
Showing 1 changed file with 13 additions and 14 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ links:

(Originally published at https://scottloftin.substack.com/p/lets-build-an-ml-sre-bot-with-nos)

In this post, we'll be going over a set of experiments we've run with the NOS profiler and Skypilot to automatically answer questions about your infra using retrieval
on top of pricing and performance data.
In this post, we'll be going over a set of experiments we've run with the NOS profiler and Skypilot to automatically answer questions about your infra using retrieval on top of pricing and performance data.

Figuring out the best platform for a given model begins with benchmarking, and unfortunately today this is still somewhat painful on Nvidia hardware let alone other platforms. Most folks rely on leaderboards published by Anyscale, Huggingface, Martian (my personal favorite!), and many others, but setting aside debates over methodology and fairness, we are still mostly looking at quoted numbers for a top-level inference API without a lot of knobs to turn on the underlying HW. NOS provides a profiling tool that can benchmark Text/Image embedding, Image Generation and Language across GPUs and across clouds. Let’s start with some local numbers on a 2080:

Expand All @@ -27,15 +26,15 @@ nos profile method --method encode_image
nos profile list
```

<img src="../assets/nos_profile_list.png" width="100%">
<img src="/docs/blog/assets/nos_profile_list.png" width="100%">

We see a breakdown across four different image embedding models including the method and task (interchangeable in this case, each CLIP variant will support both Image and Text embedding as methods), the Iterations per Second, GPU memory footprint (how much space did this model have to allocate) and finally the GPU utilization, which measures (in a very broad sense) how efficiently we are using the HW. A few things to note: the image size is fixed to 224X224X1 across all runs with a batch size of 1. In practice, the Iterations/Second will depend tremendously on tuning the batch size and image resolution for our target HW, which will be the subject of a followup post. For now we’ll take these (admittedly naive) numbers at face value and see what we can work out about how exactly to run a large embedding workload. We’re going to use a fantastic tool called Skypilot to deploy the profiler to a Tesla T4 instance on GCP:
We see a breakdown across four different image embedding models including the method and task (interchangeable in this case, each CLIP variant will support both Image and Text embedding as methods), the Iterations per Second, GPU memory footprint (how much space did this model have to allocate) and finally the GPU utilization, which measures how efficiently we are using the HW (in a very broad sense). A few things to note: the image size is fixed to 224X224X1 across all runs with a batch size of 1. In practice, the Iterations/Second will depend tremendously on tuning the batch size and image resolution for our target HW, which will be the subject of a followup post. For now, we’ll take these numbers at face value and see what we can work out about how exactly to run a large embedding workload. We’re going to use Skypilot to deploy the profiler to a Tesla T4 instance on GCP:

```bash
sky launch -c nos-profiling-service skyserve.dev.yaml --gpus=t4
```

This will run the above model variants on our chosen instance and write everything to a bucket in GCS, which I’ve already downloaded:
This will run the above model variants on our chosen instance and write everything to a bucket in GCS as a JSON file, which I’ve already downloaded:

```
[
Expand Down Expand Up @@ -68,29 +67,29 @@ This will run the above model variants on our chosen instance and write everythi
"cleanup::memory_cpu::post":19445149696,...
```

This snippet shows a breakdown of memory allocation and runtime across various stages of execution (we’re mostly interested in forward::execution). Even this short list of profiling info is quite a lot to pick apart: we would ordinarily be doing a lot of envelope math and maybe scrub through each iteration in the chrome profiling tool, but lets see if we can streamline things a bit with more modern tools.
This snippet shows a breakdown of memory allocation and runtime across various stages of execution (we’re mostly interested in forward::execution). Even this short list of profiling info is quite a lot to pick apart: we would ordinarily be doing a lot of envelope math and might scrub through each iteration in the chrome profiler, but lets see if we can streamline things a bit with more 'modern' tools.

## 🤖 Our InfraBot Chat Assistant

The OpenAI assistants API is somewhat unstable at the moment, but after a few tries it was able to ingest the profiling catalog as raw JSON and answer a few questions about its contents. Let’s start simple:

_Hey InfraBot, can you list the models in the profiling catalog by iterations per second?_

<img src="../assets/clip_speed.png" width="100%">
<img src="/docs/blog/assets/clip_speed.png" width="100%">

Ok, our raw profiling data is slowly becoming more readable. Let’s see how this all scales with number of embeddings:
Ok, our raw profiling data is slowly becoming more readable. Let’s see how this all scales with the number of embeddings:

_Can you compute how long it would take to generate embeddings with each model for workload sizes in powers of 10, starting at 1000 image embeddings and ending at 10,000,000. Please plot these for each model in a graph._

<img src="../assets/clip_embedding_times.png" width="100%">
<img src="/docs/blog/assets/clip_embedding_times.png" width="100%">

Reasonable: runtime will depend linearly on total embeddings (again, we’re using batch size 1 for illustration purposes).

The more interesting question is: what hardware should we use given price constraints? While ChatGPT can probably provide a solid high level answer describing the tradeoffs between on-demand, spot and reserved instances, as well as the value of performance of the underlying card relative to time spent on other tasks like copying data, we’ll need to provide hard numbers on instance prices if we want something concrete.

## 💵 Adding Pricing Information with Skypilot

Skypilot proves a utility for fetching pricing and availability for a variety of instance types across the big 3 CSPs (AWS, Azure and GCP). I was able to generate a summary below (lightly edited to work with Substack formatting):
Skypilot proves a utility for fetching pricing and availability for a variety of instance types across the big 3 CSPs (AWS, Azure and GCP). I was able to generate a summary below (lightly edited for formatting):

```
GPU QTY CLOUD INSTANCE_TYPE HOURLY_PRICE HOURLY_SPOT_PRICE
Expand All @@ -114,19 +113,19 @@ Ok, lets add some dollar signs to our plot above:

_Can you compute how much it would cost on a T4 with 1 GPU to generate embeddings with the cheapest model for workloads of powers of 10, starting at 1000 image embeddings and ending at 10,000,000. Please plot these in a graph._

<img src="../assets/t4_laion_price.png" width="100%">
<img src="/docs/blog/assets/t4_laion_price.png" width="100%">

The above looks reasonable assuming a minimum reservation of 1 hour (we aren’t doing serverless, we need to pay for the whole instance for the whole hour in our proposed cloud landscape). For 10 million embeddings, the total is something like 13 hours, so assuming an on-demand price of $0.35 we have 0.35*13 ~= 0.455, pretty close to the graph. But what if we wanted to index something like YouTube with ~500PB of videos? Ok, maybe not the whole site, but a substantial subset, maybe 10^11 images: if we extrapolate the above we’re looking at $40,000 in compute, which we would probably care about fitting to our workload. In particular, we might go with a reserved rather than an on-demand instance for a ~%50 discount, but at what point does that pay off? Unfortunately at time of writing, Skypilot doesn’t seem to include reserved instance pricing by default, but for a single instance type it’s easy enough to track down and feed to InfraBot: a 1 Year commitment brings us down to $0.220 per GPU, and a 3 Year commitment to $0.160 per GPU. It’s still higher than the spot price of course, but at this scale its reasonable to assume some SLA that prevents us from halting indexing on preemption. Let’s see if we can find a break-even point.
The above looks reasonable assuming a minimum reservation of 1 hour (we aren’t doing serverless; we need to pay for the whole instance for the whole hour in our proposed cloud landscape). For 10 million embeddings, the total is something like 13 hours, so assuming an on-demand price of $0.35 we have $0.35*13 ~= $4.55, pretty close to the graph. But what if we wanted to index something like YouTube with ~500PB of videos? Ok, maybe not the whole site, but a substantial subset, maybe 10^11 images. If we extrapolate the above we’re looking at $40,000 in compute, which we would probably care about fitting to our workload. In particular, we might go with a reserved rather than an on-demand instance for a ~%50 discount, but at what point does that pay off? Unfortunately at time of writing, Skypilot doesn’t seem to include reserved instance pricing by default, but for a single instance type it’s easy enough to track down and feed to InfraBot: a 1 Year commitment brings us down to $0.220 per GPU, and a 3 Year commitment to $0.160 per GPU. It’s still higher than the spot price of course, but at this scale its reasonable to assume some SLA that prevents us from halting indexing on preemption. Let’s see if we can find a break-even point.

_Can you add the cost to reserve a 1 and 3 year instance? A 1 year reservation is $0.220 per gou per hour, and a 3 year reservation is $0.160 per gpu per hour._

<img src="../assets/reserved_vs_on_demand_first.png" width="100%">
<img src="/docs/blog/assets/reserved_vs_on_demand_first.png" width="100%">

Looks like we need to go a little further to the right

_Ok can you do the same plot, but at 10^9, 10^10, and 10^11_

<img src="../assets/reserved_vs_on_demand_second.png" width="100%">
<img src="/docs/blog/assets/reserved_vs_on_demand_second.png" width="100%">

10^10 embeddings at $0.35/hr is about $4,860, so this looks roughly correct. 10 Billion embeddings is about 100,000 Hours of (low resolution) video at full 30FPS, so while it’s quite large its not completely unheard of for a larger video service.

Expand Down

0 comments on commit 148724f

Please sign in to comment.