# TEI Bulk Embedding Load Test

I think a good workflow:

1. Use load tests with k6 to do grid search (it handles failures well)
2. Then use my scripts to confirm we get similar performance when making async calls

Also need to include a demo of how to keep track of the order / tie a specific response to a given input record... apparently asyncio respects the ordering of requests..


In [1]:
import os
from dotenv import load_dotenv

from tei_benchmark import generate_batches, embed

%load_ext autoreload
%autoreload 2

In [2]:
load_dotenv(override=True)

True

Goal:

- Embed N chunks of data in total
- Vary the batch_size of each request, as well as the concurrency of requests
- Analyze the total time to complete, RPS, and latency to see which bs / concurrency combination is best.


In [3]:
batches = generate_batches("This is a test", bs=10, total_chunks=1000)
results = await embed(batches, concurrency_level=5, collect_results=True)

Batch Size: 10, Concurrency Level: 5, Total Time: 1.77 seconds, RPS: 11.31, Embed per sec: 565.45


In [14]:
batches = generate_batches("This is a test", bs=32, total_chunks=1000)
results = await embed(batches, concurrency_level=5, collect_results=True)

Batch Size: 32, Concurrency Level: 5, Total Time: 1.66 seconds, RPS: 4.23, Embed per sec: 603.97


In [11]:
batches = generate_batches("This is a test", bs=64, total_chunks=1000)
results = await embed(batches, concurrency_level=5, collect_results=True)

Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 413: {"error":"batch size 64 > maximum allowed batch size 32","error_type":"Validation"}
Request failed with status code 

In [6]:
async def run_experiments():
    batch_sizes = [1, 4, 8, 16, 32]
    concurrency_levels = [1, 25, 50, 100, 205, 500, 1000]

    results = []
    for batch_size in batch_sizes[-2:]:
        for concurrency_level in concurrency_levels[-2:]:
            batches = generate_batches(
                "This is a test", bs=batch_size, total_chunks=10_000
            )
            results.append(
                await embed(
                    batches,
                    concurrency_level=concurrency_level,
                    collect_embeddings=False,
                )
            )
    return results

In [7]:
results = await run_experiments()

Batch Size: 16, Concurrency Level: 500, Total Time: 5.50 seconds, RPS: 0.36, Embed per sec: 1816.56
Batch Size: 16, Concurrency Level: 1000, Total Time: 5.19 seconds, RPS: 0.19, Embed per sec: 1927.15
Batch Size: 32, Concurrency Level: 500, Total Time: 5.33 seconds, RPS: 0.19, Embed per sec: 1877.69
Batch Size: 32, Concurrency Level: 1000, Total Time: 5.31 seconds, RPS: 0.19, Embed per sec: 1884.80


In [9]:
results[0]["metrics"]

{'batch_size': 16,
 'concurrency_level': 500,
 'total_time': 5.5049,
 'num_chunks_embedded': 10000,
 'req_per_sec': 0.3633,
 'embed_per_sec': 1816.5634}

## K6 Load Test Grid Search

To-do

1. k6 script that runs load test at set RPS. Inputs are batch size and rate. We capture the actual achieved RPS and a boolean indication of if any requests are dropped.
2. Write a python script that iterates through a grid search of RPS and batch sizes.
3. Monitor which combination can embed the highest effective number chunks per second.
4. Visualize this? (Maybe plot latency vs. throughput like TGI does?)


In [38]:
import pandas as pd

In [67]:
df = pd.read_csv("test_results.csv")

## Test Inference Client


In [5]:
from huggingface_hub import InferenceClient

client = InferenceClient(model=os.getenv("HOST"), token=os.getenv("HF_API_KEY"))

In [7]:
res = client.feature_extraction("This is a test")

In [10]:
response.content

NameError: name 'response' is not defined

In [9]:
type(res)

numpy.ndarray