# TEI Bulk Embedding Load Test

I think a good workflow:

1. Use load tests with k6 to do grid search (it handles failures well)
2. Then use my scripts to confirm we get similar performance when making async calls

Also need to include a demo of how to keep track of the order / tie a specific response to a given input record... apparently asyncio respects the ordering of requests..

Resources:

- On asyncio in Python: https://superfastpython.com/asyncio-gather/
- On `as_completed()` vs. `gather()`: https://jxnl.github.io/instructor/blog/2023/11/13/learn-async/#asyncioas_completed-handling-tasks-as-they-complete


Goal:

- Embed N chunks of data in total
- Vary the batch_size of each request, as well as the concurrency of requests
- Analyze the total time to complete, RPS, and latency to see which bs / concurrency combination is best.


## Deploy Embedding Model with TEI on Inference Endpoints


In [16]:
import os
from dotenv import load_dotenv

%load_ext autoreload
%autoreload 2

load_dotenv(override=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


True

In [8]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv("HF_API_KEY")

In [14]:
from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    name="bge-base-en-v15-arr-tei-test",
    repository="BAAI/bge-base-en-v1.5",
    framework="pytorch",
    task="sentence-embeddings",
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    namespace="HF-test-lab",
    instance_type="g5.2xlarge",
    instance_size="medium",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_CONCURRENT_REQUESTS": "4096",
            "MAX_BATCH_TOKENS": "65536",
            "MAX_CLIENT_BATCH_SIZE": "2048",
            "MODEL_ID": "/repository",
        },
        "url": "ghcr.io/huggingface/text-embeddings-inference:86-0.6",  # A10 GPU specific!
    },
)

endpoint.wait()
print(endpoint.status)

running


In [15]:
endpoint.url

'https://qv5gh3puw805qym7.us-east-1.aws.endpoints.huggingface.cloud'

In [25]:
from tei_benchmark import main

In [37]:
data = [{"unique_id": n, "text": "hello " * 512} for n in range(10_000)]

In [47]:
batch_size = 64
concurrency = 100

os.remove("embeddings.jsonl")
results = await main(data, batch_size, concurrency, filename="embeddings.jsonl")

Batch Size: 64, Concurrency Level: 100, Total Time: 22.5188 seconds, Embed per sec: 444.0734, Num Success: 10000, Num Failures: 0


In [48]:
results

{'timing_statistics': {'total_time': {'min': 43.0,
   'max': 13431.0,
   'mean': 9044.0304,
   'median': 10177.0,
   'p90': 13149.0,
   'p95': 13172.0},
  'tokenization_time': {'min': 5.0,
   'max': 1107.0,
   'mean': 300.1616,
   'median': 48.0,
   'p90': 832.0,
   'p95': 977.0},
  'queue_time': {'min': 9.0,
   'max': 8467.0,
   'mean': 6225.8384,
   'median': 7806.0,
   'p90': 8455.0,
   'p95': 8463.0},
  'inference_time': {'min': 16.0,
   'max': 273.0,
   'mean': 269.3312,
   'median': 272.0,
   'p90': 273.0,
   'p95': 273.0}},
 'total_time': 22.5188,
 'embeddings_per_second': 444.0734,
 'request_metrics': {'success': 10000, 'failure': 0, 'total': 10000},
 'run_metadata': {'batch_size': 64,
  'concurrency': 100,
  'filename': 'embeddings.jsonl'}}

In [None]:
# async def run_experiments():
#     batch_sizes = [1, 4, 8, 16, 32]
#     concurrency_levels = [1, 25, 50, 100, 205, 500, 1000]

#     results = []
#     for batch_size in batch_sizes[-2:]:
#         for concurrency_level in concurrency_levels[-2:]:
#             batches = generate_batches(
#                 "This is a test", bs=batch_size, total_chunks=10_000
#             )
#             results.append(
#                 await embed(
#                     batches,
#                     concurrency_level=concurrency_level,
#                     collect_embeddings=False,
#                 )
#             )
#     return results