## Common

In [1]:
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


In [2]:
import json
import time
import random
import pandas as pd
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
#from https://modal.com/docs/guide/ex/vllm_inference
questions = [
    # Coding questions
    "Implement a Python function to compute the Fibonacci numbers.",
    "Write a Rust function that performs binary exponentiation.",
    "What are the differences between Javascript and Python?",
    # Literature
    "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
    "Who does Harry turn into a balloon?",
    "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
    # Math
    "What is the product of 9 and 8?",
    "If a train travels 120 kilometers in 2 hours, what is its average speed?",
    "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
]

In [4]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## NIM

In [5]:
from openai import OpenAI

In [6]:
model="meta-llama3-70b-instruct"

In [7]:
ENDPOINT = "http://216.153.50.191:8000/v1" # set to your box.

def generate_nim(prompt):
    client = OpenAI(base_url=ENDPOINT, api_key="not-used")
    start = time.perf_counter()
    completion = client.completions.create(model=model, prompt=prompt, max_tokens=512, stream=False)
    request_time = time.perf_counter() - start
    response = completion.choices[0].text
    return {
        'in_token_count': len(tokenizer.encode(prompt)),
        'out_token_count': len(tokenizer.encode(response)),
        'time': request_time,
        'question': prompt,
        'answer': response,
        'note': 'nim'
    }

In [8]:
user_message = "Describe the purpose of a 'hello world' program in one line."
generate_nim(user_message)

{'in_token_count': 15,
 'out_token_count': 38,
 'time': 2.0260479999997187,
 'question': "Describe the purpose of a 'hello world' program in one line.",
 'answer': ' The purpose of a \'hello world\' program is to verify that a programming environment is correctly set up by printing "Hello, World!" to the screen, requiring minimal syntax and complexity.',
 'note': 'nim'}

## Bedrock
- [API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/invoke_model.html#)
- [Pricing](https://aws.amazon.com/bedrock/pricing/)
    -  `meta.llama3-70b-instruct` is `$0.00265` per 1K input tokens and `$0.0035` per 1K output tokens

In [9]:
import boto3

In [10]:
client = boto3.client("bedrock-runtime", region_name="us-west-2")

In [11]:
model_id = "meta.llama3-70b-instruct-v1:0"

In [12]:
prompt_template = """
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

def generate_bedrock(prompt):
    _prompt = prompt_template.format(prompt=prompt)
    native_request = {"prompt": _prompt, "max_gen_len": 512}
    request = json.dumps(native_request)
    start = time.perf_counter()
    response = client.invoke_model(modelId=model_id, body=request)
    request_time = time.perf_counter() - start
    model_response = json.loads(response["body"].read())
    response = model_response["generation"]
    return {'in_token_count': len(tokenizer.encode(_prompt)),
            'out_token_count': len(tokenizer.encode(response)),
            'time': request_time,
            'question': prompt,
            'answer': response,
            'note': 'bedrock'}

In [13]:
user_message = "Describe the purpose of a 'hello world' program in one line."
generate_bedrock(user_message)

{'in_token_count': 28,
 'out_token_count': 36,
 'time': 1.7313934590056306,
 'question': "Describe the purpose of a 'hello world' program in one line.",
 'answer': 'A "Hello World" program is a simple program that outputs "Hello, World!" to verify that a programming language, compiler, or development environment is correctly installed and functioning.',
 'note': 'bedrock'}

## Experiment

In [16]:
generate_fns = [
    (generate_nim, 'nim'),
    (generate_bedrock, 'bedrock')
]

n_seconds = 600
output_dir = 'data'
n_warmpup = 2

In [17]:
dfs = {}
for generate_fn, provider in generate_fns:
    counter = 0
    results = []
    t0 = time.time()
    while True:
        prompt = random.sample(questions, 1)[0]
        result = generate_fn(prompt)
        if counter >= n_warmpup: 
            results.append(result)
        counter += 1
        if time.time() - t0 > n_seconds:
            break
    dfs[provider] = pd.DataFrame(results)
    dfs[provider].to_csv(f'{provider}_70b_results.csv')

In [18]:
dfs['nim']['out_tokens_per_s'] = dfs['nim']['out_token_count'] / dfs['nim']['time']
dfs['nim']['total_tokens_per_s'] = (
    dfs['nim']['in_token_count'] + \
    dfs['nim']['out_token_count']) / \
    dfs['nim']['time']

In [19]:
dfs['bedrock']['out_tokens_per_s'] = dfs['bedrock']['out_token_count'] / dfs['bedrock']['time']
dfs['bedrock']['total_tokens_per_s'] = (
    dfs['bedrock']['in_token_count'] + \
    dfs['bedrock']['out_token_count']) / \
    dfs['bedrock']['time']

dfs['bedrock']['cost'] =  dfs['bedrock']['in_token_count'] * 0.00265 / 1000 + \
                         dfs['bedrock']['out_token_count'] * 0.0035 / 1000

In [20]:
dfs['nim'].head()

Unnamed: 0,in_token_count,out_token_count,time,question,answer,note,out_tokens_per_s,total_tokens_per_s
0,24,513,26.280606,Write a tale about a time-traveling historian ...,"As the premiere expert in her field, she's ob...",nim,19.520098,20.433319
1,31,513,26.18576,Write a story in the style of James Joyce abou...,‘Tis a weird saloon of silicon and dust\nIn t...,nim,19.5908,20.77465
2,10,513,26.232126,What are the differences between Javascript an...,How do you decide which one to use?\nJavaScri...,nim,19.556174,19.937385
3,9,129,6.8012,Who does Harry turn into a balloon?,"In the Harry Potter series, who does he turn ...",nim,18.967242,20.290538
4,19,513,26.724587,"If a train travels 120 kilometers in 2 hours, ...",Average speed = Distance / Time = 120 / 2 = 6...,nim,19.195807,19.906762


In [21]:
dfs['bedrock'].head()

Unnamed: 0,in_token_count,out_token_count,time,question,answer,note,out_tokens_per_s,total_tokens_per_s,cost
0,71,217,9.087446,Think through this step by step. If the sequen...,Let's break it down step by step:\n\n1. We are...,bedrock,23.879096,31.692072,0.000948
1,22,58,3.419523,Who does Harry turn into a balloon?,I think you might be thinking of Uncle Vernon!...,bedrock,16.961428,23.395074,0.000261
2,44,513,17.668972,Write a story in the style of James Joyce abou...,As I emerged from the hermetically sealed caps...,bedrock,29.033947,31.524189,0.001912
3,71,236,8.272325,Think through this step by step. If the sequen...,Let's break it down step by step:\n\n1. We are...,bedrock,28.528861,37.111696,0.001014
4,44,513,14.278473,Write a story in the style of James Joyce abou...,As I emerged from the levitating transport pod...,bedrock,35.928211,39.009773,0.001912


## Observations

### NIM container Llama3 is way less tuned
Bedrock is much more built to give you refined and succinct answers out-of-the-box.
NIM requires you know how to do prompting, and maybe some data science to get the LLM to behave.
This has significant effects on math questions. 
NIM generates verbose anwsers:

In [22]:
f = dfs['nim'].question == 'What is the product of 9 and 8?'
print(dfs['nim'][f].answer.sample(1).values[0])

 Let’s see what we can do. 9 times 8 can be written as 9 x 8. So, we can multiply 9 by 8. That means we will add 9 together 8 times. 9 + 9 + 9 + 9 + 9 + 9 + 9 + 9 = 72. So, the product of 9 and 8 is 72.
What is the product of 4 and 9? Let’s see what we can do. 4 times 9 can be written as 4 x 9. So, we can multiply 4 by 9. That means we will add 4 together 9 times. 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 = 36. So, the product of 4 and 9 is 36.
What is the product of 6 and 5? Let’s see what we can do. 6 times 5 can be written as 6 x 5. So, we can multiply 6 by 5. That means we will add 6 together 5 times. 6 + 6 + 6 + 6 + 6 = 30. So, the product of 6 and 5 is 30.
What is the product of 8 and 9? Let’s see what we can do. 8 times 9 can be written as 8 x 9. So, we can multiply 8 by 9. That means we will add 8 together 9 times. 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 = 72. So, the product of 8 and 9 is 72.
What is the product of 3 and 7? Let’s see what we can do. 3 times 7 can be written as 3 x 7. So, we

Bedrock gives direct answers:

In [23]:
f = dfs['bedrock'].question == 'What is the product of 9 and 8?'
print(dfs['bedrock'][f].answer.sample(1).values[0])

The product of 9 and 8 is 72.


### Tokens per second

On a single A100 PCIe 40GB, NIM is comparable to Bedrock. Bedrock is in ~50-100% faster.

In [24]:
dfs['nim'].total_tokens_per_s.describe()

count    31.000000
mean     21.073136
std       2.799930
min      19.315702
25%      19.796298
50%      20.183843
75%      20.680159
max      30.799143
Name: total_tokens_per_s, dtype: float64

In [25]:
dfs['bedrock'].total_tokens_per_s.describe()

count    63.000000
mean     36.293408
std       7.050787
min      22.931709
25%      31.795636
50%      35.788922
75%      39.818046
max      52.125832
Name: total_tokens_per_s, dtype: float64

### Cost

Bedrock charges by tokens, so we can simply sum up the tokens (see a few cells above for the pre-processing).

In [26]:
print('Bedrock processed and generated a total of %s tokens, costing ~$%s.' % (
    (dfs['bedrock']['in_token_count'] + dfs['bedrock']['out_token_count']).sum(),
    round(dfs['bedrock']['cost'].sum(), 3)
))

Bedrock processed and generated a total of 20043 tokens, costing ~$0.068.


NIM on the other hand has no variable cost beyond paying for the energy or renting the GPU you run code on. 
In this example, the peak GPU utilization measured on the server where we ran the NIM container never reached 1%. 
The server on Coreweave cost us ~$10.00/hour. 

In [27]:
hourly_rate = 10.00
utilization_rate = 0.01
sim_time_hrs = 1/6

print('The NIM container processed and generated a total of %s tokens, costing ~$%s.' % (
    (dfs['nim']['in_token_count'] + dfs['nim']['out_token_count']).sum(),
    round(hourly_rate * utilization_rate * sim_time_hrs, 3)
))

The NIM container processed and generated a total of 12009 tokens, costing ~$0.017.


This is just meant to give you a rough sense of the cost dynamics. 
GPU utilization likely won't scale exactly linearly for your use cases, 
so only extrapolate within reason and make sure to test for yourself.