## Common

In [1]:
%env TOKENIZERS_PARALLELISM=false

env: TOKENIZERS_PARALLELISM=false


In [2]:
import json
import time
import random
import pandas as pd
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
#from https://modal.com/docs/guide/ex/vllm_inference
questions = [
    # Coding questions
    "Implement a Python function to compute the Fibonacci numbers.",
    "Write a Rust function that performs binary exponentiation.",
    "What are the differences between Javascript and Python?",
    # Literature
    "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
    "Who does Harry turn into a balloon?",
    "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
    # Math
    "What is the product of 9 and 8?",
    "If a train travels 120 kilometers in 2 hours, what is its average speed?",
    "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
]

In [4]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", token=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## NIM

In [55]:
from openai import OpenAI

In [61]:
model="meta-llama3-8b-instruct"

In [62]:
ENDPOINT = "http://216.153.50.180:8000/v1" # set to your box.

def generate_nim(prompt):
    client = OpenAI(base_url=ENDPOINT, api_key="not-used")
    start = time.perf_counter()
    completion = client.completions.create(model=model, prompt=prompt, max_tokens=512, stream=False)
    request_time = time.perf_counter() - start
    response = completion.choices[0].text
    return {
        'in_token_count': len(tokenizer.encode(prompt)),
        'out_token_count': len(tokenizer.encode(response)),
        'time': request_time,
        'question': prompt,
        'answer': response,
        'note': 'nim'
    }

In [57]:
user_message = "Describe the purpose of a 'hello world' program in one line."
generate_nim(user_message)

{'in_token_count': 15,
 'out_token_count': 17,
 'time': 0.3396163750003325,
 'question': "Describe the purpose of a 'hello world' program in one line.",
 'answer': ' The "hello world" program is a standard first program for programmers in most programming',
 'note': 'nim'}

## Bedrock
- [API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/invoke_model.html#)
- [Pricing](https://aws.amazon.com/bedrock/pricing/)
    -  `meta.llama3-8b-instruct` is `$0.0004` per 1K input tokens and `$0.0006` per 1K output tokens

In [5]:
import boto3

In [6]:
client = boto3.client("bedrock-runtime", region_name="us-west-2")

In [7]:
model_id = "meta.llama3-8b-instruct-v1:0"

In [8]:
prompt_template = """
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

def generate_bedrock(prompt):
    _prompt = prompt_template.format(prompt=prompt)
    native_request = {"prompt": _prompt, "max_gen_len": 512}
    request = json.dumps(native_request)
    start = time.perf_counter()
    response = client.invoke_model(modelId=model_id, body=request)
    request_time = time.perf_counter() - start
    model_response = json.loads(response["body"].read())
    response = model_response["generation"]
    return {'in_token_count': len(tokenizer.encode(_prompt)),
            'out_token_count': len(tokenizer.encode(response)),
            'time': request_time,
            'question': prompt,
            'answer': response,
            'note': 'bedrock'}

In [9]:
user_message = "Describe the purpose of a 'hello world' program in one line."
generate_bedrock(user_message)

{'in_token_count': 28,
 'out_token_count': 44,
 'time': 1.2371638750046259,
 'question': "Describe the purpose of a 'hello world' program in one line.",
 'answer': 'A "Hello, World!" program is a traditional first program in a programming language, serving as a simple demonstration of the language\'s syntax and functionality, typically printing the message "Hello, World!" to the screen.',
 'note': 'bedrock'}

## Experiment

In [10]:
generate_fns = [
    # (generate_nim, 'nim'),
    (generate_bedrock, 'bedrock')
]

n_seconds = 600
output_dir = 'data'
n_warmpup = 2

In [12]:
dfs = {
    'nim': nim_df
}
for generate_fn, provider in generate_fns:
    counter = 0
    results = []
    t0 = time.time()
    while True:
        prompt = random.sample(questions, 1)[0]
        result = generate_fn(prompt)
        if counter >= n_warmpup: 
            results.append(result)
        counter += 1
        if time.time() - t0 > n_seconds:
            break
    dfs[provider] = pd.DataFrame(results)
    dfs[provider].to_csv(f'{provider}_results.csv')

In [29]:
dfs['nim']['out_tokens_per_s'] = dfs['nim']['out_token_count'] / dfs['nim']['time']
dfs['nim']['total_tokens_per_s'] = (
    dfs['nim']['in_token_count'] + \
    dfs['nim']['out_token_count']) / \
    dfs['nim']['time']

In [70]:
dfs['bedrock']['out_tokens_per_s'] = dfs['bedrock']['out_token_count'] / dfs['bedrock']['time']
dfs['bedrock']['total_tokens_per_s'] = (
    dfs['bedrock']['in_token_count'] + \
    dfs['bedrock']['out_token_count']) / \
    dfs['bedrock']['time']

dfs['bedrock']['cost'] =  dfs['bedrock']['in_token_count'] * 0.0004 / 1000 + \
                         dfs['bedrock']['out_token_count'] * 0.0006 / 1000

In [71]:
dfs['nim'].head()

Unnamed: 0,in_token_count,out_token_count,time,question,answer,note,out_tokens_per_s,total_tokens_per_s
0,11,448,6.580566,Write a Rust function that performs binary exp...,The function takes two 32-bit unsigned intege...,nim,68.079254,69.750843
1,31,513,7.559414,Write a story in the style of James Joyce abou...,\nIn the year twenty-eight hundred and eighty...,nim,67.8624,71.963247
2,31,513,7.535785,Write a story in the style of James Joyce abou...,But instead of the lyrical and poetic prose t...,nim,68.075188,72.188894
3,58,513,7.515713,Think through this step by step. If the sequen...,"a_1 = 3, a_2 = 5, a_3 = 8, a_4 = 13, a_5 = 21...",nim,68.256998,75.974163
4,12,513,7.484234,What is the product of 9 and 8?,A) 72 B) 73 C) 74 D) 75 E) 76\nThe product of...,nim,68.54409,70.14746


In [72]:
dfs['bedrock'].head()

Unnamed: 0,in_token_count,out_token_count,time,question,answer,note,cost,out_tokens_per_s,total_tokens_per_s
0,24,372,5.172039,Implement a Python function to compute the Fib...,Here is a simple Python function to compute th...,bedrock,0.000233,71.925215,76.565552
1,25,13,0.529664,What is the product of 9 and 8?,The product of 9 and 8 is 72.,bedrock,1.8e-05,24.543862,71.743596
2,24,350,4.783876,Write a Rust function that performs binary exp...,Here is a simple implementation of binary expo...,bedrock,0.00022,73.162432,78.179285
3,25,13,0.563811,What is the product of 9 and 8?,The product of 9 and 8 is 72.,bedrock,1.8e-05,23.057379,67.398492
4,23,513,7.244253,What are the differences between Javascript an...,JavaScript and Python are two popular programm...,bedrock,0.000317,70.814756,73.989687


## Observations

### NIM container Llama3 is way less tuned
Bedrock is much more built to give you refined and succinct answers out-of-the-box.
NIM requires you know how to do prompting, and maybe some data science to get the LLM to behave.
This has significant effects on math questions. 
NIM generates verbose anwsers:

In [74]:
f = dfs['nim'].question == 'What is the product of 9 and 8?'
print(dfs['nim'][f].answer.sample(1).values[0])

 A) 60 B) 67 C) 72 D) 80
The correct answer is A) 60. 9 x 8 = 72.
What is the product of 9 and 7? A) 56 B) 63 C) 69 D) 81
The correct answer is B) 63. 9 x 7 = 63.
What is the product of 8 and 3? A) 24 B) 27 C) 30 D) 36
The correct answer is A) 24. 8 x 3 = 24.
What is the product of 7 and 5? A) 32 B) 35 C) 36 D) 35
The correct answer is B) 35. 7 x 5 = 35.
What is the product of 3 and 2? A) 5 B) 6 C) 8 D) 9
The correct answer is B) 6. 3 x 2 = 6.
What is the product of 5 and 10? A) 30 B) 50 C) 60 D) 70
The correct answer is C) 50. 5 x 10 = 50.
Read the following passage and choose the correct answer. A) B) C) D)
Passage: A new study finds that playing with your cat may help you lose weight. Researchers at the University of ---asked 100 people to play with their pets for 30 minutes a day, three times a week, and compared them to a control group that did not participate in the study. The results showed that the group that played with their pets lost an average of 2 pounds per week, while th

Bedrock gives direct answers:

In [75]:
f = dfs['bedrock'].question == 'What is the product of 9 and 8?'
print(dfs['bedrock'][f].answer.sample(1).values[0])

The product of 9 and 8 is 72.


### Tokens per second

On a single A100 PCIe 40GB, NIM is comparable to Bedrock. Bedrock is in ~5-10% faster.

In [76]:
dfs['nim'].total_tokens_per_s.describe()

count    81.000000
mean     70.767297
std       2.115696
min      67.550549
25%      69.412989
50%      70.198549
75%      71.553589
max      76.629050
Name: total_tokens_per_s, dtype: float64

In [77]:
dfs['bedrock'].total_tokens_per_s.describe()

count    144.000000
mean      78.239982
std       11.419779
min       62.579897
25%       71.033495
50%       76.414305
75%       80.875857
max      118.928222
Name: total_tokens_per_s, dtype: float64

### Cost

Bedrock charges by tokens, so we can simply sum up the tokens (see a few cells above for the pre-processing).

In [86]:
print('Bedrock processed and generated a total of %s tokens, costing ~$%s.' % (
    (dfs['bedrock']['in_token_count'] + dfs['bedrock']['out_token_count']).sum(),
    round(dfs['bedrock']['cost'].sum(), 3)
))

Bedrock processed and generated a total of 45588 tokens, costing ~$0.026.


NIM on the other hand has no variable cost beyond paying for the energy or renting the GPU you run code on. 
In this example, the peak GPU utilization measured on the server where we ran the NIM container never reached 5%. 
The server on Coreweave cost us ~$2.50/hour. 

In [87]:
hourly_rate = 2.50
utilization_rate = 0.05
sim_time_hrs = 1/6

print('The NIM container processed and generated a total of %s tokens, costing ~$%s.' % (
    (dfs['nim']['in_token_count'] + dfs['nim']['out_token_count']).sum(),
    round(hourly_rate * utilization_rate * sim_time_hrs, 3)
))

The NIM container processed and generated a total of 41520 tokens, costing ~$0.021.


This is just meant to give you a rough sense of the cost dynamics. 
GPU utilization likely won't scale exactly linearly for your use cases, 
so only extrapolate within reason and make sure to test for yourself.