To use this notebook you first have to follow these steps:

1. Run the following command in your shell from the root directory of the repo: ```./init.sh ``` 
    - This command will install all the dependencies and initialize the submodules (it was written and tested on Ubuntu 20.04 running on AWS g6 ec2 instances)
2. Login to Huggingface and Weights & Biases from your CLI:
    - ```huggingface-cli login [Auth Token]```
    - ```wandb login [Auth Token]```
3. Update the Benchmark Config Files at 
    - ```llm_judge/arena-hard-auto/config/api_config.yaml```
    - ```llm_judge/arena-hard-auto/config/gen_answer_config.yaml```
    - ```llm_judge/arena-hard-auto/config/judge_config.yaml``` 

4. Start your LLM on an OpenAI API Server with vLLM using one of the following commands: 
    - With Docker: 

    ```docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model [model name (huggingface model id)] --tensor-parallel-size [number of gpus] --max-model-len 8192 --gpu-memory-utilization 0.85 --enable-chunked-prefill --served-model-name [api name for the model]```

    
    - Without Docker: 

    ```python -m vllm.entrypoints.openai.api_server --model [model name (huggingface model id)] --tensor-parallel-size [number of gpus] --max-model-len 8192 --gpu-memory-utilization 0.9 --enable-chunked-prefill --served-model-name [api name for the model]```


After these steps you should see the message that your model is running on the address ```http://0.0.0.0:8000```

# Optimal setups for different model and instance sizes

Setup for llama3.1-70B-FP8 on a 8 GPU Instance (g6.48xl): 

```docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 --tensor-parallel-size 8 --max-model-len 8192 --gpu-memory-utilization 0.80 --enable-chunked-prefill --served-model-name llama3_1_70b_fp8 --max-num-seqs 320 --max-num-batched-tokens 2048```


Setup for llama3.1-70B-AWQ-INT4 on 8 GPUs (NOT OPTIMAL YET)

```docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --tensor-parallel-size 8 --max-model-len 8192 --gpu-memory-utilization 0.85 --enable-chunked-prefill --served-model-name llama3_1_70b_awq_int4 --tokenizer-pool-size 32```


# LOOK INTO

--tokenizer-pool-size
Size of tokenizer pool to use for asynchronous tokenization. If 0, will use synchronous tokenization.

Default: 0

--pipeline-parallel-size, -pp
Number of pipeline stages.

Default: 1



# Setup

In [53]:
import torch
import pandas as pd
from transformers import  AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
import os
from codecarbon import EmissionsTracker
import time
import csv
import json
import yaml
from vllm import LLM, SamplingParams
from openai import OpenAI

import wandb
import gc

import random

In [54]:
# Get the number of available GPUs
num_gpus = torch.cuda.device_count()

print(f"Number of available GPUs: {num_gpus}")

Number of available GPUs: 4


Set up the Benchmark Variables

In [68]:
#We have to select the same tokenizer all the time in order to get the same Token numbers at the end (this is only used to calculate the number of tokens)
huggingface_model_name = "meta-llama/Meta-Llama-3-8B-Instruct" 

model_name = "llama3_8b_fp8" #"llama3_1_70b"
bench_name = 'arena-hard-v0.1'
name = f"{model_name}-{bench_name}-{num_gpus}gpus"
delete_old_bench = True
max_gen_length = 2048 # Questions in the arena hard auto benchmark are sometimes longer than 2048 tokens. Therefore the model length has to be longer than 4096 tokens.
temperature = 0.0
num_choices = 1

config_filename = 'arena-hard-auto/config/answer_config_temp.yaml'
answer_filename = f'arena-hard-auto/data/{bench_name}/model_answer/{model_name}.jsonl'
question_path = f'arena-hard-auto/data/{bench_name}/question.jsonl'

print("Config will be written to:\n  " + config_filename)
print("\nAnswers will be written to:\n  " + answer_filename)

Config will be written to:
  arena-hard-auto/config/answer_config_temp.yaml

Answers will be written to:
  arena-hard-auto/data/arena-hard-v0.1/model_answer/llama3_8b_fp8.jsonl


Setup the Open AI API

In [69]:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

Setup Tokenizer to Count Tokens

In [70]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(huggingface_model_name, padding_side="left")
tokenizer.pad_token = tokenizer.bos_token

Create the temporary configuration file

In [71]:
# Define the data to be written to the YAML file
config_data = {
    'name': name,
    'bench_name': bench_name,
    'temperature': temperature,
    'max_tokens': max_gen_length,
    'num_choices': num_choices,
    'model_list': [
        model_name
    ]
}

Delete the old benchmark result & config file

In [72]:
# Delete temporary config file if it exists
if os.path.exists(config_filename):
    os.remove(config_filename)
    print(f"OLD Configuration file '{config_filename}' deleted successfully.")
else:
    print(f"No OLD Configuration file found at '{config_filename}'")


# Delete old bench results if they exist
if(delete_old_bench): 
    if os.path.exists(answer_filename):
        os.remove(answer_filename)
        print(f"OLD Bench results file '{answer_filename}' deleted successfully.")
    else:
        print(f"No OLD Bench results file found at '{answer_filename}'")

OLD Configuration file 'arena-hard-auto/config/answer_config_temp.yaml' deleted successfully.
No OLD Bench results file found at 'arena-hard-auto/data/arena-hard-v0.1/model_answer/llama3_8b_fp8.jsonl'


In [73]:
# Write the config to a temporary YAML file
with open(config_filename, 'w') as file:
    yaml.dump(config_data, file, default_flow_style=False)

print(f"Configuration file '{config_filename}' created successfully.")

Configuration file 'arena-hard-auto/config/answer_config_temp.yaml' created successfully.


# Test the model

In [74]:
# Colors to test the model

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    CBLACKBG  = '\33[40m'
    CREDBG    = '\33[41m'
    CGREENBG  = '\33[42m'
    CYELLOWBG = '\33[43m'
    CBLUEBG   = '\33[44m'
    CVIOLETBG = '\33[45m'
    CBEIGEBG  = '\33[46m'
    CWHITEBG  = '\33[47m'
    CBLACK  = '\33[30m'
    CRED    = '\33[31m'
    CGREEN  = '\33[32m'
    CYELLOW = '\33[33m'
    CBLUE   = '\33[34m'
    CVIOLET = '\33[35m'
    CBEIGE  = '\33[36m'
    CWHITE  = '\33[37m'

Create a function that prints the LLM response in a stream

In [75]:
def print_stream(prompt, stream, with_system=False):
    print("="*30 + f" Chat with  --- {model_name} ---  LLM using vLLM " + "="*30 + "\n")
    for idx, message in enumerate(prompt):
        if prompt[idx]['role'] == 'user':
            color = bcolors.CBLUE
            print(color + f"[ {prompt[idx]['role'].upper()} ]" + bcolors.ENDC)
            print(prompt[idx]['content']+ "\n")
        else: 
            if with_system:
                color = bcolors.CVIOLET
                print(color + f"[ {prompt[idx]['role'].upper()} ]" + bcolors.ENDC)
                print(prompt[idx]['content']+ "\n")

    print(bcolors.OKGREEN + f"[ ASSISTANT ]" + bcolors.ENDC + "\n")
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            print(chunk.choices[0].delta.content, end="")

In [76]:
def load_questions_to_df(question_file: str):
    """Load questions from a file into a DataFrame."""
    questions = []
    with open(question_file, "r") as ques_file:
        for line in ques_file:
            if line:
                questions.append(json.loads(line))
    
    df = pd.DataFrame([{
        "question_id": question["question_id"],
        "content": question["turns"][0]["content"],
        "cluster": question["cluster"]
    } for question in questions])
    
    return df

In [77]:
def prepare_prompts(df):
    """Prepare prompts dynamically based on question DataFrame."""
    prompts = []

    for _, row in df.iterrows():
        system_prompt = f"""
        You are a sophisticated AI-Expert there to help users solve tasks in several domains efficiently and accurately.
        Now solve the following task from the domain "{row['cluster']}".\n
        """

        user_message = f"{row['content']}"
        
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ]

        prompts.append(messages)

    df['prompt'] = prompts
    return df

In [78]:
# Load questions into a DataFrame
question_df = load_questions_to_df("arena-hard-auto/data/arena-hard-v0.1/question.jsonl")

# Prepare prompts
question_df = prepare_prompts(question_df)

# Select a random row from the DataFrame
example_question_df = question_df.sample(n=1, random_state=random.randint(0, len(question_df) - 1))

example_question_df

Unnamed: 0,question_id,content,cluster,prompt
204,d38fc9d20bd947f38abe497ae7d65522,Can you tell me how to get various HuggingFace...,HuggingFace Ecosystem Exploration,"[{'role': 'system', 'content': '  You a..."


In [79]:
prompt = example_question_df['prompt'].values[0]

print(prompt)

[{'role': 'system', 'content': '\n        You are a sophisticated AI-Expert there to help users solve tasks in several domains efficiently and accurately.\n        Now solve the following task from the domain "HuggingFace Ecosystem Exploration".\n\n        '}, {'role': 'user', 'content': 'Can you tell me how to get various HuggingFace LanguageModels working on my local machine using AutoGen'}]


In [80]:
stream = client.chat.completions.create(
    model=model_name,
    messages=prompt,
    stream=True, 
    extra_body={
        "min_p" : 0.05 # If it's set to 0.05, then it will allow tokens at least 1/20th as probable as the top token to be generated.
    }
)
print_stream(prompt , stream, with_system=True)


[35m[ SYSTEM ][0m

        You are a sophisticated AI-Expert there to help users solve tasks in several domains efficiently and accurately.
        Now solve the following task from the domain "HuggingFace Ecosystem Exploration".

        

[34m[ USER ][0m
Can you tell me how to get various HuggingFace LanguageModels working on my local machine using AutoGen

[92m[ ASSISTANT ][0m

I'd be happy to help!

To get various HuggingFace Language Models working on your local machine using AutoGen, follow these steps:

**Prerequisites:**

1. Install the `transformers` library by running `pip install transformers` in your terminal.
2. Install the `autogen` library by running `pip install autogen` in your terminal.
3. Make sure you have a compatible version of Python installed (Python 3.6 or later).

**Step 1: Install the Model**

Choose the language model you want to work with (e.g., BERT, RoBERTa, DistilBERT, etc.). You can find the list of available models on the HuggingFace website.

R

# Run the Benchmark with Energy Consumption Tracking

In [81]:
def get_questions(question_file: str):
    """Load questions from a file into a list."""

    questions = []
    with open(question_file, "r") as ques_file:
        for line in ques_file:
            if line:
                questions.append(json.loads(line))

    return questions

In [82]:
def get_answers(answer_file: str):
    """Load model answers."""

    answers = []
    with open(answer_file, "r") as ans_file:
        for line in ans_file:
            if line:
                answers.append(json.loads(line))

    return answers


In [83]:
print("="*10 + f" Starting Benchmark {bench_name} with {model_name}" + "="*10 + "\n\n")

qestion_path = f"arena-hard-auto/data/{bench_name}/question.jsonl"

questions = get_questions(question_path)

prompts = []

for question in questions:

    system_prompt = f"""
        You are a sophisticated AI-Expert there to help users solve tasks in several domains efficiently and accurately.
        Now solve the following task from the domain "{question['cluster']}".\n
        """
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question["turns"][0]["content"]},
    ]
    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )

    prompts.append(prompt)

num_prompts = len(prompts)

total_input_tok = 0
total_output_tok = 0


wandb.init(
    # set the wandb project where this run will be logged
    project="Model_Benchmarks",

    # track hyperparameters and run metadata
    config={
    "benchmark_name": bench_name,
    "num_prompts": num_prompts,
    "framework": 'vLLM',
    "model": model_name,
    "num_gpus": num_gpus,
    },

    name=name,
)


#### Start The Benchmark

tracker = EmissionsTracker(save_to_file=True, project_name=f"{name}", log_level="error", pue = 1.22, output_file=f"emissions_benchmakrs.csv")
tracker.start()

# Start Timer for Inference
start_time = time.time()

# Change the current working directory to 'arena-hard-auto'
os.chdir('arena-hard-auto')

%run -i 'gen_answer.py' --setting-file config/answer_config_temp.yaml --no-confirmation

# End Timer for Inference
end_time = time.time()

os.chdir('..')


emissions: float = tracker.stop()

ttime = end_time-start_time


print(f"\n\nFinished Benchmark in {ttime:.2f}s")


#### End the Benchmark


answers_file = get_answers(answer_filename)
outputs = [answer["choices"][0]["turns"][0]["content"] for answer in answers_file]


for idx, output in enumerate(outputs): 

    # Extracting information
    prompt = prompts[idx]
    input_tokens = tokenizer.encode(prompt)
    output_tokens = tokenizer.encode(output)
    num_input_tokens = len(input_tokens)
    num_output_tokens = len(output_tokens)

    # Updating cumulative counts
    total_input_tok += num_input_tokens
    total_output_tok += num_output_tokens


# Calculate averages
avg_time_per_prompt = (ttime / num_prompts)
avg_toks_per_sec = total_output_tok/ttime
avg_input_tokens = total_input_tok / num_prompts
avg_output_tokens = total_output_tok / num_prompts

em_i = emissions/total_input_tok *1_000_000
em_o = emissions/total_output_tok *1_000_000
em_p = emissions/num_prompts *10_000

print("="*15 + f" RESULTS for {name} " + "="*15 + 
    "\n\n" + 
    f"""
    Finished Benchmark {bench_name} with {model_name}\n\n
    Total Time: {ttime:.2f}s, AVG/Prompt: {avg_time_per_prompt:.2f}s\n\n
    Average tokens per second: {avg_toks_per_sec:.2f}\n\n
    Total Prompts: {num_prompts}\n
    Total Input Tokens: {total_input_tok}, AVG/Prompt: {avg_input_tokens}\n
    Total Output Tokens: {total_output_tok}, AVG/Prompt: {avg_output_tokens}\n
    """ + 
    
    "-"*50 + "\n" +
    
    f"""
    Total Inference Emissions: {emissions:.3f}kg CO₂eq\n\n
    Emissions / 1.000.000 Input Tokens: {em_i:.3f}kg CO₂eq\n
    Emissions / 1.000.000 Output Tokens: {em_o:.3f}kg CO₂eq\n
    Emissions / 10.000 Prompts: {em_p:.3f}kg CO₂eq\n

    """
    )

wandb.log({"Total Time": ttime,
    "AVG. Time / Prompt": avg_time_per_prompt,
            "AVG. Tokens / Second": avg_toks_per_sec,
            "AVG. Input Tokens": avg_input_tokens,
            "AVG. Output Tokens": avg_output_tokens,
            "Total Emissions": emissions,
            "Emissions / 1.000.000 Input Tokens": em_i,
            "Emissions / 1.000.000 Output Tokens": em_o,
            "Emissions / 10.000 Prompts": em_p,
            })

wandb.finish()

# Save results to a CSV file
results = [
    ["Model", model_name],
    ["Benchmark", bench_name],
    ["Number of GPUs", num_gpus],
    ["Total Prompts", num_prompts],
    ["Total Time", ttime], 
    ["AVG. Time / Prompt", avg_time_per_prompt],
    ["AVG. Tokens / Second", avg_toks_per_sec],
    ["Total Input Tokens", total_input_tok],
    ["AVG. Input Tokens / Prompt", avg_input_tokens],
    ["Total Output Tokens", total_output_tok],
    ["AVG. Output Tokens / Prompt", avg_output_tokens],
    ["Total Emissions", emissions],
    ["Emissions / 1.000.000 Input Tokens", em_i],
    ["Emissions / 1.000.000 Output Tokens", em_o],
    ["Emissions / 10.000 Prompts", em_p]
]

# Ensure the directory exists
emission_output_file_path = f"emission_data/{name}_emission_data.csv"
os.makedirs(os.path.dirname(emission_output_file_path), exist_ok=True)

with open(emission_output_file_path, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["Metric", "Value"])
    writer.writerows(results)

print(f"Results saved to {emission_output_file_path}\n\n")





{'bench_name': 'arena-hard-v0.1', 'max_tokens': 2048, 'model_list': ['llama3_8b_fp8'], 'name': 'llama3_8b_fp8-arena-hard-v0.1-4gpus', 'num_choices': 1, 'temperature': 0.0}

Expected Input Tokens: 
 47461 Tokens in a total of 500 questions

Expected Output Tokens: 
 275000 Tokens in a total of 500 questions

Max Output Tokens: 
 400000 Tokens in a total of 500 questions


-------------------------  Resulting in Costs:   -------------------------

Expected Costs: 
 4.36 USD

Max. Expected Costs: 
 6.24 USD

Starting to generate answers...
Output to data/arena-hard-v0.1/model_answer/llama3_8b_fp8.jsonl
arena-hard-v0.1


100%|██████████| 500/500 [03:43<00:00,  2.24it/s]




Finished Benchmark in 224.92s


    Finished Benchmark arena-hard-v0.1 with llama3_8b_fp8


    Total Time: 224.92s, AVG/Prompt: 0.45s


    Average tokens per second: 1253.21


    Total Prompts: 500

    Total Input Tokens: 73535, AVG/Prompt: 147.07

    Total Output Tokens: 281871, AVG/Prompt: 563.742

    --------------------------------------------------

    Total Inference Emissions: 0.015kg CO₂eq


    Emissions / 1.000.000 Input Tokens: 0.204kg CO₂eq

    Emissions / 1.000.000 Output Tokens: 0.053kg CO₂eq

    Emissions / 10.000 Prompts: 0.300kg CO₂eq


    


VBox(children=(Label(value='0.006 MB of 0.006 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
AVG. Input Tokens,▁
AVG. Output Tokens,▁
AVG. Time / Prompt,▁
AVG. Tokens / Second,▁
Emissions / 1.000.000 Input Tokens,▁
Emissions / 1.000.000 Output Tokens,▁
Emissions / 10.000 Prompts,▁
Total Emissions,▁
Total Time,▁

0,1
AVG. Input Tokens,147.07
AVG. Output Tokens,563.742
AVG. Time / Prompt,0.44984
AVG. Tokens / Second,1253.20592
Emissions / 1.000.000 Input Tokens,0.20399
Emissions / 1.000.000 Output Tokens,0.05322
Emissions / 10.000 Prompts,0.30001
Total Emissions,0.015
Total Time,224.91994


Results saved to emission_data/llama3_8b_fp8-arena-hard-v0.1-4gpus_emission_data.csv


