# LLM Answer Generation

## Data Munging
The following notebook contains code to process PPE, LiveBench, and MTBench and extract their prompts in a consistent way.  It creates the dataframes: ppe, livebench, and mtbench.
Then, it uses LLMs to generate output for the prompts.

In [None]:
%%capture
import pandas as pd
import json
!pip install datasets
import datasets
import requests
from openai import OpenAI
%env TRANSFORMERS_NO_ADVISORY_WARNINGS=1  #hides huggingface notoken output
import warnings
warnings.filterwarnings('ignore') #hides trivial df warnings

### PPE

In [None]:
#read ppe, filter for just english, and write out to json
ppe = pd.read_parquet("hf://datasets/lmarena-ai/PPE-Human-Preference-V1/data/test-00000-of-00001.parquet") # link provided from PPE
print("SIZE OF ALL DATA", len(ppe))
ppe = ppe[ppe['language'] == 'English'] #select only English language.  Approximately half the dataset
scale_ppe = ppe[['question_id','prompt']]
scale_ppe.rename(columns={'prompt': 'prompts'}, inplace=True)
scale_ppe['dataset'] = ['ppe' for i in range(len(scale_ppe))] #add a column to indicate the dataset
scale_ppe['category'] = ppe['math_prompt'].apply(lambda x: 'math' if x else 'general') #add a category (currently just math vs general)

print ("SIZE OF JUST ENGLISH DATA", len(ppe))
scale_ppe.to_json("scale_ppe_2025.json", orient="records") #write out data
print("WRITTEN OUT ALL ENGLISH LANGUAGE PROMPTS.  Additional metadata is available in original file as shown here:")
ppe.head()

SIZE OF ALL DATA 16038
SIZE OF JUST ENGLISH DATA 8842
WRITTEN OUT ALL ENGLISH LANGUAGE PROMPTS.  Additional metadata is available in original file as shown here:


Unnamed: 0,question_id,model_a,model_b,winner,prompt,response_1,response_2,language,conv_metadata,is_code,...,category_tag,sim,length_a,length_b,token_diff,longer_won,hard_prompt,easy_prompt,if_prompt,math_prompt
0,f4c6889138ae4867a686df2f35080a3a,mistral-large-2402,gpt-4o-2024-05-13,model_a,I'm a developer building a chrome extension. I...,"From your description, it seems like you are t...",The issue seems to be related to how the `chro...,English,"{'context_a_tokens': 191, 'context_b_tokens': ...",True,...,"{'criteria_v0.1': {'complexity': True, 'creati...",0.827569,546,380,0.695971,True,True,False,False,False
1,3c547616a9a0433481a2f32b06311a36,phi-3-medium-4k-instruct,llama-3-70b-instruct,model_b,Repeat the question and then answer it: Matteo...,Question: Matteo has 20 apples and buys 20 ora...,"Here is the question:\n\nMatteo has 20 apples,...",English,"{'context_a_tokens': 51, 'context_b_tokens': 5...",False,...,"{'criteria_v0.1': {'complexity': True, 'creati...",0.958751,211,246,0.857724,True,True,False,True,True
2,efca77bd70ce41c8acb6e806f2509705,gpt-4-1106-preview,gpt-4o-2024-05-13,model_a,Your response must begin with 'なんJ 栗の花大学 生唾芸術C...,なんJ 栗の花大学 生唾芸術CreamPan部\nたっぷりUnchartedな世界で満足して...,なんJ 栗の花大学 生唾芸術CreamPan部\nたっぷりUnchartedな世界で満足して...,English,"{'context_a_tokens': 665, 'context_b_tokens': ...",False,...,"{'criteria_v0.1': {'complexity': False, 'creat...",0.841135,1077,964,0.895079,True,True,False,True,False
3,8640f290e2184d68a68526654d148cfa,claude-3-opus-20240229,claude-3-haiku-20240307,tie,Topic: 1-element positive feedback loop\nIntro...,Here's the completed MATLAB code with the requ...,Here is the completed code with the requested ...,English,"{'context_a_tokens': 1293, 'context_b_tokens':...",True,...,"{'criteria_v0.1': {'complexity': True, 'creati...",0.963807,923,1067,0.865042,tie,True,False,True,False
4,7591ef1bc7fb44b2b57d4392ded47f17,gemini-1.5-pro-api-0514,command-r,tie (bothbad),in python check if datetime object has time,```python\nfrom datetime import datetime\n\nde...,You can check if a datetime object has time in...,English,"{'context_a_tokens': 8, 'context_b_tokens': 8,...",True,...,"{'criteria_v0.1': {'complexity': True, 'creati...",0.690801,478,205,0.42887,tie,True,False,False,False


### LiveBench

#### The set-up is a bit clunky.  To get this data, I have 1) installed livebench, which works on ONLY python 3.10.  I've then run download_questions.py and I've removed "LCB Generation" which takes up 200 MB.

In [None]:
%%capture
!gdown '1Chra9A4IgtpCMqxcB-M4zCpudF3yI0E1'
!unzip live_bench.zip
#file is munged at https://drive.google.com/file/d/1Chra9A4IgtpCMqxcB-M4zCpudF3yI0E1/view?usp=sharing
#columns are standardized and the largest question type (LCB_Generation, 200mb+) is dropped

In [None]:
# prompt:  loop through each folder in live_bench recursively  until you reach question.jsonl.
#open each file and save the contents to a central pandas dataframe

import pandas as pd
import json
import os
import datasets

def process_livebench(root_dir):
    all_data = []
    for subdir, _, files in os.walk(root_dir):
        for file in files:
            if file == "question.jsonl":
                filepath = os.path.join(subdir, file)
                try:
                    with open(filepath, 'r') as f:
                        print("OPENING", filepath)
                        for line in f:
                            try:
                                data = json.loads(line)
                                all_data.append(data)
                            except json.JSONDecodeError as e:
                                print(f"Skipping invalid JSON line in {filepath}: {e}")
                except FileNotFoundError:
                    print(f"File not found: {filepath}")
                except Exception as e:
                    print(f"Error processing file {filepath}: {e}")
    return pd.DataFrame(all_data)

livebench_df = process_livebench('live_bench')

if not livebench_df.empty:
    print('\n', "Merged headers for all questions are", livebench_df.keys())
else:
    print("No data found in the specified directory.")

OPENING live_bench/math/math_comp/question.jsonl
OPENING live_bench/math/AMPS_Hard/question.jsonl
OPENING live_bench/math/olympiad/question.jsonl
OPENING live_bench/data_analysis/tablejoin/question.jsonl
OPENING live_bench/data_analysis/tablereformat/question.jsonl
OPENING live_bench/data_analysis/cta/question.jsonl
OPENING live_bench/instruction_following/paraphrase/question.jsonl
OPENING live_bench/instruction_following/summarize/question.jsonl
OPENING live_bench/instruction_following/simplify/question.jsonl
OPENING live_bench/instruction_following/story_generation/question.jsonl
OPENING live_bench/reasoning/zebra_puzzle/question.jsonl
OPENING live_bench/reasoning/spatial/question.jsonl
OPENING live_bench/reasoning/web_of_lies_v2/question.jsonl
OPENING live_bench/language/typos/question.jsonl
OPENING live_bench/language/connections/question.jsonl
OPENING live_bench/language/plot_unscrambling/question.jsonl
OPENING live_bench/coding/coding_completion/question.jsonl

 Merged headers fo

In [None]:
scale_livebench = livebench_df[['question_id','turns', 'category', 'task']]
scale_livebench['turns'] = scale_livebench['turns'].apply(lambda x: x[0])
scale_livebench.rename(columns={'turns': 'prompts'}, inplace=True)
scale_livebench['dataset'] = ['livebench' for i in range(len(scale_livebench))] #add a column to indicate the dataset
scale_livebench.to_json("scale_livebench_2025.json", orient="records") #write out data
print('WRITING OUT LIVEBENCH')
scale_livebench.head()

WRITING OUT LIVEBENCH


Unnamed: 0,question_id,prompts,category,task,dataset
0,a191e799d6ca2258faa9f4cfe3d9a55317c96d32c92ab8...,Houses $X$ and $Y$ are $45$ miles apart. Ava l...,math,math_comp,livebench
1,c16e1e2d181fc4d4bb9b88a1b32d44b3ab54aaeb1b76ac...,The weight of $\frac{1}{3}$ of a birthday cake...,math,math_comp,livebench
2,7597e564b5c500bd2979e29e6b130437d089570148a8d5...,How many positive perfect squares less than $2...,math,math_comp,livebench
3,3fa2ad109d9ea27936ac3c09c9fefb055d67ca3598f2cc...,How many digits are in the base-ten representa...,math,math_comp,livebench
4,2cd412daa3383147d43cd0151c66909377d6c8fbe3b290...,Xander rolls a standard $6$-sided die $4$ time...,math,math_comp,livebench


### MTBench

In [None]:
# Download MT-Bench questions
url = "https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl"
response = requests.get(url)
lines = response.text.split("\n")

# Iterate through lines and append them to questions array as json
questions = []
for line in lines:
  if line:
    questions.append(json.loads(line))

example_question = questions[0]
example_question

{'question_id': 81,
 'category': 'writing',
 'turns': ['Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.',
  'Rewrite your previous response. Start every sentence with the letter A.']}

In [None]:
scale_mtbench = pd.DataFrame(questions)
scale_mtbench.drop('reference', axis=1, inplace=True)
scale_mtbench['turns'] = scale_mtbench['turns'].apply(lambda x: x[0])
scale_mtbench.rename(columns={'turns': 'prompts'}, inplace=True)
scale_mtbench['dataset'] = ['mtbench' for i in range(len(scale_mtbench))] #add a column to indicate the dataset
scale_mtbench.to_json("scale_mtbench_2025.json", orient="records") #write out data
print('WRITING OUT MTBENCH')
scale_mtbench.head()

WRITING OUT MTBENCH


Unnamed: 0,question_id,category,prompts,dataset
0,81,writing,Compose an engaging travel blog post about a r...,mtbench
1,82,writing,Draft a professional email seeking your superv...,mtbench
2,83,writing,Imagine you are writing a blog post comparing ...,mtbench
3,84,writing,Write a persuasive email to convince your intr...,mtbench
4,85,writing,"Describe a vivid and unique character, using s...",mtbench


## Data Generation

In [None]:
import pandas as pd
import json

data = ['livebench']#,'ppe','mtbench']. #IDEALLY WOULD PROCESS ALL AT ONCE
df_list = []

for dataset in data:
  filename = f'scale_{dataset}_2025.json'
  try:
    with open(filename, 'r') as f:
      json_data = json.load(f)
      for item in json_data:
        df_list.append([
            item.get('question_id'), #format is consistent across all 3
            item.get('prompts'),
            item.get('category'),
            item.get('dataset')
        ])

  except FileNotFoundError:
      print(f"File {filename} not found. Skipping.")

df = pd.DataFrame(df_list, columns=['question_id', 'prompts', 'category', 'dataset'])
sample = df.sample(80, random_state=42)

In [None]:
df

Unnamed: 0,question_id,prompts,category,dataset
0,a191e799d6ca2258faa9f4cfe3d9a55317c96d32c92ab8...,Houses $X$ and $Y$ are $45$ miles apart. Ava l...,math,livebench
1,c16e1e2d181fc4d4bb9b88a1b32d44b3ab54aaeb1b76ac...,The weight of $\frac{1}{3}$ of a birthday cake...,math,livebench
2,7597e564b5c500bd2979e29e6b130437d089570148a8d5...,How many positive perfect squares less than $2...,math,livebench
3,3fa2ad109d9ea27936ac3c09c9fefb055d67ca3598f2cc...,How many digits are in the base-ten representa...,math,livebench
4,2cd412daa3383147d43cd0151c66909377d6c8fbe3b290...,Xander rolls a standard $6$-sided die $4$ time...,math,livebench
...,...,...,...,...
1053,5f7ac157843831b50cc45eb8991c1659e0196805ea75d9...,### Instructions: You are an expert Python pro...,coding,livebench
1054,7ad613404830cbc9bd478f52fb74f5c73a2c5540593840...,### Instructions: You are an expert Python pro...,coding,livebench
1055,d5ecf69bb5d8e62278ed3d129812cdc0e9bfb90d8b6ca3...,### Instructions: You are an expert Python pro...,coding,livebench
1056,0119bdbf9e79f05e988b6d3408e2fdc0b7e4bc11e81ff9...,### Instructions: You are an expert Python pro...,coding,livebench


In [None]:
from google.colab import userdata
import time
API_KEY = userdata.get('API_KEY') #this is set as a colab secret

BASE_URL = "https://litellm.ml.scaleinternal.com/"
api_key = API_KEY

client = OpenAI(
    api_key=api_key,
    base_url=BASE_URL,
)


INTERNAL_USER = "internal"
cost_tags = {
    "metadata": {
        "tags": ["useCase:mldg-LLMJudge4Evaluation"]
    }
    }

#prompting code which makes three attempts before giving up
def prompt_llm(query, model):
    retries = 3
    for i in range(retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": query}
                ],
                user=INTERNAL_USER,
                extra_body=cost_tags,
            )
            llm_output = (response.choices[0].message.content)
            return llm_output
        except Exception as e:
            if i < retries - 1:
                print(f"Attempt {i + 1} failed with error: {e}. Retrying in 2 seconds...")
                time.sleep(2)  # Wait before retrying
            else:
                print(f"All retry attempts failed. Returning 'ERROR'")
                return "ERROR"


llms = [
    "openai/o3-mini-2025-01-31",
    'anthropic/claude-3-7-sonnet-20250219'
]

In [None]:
sample

Unnamed: 0,question_id,prompts,category,dataset
457,40af2768cabe32744e3b1efd2552edb077e2539a8f4580...,Please convert the Input Table from CSV format...,data_analysis,livebench
289,63608f487c9f22541c1f8778ae6771658425d60d1c0819...,"Find the greatest common divisor of $\{32, 174...",math,livebench
323,fa5ad59691de133027557e68714fc9cee3cc37e69e3c26...,You are given a question and its solution. The...,math,livebench
31,da8d8cf796bb05ab165bf9cf4843a9fd569a1ef0b25d60...,For how many integers $n$ does the expression\...,math,livebench
428,2379e1e2586eacdf0c9ea0b9385f3c679ff72aa7e2a74f...,Please convert the Input Table from TSV format...,data_analysis,livebench
...,...,...,...,...
96,563f538677e0e1e6d34368ced435ae61c00c3a15aa4432...,"Circle $C_1$ and $C_2$ each have radius $1$, a...",math,livebench
442,4d22f4f91dfc8188c2244048d968e9885ee063658c14fc...,Please convert the Input Table from TSV format...,data_analysis,livebench
49,f132f302743f83ae7ce9ed1be6017917173ca431c52abc...,A regular pentagon with area $\sqrt{5}+1$ is p...,math,livebench
872,3930c1a9f5be45985e44f14574267baf94e09c1d8d76fa...,"Please output this exact text, with no changes...",language,livebench


In [None]:
%%time
for model in llms:
  model_output_list = []
  count = 0
  print('GENERATING OUTPUT FOR MODEL', model)
  for index, row in sample.iterrows():
    if count % 10 == 0:
      print('PROCESSING ROW', count)
    model_output = prompt_llm(row['prompts'],model)
    model_output_list.append(model_output)
    count+=1
  sample['model/'+model] = model_output_list

GENERATING OUTPUT FOR MODEL openai/o3-mini-2025-01-31
PROCESSING ROW 0
PROCESSING ROW 10
PROCESSING ROW 20
PROCESSING ROW 30
PROCESSING ROW 40
PROCESSING ROW 50
Attempt 1 failed with error: Error code: 524. Retrying in 2 seconds...
PROCESSING ROW 60
PROCESSING ROW 70
GENERATING OUTPUT FOR MODEL anthropic/claude-3-7-sonnet-20250219
PROCESSING ROW 0
PROCESSING ROW 10
PROCESSING ROW 20
PROCESSING ROW 30
PROCESSING ROW 40
PROCESSING ROW 50


In [None]:
sample.to_json("scale_judgedata_LIVEBENCHONLY_2025.json", orient="records") #write out data
print('WRITING OUT FINAL DATA')

#save to google drive
from google.colab import drive
drive.mount('/content/drive')
df.to_json('/content/drive/My Drive/scale_judgedata_LIVEBENCHONLY_2025.json', index=False)

WRITING OUT FINAL DATA
