<a href="https://colab.research.google.com/github/Wittgenbot/fine-tuning/blob/main/LLM_as_a_judge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set Up

## Imports

In [None]:
import re
import ast
import json
import requests
from tqdm.auto import tqdm
from google.colab import userdata, drive

## Google Drive

In [None]:
drive.mount('/content/drive')
data_dir = '/content/drive/My Drive/LLM as a Judge/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Mixtral API Set Up

In [None]:
MIXTRAL_API_KEY = userdata.get('MIXTRAL_API_KEY')

In [None]:
def query_mixtral(system_prompt, user_prompt):

    url = "https://api.mistral.ai/v1/chat/completions"

    payload = {
        "model": "open-mixtral-8x7b",
        "stop": ["</s>"],
        "stream": False,
        "messages": [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": user_prompt
            }
        ]
    }

    headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "Authorization": f"Bearer {MIXTRAL_API_KEY}"
    }

    answer = ''
    max_attempts = 3
    attempt = 0

    while attempt < max_attempts:
        try:
            response = requests.post(url, json=payload, headers=headers)
            response.raise_for_status()
            data = response.json()
            answer = data.get("choices", [{}])[0].get("message", {}).get("content", "")
            break
        except requests.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            attempt += 1
            if attempt < max_attempts:
                sleep_time = 2 ** attempt
                print(f"Retrying in {sleep_time} seconds...")
                sleep(sleep_time)
            else:
                print("Maximum retry attempts reached, failing.")

    return answer

# Test Set Up

## Test Questions

In [None]:
TLP_test_questions = [
    'How does Wittgenstein define the concept of a "proposition" in the Tractatus?',
    'What role does logic play in the structure of reality as presented in the Tractatus?',
    'Can you explain the picture theory of language proposed in the Tractatus?',
    'How does Wittgenstein distinguish between what can be said and what can only be shown?',
    'How are facts and states of affairs conceptualized in the Tractatus?',
    'What is the significance of the limit of language in the Tractatus?',
    'How does the Tractatus address the relationship between language and reality?',
    'What is the purpose of the ladder metaphor in the conclusion of the Tractatus?',
    'What is meant by "propositional logic" in the Tractatus, and how is it significant to the work’s overall argument?',
    'How does the Tractatus critique the possibility of metaphysical propositions?',
    'What is the role of silence in Wittgenstein’s philosophy as expressed in the Tractatus?',
    'What did Wittgenstein mean by "philosophy is not a body of doctrine but an activity" in the Tractatus?',
    'How does Wittgenstein treat the problem of solipsism in the Tractatus?',
    'What does Wittgenstein mean by "The limits of my language mean the limits of my world"?',
    'How does the Tractatus conceptualize the idea of sense and nonsense in language?',
]

PI_test_questions = [
    'In Wittgenstein\'s Philosophical Investigations, what determines the meaning of a word?',
    'How did Wittgenstein use the example of "games" in Philosophical Investigations to illustrate the family resemblance concept?',
    'In Philosophical Investigations, what is the concept of family resemblance?',
    'In the context of Philosophical Investigations, is the existence of a private language possible?',
    'How can we confirm that someone is following a rule, according to Philosophical Investigations?',
    'In Philosophical Investigations, how can the concept of sameness be used to teach a rule?',
    'What is the role of language-games in Wittgenstein\'s Philosophical Investigations?',
    'How is the idea that mental processes form the basis of our understanding of language critiqued in Philosophical Investigations?',
    'Do private mental objects exist according to Wittgenstein’s Philosophical Investigations?',
    'What is the relationship between forms of life and language in Philosophical Investigations?',
    'In Philosophical Investigations, how is it explained that the meaning of a word is its use in language?',
    'How do philosophical problems arise from misunderstandings of language, as discussed in Philosophical Investigations?',
    'What is problematic about the Augustinian view on meaning in Philosophical Investigations?',
    'In Philosophical Investigations, what is said about the misguided nature of philosophical questions?',
    'What is the purpose of the analogy with the toolbox in Philosophical Investigations?',
    'In what ways does the notion of "language-games" in Philosophical Investigations challenge traditional epistemology?',
    'What implications does the critique of private language in Philosophical Investigations have for theories of consciousness?',
    'How is the concept of pain used in Philosophical Investigations to argue against the possibility of a private language?',
    'In what ways does the examination of pain\'s sameness in Philosophical Investigations challenge the understanding of subjective experiences?',
    'What are the limitations of language, as discussed in Wittgenstein’s Philosophical Investigations?',
    'In Philosophical Investigations, what is the relationship between forms of life and understanding philosophy?'
]

test_questions = {
    'TLP': TLP_test_questions,
    'PI': PI_test_questions
}

## Retrieve Wittgenbot "To-Be-Judged" Answers From Google Drive

In [None]:
wittgenbot_answers_file_path = data_dir + 'wittgenbot_answers.txt'

with open(wittgenbot_answers_file_path, 'r') as file:
    content = file.read()
    wittgenbot_answers = ast.literal_eval(content)

print(wittgenbot_answers)

{'TLP': {'How does Wittgenstein define the concept of a "proposition" in the Tractatus?': 'Wittgenstein\'s definition of a "proposition" in the Tractatus Logico-Philosophicus is quite precise. He sees a proposition as a picture of reality. More specifically, he defines it as a logical product of elementary propositions, which are the smallest units of language that can still express a meaningful thought. \n\nA proposition, according to Wittgenstein, shares a logical form with the reality it represents. This means that the structure of a proposition mirrors the structure of the state of affairs in the world that it depicts. When a proposition accurately represents the facts, it is true; when it doesn\'t, it\'s false.\n\nSo, in essence, Wittgenstein\'s concept of a proposition is deeply connected to his idea of language as a model of reality. Propositions are the tools we use to represent the world, and their truth or falsehood depends on how well they align with the actual state of affa

## Generate Reference "Golden" Answers Through Mixtral 8x7b

In [None]:
system_prompt = '''
You are an AI assistant specialized in the philosophy of Ludwig Wittgenstein. \
Answer the user's question. Ensure your answer is accurate, thorough, coherent and relevant. \
Use a tone that is conversational and clear.
'''

user_prompt = '''
Answer this question about Ludwig Wittgenstein's philosophy:
{question}
'''

In [None]:
golden_answers = {}

for work, questions in tqdm(test_questions.items()):

  golden_answers[work] = {}

  for question in tqdm(questions):

    formatted_user_prompt = user_prompt.format(question=question)

    answer = query_mixtral(system_prompt, formatted_user_prompt)

    golden_answers[work][question] = answer

print(golden_answers)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

{'TLP': {'How does Wittgenstein define the concept of a "proposition" in the Tractatus?': 'In the Tractatus Logico-Philosophicus, Ludwig Wittgenstein defines a "proposition" as a logical picture of the world. For Wittgenstein, propositions are fundamentally linguistic expressions that can represent the state of affairs in the world. He argues that the structure of a proposition mirrors the structure of the reality it represents, and this allows for the possibility of truth and falsehood.\n\nWittgenstein\'s concept of a proposition is built upon his idea of language as a logical system, in which simple symbols (names) refer to simple objects, and complex symbols (propositions) represent complexes of objects. He claims that a proposition is a fact, in that it either corresponds to a state of affairs in the world (and is therefore true), or it does not (and is false).\n\nWittgenstein\'s definition of a proposition is closely tied to his theory of meaning, which holds that the meaning of a

### Save Results to Google Drive

In [None]:
golden_answers_file_name = 'golden_answers.txt'
golden_answers_file_path = data_dir + golden_answers_file_name

golden_answers_json = json.dumps(golden_answers, indent=4)

with open(golden_answers_file_path, 'w') as file:
    file.write(golden_answers_json)

print(f'File saved as {golden_answers_file_name}')

File saved as golden_answers.txt


# LLM-as-a-Judge

## Single Answer Grading
### Prompts
Prompts based on prompt templates from [LLM-as-a-Judge](https://arxiv.org/abs/2306.05685) paper.

In [None]:
system_prompt = '''
Please act as an impartial judge and evaluate the quality of the response provided by an \
AI assistant specialized in Ludwig Wittgenstein's philosophy to the user's question about Ludwig Wittgenstein's philosophy. \
Your evaluation should consider factors such as the accuracy, relevance, depth, coherency, and level of detail of \
the response. Begin your evaluation by providing a short explanation. Be as objective as \
possible. After providing your explanation, please rate the response on a scale of 1 to 100 \
by strictly following this format: "[[rating]]", for example: "Rating: [[53]]".
'''

user_prompt = '''
[User Question]
{question}

[The Start of Assistant’s Answer]
{answer}
[The End of Assistant’s Answer]
'''

### Grading

In [None]:
single_grade_eval = {}

for work, questions in tqdm(wittgenbot_answers.items()):

    single_grade_eval[work] = {}

    for question, answer in tqdm(questions.items()):

      formatted_user_prompt = user_prompt.format(question=question, answer=answer)

      response_grade = query_mixtral(system_prompt, formatted_user_prompt)

      single_grade_eval[work][question] = response_grade

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

### Save Results to Google Drive

In [None]:
single_grade_eval_file_name = 'single_grade_eval.txt'
single_grade_eval_file_path = data_dir + single_grade_eval_file_name

single_grade_eval_json = json.dumps(single_grade_eval, indent=4)

with open(single_grade_eval_file_path, 'w') as file:
    file.write(single_grade_eval_json)

print(f'File saved as {single_grade_eval_file_name}')

File saved as single_grade_eval.txt


## Reference-Guided Grading
### Prompts



In [None]:
system_prompt = '''
Please act as an impartial judge and evaluate the quality of the response provided by an AI \
assistant specialized in Ludwig Wittgenstein's philosophy to the user's question about Ludwig \
Wittgenstein's philosophy. You will be given a reference answer and the assistant's answer. \
Your job is to evaluate the assistant's answer in reference to the reference answer. Your \
evaluation should consider factors such as the accuracy, relevance, depth, coherence, and \
level of detail of the response. Begin your evaluation by comparing the assistant's answer \
with the reference answer. Identify and correct any mistakes. Do not allow the length of the \
responses to influence your evaluation. Be as objective as possible. After providing your \
explanation, please rate the assistant's answer relative to the reference answer on a scale \
of 1 to 100 by strictly following this format: "[[rating]]", for example: "Rating: [[53]]".
'''

user_prompt = '''
[User Question]
{question}

[The Start of Reference Answer]
{ref_answer}
[The End of Reference Answer]

[The Start of Assistant Answer]
{wittgenbot_answer}
[The End of Assistant Answer]
'''

## Grading

In [None]:
ref_grade_eval = {}

for work, questions in tqdm(wittgenbot_answers.items()):

    ref_grade_eval[work] = {}

    for question, answer in tqdm(questions.items()):

      wittgenbot_answer = answer
      ref_answer = golden_answers[work][question]

      formatted_user_prompt = user_prompt.format(question=question,
                                                 wittgenbot_answer=wittgenbot_answer,
                                                 ref_answer=ref_answer)

      response_grade = query_mixtral(system_prompt, formatted_user_prompt)

      ref_grade_eval[work][question] = response_grade

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

### Save Results to Google Drive

In [None]:
ref_grade_eval_file_name = 'ref_grade_eval.txt'
ref_grade_eval_file_path = data_dir + ref_grade_eval_file_name

ref_grade_eval_json = json.dumps(ref_grade_eval, indent=4)

with open(ref_grade_eval_file_path, 'w') as file:
    file.write(ref_grade_eval_json)

print(f'File saved as {ref_grade_eval_file_name}')

File saved as ref_grade_eval.txt


## Parsing Results


In [None]:
pattern = r"\[\[(\d+)\]\]"

def extract_rating(evaluation):

    match = re.search(pattern, evaluation)

    return int(match.group(1)) if match else -1

### Single Answer Grading

In [None]:
single_grade_eval_total = 0
num_scores = 0
question_counter = 1

for work, questions in single_grade_eval.items():

    for question, evaluation in questions.items():

      score = extract_rating(evaluation)

      if score >= 0:
        single_grade_eval_total += score
        num_scores += 1

      print(f'Question {question_counter}: {score}%')
      question_counter += 1

single_grade_eval_avg_score = round(single_grade_eval_total / num_scores, 1)
print(f'\nSingle grade evaluation average score: {single_grade_eval_avg_score}%')

Question 1: 92%
Question 2: 90%
Question 3: 82%
Question 4: 92%
Question 5: 92%
Question 6: 88%
Question 7: 89%
Question 8: 85%
Question 9: 92%
Question 10: 89%
Question 11: 92%
Question 12: 92%
Question 13: 88%
Question 14: 87%
Question 15: 88%
Question 16: 89%
Question 17: 87%
Question 18: 85%
Question 19: 92%
Question 20: 92%
Question 21: 92%
Question 22: 95%
Question 23: 90%
Question 24: 92%
Question 25: 92%
Question 26: 95%
Question 27: 92%
Question 28: 90%
Question 29: 92%
Question 30: 90%
Question 31: 92%
Question 32: 87%
Question 33: 90%
Question 34: 92%
Question 35: 95%
Question 36: 88%

Single grade evaluation average score: 90.2%


### Reference-Guided Grading

In [None]:
ref_grade_eval_total = 0
num_scores = 0
question_counter = 1

for work, questions in ref_grade_eval.items():

    for question, evaluation in questions.items():

      score = extract_rating(evaluation)

      if score >= 0:
        ref_grade_eval_total += score
        num_scores += 1

      print(f'Question {question_counter}: {score}%')
      question_counter += 1

ref_grade_eval_avg_score = round(ref_grade_eval_total / num_scores, 1)
print(f'\nReference-guided evaluation average score: {ref_grade_eval_avg_score}%')

Question 1: 78%
Question 2: 83%
Question 3: 92%
Question 4: 95%
Question 5: 85%
Question 6: 78%
Question 7: 90%
Question 8: 78%
Question 9: 75%
Question 10: 88%
Question 11: 85%
Question 12: 78%
Question 13: 78%
Question 14: 60%
Question 15: 65%
Question 16: 78%
Question 17: 78%
Question 18: 78%
Question 19: 85%
Question 20: 85%
Question 21: 75%
Question 22: 72%
Question 23: 72%
Question 24: 92%
Question 25: 85%
Question 26: 88%
Question 27: 72%
Question 28: 75%
Question 29: 85%
Question 30: 85%
Question 31: 85%
Question 32: 80%
Question 33: 88%
Question 34: 78%
Question 35: 70%
Question 36: 72%

Reference-guided evaluation average score: 80.2%
