<a href="https://colab.research.google.com/github/gilbert215/Credit-Risk-Modeling-Customer-Segmentation/blob/main/114_611_In_Class_Exercise_(Nov_6).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 114/611 In-class exercise: Prompt Engineering for QA

In this exercise, you will construct specific prompts for a set of expected answer types, and learn about different evaluation methods.

We will use `gpt-4o-mini-2024-07-18`, so be sure you have an OpenAI key, which you may create on the [CMU AI Gateway](https://ai-gateway.andrew.cmu.edu/ui/?login=success&page=api-keys).

*(**Warning**: Each student has $50 API credit throughout the semester, so please keep track of your usage!)*

Please submit your notebook in a completed state, i.e. run the notebook to completion, and don't erase the content of the output cells, especially in the "üìã**Results**" sections. Remember to answer the "‚úçÔ∏è**Reflection**" sections in the notebook!



## Challenge of QA: How do we evaluate the answers?

When humans ask and answer questions, we rarely stick to strict formats or predefined options like in multiple-choice tests. Instead, we interpret the information we receive and decide whether it makes sense or satisfies our intent.

Dealing QA systems is more difficult. Even for factual questions, it can be difficult to assess the answers when no clear constraints or examples are provided, where methods like exact match may fail. This is more complicated for questions that require free-form generation.

In this notebook, we focus on factual QA, and in this section, we'll explore some evaluation methods based on the example below.

In [None]:
# DO NOT CHANGE THIS CELL
question = "What are the benefits of exercise?"
expectedAnswer = [
    "improves heart health",
    "strengthens muscles",
    "boosts mental health"
]
possible_answers = {
    "Answer 1": "Exercise improves heart health, strengthens muscles, and boosts mental health.",
    "Answer 2": "Exercise improves heart health but often harms mental health due to stress.",
    "Answer 3": "Exercise keeps the body strong and improves mood.",
    "Answer 4": "Exercise is tiring, weakens the body, and causes stress.",
}

### Method 1 : Soft Match (Factoid and Factoid List Answers)



**Soft-match** is a string-matching-based evaluation method that relaxes the constraint of exact match. It can also handle cases where multiple answers are acceptable.

For a single expected answer (**factoid**), we check whether the expected answer appears in the model's output.
- If it does not occur, the score is 0.
- If it does, the score is computed as the ratio of the number of characters in the expected answer to the number of characters in the generated answer, which penalizes redundant words.

For multiple expected answers (**factoid list**), we assume the generated output is a comma-separated list of distinct factoids. In this case, we sum the scores for each expected factoid, and apply a discount factor to account for the extra punctuation or separators in the list.

In [None]:
# DO NOT CHANGE THIS CELL
import string

def softMatch ( generatedString , expectedString, discount ):
  if expectedString in generatedString:
    score = len(expectedString) / (len(generatedString) - discount)
  else:
    score = 0;
  return score

def evaluate_soft_match(question, answer, expectedAnswer):
    if isinstance(expectedAnswer, list):
        total = 0
        discount = 2*len(expectedAnswer)
        for exp in expectedAnswer:
            total += softMatch(answer, exp, discount)
        print("\nSoftmatch Score:", f"{total:.2f}")
    else:
        score = softMatch(answer, expectedAnswer, 0)
        print("\nSoftmatch Score:", f"{score:.2f}")


In [None]:
# DO NOT CHANGE THIS CELL
# Soft Match Exploration Cell
print(f"Question: {question}\n")
for label, ans in possible_answers.items():
    print(f"--- {label} ---")
    evaluate_soft_match(question, ans, expectedAnswer)
    print(f"Answer: {ans}\n")


#### ‚úçÔ∏è **Reflection**

Please use the text cell below to answer the following question:

What are the strengths and weaknesses of SoftMatch? Give at least one strength and one weakness; refer to the examples above for in order to illustrate your points.

**Your Answer**

...

### Method 2: LLM as a Judge

Another way to evaluate answers is by using a **LLM as a judge**, which is a popular method recently. In real-world applications, we often choose the best affordable LLM available to serve as the evaluator.

To make LLMs judge, we have to provide explicit evaluation instructions to the model. In this notebook, we ask the LLM to rate each answer on a Likert scale from 0 to 5, and then normalize the score to a 0-1 range so it can be compared directly with the soft-match score.

We also track the evaluation cost based on the data stored in the `PRICE` dictionary.

In [None]:
# Define prompts and price of the model we choose

JUDGE_PROMPT = """# Instruction
You will be given a question, gold answer, and system answer.
Your task is to provide a 'total rating' scoring
how well the system answer matches the gold answer for the question.
Give your answer as an integer on a scale of 0 to 5, where
0 means that the system answer does not match the gold answer at all,
and 5 means that the system answer matches the gold answer.

Provide your feedback as follows:

# Feedback
Rational: (your thinking process)
Total rating: (your rating, as an integer from 0 to 5)"""

TASK_PROMPT = """# Task
Now here are the question and answer.
Question: {question}
Gold Answer: {gold_answer}
System Answer: {system_answer}

# Feedback
Rational: """

PRICE = {
    'input_tokens': 0.15/1e6,
    'output_tokens': 0.60/1e6
}

In [None]:
# DO NOT CHANGE THIS CELL
from collections import defaultdict

def parse(text):
    """
    Parse an output, assuming the following output format:
    xxx Total rating: y

    """
    output = 0
    if 'Total rating:' in text:
        splits = text.split('Total rating:')
        score = splits[-1].strip()
        if score.isdigit():
            output = int(score)
        else:
            print(f"Error: score cannot be converted to integer.")
    else:
        print(f"Error: output does not follow the specified format.")

    return output

def evaluate_llm_as_a_judge(judge, examples: list[dict[str, str]]):
    """
    Given a judge and examples, print out
    * the average score
    * the api cost for the evaluation
    and return the score scaled to 0-1

    """
    scores = []
    usage = defaultdict(int)
    for example in examples:

        response = judge.responses.create(
            model="gpt-4o-mini-2024-07-18",
            instructions=JUDGE_PROMPT,
            input=TASK_PROMPT.format(
                question=example['question'],
                gold_answer=example['gold_answer'],
                system_answer=example['system_answer']
                )
        )
        output_raw = response.output[0].content[0].text
        scores.append(parse(output_raw))
        usage['input_tokens'] += response.usage.input_tokens
        usage['output_tokens'] += response.usage.output_tokens

    cost = sum(usage[k]*v for k, v in PRICE.items())

    avg = sum(scores)/len(scores) / 5 # normalize
    print(f"\nLLM Likert Score (normalized): {avg:.2f} (Cost: {cost:.4f} USD)")
    return


In [None]:
# DO NOT CHANGE THIS CELL
# Get the user's OpenAI key and create a client model to be used for answering
# questions as well as judging answers.
import openai
import getpass
llm = openai.OpenAI(
    api_key=getpass.getpass("Enter your OpenAI API key for gpt-4o-mini:"),
    base_url="https://ai-gateway.andrew.cmu.edu/"
)

Enter your OpenAI API key for gpt-4o-mini:¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [None]:
# DO NOT CHANGE THIS CELL
# LLM as a judge Exploration Cell
print(f"Question: {question}\n")
gold_answer = ", ".join(expectedAnswer)
for label, ans in possible_answers.items():
  print(f"--- {label} ---")
  evaluate_llm_as_a_judge(llm, [{"question": question, "system_answer": ans, "gold_answer": gold_answer}])
  print(f"Answer: {ans}\n")

#### ‚úçÔ∏è **Reflection**

Please use the text cell below to answer the following question:

Comparing to SoftMatch, how is the performance of LLM as a judge? Give at least one strength and one weakness; refer to the examples above in order to illustrate your points.

**Your Answer**

...



## Hands-on: How do we get better answers from LLMs?

### QA Data

These are the question and answer pairs, while `type` indicates the format of the expected answer.
Therefore, some questions are included more than once, with different `type` and `answer`.

In [None]:
# DO NOT CHANGE THIS CELL.

q1 = {
    "type": "LAST_NAME",
    "question": "Who are the last three presidents of Carnegie Mellon?",
    "answer": ["Suresh", "Cohon", "Mehrabian"],
}
q2 = {
    "type": "FULL_NAME",
    "question": "Who are the last three presidents of Carnegie Mellon?",
    "answer": ["Subra Suresh", "Jared L. Cohon", "Robert Mehrabian"],
}
q3 = {
    "type": "FIRST_NAME",
    "question": "Who are the last three presidents of Carnegie Mellon?",
    "answer": ["Subra", "Jared", "Robert"],
}
q4 = {
    "type": "LAST_NAME",
    "question": "Who is the current president of Carnegie Mellon?",
    "answer": "Jahanian"
}
q5 = {
    "type": "FULL_NAME",
    "question": "Who is the current president of Carnegie Mellon?",
    "answer": "Farnam Jahanian"
}
q6 = {
    "type": "FIRST_NAME",
    "question": "Who is the current president of Carnegie Mellon?",
    "answer": "Farnam"
}
q7 = {
    "type": "MONTH",
    "question": "When was Charles Dickens born?",
    "answer": "February"
}
q8 = {
    "type": "DATE",
    "question": "When was Charles Dickens born?",
    "answer": "February 7, 1812"
}
q9 = {
    "type": "LIST_OF_STEPS",
    "question": "How do I get a PA driver's license?",
    "answer": ["Get a medical exam","Study the manual","Gather required documents","Take the knowledge and vision tests","Receive your permit and practice","Schedule and pass the road test","Get your license at a Photo License Center"]
}

q10 = {
    "type": "LIST_OF_STEPS",
    "question": "How do I get a US passport?",
    "answer": ["Complete and print Form DS-11", "Gather proof of U.S. citizenship and identity" , "Get a passport photo", "Make an appointment at an acceptance facility to submit your application in person"]
}

### Answering question with LLMs

Now that we have the data and the LLM API ready, let's start answering questions using the basic setup!

In the `answerQuestion` function, you can choose which **question**, **prompt format**, and **evaluation method** to use in order to test how the LLM performs.
In the basic setup, where `type_specific_prompts` is empty, all question types use the general prompt description.
The full prompt and response are then generated using `llm.chat.completions.create`, and then answer is evaluated automatically.

**Note**: Subsequent calls to the model with the same prompt and question may produce slightly different outputs. Don‚Äôt worry ‚Äî you can report a single representative result in the output cells below.

In [None]:
# DO NOT CHANGE THIS CELL
# Define the dict used to store prompts per answer type.
type_specific_prompts = {}

In [None]:
# DO NOT CHANGE THIS CELL

def answerQuestion ( input , prompt_type="general", evaluation_type="softmatch"):
  answerType = input["type"]
  question = input["question"]
  expectedAnswer = input["answer"]
  if prompt_type == "general":
    prompt = "Answer the following question."
  else: # use type specific prompts
    prompt = type_specific_prompts.get( answerType , "Answer the following question.")

  print("Answer Type: "+ answerType )
  print("\nPrompt: " + prompt )
  print("\nQuestion: " + question )

  response = llm.chat.completions.create(
    model="gpt-4o-mini-2024-07-18",
    messages = [
        { "role": "system", "content": f"{prompt}" },
        { "role": "user", "content":  f"Question:\n{question}"}

    ]
  )
  answer = response.choices[0].message.content
  print("\nAnswer:\n" + answer )
  print("\nExpected Answer:\n" + str(expectedAnswer) )

  if evaluation_type == "softmatch":
    evaluate_soft_match( question, answer, expectedAnswer )
  else: # eval with llm as a judge
    evaluate_llm_as_a_judge(llm, [{"question": question, "system_answer": answer, "gold_answer": expectedAnswer}])


In [None]:
# Try answering a single question like this!
answerQuestion( q1 )

Answer Type: LAST_NAME

Prompt: Answer the following question.

Question: Who are the last three presidents of Carnegie Mellon?

Answer:
As of October 2023, the last three presidents of Carnegie Mellon University are:

1. **Jared L. Cohon** (1997‚Äì2013)
2. **Subra Suresh** (2013‚Äì2017)
3. **Farnam Jahanian** (2017‚Äìpresent)

Farnam Jahanian is the current president.

Expected Answer:
['Suresh', 'Cohon', 'Mehrabian']

Softmatch Score: 0.05


#### Results (to compare with the next section)

In [None]:
# Change this cell if needed. The output should include the results of all questions.

use_llm_as_a_judge = [] # TODO: Choose suitable evaluation type for each question.
for i, question in enumerate([q1, q2, q3, q4, q5, q6, q7, q8, q9, q10]):
  print(f"---------------- Question {i+1} ----------------")
  if i in use_llm_as_a_judge:
    answerQuestion( question , evaluation_type="llm_as_a_judge" )
  else:
    answerQuestion( question )

### Prompt Engineering 1: Answer Format

After experimenting with `answerQuestion`, you may have noticed some limitations of the general prompt. For the next step, you'll tune the prompts to **produce answers in the correct format** for each question type!

Your goal is to improve performance by **adjusting only the prompts below**. Try to design prompts so that the generated output is as close as possible to the expected answer, given what the LLM is capable of producing.

*(You don't need to add RAG contexts or external information; the focus here is on using prompts to control the level of detail and structure in the output.)*

**Hint**: Work through the questions one by one. After updating a prompt, re-run the corresponding evaluation cell to see how your new prompt affects the model's response.

In [None]:
# TODO: Change the prompts!

# Names
type_specific_prompts["LAST_NAME"] = "Answer the given question."
type_specific_prompts["FULL_NAME"] = "Answer the given question."
type_specific_prompts["FIRST_NAME"] = "Answer the given question."

# Dates
type_specific_prompts["MONTH"] = "Answer the given question."
type_specific_prompts["DATE"] = "Answer the given question."

# Steps
type_specific_prompts["LIST_OF_STEPS"] = "Answer the given question."

In [None]:
# Try answering questions one by one! Index to format type:
# LAST_NAME: q1, q4
# FULL_NAME: q2, q5
# FIRST_NAME: q3, q6
# MONTH / DATE: q7, q8
# LIST_OF_STEPS: q9, q10
answerQuestion( q1, prompt_type="specific")

#### üìã**Results**

In [None]:
# Change this cell if needed. The output should include the results of all questions.

use_llm_as_a_judge = [] # TODO: Choose suitable evaluation type for each question.
for i, question in enumerate([q1, q2, q3, q4, q5, q6, q7, q8, q9, q10]):
  print(f"---------------- Question {i+1} ----------------")
  if i in use_llm_as_a_judge:
    answerQuestion( question , prompt_type="specific" , evaluation_type="llm_as_a_judge" )
  else:
    answerQuestion( question , prompt_type="specific")

#### ‚úçÔ∏è **Reflection**

Please use the text cell below to answer the following question:

What changes did you observe after applying the type-specific formatting prompts?
Were there any challenges or difficulties in designing these prompts?

**Your Answer**

...



### Prompt Engineering 2: LLM as a Judge

Formatting helps us get better answers from LLMs, but evaluating those answers is equally important. The clearer the instructions, the more stable and reliable the LLM's scores will be.

Looking at the prompt and the evaluation scores above, can you **revise the judge prompt and/or task prompt** to improve the LLM-as-a-judge's performance?

In [None]:
# TODO: Change the prompts! This is copied from above.

JUDGE_PROMPT = """# Instruction
You will be given a question, gold answer, and system answer.
Your task is to provide a 'total rating' scoring
how well the system answer matches the gold answer for the question.
Give your answer as an integer on a scale of 0 to 5, where
0 means that the system answer does not match the gold answer at all,
and 5 means that the system answer matches the gold answer.

Provide your feedback as follows:

# Feedback
Rational: (your thinking process)
Total rating: (your rating, as an integer from 0 to 5)"""

TASK_PROMPT = """# Task
Now here are the question and answer.
Question: {question}
Gold Answer: {gold_answer}
System Answer: {system_answer}

# Feedback
Rational: """

#### üìã**Results**

In [None]:
# Change this cell if needed. The output should include the results of questions that use LLM as a judge.

use_llm_as_a_judge = [] # TODO: Choose suitable evaluation type for each question.
for i, question in enumerate([q1, q2, q3, q4, q5, q6, q7, q8, q9, q10]):
  if i not in use_llm_as_a_judge: continue
  print(f"---------------- Question {i+1} ----------------")
  answerQuestion( question , prompt_type="specific" , evaluation_type="llm_as_a_judge" )

#### ‚úçÔ∏è **Reflection**

Please use the text cell below to answer the following question:

What observations guided your prompt design for the LLM-as-a-judge method?
Did the prompts work as you expected?
Were there any challenges in designing these prompts?

**Your Answer**

...

