# Grade with Golden (Single Grader)

In this python notebook, we will use one language model (LM) as a grader to grade some answer from different students (specifically from other LMs). We already have the golden answers, and we will use them as the oracle. However, we don't want the golden answers to be leaked to the students in their scores and feedback. As a result, the key issue is **how to use the golden answers to grade the students' answers without leaking the golden answers**. 

## 1. Read Data

All the data is stored in the 'data' folder. 

1. `Merged_Responses.csv` contains the questions and the golden answers from 5 experts named 'Laine', 'Gralneck', 'Sung', 'Gracia-Tsao', and 'Barkun'. Each line contains a question in text and its word count, the golden answers from the 5 experts with their word counts. There are totally 9 questions and 4 scenarios.

2. In `UGIB_generations_temp_0.8` folder, there are several '.csv' files, each of which is generated from a language model (e.g. GPT3.5) For each model, it acts as a student and generates answers for all the 9+4 questions and scenarios, each question with several answers. For each line (some answers have several lines), it contains the index of the question, followed by the answer in text.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# check the encoding of the data
import chardet

with open('data/Merged_Responses.csv', 'rb') as file:
    result = chardet.detect(file.read())
    
print(result)

{'encoding': 'MacRoman', 'confidence': 0.729694522324978, 'language': ''}


In [3]:
'''We need to read in the oracle question-answer pairs in a list, each element is a dictionay with keys as follows:
{
    "question": (question_text, question_word_count),
    "Laine_ans": (Laine_ans_text, Laine_word_count),
    "Gralneck_ans": (Gralneck_ans_text, Gralneck_word_count),
    "Sung_ans": (Sung_ans_text, Sung_word_count),
    "Gracia-Tsao_ans": (Gracia-Tsao_ans_text, Gracia-Tsao_word_count),
    "Barkun_ans": (Barkun_ans_text, Barkun_word_count)
}'''

# read in the 'Merged_Responses.csv' file in the 'data' folder
df_oracle = pd.read_csv('data/Merged_Responses.csv', encoding='MacRoman')
# replace the NaN values with empty strings
df_oracle = df_oracle.fillna('')

# collect the data and format it into a list of dictionaries
oracle = []
for i in range(df_oracle.shape[0]):
    row = df_oracle.iloc[i]
    oracle.append({
        "question": (row['Question_Text'], row['Question_Word_Count']),
        "Laine_ans": (row['Laine_Answer_Text'], row['Laine_Answer_Word_Count']),
        "Gralneck_ans": (row['Gralneck_Answer_Text'], row['Gralneck_Answer_Word_Count']),
        "Sung_ans": (row['Sung_Answer_Text'], row['Sung_Answer_Word_Count']),
        "Gracia-Tsao_ans": (row['Garcia-Tsao_Answer_Text'], row['Garcia-Tsao_Word_Count']),
        "Barkun_ans": (row['Barkun_Answer_Text'], row['Barkun_Word_Count'])
    })
    
oracle[0]['Gracia-Tsao_ans']


('', '')

In [4]:
# for the students' answers, we use the answers from chatgpt3.5 in folder 'data/UGIB_generations_temp_0.8/BASE-GPT3.5.csv'
df_students = pd.read_csv('data/UGIB_generations_temp_0.8/BASE-GPT3.5.csv', encoding='MacRoman')
df_students.fillna('')
df_students.head()

Unnamed: 0,question,text
0,1,The Rockall score can be used for risk stratif...
1,1,You can use the Glasgow-Blatchford Score (GBS)...
2,1,For assessing very-low-risk patients with UGIB...
3,1,You can use the Glasgow-Blatchford score to as...
4,1,"For very-low-risk patients with UGIB, you can ..."


## 2. Import Libraries and Set Up Utilities

In [8]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

In [9]:
# call the grading gpt4 model to grade the students' answers
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    # Consider benchmarking with a good model to get
    # a sense of the best possible quality.
    model="gpt-4-0125-preview",
    # Remember to set the temperature to 0 to prevent randomness.
    temperature=0,
)

In [22]:
# evaluate the leakage of the golden answers with the feedbacks.
# For each question, we calcualte the F1 score of the each feedback and the golden answer. Then we calculate the average F1 score over all feedbacks.

import nltk
import re

# Ensure you have the necessary nltk data
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

# Tokenization function
def tokenize(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize into words
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

[nltk_data] Downloading package punkt to /home/pcv1327/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/pcv1327/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
# Function to calculate precision, recall, and F1 score
from collections import Counter
from sklearn.metrics import f1_score

def calculate_f1(golden_text, student_text):
    golden_tokens = tokenize(golden_text)
    student_tokens = tokenize(student_text)
    
    # Count the tokens
    golden_counter = Counter(golden_tokens)
    student_counter = Counter(student_tokens)
    
    # Calculate true positives, false positives, and false negatives
    true_positives = sum((golden_counter & student_counter).values())
    false_positives = sum((student_counter - golden_counter).values())
    false_negatives = sum((golden_counter - student_counter).values())
    
    # Precision and recall
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    
    # F1 Score
    if precision + recall == 0:
        return 0.0
    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

## 3. Directly Grade with Golden

In this section, we directly grade the students' answers with the golden answers. We will only indicate in the system prompt that the golden answers can't be leaked to the students in the feedback.

### 3.1. Setup Chat Prompt Template

In [27]:
# Here we define the grading prompts for the grader LM. It's task is to grade the students' answers, including the scores and feedback.
# It takes some texts, including the golden answers and the students' answers, as input. The specific format is not decided, maybe it will be the difference between the two.

gradingPrompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are now an expert grader in the field of medical science. "
            "Your task is to grade the students' answers based on the golden answers. "
            "What you should do is to give a score and feedback to the students' answers. "
            "The key issue here is not to leak the golden answers to the students in the feedback.",
        ),
        ("human", "The question is: {question}"),
        ("system", "The golden answers from the five experts are as follows:"),
        ("human", "Laine: {Laine_ans}"),
        ("human", "Gralneck: {Gralneck_ans}"),
        ("human", "Sung: {Sung_ans}"),
        ("human", "Gracia-Tsao: {Gracia_Tsao_ans}"),
        ("human", "Barkun: {Barkun_ans}"),
        ("system", "The students' answer is as follows:"),
        ("human", "{student_ans}"),
        (
            "system", 
            "Now, please grade the students' answers and provide detailed feedback. Note that the feedback should not leak the golden answers. "
            "The score should be between 0 and 100, and the feedback should be at least 20 words long. "
            "The format of the grading is as follows: \n"
            "Score: a number from 0 to 100 \n"
            "Feedback: detailed feedback with at least 20 words. It's better to be provided in several points."),
    ]
)

### 3.2. Grade with Golden (Single Grader)

In [30]:
# define the schema
class Response(BaseModel):
    """Information about a response."""

    # Note that:
    # 1. Each field is an `optional` -- this allows the model to decline to extract it!
    # 2. Each field has a `description` -- this description is used by the LLM.
    # Having a good description can help improve extraction results.
    score: Optional[float] = Field(..., description="The score of the student's answer")
    feedback: Optional[str] = Field(..., description="The feedback of the student's answer")


class Data(BaseModel):
    """Extracted data about answers."""

    # Creates a model so that we can extract multiple entities.
    responses: List[Response]
    
runnable = gradingPrompt | llm.with_structured_output(
    schema=Response,
    method="function_calling",
    include_raw=False,
)

In [35]:

scores = []
feedbacks = []
# call the grading model to grade the students' answers
for i in range(len(oracle)):
    question = oracle[i]['question'][0]
    Laine_ans = oracle[i]['Laine_ans'][0]
    Gralneck_ans = oracle[i]['Gralneck_ans'][0]
    Sung_ans = oracle[i]['Sung_ans'][0]
    Gracia_Tsao_ans = oracle[i]['Gracia-Tsao_ans'][0]
    Barkun_ans = oracle[i]['Barkun_ans'][0]
    # for each question, we store all the scores and feedbacks for the students' answers
    scores.append([])
    feedbacks.append([])
    answers = df_students[df_students['question'] == i+1]["text"]  # select the students' answers for the current question
    for j in range(len(answers)):
        # create a runnable object from the prompt
        response = runnable.invoke(
            {
                "question": question,
                "Laine_ans": Laine_ans,
                "Gralneck_ans": Gralneck_ans,
                "Sung_ans": Sung_ans,
                "Gracia_Tsao_ans": Gracia_Tsao_ans,
                "Barkun_ans": Barkun_ans,
                "student_ans": answers.iloc[j]
            }
        )
        if j % 20 == 0:
            print(f"Question {i+1}, Answer {j+1}: Score: {response.score}, Feedback: {response.feedback}")
            print("\n")
        scores[i].append(response.score)
        feedbacks[i].append(response.feedback)

Question 1, Answer 1: Score: 20.0, Feedback: Your answer identifies the Rockall score as a tool for risk stratification in UGIB, which is a recognized method. However, the question specifically asked for the score used to assess very-low-risk patients for discharge from the ED, and the most recommended score and threshold for this purpose were not accurately identified. It's important to review the latest guidelines and studies to ensure the information provided aligns with current best practices for managing UGIB patients in the emergency setting.
Question 1, Answer 21: Score: 80.0, Feedback: Your answer correctly identifies the Glasgow-Blatchford Score (GBS) as the tool for assessing very-low-risk patients with upper gastrointestinal bleeding (UGIB). However, the threshold for considering discharge from the emergency department (ED) is not limited to a score of 0. It's important to consider a range of scores that might be appropriate for discharge decisions. Additionally, exploring t

### 3.3. Evaluation with F1 scores

In [37]:
avg_f1_scores = []
for i in range(len(oracle)):
    question = oracle[i]['question'][0]
    # concatenate the golden answers
    golden_answers = oracle[i]['Laine_ans'][0] + ' ' + oracle[i]['Gralneck_ans'][0] + ' ' + oracle[i]['Sung_ans'][0] + ' ' + oracle[i]['Gracia-Tsao_ans'][0] + ' ' + oracle[i]['Barkun_ans'][0]
    f1_scores = np.zeros(len(feedbacks[i]))
    for j in range(len(feedbacks[i])):
        f1 = calculate_f1(golden_answers, feedbacks[i][j])
        f1_scores[j] = f1
    avg_f1 = np.mean(f1_scores)
    print(f"Question {i+1}: Average F1 Score: {avg_f1}")
    avg_f1_scores.append(avg_f1)
    

Question 1: Average F1 Score: 0.15645397645412618
Question 2: Average F1 Score: 0.15416018539906617
Question 3: Average F1 Score: 0.21984147385750974
Question 4: Average F1 Score: 0.149593688711025
Question 5: Average F1 Score: 0.18891827841258582
Question 6: Average F1 Score: 0.16621351052221026
Question 7: Average F1 Score: 0.1778982244916877
Question 8: Average F1 Score: 0.18934735119479756
Question 9: Average F1 Score: 0.16939877575469967
Question 10: Average F1 Score: 0.1267491984963874
Question 11: Average F1 Score: 0.08596652529281702
Question 12: Average F1 Score: 0.10892264096021098
Question 13: Average F1 Score: 0.1243578734945295


## 4. Grade with Difference between Golden and Student Answers

In this section, we will grade the students' answers with the difference between the golden answers and the student answers. We will tell the LM that the context is the difference, and grade based on the difference.

### 4.1. Setup Chat Prompt Template

In [19]:
def List2Str(str_lst):
    formatted_str = [f"{i+1}: {str_lst[i]}" for i in range(len(str_lst))]
    return '\n'.join(formatted_str)

gradeDiffPrompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are now an expert grader in the field of medical science. "
            "Your task is to grade the students' answers based on the golden answers. "
            "What you have is the question, and differences between the students' answer and the golden answers listed in several points. "
            "What you should do is to give a score and feedback based on these materials.",
        ),
        ("system", "The question is as follows: "),
        ("human", "question: {question}"),
        ("system", "The differences between the students' answer and the golden answers are as follows:"),
        ("human", "{differences}"),
        (
            "system", 
            "Now, please grade the students' answers and provide detailed feedback. Note that the feedback should not leak the golden answers. "
            "The score should be between 0 and 100, and the feedback should be at least 20 words long. "
            "The format of the grading is as follows: \n"
            "Score: a number from 0 to 100 \n"
            "Feedback: detailed feedback with at least 20 words. It's better to be provided in several points."),
    ]
)

# define the schema
class Response(BaseModel):
    """Information about a response."""
    
    score: Optional[float] = Field(..., description="The score of the student's answer")
    feedback: Optional[str] = Field(..., description="The feedback of the student's answer")

gradeRunnable = gradeDiffPrompt | llm.with_structured_output(
    schema=Response,
    method="function_calling",
    include_raw=False,
)

In [17]:
generateDiffPrompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            "You are now an expert in the field of medical science. "
            "Your task is to generate the difference between the students' answer and the golden answers. "
            "What you have is the question, the students' answer and the golden answers from 5 experts. ",
        ),
        ("system", "The question is as follows:"),
        ("human", "question: {question}"),
        ("system", "The golden answers from 5 experts are as follows:"),
        ("human", "Laine: {Laine_ans}"),
        ("human", "Gralneck: {Gralneck_ans}"),
        ("human", "Sung: {Sung_ans}"),
        ("human", "Gracia-Tsao: {Gracia_Tsao_ans}"),
        ("human", "Barkun: {Barkun_ans}"),
        ("system", "The students' answer is as follows:"),
        ("human", "{student_ans}"),
        (
            "system", 
            "Now, please generate the difference between the student's answer and the golden answers. "
            "You should output the question and the differences. "
            "The format of the differences should be provided in several points. "
            "Note that the differences should not leak the golden answers."
        ),
    ]
)

class Difference(BaseModel):
    """Information about the difference between the students' answer and the golden answers."""

    question: Optional[str] = Field(..., description="The original question")
    differences: Optional[List[str]] = Field(..., description="The differences between the student's answer and the golden answers in several points")
    
generateRunnable = generateDiffPrompt | llm.with_structured_output(
    schema=Difference,
    method="function_calling",
    include_raw=False,
)

### 4.2. Grade with Difference (Single Grader)

In [20]:
scores = []
feedbacks = []
# call the grading model to grade the students' answers
for i in range(len(oracle)):
    question = oracle[i]['question'][0]
    Laine_ans = oracle[i]['Laine_ans'][0]
    Gralneck_ans = oracle[i]['Gralneck_ans'][0]
    Sung_ans = oracle[i]['Sung_ans'][0]
    Gracia_Tsao_ans = oracle[i]['Gracia-Tsao_ans'][0]
    Barkun_ans = oracle[i]['Barkun_ans'][0]
    # for each question, we store all the scores and feedbacks for the students' answers
    scores.append([])
    feedbacks.append([])
    answers = df_students[df_students['question'] == i+1]["text"]  # select the students' answers for the current question
    for j in range(len(answers)):  # for each student's answer
        # create a runnable object to generate the differences
        differences = generateRunnable.invoke(
            {
                "question": question,
                "Laine_ans": Laine_ans,
                "Gralneck_ans": Gralneck_ans,
                "Sung_ans": Sung_ans,
                "Gracia_Tsao_ans": Gracia_Tsao_ans,
                "Barkun_ans": Barkun_ans,
                "student_ans": answers.iloc[j]
            }
        )
        # transform the differences into a string
        differences_str = List2Str(differences.differences)
        # create a runnable object to grade the student's answer
        response = gradeRunnable.invoke(
            {
                "question": question,
                "differences": differences_str
            }
        )
        if j % 20 == 0:
            print(f"Question {i+1}, Answer {j+1}: Score: {response.score}, Feedback: {response.feedback}")
            print("\n")
        scores[i].append(response.score)
        feedbacks[i].append(response.feedback)

Question 1, Answer 1: Score: 50.0, Feedback: Your answer has identified a risk stratification score, which is a good start. However, the Rockall score is not the most recommended score for assessing very-low-risk patients with upper gastrointestinal bleeding (UGIB) for discharge from the emergency department (ED). Additionally, the threshold you've suggested does not align with the most commonly recommended thresholds for the appropriate score. It's important to review the latest guidelines and studies to ensure the use of the most accurate and up-to-date risk stratification tools and their thresholds. Consider exploring other scores that are specifically recommended for very-low-risk UGIB patients and their discharge criteria from the ED.
Question 1, Answer 21: Score: 70.0, Feedback: Your answer is on the right track by identifying a specific threshold for discharge from the ED for very-low-risk patients with UGIB. However, there are a few areas for improvement: 1. Consider mentioning

In [24]:
avg_f1_scores = []
for i in range(len(oracle)):
    question = oracle[i]['question'][0]
    # concatenate the golden answers
    golden_answers = oracle[i]['Laine_ans'][0] + ' ' + oracle[i]['Gralneck_ans'][0] + ' ' + oracle[i]['Sung_ans'][0] + ' ' + oracle[i]['Gracia-Tsao_ans'][0] + ' ' + oracle[i]['Barkun_ans'][0]
    f1_scores = np.zeros(len(feedbacks[i]))
    for j in range(len(feedbacks[i])):
        f1 = calculate_f1(golden_answers, feedbacks[i][j])
        f1_scores[j] = f1
    avg_f1 = np.mean(f1_scores)
    print(f"Question {i+1}: Average F1 Score: {avg_f1}")
    avg_f1_scores.append(avg_f1)

Question 1: Average F1 Score: 0.13451975811523462
Question 2: Average F1 Score: 0.21016929826909764
Question 3: Average F1 Score: 0.1651647148948972
Question 4: Average F1 Score: 0.193905237977186
Question 5: Average F1 Score: 0.22963774432820835
Question 6: Average F1 Score: 0.23468966159936297
Question 7: Average F1 Score: 0.2039452407900692
Question 8: Average F1 Score: 0.211733940683577
Question 9: Average F1 Score: 0.2072310329792205
Question 10: Average F1 Score: 0.16769402395735264
Question 11: Average F1 Score: 0.18804865342510835
Question 12: Average F1 Score: 0.15822331188256128
Question 13: Average F1 Score: 0.17238700947788144
