In [1]:
"""
In this notebook, we will introduce how to start the evaluation.
"""

'\nIn this notebook, we will introduce how to start the evaluation.\n'

Chapter 1: Question set structure\
In this chapter, we will introduce how to load the question set and the structure of the record.

In [2]:
import pickle as pkl

with open("../dataset/questions_sample_5.pkl", "rb") as f:
    question_data = pkl.load(f)
print(f"Loaded {len(question_data)} questions.")
for k,v in question_data[0].items():
    print(f"{k}: {v}")

Loaded 75 questions.
question_index: 6648c31cda62de46659828790ca679fd4088153f24f8851e52157f5a83bce473
question_type: single_choice
knowledge: chromosome
question: Select the chromosome location of LOC124905048 gene.
options: {'A': '15;15q15.2', 'B': 'X;Xq11-q12', 'C': 'X;Xp11.3-p11.23', 'D': '21;21q21.1'}
answer: D
batch_id: 3
perspective: study_bias
flag: loc


Take the first question in the question set as an example:

``` the unique identifier of the question ```\
question_index: 6648c31cda62de46659828790ca679fd4088153f24f8851e52157f5a83bce473

``` the type of the question, it determines which prompt template to use ```\
question_type: single_choice

``` the knowledge domain of the question, corresponding to 8 gene knowledge domains in our paper ```\
knowledge: chromosome

``` the question text ```\
question: Select the chromosome location of LOC124905048 gene.
options: {'A': '15;15q15.2', 'B': 'X;Xq11-q12', 'C': 'X;Xp11.3-p11.23', 'D': '21;21q21.1'}

``` the correct answer ```\
answer: D

``` the batch id of the question, we use it in our preliminary experiment, you can ignore it ```\
batch_id: 3

``` the perspective of LLMs' capability, which is corresponding to 4 perspectives in our paper ```\
perspective: study_bias

``` the flag to tag whether the gene is a LOC-prefixed gene or not ```\
flag: loc


Chapter 2: Load the model\
In this chapter, we load the model we want to evaluate. We recommend using OpenAI compatible interface.\
Here we use ```langchain``` framework to load the model.

In [None]:
from langchain_community.chat_models import ChatOpenAI
from langchain_core.messages import HumanMessage

# set your model, api base and api key here
model_name = ""
api_base = ""
api_key = ""

client = ChatOpenAI(
            model= f"{model_name}",
            openai_api_base=api_base, 
            openai_api_key=api_key,    
            temperature=0,     #  setting temperature to 0 for reproducibility
        )

# test the model status
message = "Hello, how are you?"
human_message = [HumanMessage(content=message)]
response = client.invoke(human_message)
print(response.content)


  client = ChatOpenAI(


Hello! I'm doing well, thank you for asking! ðŸ˜Š I'm here and ready to help you with anything you needâ€”whether it's answering questions, brainstorming ideas, or just having a friendly chat. How are you doing today?


Chapter 3: Using our prompt template to evaluate the model\
In this chapter, we will use our prompt template and question to evaluate the model.

In [4]:
from prompt_pydantic import get_prompt, get_parser_by_type
# take the first question as an example
question_content = get_prompt(question_data[5])
parser = get_parser_by_type(question_data[5]["question_type"])
prompt_str = question_content + "\n" + parser.get_format_instructions()
# wrap into HumanMessage
messages = [HumanMessage(content=prompt_str)]
# evaluate the model and compare the answer
response = client.invoke(messages)
print(response.content)
print("The correct answer is: " + question_data[5]["answer"])

```json
{
    "answer": "C"
}
```
The correct answer is: C


If you have get the response, you have successfully completed the model call.\
Then you can code by yourself to record the answer and calculate matrics. \
In the following chapter, we will introduce the metrics.

Chapter 4: Metrics\
In this chapter, we will introduce the metrics one by one.

In [None]:
# 4-1 single choice
from metrics.choice import calculate_single_choice

question_content = get_prompt(question_data[5])
parser = get_parser_by_type(question_data[5]["question_type"])
prompt_str = question_content + "\n" + parser.get_format_instructions()
messages = [HumanMessage(content=prompt_str)]
response = client.invoke(messages)

score = calculate_single_choice(question_data[5]["answer"], response.content)
print(response.content)
print("The correct answer is: " + question_data[5]["answer"])
print(f"Score: {score}")


```json
{
    "answer": "C"
}
```
The correct answer is: C
Score: 1.0


In [None]:
# 4-2 multiple choice
import importlib
import metrics.choice
importlib.reload(metrics.choice)
from evaluate.metrics.choice import calculate_multiple_choice


question_content = get_prompt(question_data[48])
parser = get_parser_by_type(question_data[48]["question_type"])
prompt_str = question_content + "\n" + parser.get_format_instructions()
messages = [HumanMessage(content=prompt_str)]
response = client.invoke(messages)


score = calculate_multiple_choice(question_data[48]["answer"], response.content)
print(response.content)
print("The correct answer is: " + str(question_data[48]["answer"]))
print(f"Score: {score}")

```json
{
    "answers": ["A", "B", "C", "D"]
}
```
The correct answer is: ['D', 'A', 'B', 'C']
Score: 1.0


In [28]:
# 4-3 expression
import importlib
importlib.reload(metrics.expression)
from metrics.expression import calculate_expression

ind = 74
question_content = get_prompt(question_data[ind])
parser = get_parser_by_type(question_data[ind]["question_type"])

prompt_str = question_content + "\n" + parser.get_format_instructions()
messages = [HumanMessage(content=prompt_str)]
response = client.invoke(messages)


score = calculate_expression(question_data[ind]["answer"], response.content)
print(response.content)
print("The correct answer is: " + str(question_data[ind]["answer"]))
print(f"Score: {score}")

{"Tissue": ["liver", "adrenal", "pancreas", "fat", "testis", "ovary", "prostate", "colon", "small intestine", "duodenum", "stomach", "esophagus", "lung", "thyroid", "gall bladder", "kidney", "urinary bladder", "endometrium", "placenta", "skin", "salivary gland", "brain", "spleen", "lymph node", "appendix", "bone marrow", "heart"], "Category": "Ubiquitous expression"}
The correct answer is: {'tissue_list': ['liver', 'kidney', 'skin', 'brain', 'fat', 'duodenum'], 'category': 'Biased expression'}
Score: 0.1818181818181818


In [None]:
# 4-4 gene ontology
import importlib
importlib.reload(metrics.ontology)
from metrics.ontology import calculate_go


ind = 54
question_content = get_prompt(question_data[ind])
parser = get_parser_by_type(question_data[ind]["question_type"])

prompt_str = question_content + "\n" + parser.get_format_instructions()
messages = [HumanMessage(content=prompt_str)]
response = client.invoke(messages)


score = calculate_go(question_data[ind]["answer"], response.content)
print(response.content)
print("The correct answer is: " + str(question_data[ind]["answer"]))
print(f"Precision: {score[0]}, Recall: {score[1]}, F1: {score[2]}, Hallucination Rate: {score[3]}")

/home/huangxiaohan/SciHorizonGene/evaluate/metrics/go-basic.obo: fmt(1.2) rel(2025-10-10) 42,666 Terms
[
    {"go": "mitochondrial respiratory chain complex I assembly", "evidence": "IMP"},
    {"go": "mitochondrial respiratory chain complex I biogenesis", "evidence": "IMP"},
    {"go": "mitochondrial respiratory chain complex I assembly", "evidence": "IDA"},
    {"go": "mitochondrial respiratory chain complex I biogenesis", "evidence": "IDA"},
    {"go": "mitochondrial inner membrane", "evidence": "IDA"},
    {"go": "mitochondrial respiratory chain complex I", "evidence": "IDA"},
    {"go": "mitochondrial respiratory chain complex I", "evidence": "IEA"},
    {"go": "mitochondrial inner membrane", "evidence": "IEA"},
    {"go": "integral component of mitochondrial inner membrane", "evidence": "IEA"},
    {"go": "NADH dehydrogenase (ubiquinone) activity", "evidence": "IEA"},
    {"go": "oxidoreduction-driven active transmembrane transporter activity", "evidence": "IEA"},
    {"go": "pro

In [40]:
# 4-5 summary
import importlib
importlib.reload(metrics.summary)
from metrics.summary import calculate_summary
import warnings
warnings.filterwarnings("ignore")

ind = 60
question_content = get_prompt(question_data[ind])
parser = get_parser_by_type(question_data[ind]["question_type"])

prompt_str = question_content + "\n" + parser.get_format_instructions()
messages = [HumanMessage(content=prompt_str)]
response = client.invoke(messages)


score = calculate_summary(question_data[ind]["answer"], response.content)
print(response.content)
print("The correct answer is: " + str(question_data[ind]["answer"]))
print(f"Rouge Score: {score[0]}, BERT F1: {score[1]}, Perplexity: {score[2]}, Length: {score[3]}")

```json
{
    "summary": "MIR151A is a microRNA gene that regulates gene expression post-transcriptionally by binding to target mRNAs, often involved in processes such as cell proliferation, differentiation, and immune response."
}
```
The correct answer is: microRNAs (miRNAs) are short (20-24 nt) non-coding RNAs that are involved in post-transcriptional regulation of gene expression in multicellular organisms by affecting both the stability and translation of mRNAs. miRNAs are transcribed by RNA polymerase II as part of capped and polyadenylated primary transcripts (pri-miRNAs) that can be either protein-coding or non-coding. The primary transcript is cleaved by the Drosha ribonuclease III enzyme to produce an approximately 70-nt stem-loop precursor miRNA (pre-miRNA), which is further cleaved by the cytoplasmic Dicer ribonuclease to generate the mature miRNA and antisense miRNA star (miRNA*) products. The mature miRNA is incorporated into a RNA-induced silencing complex (RISC), which 

Congratulations! You have finished the evaluation progress. :) 

If you counter any Path Error, please check the path of the files. \
Attention:\
When you run summary evaluation, you will download the gpt-2 model.

At last, you can:
1. Filter the results by the perspective (e.g.  ```study bias a.k.a. Research Attention```), question type (e.g. ```single-choice``` ) and tag (loc, non-loc) to reproduce the ```Research Attention``` experiment.
2. Filter the results by the perspective (e.g. ```hallucination```) to reproduce the ```Hallucination``` experiment.
3. Filter the results by the perspective (e.g. ```completeness```) to reproduce the ```Knowledge Completness``` experiment.
4. Filter the results by the perspective (e.g. ```literature_utilize```) to reproduce the ```Literature Influence``` experiment.