# Documentation for Medical Knowledge Evaluation Solution
## Introduction

Welcome to our journey in enhancing medical knowledge evaluation! In this project, we are tackling a unique challenge presented by the French Medical Practice exam. Our objective is to improve a model's ability to answer 103 multiple-choice questions, each with options ranging from A to E. 

What makes this task especially intriguing is that many questions may have multiple correct answers, leading to a total of 31 possible answer combinations. This complexity pushes us to think creatively about how to equip our model with the necessary medical knowledge.

While we don’t have a training dataset with the correct answers, because where’s the fun in that?—we’re excited to explore open-source resources and innovative techniques to enrich our model's understanding of medical concepts. 

Join us as we dive into the details of our approach, aiming to not only meet but exceed the benchmarks set by this competition. Let’s see how far we can push the boundaries of medical knowledge assessment!


Install requirements

In [None]:
! pip install -r req.txt

### Variables you need to change

In [8]:
api_key = os.environ.get('API_KEY')

questions_file = './data/questions.csv'

In [14]:
import pandas as pd
from mistralai import Mistral
import time
from transformers import AutoTokenizer
import tiktoken
import faiss
import numpy as np
import json
import os
from tqdm import tqdm


In [20]:
questions_df = pd.read_csv(questions_file)


# RAG

get data

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("thedevastator/comprehensive-medical-q-a-dataset")

print("Path to dataset files:", path)


In [16]:

df_rag = pd.read_csv(path+'\\'+"train.csv")

# Function to combine the columns into a structured text representation
def combine_qtype_question_answer(row):
    return (
        f"Question Type: {row['qtype']}\n"
        f"Question: {row['Question']}\n"
        f"Answer: {row['Answer']}\n"
    )

# Apply the function to each row in the dataframe to create a combined text column
df_rag['combined_text'] = df_rag.apply(combine_qtype_question_answer, axis=1)
encoding = tiktoken.get_encoding("cl100k_base")  # Choose the appropriate encoding model

# Function to split text into chunks of max_token_size
def split_into_chunks(text, max_token_size=2048):
    tokens = encoding.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_token_size):
        chunk = encoding.decode(tokens[i:i + max_token_size])
        chunks.append(chunk)
    return chunks

# Apply the function to each combined text in the dataframe
chunks_list = []
for text in df_rag['combined_text']:
    chunks = split_into_chunks(text, max_token_size=512)
    chunks_list.extend(chunks)

In [17]:
client = Mistral(api_key=api_key)

In [18]:
def get_text_embedding(input):
    embeddings_batch_response = client.embeddings.create(
          model="mistral-embed",
          inputs=input
      )
    return embeddings_batch_response.data[0].embedding
# text_embeddings = np.array([get_text_embedding(chunk) for chunk in chunks_list])

In [None]:
from tqdm import tqdm
text_embeddings = []
for chunk in tqdm(chunks_list):
    text_embeddings.append(get_text_embedding(chunk))

In [21]:
# Load the embedding arrays
emb_arr = np.load('embdeddigs_g')  # Make sure this is a valid file path
# emb_arr2 = np.load('embdeddigs_mcqa_red')  # Ensure this is a .npy file

# Ensure that the embedding arrays have the same number of dimensions
# if emb_arr.shape[1] != emb_arr2.shape[1]:
#     raise ValueError("The dimensions of emb_arr and emb_arr2 do not match.")

# Create the FAISS index
d = emb_arr.shape[1]
index = faiss.IndexFlatL2(d)

# Add the embeddings to the index
index.add(emb_arr)
# index.add(emb_arr2)

In [29]:
df = pd.read_csv("data/questions.csv")


In [68]:
translation_prompt = lambda question: (
    "Translate the following question into English. If the question is already in English, repeat it as is.\n\n"
    f"Question: {question}\n"
    "Translation:"
)

In [69]:


initial_answer_prompt = lambda question, possible_answer_a, possible_answer_b, possible_answer_c, possible_answer_d, possible_answer_e, context: (
    "Vous êtes un assistant très compétent formé pour résoudre des questions à choix multiples dans divers domaines, "
    "y compris la science, les mathématiques, l'histoire et plus encore. Votre objectif est de fournir la réponse la plus précise en fonction des options présentées.\n\n"


    "Tâche : Identifier la ou les réponses correctes pour la question à choix multiples donnée. La question peut avoir une ou plusieurs réponses correctes.\n"
    "Fournir la ou les réponses en utilisant uniquement les lettres correspondant aux options correctes.\n\n"

    "# Instructions :\n"
    "1. Analysez la question pour comprendre son contexte et les informations qu'elle requiert.\n"
    "2. Évaluez chacune des options données (A, B, C, D, E) pour déterminer quelle(s) option(s) répond(ent) le mieux à la question.\n"
    "3. Si plusieurs options sont correctes, listez-les par ordre alphabétique, séparées par des virgules et sans espaces.\n"
    "4. Ne fournissez que la réponse telle que spécifiée, sans texte ou explications supplémentaires.\n\n"

    "# Contraintes :\n"
    "- Si toutes les réponses sont incorrectes, retournez 'Aucune'.\n"
    "- Si la question précise 'Sélectionnez toutes les réponses applicables', plusieurs réponses peuvent être possibles.\n"
    "- Si les options incluent des informations contradictoires, choisissez l'option la plus précise ou pertinente en fonction du contexte de la question.\n\n"
    
    "# Examples:\n"
    "- Question: 'Which of the following are prime numbers?'\n"
    "  A: 4\n"
    "  B: 5\n"
    "  C: 7\n"
    "  D: 9\n"
    "  E: 12\n"
    "  Output: 'B,C'\n\n"
    "- Question: 'Which animals are mammals?'\n"
    "  A: Elephant\n"
    "  B: Crocodile\n"
    "  C: Kangaroo\n"
    "  D: Snake\n"
    "  E: Dolphin\n"
    "  Output: 'A,C,E'\n\n"
    
    "If additional context is needed, here it is:\n"
    "--------------------\n"
    f"{context}\n"
    "--------------------\n\n"
    "# Question:\n"
    f"{question}\n"
    "Options:\n"
    f"A: {possible_answer_a}\n"
    f"B: {possible_answer_b}\n"
    f"C: {possible_answer_c}\n"
    f"D: {possible_answer_d}\n"
    f"E: {possible_answer_e}\n\n"
    "Answer:"
)

In [70]:

confirmation_prompt = lambda question, possible_answer_a, possible_answer_b, possible_answer_c, possible_answer_d, possible_answer_e, initial_answer: (
    "You have provided an initial answer to the multiple-choice question. Please reassess the question and options carefully.\n\n"
    "If you believe your initial answer was correct, confirm it and explain why. If you think it was incorrect, provide the correct answer with an explanation.\n\n"
    f"# Question:\n{question}\n"
    "Options [Translate To English if there's not English BEFORE answer]:\n"
    f"A: {possible_answer_a}\n"
    f"B: {possible_answer_b}\n"
    f"C: {possible_answer_c}\n"
    f"D: {possible_answer_d}\n"
    f"E: {possible_answer_e}\n\n"
    f"Previously selected answer: '{initial_answer}'\n\n"
    "Reassess the options and provide a new answer if needed, along with a brief explanation."
)


In [71]:
result_formatting_prompt = lambda question, context, final_answer: (
    "Inital Question was:\n"
    f"{question}\n\n"
    "After careful evaluation, the final answer is:\n"
    f"{final_answer}\n\n"
    "Given the context this:\n"
    f"{context}\n\n"
    "#Task"
    "Please confirm the final answer format as follows:\n"
    "- If one option is correct, format as: 'A'\n"
    "- If multiple options are correct, format as: 'A,B'\n"
    "- If no correct options, return: 'None'\n"
    "Final formatted answer:"
)

In [None]:

# Placeholder for answers
answers = []

for row_idx, row in df.iterrows():
    question_embeddings = np.array([get_text_embedding(row["question"])])
    D, I = index.search(question_embeddings, k=4)  # distance, index

    # Check if I has valid entries and retrieve chunks
    if I.tolist() and len(I[0]) > 0:
        # Retrieve the chunks based on available indices in I
        retrieved_chunk = [chunks[i] for i in I[0] if i < len(chunks)]
        retrieved_chunk = retrieved_chunk[0] if retrieved_chunk else 'No additional context available.'
    else:
        retrieved_chunk = 'No additional context available.'
    # Step 1: Translate question to English if necessary
    translation_request = translation_prompt(row["question"])
    translation_response = client.chat.complete(
        model="mistral-large-latest",
        messages=[{"role": "user", "content": translation_request}],
        temperature=0.
    )
    translated_question = translation_response.choices[0].message.content.strip()

    # Step 2: Initial Answer Selection
    context = retrieved_chunk  #  use FAISS retrieval if available
    initial_request = initial_answer_prompt(
        translated_question,
        row["answer_A"],
        row["answer_B"],
        row["answer_C"],
        row["answer_D"],
        row.get("answer_E"),
        context
    )
    initial_response = client.chat.complete(
        model="mistral-large-latest",
        messages=[{"role": "user", "content": initial_request}],
        temperature=0.
    )
    initial_answer = initial_response.choices[0].message.content.strip()

    # Step 3: Confirmation with Reasoning
    confirmation_request = confirmation_prompt(
        translated_question,
        row["answer_A"],
        row["answer_B"],
        row["answer_C"],
        row["answer_D"],
        row.get("answer_E"),
        initial_answer
    )
    confirmation_response = client.chat.complete(
        model="mistral-large-latest",
        messages=[{"role": "user", "content": confirmation_request}],
        temperature=0.
    )
    confirmed_answer = confirmation_response.choices[0].message.content.strip()

    # Step 4: Final Result Formatting
    result_request = result_formatting_prompt(
        translated_question,
        retrieved_chunk,
        initial_answer # dump way of doing it.
    )
    result_response = client.chat.complete(
        model="mistral-large-latest",
        messages=[{"role": "user", "content": result_request}],
        temperature=0.0
    )
    final_answer = result_response.choices[0].message.content.strip()

    # Store the final formatted answer
    answers.append(final_answer)
    print(f"Final Answer for Question {row_idx}: {final_answer}")
# output format is a 2-columns dataframe with exactly 103 rows
output_df = pd.DataFrame(answers, columns=["Answer"])
output_df.index.name = "id"

output_df.to_csv("output_multistage.csv")


In [None]:
import json

def convert_to_json(df):
    questions_list = []
    for _, row in df.iterrows():
        question_entry = {
            "qtype": row['qtype'],
            "question": row['Question'],
            "answer": row['Answer']
        }
        questions_list.append(question_entry)
        print(question_entry)
    
    result = {"questions": questions_list}
    return json.dumps(result, indent=4)

df_emb = pd.read_csv("C:\\Users\\aligh\\.cache\\kagglehub\\datasets\\thedevastator\\comprehensive-medical-q-a-dataset\\versions\\2\\train.csv")

convert_to_json(df_emb)

In [None]:
import os
import xml.etree.ElementTree as ET
import json

import os
import xml.etree.ElementTree as ET
import json

def extract_qa_pairs_to_jsonl(folder_path, output_jsonl, limit=5000):
    qa_list = []
    count = 0
    
    # Iterate through the folder and sub-folders
    for root_dir, sub_dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.xml'):
                file_path = os.path.join(root_dir, file)
                try:
                    # Parse the XML file
                    tree = ET.parse(file_path)
                    root = tree.getroot()
                    
                    # Find all QAPair elements
                    qa_pairs = root.findall('.//QAPair')
                    for qa in qa_pairs:
                        if count >= limit:
                            break
                        # Extract Question
                        question = qa.find('Question').text.strip()
                        
                        # Extract Answer if present
                        answer = qa.find('Answer')
                        if answer is not None and answer.text and answer.text.strip():
                            answer_text = answer.text.strip()
                            # Add the question and answer as a dictionary to the list
                            qa_list.append({"Question": question, "Answer": answer_text})
                            count += 1
                        else:
                            continue
                    
                    if count >= limit:
                        break

                except ET.ParseError as e:
                    print(f"Error parsing file {file_path}: {e}")
    
    # Save the collected Q&A pairs to a JSONL file
    with open(output_jsonl, 'w', encoding='utf-8') as jsonlfile:
        for qa in qa_list:
            jsonlfile.write(json.dumps(qa, ensure_ascii=False) + '\n')
        
    print(f"Saved {count} Q&A pairs to {output_jsonl}")


# Specify the path to the folder containing the XML files and the output CSV file name
folder_path = 'C:\\Users\\aligh\\Downloads\\MedQuAD-master\\MedQuAD-master\\'
output_csv = 'qa_pairs.jsonl'
extract_qa_pairs_to_jsonl(folder_path, output_csv)


In [None]:
import os
import xml.etree.ElementTree as ET
import json

def extract_qa_pairs_to_mistral_jsonl(folder_path, output_jsonl):
    conversations = []
    count = 0
    
    # Iterate through the folder and sub-folders
    for root_dir, sub_dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.xml'):
                file_path = os.path.join(root_dir, file)
                try:
                    # Parse the XML file
                    tree = ET.parse(file_path)
                    root = tree.getroot()
                    
                    # Find all QAPair elements
                    qa_pairs = root.findall('.//QAPair')
                    for qa in qa_pairs:
                        # Extract Question
                        question = qa.find('Question').text.strip()
                        
                        # Extract Answer if present
                        answer = qa.find('Answer')
                        if answer is not None and answer.text and answer.text.strip():
                            answer_text = answer.text.strip()
                            
                            # Structure the conversation in Mistral format
                            conversation = {
                                "messages": [
                                    {
                                        "role": "user",
                                        "content": question
                                    },
                                    {
                                        "role": "assistant",
                                        "content": answer_text
                                    }
                                ]
                            }
                            
                            conversations.append(conversation)
                            count += 1
                        else:
                            continue
                    

                except ET.ParseError as e:
                    print(f"Error parsing file {file_path}: {e}")
    
    # Save the collected conversations to a JSONL file in Mistral format
    with open(output_jsonl, 'w', encoding='utf-8') as jsonlfile:
        for conversation in conversations:
            jsonlfile.write(json.dumps(conversation, ensure_ascii=False) + '\n')
        
    print(f"Saved {count} conversations to {output_jsonl}")


# Specify the path to the folder containing the XML files and the output CSV file name
folder_path = 'C:\\Users\\aligh\\Downloads\\MedQuAD-master\\MedQuAD-master\\'
output_jsonl = 'qa_pairs_mistral.jsonl'
extract_qa_pairs_to_mistral_jsonl(folder_path, output_jsonl)


In [60]:
from mistralai import Mistral
import os


client = Mistral(api_key=api_key)

training_data = client.files.upload(
    file={
        "file_name": "qa_pairs_mistral.jsonl",
        "content": open("qa_pairs_mistral.jsonl", "rb"),
    }
)  

In [None]:
training_data.id

In [74]:
# create a fine-tuning job
created_jobs = client.fine_tuning.jobs.create(
    model="mistral-large-latest", 
    training_files=[{"file_id": training_data.id, "weight": 1}],
    hyperparameters={
        "training_steps": 10,
        "learning_rate":0.0001
    },
    auto_start=False
)


In [None]:

# start a fine-tuning job
client.fine_tuning.jobs.start(job_id = created_jobs.id)

created_jobs

In [None]:
# # Cancel a jobs
# canceled_jobs = client.fine_tuning.jobs.cancel(job_id = created_jobs.id)
# print(canceled_jobs)

In [None]:
# Retrieve a jobs
retrieved_jobs = client.fine_tuning.jobs.get(job_id = created_jobs.id)
retrieved_jobs