## Build Medical Question Answering system using LangChain and Mistral 7B 

In [1]:
# Uncomment the following block to install required libraries 
"""
!pip install langchain chromadb sentence-transformers
!pip install  openai tiktoken
!pip install jq
!pip install faiss
!pip install pymilvus

"""

'\n!pip install langchain chromadb sentence-transformers\n!pip install  openai tiktoken\n!pip install jq\n!pip install faiss\n!pip install pymilvus\n\n'

- Setting the API key of HuggingFace to load the model  

In [1]:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN']='YOUR_HF_API_KEY'

* Load the PubMed articles from the JSON file. To prepare the JSON file, please refer to the script `download_pubmed.py`

In [2]:
import json

# 1. 纯原生读取，避开所有 Pydantic 报错
with open('./medical_data.json', 'r', encoding='utf-8') as f:
    raw_data = json.load(f)

# 2. 手动模拟 JSONLoader 的加载逻辑
data = []
for record in raw_data:
    # 提取正文
    content = record.get('article_abstract', '')
    
    # 提取元数据 (对应你代码里的 metadata_func)
    metadata = {
        "year": record.get("pub_date", {}).get('year'),
        "month": record.get("pub_date", {}).get('month'),
        "day": record.get("pub_date", {}).get('day'),
        "title": record.get("article_title")
    }
    
    # 构造类似于 Document 的对象（如果你后面还要用 langchain）
    # 如果只是为了 SFT，可以直接跳到下一步
    data.append({"page_content": content, "metadata": metadata})

print(f"✅ 成功！{len(data)} 篇 PubMed 文章已通过原生方式加载！")
print(f"数据样例: {data[1]['metadata']['title']}")

✅ 成功！2496 篇 PubMed 文章已通过原生方式加载！
数据样例: Metal-metal bonds inside fullerenes.


- Chunk abstracts into small text passages for efficient retrieval and LLM context length

In [5]:
from langchain_core.documents import Document
from langchain_text_splitters import TokenTextSplitter

# 1. 转换数据：将 dict 列表手动转为 Document 对象列表
formatted_data = []
for entry in data:
    # 假设 data 是通过原生 json.load 或损坏的 loader 读取的字典列表
    if isinstance(entry, dict):
        # 提取内容
        content = entry.get('page_content') or entry.get('article_abstract') or ""
        # 提取元数据
        metadata = entry.get('metadata') or {
            "title": entry.get("article_title"),
            "year": entry.get("pub_date", {}).get("year")
        }
        formatted_data.append(Document(page_content=content, metadata=metadata))
    else:
        # 如果已经是 Document 对象，直接添加
        formatted_data.append(entry)

# 2. 现在进行分割
text_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=64)

try:
    # 注意：这里传入转换后的 formatted_data
    chunks = text_splitter.split_documents(formatted_data)
    print(f"✅ 成功！{len(formatted_data)} 篇文章已转换为 {len(chunks)} 个片段！")
    print(f"样例片段内容: {chunks[0].page_content[:100]}...")
except Exception as e:
    print(f"❌ 分割依然失败，报错原因: {e}")

✅ 成功！2496 篇文章已转换为 12223 个片段！
样例片段内容: . Malnutrition in older adults. Malnutrition in older adults is a multifactorial condition with seri...


- Load the embedding model. The following code defines two options for loading the model: 
    - **Option a:** Using SentenceTransformerEmbeddings framework to load their most performing model `all-mpnet-base-v2`
    - **Option b:** Using HuggingFaceEmbeddings hub to load the popular model `e5-large-unsupervised`

In [6]:
# Option a: using all-mpnet from SentenceTransformer 
#from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
#embeddings = SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2")

# Option b: using e5-large-unspupervised from huggingface 
from langchain_community.embeddings import HuggingFaceEmbeddings

modelPath = "/home/janie/RAG/models/e5-large-unsupervised"
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs={'device':'cuda'},
    encode_kwargs={'normalize_embeddings':False}
)
print("✅ 模型本地加载成功！")

  embeddings = HuggingFaceEmbeddings(
  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
No sentence-transformers model found with name /home/janie/RAG/models/e5-large-unsupervised. Creating a new one with mean pooling.


✅ 模型本地加载成功！


- Build the vector databse (VDB) to index the text chunks and their corresponsding vectors. We also define three options to define the VDB: 
    - **Option a:** Using chromaDB
    - **Option b:** Using Milvus
    - **Option c:** Using FAISS index

#TODO Add definition and comparison between the two options

In [7]:
'''
# Option a: Using chroma database
from langchain.vectorstores import Chroma
db = Chroma.from_documents(chunks, embeddings)
'''

'''
# Option b: Using Milvus database
# To run the following code, you should have a milvus instance up and running
# Follow the instructions in the following the link: https://milvus.io/docs/install_standalone-docker.md
from langchain.vectorstores import Milvus
db = Milvus.from_documents(
    chunks,
    embeddings,
    connection_args={"host": "127.0.0.1", "port": "19530"},
)
'''

# Using faiss index
from langchain_community.vectorstores import Chroma

# 确保之前已经成功定义了 embeddings
# 显式指定持久化位置，并确保目录干净
db = Chroma.from_documents(
    documents=chunks, 
    embedding=embeddings,
    persist_directory="./chroma_medical_db_final" # 使用一个新目录
)
# 注意：新版 LangChain 自动持久化，无需手动调用 .persist()
print("✅ 向量库已成功创建并保存！")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


✅ 向量库已成功创建并保存！


- Load pre-trained Mistral 7B

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline
import torch

local_model_path = "/home/janie/RAG/models/Mistral-7B-v0.1" 

tokenizer = AutoTokenizer.from_pretrained(local_model_path)


model = AutoModelForCausalLM.from_pretrained(
    local_model_path, 
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True
)

# 4. 创建生成流水线
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=128,
    repetition_penalty=1.1  # 稍微增加惩罚项，防止医疗回答陷入死循环
)

# 5. 封装为 LangChain 的 LLM 对象
llm = HuggingFacePipeline(
    pipeline=pipe,
    model_kwargs={"temperature": 0} # 医疗问答设为 0，保证结果的一致性和严谨性
)

print("✅ 本地 Mistral 模型已成功加载至 GPU！")

Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.47s/it]

✅ 本地 Mistral 模型已成功加载至 GPU！



  llm = HuggingFacePipeline(


- Define the RAG pipeline using LangChain. The LLM's answer highly depends on the prompt template, that's why we tested three different prompts. The one giving the best answer as PROMPT2. 

#TODO: Add explanation about the three prompts

In [9]:
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
import time

# PROMPT 1
PROMPT_TEMPLATE_1 = """Answer the question based only on the following context:
{context}
You are allowed to rephrase the answer based on the context. 
Question: {question}
"""
PROMPT1 = PromptTemplate.from_template(PROMPT_TEMPLATE_1)

# PROMPT 2
PROMPT_TEMPLATE_2="Your are a medical assistant for question-answering tasks. Answer the Question using the provided Contex only. Your answer should be in your own words and be no longer than 128 words. \n\n Context: {context} \n\n Question: {question} \n\n Answer:"
PROMPT2 = PromptTemplate.from_template(PROMPT_TEMPLATE_2)

# PROMPT 3
from langchain import hub
PROMPT3 = hub.pull("rlm/rag-prompt", api_url="https://api.hub.langchain.com")

# RAG pipeline
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(k=2),
    chain_type_kwargs={"prompt": PROMPT2},
    return_source_documents=True
)



- Run one sample query `"What are the safest cryopreservation methods?"

In [10]:
start_time = time.time()
query = "What are the safest cryopreservation methods?"
result = qa_chain({"query": query})
print(f"\n--- {time.time() - start_time} seconds ---")

  result = qa_chain({"query": query})
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- 2.6624715328216553 seconds ---


In [11]:
print(result['result'].strip())
titles = ['\t-'+doc.metadata['title'] for doc in result['source_documents']]
print("\n\nThe provided answer is based on the following PubMed articles:\t")
print("\n".join(set(titles)))

Your are a medical assistant for question-answering tasks. Answer the Question using the provided Contex only. Your answer should be in your own words and be no longer than 128 words. 

 Context: <b><i>Objectives:</i></b> This study compared the synthetic polymer (SP) and the antifreeze protein type 3 (AFP3) protocols for the vitrification of bovine cumulus-oocyte complexes (COCs). <b><i>Methods:</i></b> Fresh bovine COCs were subjected to <i>in vitro</i> maturation (IVM) for 24 hours, while other COCs were vitrified using the SP or AFP protocols. After vitrification and warming, the COCs were subjected to IVM for 24 hours

The Brazilian Caatinga biome, a hotspot of unique biodiversity, faces escalating threats from habitat loss and climate change. Over the past two decades, significant progress has been made in developing reproductive biotechnologies to preserve the genetic diversity of native species through germplasm biobanking. This review synthesizes pioneering work by the Laborat

- Get the answer to the sample query from the LLM only 

In [12]:
# Define the langchain pipeline for llm only
from langchain_core.prompts import PromptTemplate
PROMPT_TEMPLATE ="""Answer the given Question only. Your answer should be in your own words and be no longer than 100 words. \n\n Question: {question} \n\n
Answer:
"""
PROMPT = PromptTemplate.from_template(PROMPT_TEMPLATE)
llm_chain = PROMPT | llm
start_time = time.time()
result = llm_chain.invoke({"question": query})
print(f"\n--- {time.time() - start_time} seconds ---")
print(result)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



--- 3.3482871055603027 seconds ---
Answer the given Question only. Your answer should be in your own words and be no longer than 100 words. 

 Question: What are the safest cryopreservation methods? 


Answer:

Cryopreservation is a process of preserving cells, tissues or organs by freezing them at very low temperatures. It is used to preserve biological materials for future use. There are several methods of cryopreservation, but some are safer than others. The safest cryopreservation methods include vitrification, slow freezing, and encapsulation.

Vitrification is a method of cryopreservation that involves rapidly cooling the sample to extremely low temperatures. This method is considered to be the safest because it prevents ice crystals from forming inside the cells, which can damage them.


In [None]:
# 1. 定义检索器 (从你刚才创好的 db 里找最相关的 3 片摘要)
retriever = db.as_retriever(search_kwargs={"k": 3})

# 2. 创建 RAG 问答链
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, # 你之前定义好的 HuggingFacePipeline
    chain_type="stuff", 
    retriever=retriever,
    return_source_documents=True # 这样你可以看到它是参考了哪篇论文回答的
)

# 3. 提问测试
question = "What are the common symptoms discussed in recent PubMed articles about heart disease?"
result = qa_chain({"query": question})

print("机器人回答:", result["result"])
print("参考来源:", [doc.metadata['title'] for doc in result["source_documents"]])

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


机器人回答: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

mg/dL; 95% CI= -14.59 to -4.65), total cholesterol (-9.47 mg/dL; 95% CI= -15.92 to -3.02), triglycerides (-8.96 mg/dL; 95% CI= -16.19 to -1.73), high-density lipoprotein cholesterol (2.95 mg/dL; 95% CI = 0.66 to 5.25), diastolic blood pressure ( -2.87 mmHg; 95% CI= -4.23 to -1.51),

 Transthoracic echocardiography is the first-line modality for assessment, but magnetic resonance imaging has emerged as a more accurate tool for the tissue characterization of this disease. Consider endomyocardial fibrosis in patients with restrictive cardiomyopathy and a tropical origin or eosinophilia.Cardiac magnetic resonance imaging is essential for non-invasive diagnosis and assessment of fibrosis, calcification, and ventricular involvement.Microvascular angina may be an unusual initial presentation of endomyocardial fibrosis.

Bidi

: 