## 1. Load fine-tuned model

In [None]:
import unsloth
import torch
from unsloth import FastLanguageModel
from langchain.embeddings import HuggingFaceEmbeddings

# Load the finetuned model
def load_model_tokenizer_ft(base_model_id="Qwen/Qwen2.5-3B-Instruct", adapter_path="grpo_lora"):
    # Load the base model
    base_model, _ = FastLanguageModel.from_pretrained(
        model_name=base_model_id,
        max_seq_length=1024,         # Same as during training
        load_in_4bit=True,           # Load in 4-bit for memory efficiency
    )

    # Apply LoRA weights (need to be the same as before)
    model = FastLanguageModel.get_peft_model(
        base_model,
        r=16,  # Same rank as during training
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=32,  # Same alpha as during training
        use_gradient_checkpointing="unsloth",  # Enable gradient checkpointing if needed
        random_state=3407,  # Same random state as during training
    )
    # Load the saved LoRA weights
    model.load_adapter(adapter_path,adapter_name="default")
    # Set the model to evaluation mode
    model.eval()

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    return model, tokenizer

model,tokenizer=  load_model_tokenizer_ft()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: We'll be using `/tmp/unsloth_compiled_cache` for temporary Unsloth patches.
Standard import failed for UnslothCPOTrainer: No module named 'UnslothCPOTrainer'. Using tempfile instead!
==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.50.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.3.19 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


## 2. RAG

In [None]:
%%capture
!pip install pypdf
!pip install -U langchain-community
!pip install chromadb
!pip install sentence-transformers

In [None]:
from langchain.document_loaders import PyPDFLoader

# load PDF documents
loader=PyPDFLoader("medical.pdf")
pages=loader.load()
print(len(pages))
print(pages[10].page_content[:1000])

100
saving of many thousands of dollars.
As resources become increasingly constrained, the physician must weigh the possible 
benefits of performing costly procedures that provide only a limited life expectancy 
against the pressing need for more primary care for those persons who do not have 
adequate access to medical services. For the individual patient, it is important to reduce 
costly hospital admissions as much as possible if total health care is to be provided at a 
cost that most can afford. This policy, of course, implies and depends on close 
cooperation among patients, their physicians, employers, payers, and government. It is 
equally important for physicians to know the cost of the diagnostic procedures they 
order and the drugs and other therapies they prescribe and to monitor both costs and 
effectiveness. The medical profession should provide leadership and guidance to the 
public in matters of cost control, and physicians must take this responsibility seriously 
witho

In [None]:
# split documents

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
)
splits=splitter.split_documents(pages)
print("Number of splits: ", len(splits))
print("-----First text spllit------\n", splits[0].page_content)

Number of splits:  404
-----First text spllit------
 PART ONE -INTRODUCTION TO CLINICAL MEDICINE
1. THE PRACTICE OF MEDICINE - The Editors
WHAT IS EXPECTED OF THE PHYSICIAN
The practice of medicine combines both science and art. The role of science in medicine
 is clear. Science-based technology and deductive reasoning form the foundation for the 
solution to many clinical problems; the spectacular advances in genetics, biochemistry, 
and imaging techniques allow access to the innermost parts of the cell and the most 
remote recesses of the body. Highly advanced therapeutic maneuvers are increasingly 
a major part of medical practice. Yet skill in the most sophisticated application of 
laboratory technology and in the use of the latest therapeutic modality alone does not 
make a good physician. One must be able to identify the crucial elements in a complex 
history and physical examination and extract the key laboratory results from the 
crowded computer printouts of laboratory data in

In [None]:
from langchain.vectorstores import Chroma

# create directory to store splits
persist_directory = 'chroma/'

# Create ChromaDB vector database storage
vectordb = Chroma.from_documents(
    documents=splits,  # Your list of text documents
    embedding=embedding,
    persist_directory=persist_directory  # Path to store embeddings
)

print(vectordb._collection.count())  # number of stored vectors

404


In [None]:
# inference with rag
def inference_with_rag(question):
    system_prompt, temperature=select_prompt(question)

    # search vectordb for meaningful context: use similarity_search_with_scores
    results_with_scores=vectordb.similarity_search_with_score(question, k=3)
    scores=[]
    for doc, score in results_with_scores:
        scores.append(score)
    print("--------Top scores-------\n", scores)
    # filter documents with high scores
    relevant_results = [doc for doc, score in results_with_scores if score >= 0.25]

    if relevant_results:
        context = "\n\n".join([d.page_content for d in relevant_results])
    else:
        context=""
    # user prompt
    template = (
        "Use the following pieces of context to answer the question at the end. If the context is None, simply answer the question based on your knowledge. If you don't know the answer, just say that you don't know, don't make up an answer."
        "Show your reasoning in <think> </think> tags. And return the final answer in <answer> </answer> tags. Stop generating after <answer> </answer> tags\n"
        "{context}\n"
        "Question: {question}"
    )
    user_prompt=template.format(context=context,question=question)
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
    ]

    # chat tempate
    inputs=tokenizer.apply_chat_template(messages, tokenize=False, continue_final_message=True)
    # tokenize
    model_inputs=tokenizer([inputs],return_tensors="pt").to(model.device)
    # generate
    generated_ids=model.generate(
        **model_inputs,
        max_new_tokens=1024,          # Adjust for longer responses
        temperature=temperature,      # determined by question, lower for more deterministic
        streamer = TextStreamer(tokenizer, skip_special_tokens=True)
    )
    # only record the part generated by the model
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    # decode to text
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response


In [None]:
question="Which type of thyroid cancer is commonly associated with conditions such as pancreatitis, pituitary tumor, and pheochromocytoma?"
response=inference_with_rag(question)

--------Top scores-------
 [0.3617010712623596, 0.3765064775943756, 0.3802699148654938]
system
You a medical expert with advanced knowledge in clinical reasoning, diagonstics, and treatment planning. You first think through the reasoning process step-by-step in your mind and then provide the user with the answer.
user
Use the following pieces of context to answer the question at the end. If the context is None, simply answer the question based on your knowledge. If you don't know the answer, just say that you don't know, don't make up an answer.Show your reasoning in <think> </think> tags. And return the final answer in <answer> </answer> tags. Stop generating after <answer> </answer> tags
1. Disease presentation is often atypical in the elderly, especially in those more than 75 
to 80 years old. Homeostatic strain caused by onset of a new disease often leads to 
symptoms associated with a different organ system, particularly one compromised by 
preexisting disease. For example, fewer 