## 1. Normal Inference


In [1]:
%%capture
!pip install datasets trl peft bitsandbytes==0.45.0 
!pip install --upgrade pip
!pip install unsloth unsloth_zoo --no-cache-dir --upgrade
!pip install vllm
!pip install --upgrade pillow

In [2]:
import unsloth
import torch
from transformers import AutoTokenizer, TextStreamer
from unsloth import FastLanguageModel

def load_model_tokenizer_ft(base_model_id="Qwen/Qwen2.5-3B-Instruct", adapter_path="grpo_lora"):
    # Load the base model
    base_model, _ = FastLanguageModel.from_pretrained(
        model_name=base_model_id,
        max_seq_length=1024,         # Same as during training
        load_in_4bit=True,           # Load in 4-bit for memory efficiency
        fast_inference=True,         # Enable fast inference
    )

    # Apply LoRA weights (need to be the same as before)
    model = FastLanguageModel.get_peft_model(
        base_model,
        r=16,  # Same rank as during training
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_alpha=32,  # Same alpha as during training
        use_gradient_checkpointing="unsloth",  # Enable gradient checkpointing if needed
        random_state=3407,  # Same random state as during training
    )
    # Load the saved LoRA weights
    model.load_adapter(adapter_path,adapter_name="default")
    # Set the model to evaluation mode
    model.eval()

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    return base_model, model, tokenizer

base_model, model,tokenizer=  load_model_tokenizer_ft()


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-10 03:08:55 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.49%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.17 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 8.55 GB. Al



INFO 04-10 03:09:06 weight_utils.py:254] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-10 03:09:08 model_runner.py:1115] Loading model weights took 2.2160 GB
INFO 04-10 03:09:08 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-10 03:09:10 worker.py:267] Memory profiling takes 2.46 seconds
INFO 04-10 03:09:10 worker.py:267] the current vLLM instance can use total_gpu_memory (22.17GiB) x gpu_memory_utilization (0.49) = 10.97GiB
INFO 04-10 03:09:10 worker.py:267] model weights take 2.22GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 7.48GiB.
INFO 04-10 03:09:11 executor_base.py:111] # cuda blocks: 13623, # CPU blocks: 10922
INFO 04-10 03:09:11 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 212.86x
INFO 04-10 03:09:16 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error

Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:26<00:00,  1.17it/s]

INFO 04-10 03:09:43 model_runner.py:1562] Graph capturing finished in 26 secs, took 4.41 GiB
INFO 04-10 03:09:43 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 35.18 seconds



Unsloth 2025.3.19 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [3]:
# tokenize input question
def format_prompt(question):
    system_prompt = (
    "You are an expert in solving high school math word problems. "
    "You first think through the reasoning process step-by-step in your mind and then provide the answer."
    )

    user_prompt= (
    "Solve the following question:\n{ques}.\n"
    "Show your work in <think> </think> tags. And return the final answer as an integer in <answer> </answer> tags."
    )

    prompt=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt.format(ques=question)},
        {"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
    ]
    return tokenizer.apply_chat_template(prompt,tokenize=False, continue_final_message=True)


def inference(question):
    prompt=format_prompt(question)
    # tokenize the prompt
    input_ids=tokenizer(prompt,return_tensors="pt").input_ids
    input_ids=input_ids.to(model.device)

    with torch.no_grad():
        model.generate(
            input_ids,
            max_length=1024,  # Adjust based on your needs
            temperature=0.1,  # Control randomness (lower = more deterministic)
            top_p=0.9,        # Take the top logits with probs summing up to 0.9
            do_sample=True,
            streamer = TextStreamer(tokenizer, skip_special_tokens=True)
        )


In [7]:
question="James trains for the Olympics. He trains twice a day for 4 hours each time for all but 2 days per week. How many hours does he train a year?"
inference(question)

system
You are an expert in solving high school math word problems. You first think through the reasoning process step-by-step in your mind and then provide the answer.
user
Solve the following question:
James trains for the Olympics. He trains twice a day for 4 hours each time for all but 2 days per week. How many hours does he train a year?.
Show your work in <think> </think> tags. And return the final answer as an integer in <answer> </answer> tags.
assistant
Let me solve this step by step.
<think> First, we need to determine how many days James trains in a week. He trains for 7 days minus the 2 days he doesn't train, which means he trains for 5 days a week. 

Next, we calculate the total training hours per training day. He trains for 4 hours each time, and he trains twice a day, so he trains for 4 * 2 = 8 hours per training day.

Now, we multiply the number of training days per week by the number of hours he trains each day to find the total training hours per week. That is, 5 days

## 2. Inference with RAG

In [8]:
%%capture
!pip install pypdf
!pip install -U langchain-community
!pip install chromadb
!pip install sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
from langchain.document_loaders import PyPDFLoader
loader =PyPDFLoader("decimalwordprobsgood.pdf")
pages=loader.load()
print(len(pages))
print(pages[0])

104
page_content='Word Problem 
Practice Workbook
00i_TP_881033.indd Page 1  1/15/08  10:45:43 AM user00i_TP_881033.indd Page 1  1/15/08  10:45:43 AM user /Volumes/ju104/MHGL149/Quark%0/Word Problem%/Application file%0/FM/Course 1/Volumes/ju104/MHGL149/Quark%0/Word Problem%/Application file%0/FM/Course 1
PDF Proof' metadata={'producer': 'Acrobat Distiller 7.0 for Macintosh', 'creator': 'QuarkXPress™ 4.1 w/d: LaserWriter 8 8.7.3', 'creationdate': '2007-12-24T15:16:33+05:30', 'author': 'qwe', 'moddate': '2008-01-15T15:37:24+05:30', 'title': '001_009_CRM01_881033.qxd', 'source': 'decimalwordprobsgood.pdf', 'total_pages': 104, 'page': 0, 'page_label': '1'}


In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 50,
)
splits=splitter.split_documents(pages)
print(len(splits))
print(splits[0])

409
page_content='Word Problem 
Practice Workbook
00i_TP_881033.indd Page 1  1/15/08  10:45:43 AM user00i_TP_881033.indd Page 1  1/15/08  10:45:43 AM user /Volumes/ju104/MHGL149/Quark%0/Word Problem%/Application file%0/FM/Course 1/Volumes/ju104/MHGL149/Quark%0/Word Problem%/Application file%0/FM/Course 1
PDF Proof' metadata={'producer': 'Acrobat Distiller 7.0 for Macintosh', 'creator': 'QuarkXPress™ 4.1 w/d: LaserWriter 8 8.7.3', 'creationdate': '2007-12-24T15:16:33+05:30', 'author': 'qwe', 'moddate': '2008-01-15T15:37:24+05:30', 'title': '001_009_CRM01_881033.qxd', 'source': 'decimalwordprobsgood.pdf', 'total_pages': 104, 'page': 0, 'page_label': '1'}


In [11]:
from langchain.vectorstores import Chroma
persist_directory = 'cs229_lectures/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Use SentenceTransformer as a LangChain-compatible embedding function
embedding_function = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en")

# Create ChromaDB vector store
vectordb = Chroma.from_documents(
    documents=splits,  # Your list of text documents
    embedding=embedding_function,
    persist_directory=persist_directory  # Path to store embeddings
)

print(vectordb._collection.count())  # Check number of stored vectors


  embedding_function = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en")


1602


In [13]:
def inference(model,tokenizer,question):
    system_prompt="You are a helpful assistant. You first think through the reasoning process step-by-step in your mind and then provide the answer."
    
    # search vectordb for meaningful context: use similarity_search_with_scores
    results_with_scores=vectordb.similarity_search_with_score(question,k=3)
    # filter documents with high scores
    relevant_results = [doc for doc, score in results_with_scores if score >= 0.3]

    if relevant_results:
        context = "\n\n".join([d.page_content for d in relevant_results])
    else:
        context=""
    # user prompt
    template = """Use the following pieces of context to answer the question at the end. If the context is None, simply answer the question based on your knowledge. If you don't know the answer, just say that you don't know, don't try to make up an answer. 
    Show your reasoning in <think> </think> tags. And return the final answer in <answer> </answer> tags. 
    {context}
    Question: {question}
    Helpful Answer:"""
    user_prompt=template.format(context=context,question=question)
    
    prompt=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
    ]
    # set input text 
    text=tokenizer.apply_chat_template(
        prompt,
        tokenize=False, 
        continue_final_message=True,
        #add_generation_prompt=False
    )
    
    # tokenize inputs
    input_ids=tokenizer(text,return_tensors="pt").input_ids
    input_ids=input_ids.to(model.device)

    with torch.no_grad():
        model.generate(
            input_ids,
            max_new_tokens=1024,  # Adjust based on your needs
            temperature=0.3,      # Control randomness (lower = more deterministic)
            top_p=0.9,            # Take the top logits with probs summing up to 0.9
            do_sample=True,
            streamer = TextStreamer(tokenizer, skip_special_tokens=True)
        )


In [14]:
question= """In today's field day challenge, the 4th graders were competing against the 5th graders. 
Each grade had 2 different classes. The first 4th grade class had 12 girls and 13 boys. 
The second 4th grade class had 15 girls and 11 boys. 
The first 5th grade class had 9 girls and 13 boys while the second 5th grade class had 10 girls and 11 boys. 
In total, how many more boys were competing than girls?"""
inference(base_model,tokenizer,question)

system
You are a helpful assistant. You first think through the reasoning process step-by-step in your mind and then provide the answer.
user
Use the following pieces of context to answer the question at the end. If the context is None, simply answer the question based on your knowledge. If you don't know the answer, just say that you don't know, don't try to make up an answer. 
    Show your reasoning in <think> </think> tags. And return the final answer in <answer> </answer> tags. 
    
    Question: In today's field day challenge, the 4th graders were competing against the 5th graders. 
Each grade had 2 different classes. The first 4th grade class had 12 girls and 13 boys. 
The second 4th grade class had 15 girls and 11 boys. 
The first 5th grade class had 9 girls and 13 boys while the second 5th grade class had 10 girls and 11 boys. 
In total, how many more boys were competing than girls?
    Helpful Answer:
assistant
Let me solve this step by step.
<think> First, 

I'll calculate the total number of boys and girls from both grades. Then, I'll find the difference between the total number of boys and girls. </think>
The total number of boys is:
12 (4th grade class 1) + 13 (4th grade class 2) + 9 (5th grade class 1) + 11 (5th grade class 2) = 45 boys

The total number of girls is:
15 (4th grade class 2) + 9 (5th grade class 1) + 10 (5th grade class 2) = 34 girls

Now, to find out how many more boys were competing than girls, I'll subtract the total number of girls from the total number of boys:
45 boys - 34 girls = 11 more boys than girls

<answer>11</answer>
