# Demonstration of the Granite RAG Context Relevance Intrisic

This notebook shows the usage of the IO processor for the Granite RAG context relevance intrisic, 
also known as the [LoRA Adapter for Context Relevance Classifier]()

This notebook can run its own vLLM server to perform inference, or you can host the 
models on your own server. To use your own server, set the `run_server` variable below
to `False` and set appropriate values for the constants 
`openai_base_url`, `openai_base_model_name` and `openai_lora_model_name`.

In [1]:
import sys, os
# if your notebook’s working dir is the project root:
src_path = os.path.abspath("../src")
if src_path not in sys.path:
    sys.path.insert(0, src_path)

In [2]:
from granite_io.io.granite_3_3.input_processors.granite_3_3_input_processor import (
    Granite3Point3Inputs,
)
from granite_io import make_io_processor, make_backend
from IPython.display import display, Markdown
from granite_io.backend.vllm_server import LocalVLLMServer
from granite_io.io.context_relevancy import ContextRelevancyIOProcessor, ContextRelevancyCompositeIOProcessor

In [3]:
# Constants go here
base_model_name = "ibm-granite/granite-3.3-8b-instruct"
# TEMPORARY: Load LoRA adapter locally
lora_model_name = "local-granite-3.3-8b-lora-rag-context-relevancy"
run_server = True

In [4]:
if run_server:
    # Start by firing up a local vLLM server and connecting a backend instance to it.
    server = LocalVLLMServer(
        base_model_name, lora_adapters=[(lora_model_name, lora_model_name)]
    )
    server.wait_for_startup(200)
    lora_backend = server.make_lora_backend(lora_model_name)
    backend = server.make_backend()
else:  # if not run_server
    # Use an existing server.
    # Modify the constants here as needed.
    openai_base_url = "http://localhost:55555/v1"
    openai_api_key = "granite_intrinsics_1234"
    openai_base_model_name = base_model_name
    openai_lora_model_name = lora_model_name
    backend = make_backend(
        "openai",
        {
            "model_name": openai_base_model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )
    lora_backend = make_backend(
        "openai",
        {
            "model_name": openai_lora_model_name,
            "openai_base_url": openai_base_url,
            "openai_api_key": openai_api_key,
        },
    )

INFO 20:46:45 Running: /proj/dmfexp/8cc/krishna/miniforge3/envs/granite-io/bin/vllm serve ibm-granite/granite-3.3-8b-instruct --port 56741 --gpu-memory-utilization 0.45 --max-model-len 32768 --guided_decoding_backend outlines --device auto --enforce-eager --enable-lora --max_lora_rank 64 --lora-modules local-granite-3.3-8b-lora-rag-context-relevancy=local-granite-3.3-8b-lora-rag-context-relevancy
INFO 06-13 20:46:49 __init__.py:207] Automatically detected platform cuda.
INFO 06-13 20:46:50 api_server.py:912] vLLM API server version 0.7.3
INFO 06-13 20:46:50 api_server.py:913] args: Namespace(subparser='serve', model_tag='ibm-granite/granite-3.3-8b-instruct', config='', host=None, port=56741, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='local-granite-3.3-8b-lora-rag-context-relevancy', path='local-granite-3.3-8b-lora-rag-context-relevancy', base_model_name=None)], 

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.92s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.96s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.60s/it]



INFO 06-13 20:47:09 model_runner.py:1115] Loading model weights took 15.2531 GB
INFO 06-13 20:47:09 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 06-13 20:47:12 worker.py:267] Memory profiling takes 2.32 seconds
INFO 06-13 20:47:12 worker.py:267] the current vLLM instance can use total_gpu_memory (79.21GiB) x gpu_memory_utilization (0.45) = 35.64GiB
INFO 06-13 20:47:12 worker.py:267] model weights take 15.25GiB; non_torch_memory takes 0.16GiB; PyTorch activation peak memory takes 3.38GiB; the rest of the memory reserved for KV Cache is 16.85GiB.
INFO 06-13 20:47:12 executor_base.py:111] # cuda blocks: 6902, # CPU blocks: 1638
INFO 06-13 20:47:12 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 3.37x
INFO 06-13 20:47:13 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 4.16 seconds
INFO 06-13 20:47:14 serving_models.py:174] Loaded new LoRA adapter: name 'local-granite-3.3-8b-lora-rag-context-relevancy', path 'local-granite-3.3-8b-l

INFO:     Started server process [3029641]
INFO:     Waiting for application startup.
INFO:     Application startup complete.


INFO:     127.0.0.1:60046 - "GET /ping HTTP/1.1" 200 OK


In [5]:
# Create an example chat completion with a short conversation.
# Base conversation about pets
base_messages = [
    {"role": "assistant", "content": "I'm here to help you prepare for your job interview!"},
    {
        "role": "user",
        "content": "I have a job interview next week for a marketing manager position.",
    },
    {
        "role": "assistant",
        "content": "Congratulations! Marketing manager is an exciting role. How are you feeling about it?"
    },
    {
        "role": "user",
        "content": "I'm nervous because I haven't interviewed in years, and this is a big career move for me.",
    },
    {
        "role": "assistant",
        "content": "It's natural to feel nervous, but preparation will help boost your confidence."
    },
    {
        "role": "user",
        "content": "What should I expect them to ask about my experience with social media campaigns as a marketing manager?",
    },
]

## Relevant Document Context Relevance Check

In [6]:
# Example 1: RELEVANT document - directly addresses outdoor pets and flea risk
chat_input_relevant = Granite3Point3Inputs.model_validate(
    {
        "messages": base_messages,
        "documents": [{"text":
        "Marketing manager interviews often focus on campaign experience and measurable results. "
        "Expect questions about social media ROI, audience engagement metrics, and conversion rates. "
        "Prepare specific examples of campaigns you've managed, including budget, timeline, and outcomes. "
        "Interviewers may ask about your experience with different social media platforms and their unique audiences. "
        "Be ready to discuss how you measure campaign success and adjust strategies based on performance data. "
        "Knowledge of current social media trends and emerging platforms demonstrates industry awareness."}
        ],
        "generate_inputs": {"temperature": 0.0},
    }
)

In [7]:
io_proc = ContextRelevancyIOProcessor(backend)
# Pass our example input through the I/O processor and retrieve the result
chat_result = await io_proc.acreate_chat_completion(chat_input_relevant)
print(chat_result.results[0].next_message.model_dump_json(indent=2))

INFO 06-13 20:47:15 logger.py:39] Received request cmpl-5df174b3e62e43c9a9fe97710732f81c-0: prompt: '<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday\'s Date: June 13, 2025.\nYou are Granite, developed by IBM. Write the response to the user\'s input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>\n<|start_of_role|>document {"document_id": "1"}<|end_of_role|>\nMarketing manager interviews often focus on campaign experience and measurable results. Expect questions about social media ROI, audience engagement metrics, and conversion rates. Prepare specific examples of campaigns you\'ve managed, including budget, timeline, and outcomes. Interviewers may ask about your experience with different social media platforms and their unique audiences. Be ready to discuss how 

## Partially Relevant Context Relevance Check

In [8]:
chat_input_partially_relevant = Granite3Point3Inputs.model_validate(
    {
        "messages": base_messages,
        "documents": [{"text":
        "Job interviews typically follow a structured format with behavioral and technical questions. "
        "Preparing specific examples using the STAR method helps answer behavioral questions effectively. "
        "Research the company's mission, values, and recent news before your interview. "
        "Dress appropriately for the company culture and arrive 10-15 minutes early. "
        "Prepare thoughtful questions to ask the interviewer about the role and company. "
        "Following up with a thank-you email within 24 hours shows professionalism and interest."}
        ],
        "generate_inputs": {"temperature": 0.0},
    }
)

In [9]:
io_proc = ContextRelevancyIOProcessor(backend)
# Pass our example input through the I/O processor and retrieve the result
chat_result = await io_proc.acreate_chat_completion(chat_input_partially_relevant)
print(chat_result.results[0].next_message.model_dump_json(indent=2))

INFO 06-13 20:47:21 logger.py:39] Received request cmpl-4a90ef40311f4f3abb8e40039cee463b-0: prompt: '<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday\'s Date: June 13, 2025.\nYou are Granite, developed by IBM. Write the response to the user\'s input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>\n<|start_of_role|>document {"document_id": "1"}<|end_of_role|>\nJob interviews typically follow a structured format with behavioral and technical questions. Preparing specific examples using the STAR method helps answer behavioral questions effectively. Research the company\'s mission, values, and recent news before your interview. Dress appropriately for the company culture and arrive 10-15 minutes early. Prepare thoughtful questions to ask the interviewer about the rol

## Irrelevant Context Check

In [10]:
chat_input_irrelevant = Granite3Point3Inputs.model_validate(
    {
        "messages": base_messages,
        "documents": [{"text":
        "Proper knife skills are fundamental to efficient cooking and food safety in the kitchen. "
        "Different cuts like julienne, brunoise, and chiffonade serve specific culinary purposes. "
        "Sharp knives are actually safer than dull ones because they require less pressure to cut. "
        "Learning to properly hold and control a chef's knife takes practice and patience. "
        "Professional chefs can prep vegetables much faster due to their refined knife techniques. "
        "Regular knife maintenance including sharpening and proper storage extends blade life."
                }
        ],
        "generate_inputs": {"temperature": 0.0},
    }
)

In [11]:
io_proc = ContextRelevancyIOProcessor(backend)
# Pass our example input through the I/O processor and retrieve the result
chat_result = await io_proc.acreate_chat_completion(chat_input_irrelevant)
print(chat_result.results[0].next_message.model_dump_json(indent=2))

INFO 06-13 20:47:22 logger.py:39] Received request cmpl-e80c29358df848bd8c1ea296faa785db-0: prompt: '<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday\'s Date: June 13, 2025.\nYou are Granite, developed by IBM. Write the response to the user\'s input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>\n<|start_of_role|>document {"document_id": "1"}<|end_of_role|>\nProper knife skills are fundamental to efficient cooking and food safety in the kitchen. Different cuts like julienne, brunoise, and chiffonade serve specific culinary purposes. Sharp knives are actually safer than dull ones because they require less pressure to cut. Learning to properly hold and control a chef\'s knife takes practice and patience. Professional chefs can prep vegetables much faster due to th