<a href="https://colab.research.google.com/github/dragstoll/interesting_code/blob/main/RAG_with_Qwen3_Embeddings_and_Qwen3_Reranker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*More details in this article: [RAG with Qwen3 Embedding and Qwen3 Reranker](https://kaitchup.substack.com/p/rag-with-qwen3-embeddings-and-qwen3)*

This notebook shows how to run Qwen3 Embedding and Reranker successively to retrieve the most relevant documents from a list, given a user query, as typically done in RAG applications.

The embedding model runs with sentence-transformers while the reranker model is loaded with vLLM.
I used the smallest models so this notebook can run with a small GPU.

The code is largely based on the code provided as example in [the model cards by the Qwen team](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B).


# Installation

In [None]:
!pip install --upgrade transformers sentence-transformers vllm flash-attn

Collecting vllm
  Downloading vllm-0.9.1-cp38-abi3-manylinux1_x86_64.whl.metadata (15 kB)
Collecting flash-attn
  Downloading flash_attn-2.8.0.post2.tar.gz (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m123.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting blake3 (from vllm)
  Downloading blake3-1.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer<0.11,>=0.10.11 (from vllm)
  Downloading lm_format_enforcer-0.10.11-py3-none-any.whl.metadata (17 kB)
Collecting llguidance<0.8.0,>=0.7.11 (from vllm)
  Downloading llguidance-0.7.29-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.9 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any

# Qwen3 Embedding (Recall)

In [None]:
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
     "Qwen/Qwen3-Embedding-0.6B",
     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto","torch_dtype":"float16"},
     tokenizer_kwargs={"padding_side": "left"},
)


documents = [
    "The Moon has no atmosphere, which means it cannot support life as we know it. Temperatures swing wildly from scorching hot during the day to freezing cold at night. The surface is covered in a layer of fine dust called regolith.",

    "Python is a high-level programming language known for its readability and wide range of applications. It supports multiple paradigms, including procedural, object-oriented, and functional programming. Python is especially popular in data science and AI.",

    "Mount Everest is the tallest mountain on Earth, standing at 8,848 meters above sea level. Located in the Himalayas on the border between Nepal and China, it attracts climbers from around the globe each year.",

    "Photosynthesis is the process by which green plants use sunlight to convert carbon dioxide and water into glucose and oxygen. This process occurs primarily in the chloroplasts of plant cells. It is essential for life on Earth.",

    "The Great Barrier Reef is the world's largest coral reef system. It is located off the coast of Queensland, Australia and is composed of over 2,900 individual reefs. It supports a vast diversity of marine life.",

    "Saturn is known for its prominent ring system, which is made up of ice particles, rock debris, and dust. It is the sixth planet from the Sun and the second-largest in our solar system. Saturn has at least 83 moons.",

    "Shakespeare wrote 37 plays and 154 sonnets, contributing immensely to English literature. Some of his most famous works include Hamlet, Macbeth, and Romeo and Juliet. His influence is still seen in modern storytelling.",

    "Photosynthesis is critical in maintaining atmospheric oxygen levels. Without it, life as we know it would not exist. The glucose produced is also a primary energy source for many organisms.",

    "The boiling point of water at sea level is 100 degrees Celsius. However, this value decreases at higher altitudes due to lower atmospheric pressure. This is why cooking times may vary in mountainous regions.",

    "The human brain contains approximately 86 billion neurons. These neurons communicate via synapses, creating the complex networks that underlie thought, memory, and emotion. Brain plasticity allows it to adapt over time.",

    "The Nile River is the longest river in Africa and was essential to the development of ancient Egyptian civilization. Its predictable flooding supported agriculture in the otherwise arid region. Today, it remains a crucial water source.",

    "The Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones. It starts with 0 and 1. This sequence appears frequently in nature, such as in flower petals and pinecones.",

    "The speed of light in a vacuum is approximately 299,792 kilometers per second. It represents the ultimate speed limit in the universe. According to Einstein’s theory of relativity, nothing can travel faster than light.",

    "Bees are essential pollinators in many ecosystems. Without them, numerous plants would fail to reproduce. In recent years, bee populations have declined due to pesticides, habitat loss, and disease.",

    "Machine learning is a subset of artificial intelligence focused on building systems that learn from data. Common types include supervised, unsupervised, and reinforcement learning. ML is widely used in recommendation engines and fraud detection.",

    "Jupiter is the largest planet in the solar system and has a strong magnetic field. It has at least 95 moons, including Ganymede, the largest moon in the solar system. Its Great Red Spot is a massive storm system.",

    "World War II began in 1939 and ended in 1945. It involved most of the world's nations and led to significant geopolitical changes. The conflict ended with the defeat of the Axis powers and the emergence of the U.S. and Soviet Union as superpowers.",

    "The Amazon Rainforest produces about 20% of the world's oxygen. It is home to millions of species, many of which are yet to be discovered. Deforestation poses a serious threat to this critical ecosystem.",

    "Blockchain is a decentralized digital ledger technology. Each block contains a record of transactions and is linked to the previous one, forming a chain. It underpins cryptocurrencies like Bitcoin and Ethereum.",

    "The Andromeda Galaxy is the closest spiral galaxy to the Milky Way and is expected to collide with it in about 4.5 billion years. It contains roughly one trillion stars. This future merger will form a single, larger galaxy."
]

query = "Which planet has a massive storm called the Great Red Spot?"

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)


# Rank the documents by similarity score (descending)
scores = similarity.squeeze(0)           # shape: (N,)

# Rank documents by similarity (highest first)
ranked_idx = torch.argsort(scores, descending=True)                       # descending order

print("Ranked results:")
for i in ranked_idx:
    print(f"{scores[i]:.4f}  -  {documents[i]}")

Ranked results:
0.3159  -  Mount Everest is the tallest mountain on Earth, standing at 8,848 meters above sea level. Located in the Himalayas on the border between Nepal and China, it attracts climbers from around the globe each year.
0.1934  -  The Great Barrier Reef is the world's largest coral reef system. It is located off the coast of Queensland, Australia and is composed of over 2,900 individual reefs. It supports a vast diversity of marine life.
0.1793  -  The Andromeda Galaxy is the closest spiral galaxy to the Milky Way and is expected to collide with it in about 4.5 billion years. It contains roughly one trillion stars. This future merger will form a single, larger galaxy.
0.1775  -  The boiling point of water at sea level is 100 degrees Celsius. However, this value decreases at higher altitudes due to lower atmospheric pressure. This is why cooking times may vary in mountainous regions.
0.1697  -  Jupiter is the largest planet in the solar system and has a strong magnetic fi

# Qwen3 Reranker (Precision)

In [None]:
# Requires vllm>=0.8.5
import logging
from typing import Dict, Optional, List

import json
import logging

import torch

from transformers import AutoTokenizer, is_torch_npu_available
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
import gc
import math
from vllm.inputs.data import TokensPrompt



def format_instruction(instruction, query, doc):
    text = [
        {"role": "system", "content": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."},
        {"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
    ]
    return text

def process_inputs(pairs, instruction, max_length, suffix_tokens):
    messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
    messages =  tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
    )
    messages = [ele[:max_length] + suffix_tokens for ele in messages]
    messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
    return messages

def compute_logits(model, messages, sampling_params, true_token, false_token):
    outputs = model.generate(messages, sampling_params, use_tqdm=False)
    scores = []
    for i in range(len(outputs)):
        final_logits = outputs[i].outputs[0].logprobs[-1]
        token_count = len(outputs[i].outputs[0].token_ids)
        if true_token not in final_logits:
            true_logit = -10
        else:
            true_logit = final_logits[true_token].logprob
        if false_token not in final_logits:
            false_logit = -10
        else:
            false_logit = final_logits[false_token].logprob
        true_score = math.exp(true_logit)
        false_score = math.exp(false_logit)
        score = true_score / (true_score + false_score)
        scores.append(score)
    return scores

number_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Reranker-0.6B')
model = LLM(model='Qwen/Qwen3-Reranker-0.6B', tensor_parallel_size=number_of_gpu, max_model_len=10000, enable_prefix_caching=True, gpu_memory_utilization=0.8)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
max_length=8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(temperature=0,
    max_tokens=1,
    logprobs=20,
    allowed_token_ids=[true_token, false_token],
)


task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["Which planet has the Great Red Spot?"
]
documents = [
  "Mount Everest is the tallest mountain on Earth, standing at 8,848 meters above sea level. Located in the Himalayas on the border between Nepal and China, it attracts climbers from around the globe each year.",
  "The boiling point of water at sea level is 100 degrees Celsius. However, this value decreases at higher altitudes due to lower atmospheric pressure. This is why cooking times may vary in mountainous regions.",
  "Jupiter is the largest planet in the solar system and has a strong magnetic field. It has at least 95 moons, including Ganymede, the largest moon in the solar system. Its Great Red Spot is a massive storm system.",
  "Blockchain is a decentralized digital ledger technology. Each block contains a record of transactions and is linked to the previous one, forming a chain. It underpins cryptocurrencies like Bitcoin and Ethereum.",
  "The Great Barrier Reef is the world's largest coral reef system. It is located off the coast of Queensland, Australia and is composed of over 2,900 individual reefs. It supports a vast diversity of marine life."
]

for query in queries:
  pairs = [(query, doc) for doc in documents]
inputs = process_inputs(pairs, task, max_length-len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print('scores', scores)

destroy_model_parallel()


INFO 06-17 06:41:19 [__init__.py:244] Automatically detected platform cuda.
INFO 06-17 06:41:40 [config.py:823] This model supports multiple tasks: {'score', 'classify', 'embed', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 06-17 06:41:40 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-17 06:41:43 [core.py:455] Waiting for init message from front-end.
INFO 06-17 06:41:43 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='Qwen/Qwen3-Reranker-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-Reranker-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=Decoding

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 06-17 06:41:46 [default_loader.py:272] Loading weights took 0.44 seconds
INFO 06-17 06:41:47 [gpu_model_runner.py:1624] Model loading took 1.1196 GiB and 1.712442 seconds
INFO 06-17 06:41:58 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/f8761a3afc/rank_0_0 for vLLM's torch.compile
INFO 06-17 06:41:58 [backends.py:472] Dynamo bytecode transform time: 10.18 s
INFO 06-17 06:42:05 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 6.739 s
INFO 06-17 06:42:06 [monitor.py:34] torch.compile takes 10.18 s in total
INFO 06-17 06:42:08 [gpu_worker.py:227] Available KV cache memory: 15.18 GiB
INFO 06-17 06:42:08 [kv_cache_utils.py:715] GPU KV cache size: 142,128 tokens
INFO 06-17 06:42:08 [kv_cache_utils.py:719] Maximum concurrency for 10,000 tokens per request: 14.21x
INFO 06-17 06:42:41 [gpu_model_runner.py:2048] Graph capturing finished in 33 secs, took 0.45 GiB
INFO 06-17 06:42:41 [core.py:171] init engine (profile, 

In [None]:
import numpy as np

ranked_idx = np.argsort(scores)

print("Ranked results:")
for i in ranked_idx:
    print(f"{scores[i]:.4f}  -  {documents[i]}")

Ranked results:
0.0102  -  The Great Barrier Reef is the world's largest coral reef system. It is located off the coast of Queensland, Australia and is composed of over 2,900 individual reefs. It supports a vast diversity of marine life.
0.0204  -  Mount Everest is the tallest mountain on Earth, standing at 8,848 meters above sea level. Located in the Himalayas on the border between Nepal and China, it attracts climbers from around the globe each year.
0.0336  -  The boiling point of water at sea level is 100 degrees Celsius. However, this value decreases at higher altitudes due to lower atmospheric pressure. This is why cooking times may vary in mountainous regions.
0.0573  -  Blockchain is a decentralized digital ledger technology. Each block contains a record of transactions and is linked to the previous one, forming a chain. It underpins cryptocurrencies like Bitcoin and Ethereum.
0.9978  -  Jupiter is the largest planet in the solar system and has a strong magnetic field. It has a