<a href="https://colab.research.google.com/github/benitomartin/rag-qwen-vllm-milvus/blob/main/rag_qwen_vllm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Install and Import Dependencies

This notebook runs using a A100 GPU.

In [None]:
!pip -qq install vllm triton langchain-community langchain-milvus langchain-openai

In [2]:
import os
import torch
import gc
from google.colab import userdata

from langchain_milvus import Milvus
from langchain_community.llms import VLLM
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter



# Step 2: Set Environment

In [3]:
# Check the GPU
!nvidia-smi

Tue Dec 17 12:36:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              50W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
# Run garbage collector
gc.collect()

# Clear the GPU memory cache.
torch.cuda.empty_cache()

In [5]:
# Set environment variable

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Step 3: Load the Data

In [6]:
# Load documents

loader = WebBaseLoader(
    web_paths=(
        "https://qwenlm.github.io/blog/qwq-32b-preview/",
        "https://qwenlm.github.io/blog/qwen2.5/",
        "https://qwenlm.github.io/blog/qwen2.5-coder-family/",

    )
)

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

docs = text_splitter.split_documents(documents)

In [7]:
docs[1]

Document(metadata={'source': 'https://qwenlm.github.io/blog/qwq-32b-preview/', 'title': 'QwQ: Reflect Deeply on the Boundaries of the Unknown | Qwen', 'description': 'GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD\nNote: This is the pronunciation of QwQ: /kwju:/ , similar to the word “quill”.\nWhat does it mean to think, to question, to understand? These are the deep waters that QwQ (Qwen with Questions) wades into. Like an eternal student of wisdom, it approaches every problem - be it mathematics, code, or knowledge of our world - with genuine wonder and doubt. QwQ embodies that ancient philosophical spirit: it knows that it knows nothing, and that’s precisely what drives its curiosity.', 'language': 'en'}, page_content='ResourcesBlogPublicationAboutQwQ: Reflect Deeply on the Boundaries of the UnknownNovember 28, 2024\xa0·\xa022 min\xa0·\xa04496 words\xa0·\xa0Qwen Team\xa0|\xa0Translations:简体中文GITHUB\nHUGGING FACE\nMODELSCOPE\nDEMO')

# Step 4: Create Milvus Retriever

In [8]:
# Set Milvus retriever

embeddings = OpenAIEmbeddings()

vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=embeddings,
    connection_args={
        "uri": "./milvus_demo.db",
    },
    text_field="page_content",
    metadata_field="metadata",
    drop_old=True,
)

retriever = vectorstore.as_retriever(
                          search_type="similarity",
                          search_kwargs={"k": 4}
                          )

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 798a61574fb34c9da84fed7a9014a752


# Step 5: Initialize LLM Engine

In [9]:
# Initialize vLLM

llm = VLLM(
    model="Qwen/Qwen2.5-1.5B",
    trust_remote_code=True,
    max_new_tokens=500,
    enforce_eager=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.8,
)

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

INFO 12-17 12:37:23 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-17 12:37:23 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 12-17 12:37:23 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='Qwen/Qwen2.5-1.5B', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_

tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

INFO 12-17 12:37:24 selector.py:135] Using Flash Attention backend.
INFO 12-17 12:37:25 model_runner.py:1072] Starting to load model Qwen/Qwen2.5-1.5B...
INFO 12-17 12:37:26 weight_utils.py:243] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

INFO 12-17 12:38:40 weight_utils.py:288] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 12-17 12:38:41 model_runner.py:1077] Loading model weights took 2.9104 GB
INFO 12-17 12:38:42 worker.py:232] Memory profiling results: total_gpu_memory=39.56GiB initial_memory_usage=3.53GiB peak_torch_memory=4.30GiB memory_usage_post_profile=3.57GiB non_torch_memory=0.65GiB kv_cache_size=30.66GiB gpu_memory_utilization=0.90
INFO 12-17 12:38:43 gpu_executor.py:113] # GPU blocks: 71770, # CPU blocks: 9362
INFO 12-17 12:38:43 gpu_executor.py:117] Maximum concurrency for 131072 tokens per request: 8.76x
INFO 12-17 12:38:46 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-17 12:38:46 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO

# Step 6: Implement RAG Pipeline


In [10]:
# Set QA chain

template = """
              You are an assistant for question-answering tasks.
              Use the following pieces of retrieved context to answer the question.
              If you don't know the answer, just say that you don't know.
              Pleae provide the answer in English language.

              Question: {question}
              Context: {context}

              Answer:
          """

prompt = PromptTemplate.from_template(template)

qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt },
        return_source_documents=True
        )

In [11]:
# Print statements for response details

def display_response_details(response):
    print("Query:", response["query"])
    print("Result:", response["result"])

    # Extract metadata from the first source document
    source_doc = response["source_documents"][0].metadata
    print("Source:", source_doc["source"])
    print("Description:", source_doc["description"])
    print("Title:", source_doc["title"])

In [12]:
# Set the question
question = "What can you tell my about QwQ model?"

# Initialize the query
response = qa_chain.invoke({"query": question, "context": retriever})

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.26it/s, est. speed input: 1325.93 toks/s, output: 149.70 toks/s]


In [13]:
# Example usage (replace 'response' with the actual response dictionary):
display_response_details(response)

Query: What can you tell my about QwQ model?
Result:  QwQ model shows exceptional abilities in analytical and problem-solving capabilities across technical domains, with excellent performance in mathematical and coding tasks. It excels in understanding and reasoning across diverse topics, delivering highly competitive results on MMLU, surpassing some larger models. Its post-training methodologies have improved performance, expanded capability, and enhanced robustness for structured data, long text generation, and system prompts. With advancements in fine-tuning and prompt guidance models such as Qwen2.5 + CodeQwen + SimQwen, the language model community is set to witness an even brighter future.
Source: https://qwenlm.github.io/blog/qwq-32b-preview/
Description: GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD
Note: This is the pronunciation of QwQ: /kwju:/ , similar to the word “quill”.
What does it mean to think, to question, to understand? These are the deep waters that QwQ (Qwen with Qu

In [14]:
# Set the question
question = "What can you tell my about Qwen2.5 open-source models: Qwen2.5, Qwen2.5-Coder, Qwen2.5-Math?"

# Initialize the query
response = qa_chain.invoke({"query": question, "context": retriever})

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.94s/it, est. speed input: 611.53 toks/s, output: 165.20 toks/s]


In [15]:
# Example usage (replace 'response' with the actual response dictionary):
display_response_details(response)

Query: What can you tell my about Qwen2.5 open-source models: Qwen2.5, Qwen2.5-Coder, Qwen2.5-Math?
Result:  The Qwen2.5 models, namely Qwen2.5, Qwen2.5-Coder, and Qwen2.5-Math, have acquired significant improvements in various aspects compared to their predecessors. Here are some key takeaways:

1. **Academic Performance:**
   - Qwen2.5 achieved the best performance among open-source models on multiple popular code generation benchmarks (EvalPlus, LiveCodeBench, BigCodeBench).
   - Qwen2.5 also scored highly on an Instructive evaluation task, indicating strong alignment with human-like instruction-following abilities.
   - Qwen2.5 provided competitive performance with GPT-4o, demonstrating superiority over its predecessors.

2. **Practical Applications:**
   - Qwen2.5 models excel in handling different programming languages, showcasing their adaptability and versatility.
   - Qwen2.5 models have undergone significant enhancements, extending their skills to handle more complex reasonin