# Retrieval Augmented Generation (RAG)

In this section, we implement a complete RAG pipeline for answering questions based on a given context. Using the LangChain library, we'll walk through the entire processâ€”from retrieving relevant context to generating accurate answers.

So far, we have done:
1. **Indexing**: Organize the raw documents into a structured format suitable for processing, such as splitting them into chunks or passages for more efficient retrieval.

2. **Embedding**: Convert each text chunk into a dense vector representation using a pre-trained embedding model. These embeddings capture the semantic meaning of the content.

3. **Vector Store**: Store the embeddings in a vector database (Qdrant in our case), allowing fast and scalable similarity search across the document collection.

Now we will continue with **Retrieval and Generation**: in this notebook, given a user query, retrieve the most relevant document chunks from the vector store and feed them into a language model (EVE) to generate a context-aware, accurate response.

In [None]:
from sentence_transformers import SentenceTransformer

# Initialize the embedding model
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [None]:
query = "What is TROPOMI?"
query_vector = embedder.encode([query])[0].tolist()

In [None]:
from qdrant_client import QdrantClient
import os

from dotenv import load_dotenv
load_dotenv()

# get your keys from the qdrant UI
QDRANT_API_KEY = os.environ['QDRANT_API_KEY']
QDRANT_URL = os.environ['QDRANT_URL']

collection_name = "ingestion_demo"
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)

In [None]:
results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=1  # number of similar results you want
)

In [None]:
for result in results.points:
    print(f"retrieval score: {result.score}")
    print(f"retrieved chunk: {result.payload['content']}")

## Generation

Once the relevant context is retrieved, it is passed to an LLM to generate a coherent and informed response based on both the query and the retrieved context.

This approach ensures that the generated answers are grounded in the source documents, improving accuracy and reducing hallucination.

In [None]:
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="Qwen/Qwen3-0.6B",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512,
        do_sample=False,
        repetition_penalty=1.03,
    ),
)

chat_model = ChatHuggingFace(llm=llm)

In [None]:
from langchain.messages import (
    HumanMessage,
    SystemMessage,
)

system_message = '''You are an expert assistant that answers questions about different topics.
If you don't know the answer, just say "I don't know." Don't try to make up an answer.
Use only the following pieces of context to answer the question at the end.
Do not use any prior knowledge.'''


messages = [
    SystemMessage(content=system_message),
    HumanMessage(
        content=f"Context: {result.payload['content']} Question: {query}"
    ),
]

response = chat_model.invoke(messages)

In [None]:
print(response.content)