# Using LlamaIndex for Inference

## Introduction

After fine-tuning your RAG system to achieve desired performance, you'll want to
deploy it for inference. While FedRAG's `RAGSystem` provides complete inference
capabilities out of the box, you may need additional features for production deployments
or want to leverage the ecosystem of existing RAG frameworks.

FedRAG offers a seamless integration into [LlamaIndex](https://github.com/run-llama/llama_index) through our bridges system,
giving you the best of both worlds: FedRAG's fine-tuning capabilities combined
with the extensive inference features of LlamaIndex.

In this example, we demonstrate how you can convert a `RAGSystem` to a
`~llama_index.BaseManagedIndex` from which you can obtain `~llama_index.QueryEngine`
as well as `~llama_index.Retriever`.

### Install dependencies

In [None]:
# If running in a Google Colab, the first attempt at installing fed-rag may fail,
# though for reasons unknown to me yet, if you try a second time, it magically works...
!pip install fed-rag[huggingface,llama-index] -q

## Setup — The RAG System

In [2]:
import torch
from transformers.generation.utils import GenerationConfig

from fed_rag import RAGSystem, RAGConfig
from fed_rag.generators.huggingface import HFPretrainedModelGenerator
from fed_rag.retrievers.huggingface import (
    HFSentenceTransformerRetriever,
)
from fed_rag.knowledge_stores import InMemoryKnowledgeStore
from fed_rag.types import KnowledgeNode, NodeType


QUERY_ENCODER_NAME = "nthakur/dragon-plus-query-encoder"
CONTEXT_ENCODER_NAME = "nthakur/dragon-plus-context-encoder"
PRETRAINED_MODEL_NAME = "Qwen/Qwen3-0.6B"

# Retriever
retriever = HFSentenceTransformerRetriever(
    query_model_name=QUERY_ENCODER_NAME,
    context_model_name=CONTEXT_ENCODER_NAME,
    load_model_at_init=False,
)

# Generator
generation_cfg = GenerationConfig(
    do_sample=True,
    eos_token_id=151643,
    bos_token_id=151643,
    max_new_tokens=2048,
    top_p=0.9,
    temperature=0.6,
    cache_implementation="offloaded",
    stop_strings="</response>",
)
generator = HFPretrainedModelGenerator(
    model_name="Qwen/Qwen2.5-1.5B",
    load_model_at_init=False,
    load_model_kwargs={"device_map": "auto", "torch_dtype": torch.float16},
    generation_config=generation_cfg,
)

# Knowledge store
knowledge_store = InMemoryKnowledgeStore()


# Create the RAG system
rag_system = RAGSystem(
    retriever=retriever,
    generator=generator,
    knowledge_store=knowledge_store,
    rag_config=RAGConfig(top_k=1),
)

### Add some knowledge

In [3]:
text_chunks = [
    "Retrieval-Augmented Generation (RAG) combines retrieval with generation.",
    "LLMs can hallucinate information when they lack context.",
]
knowledge_nodes = [
    KnowledgeNode(
        node_type="text",
        embedding=retriever.encode_context(ct).tolist(),
        text_content=ct,
    )
    for ct in text_chunks
]
knowledge_store.load_nodes(knowledge_nodes)

In [4]:
rag_system.knowledge_store.count

2

## Using the Bridge

Converting your RAG system to a LlamaIndex object is seamless since the bridge
functionality is already built into the `RAGSystem` class. The `RAGSystem` inherits
from `LlamaIndexBridgeMixin`, which provides the `to_llamaindex()` method for
effortless conversion.

In [8]:
# Create a llamaindex object
index = rag_system.to_llamaindex()

# Use it like any other LlamaIndex object to get a query engine
query = "What happens if LLMs lack context?"
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response, "\n")

# Or, get a retriever
retriever = index.as_retriever()
results = retriever.retrieve(query)
for node in results:
    print(f"Score: {node.score}, Content: {node.node}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



You are a helpful assistant. Given the user's question, provide a succinct
and accurate response. If context is provided, use it in your answer if it helps
you to create the most accurate response.

<question>
Context information is below.
---------------------
LLMs can hallucinate information when they lack context.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What happens if LLMs lack context?
Answer: 
</question>

<context>

</context>

<response>

Assistant: If LLMs lack context, they may hallucinate information, which means they generate incorrect or irrelevant information that is not based on the available context. This can lead to inaccurate results and potentially harm the user's decision-making process. To avoid this, it is important to provide LLMs with relevant and accurate context before using them. 

Score: 0.5453173113645673, Content: Node ID: 9eaf96fe-c784-4cac-b423-518e063a936b
Text: LLMs can hallucinate informat