# RAG Using Different LLM from RAG Essentials

In this notebook, we demonstrate the construction of a **Retrieval-Augmented Generation (RAG)** pipeline by connecting with **HPE's RAG Essentials**.

Our goal is to integrate RAG Essentials into our RAG architecture, where a language model is enhanced with external knowledge retrieval. This allows the system to answer queries based not only on its internal knowledge but also by dynamically retrieving relevant context from an external data source.

This notebook covers:
- Preparing and indexing documents for retrieval
- Setting up RAG Essentials for LLM inference
- Querying the RAG system with natural language inputs
- Using the retrieved context to generate enriched, accurate answers

This example serves as a practical guide to leveraging cloud-based inferencing with HPE RAG Essentials to build scalable and intelligent applications using retrieval-augmented techniques.

## Installing the Libraries

In [None]:
!pip install -r requirements.txt -q

### One-Time Environment Setup (Cert & Dependencies)

The following commands should be run **once** at the beginning of your session:

```python
!pip install -r requirements.txt -qq > /dev/null 2>&1
!cat my-private-ca-pcai-1.crt >> $(python -m certifi)

### Append Custom CA Certificate to Python's Trusted Cert Store

The following command appends a custom certificate (`my-private-ca-pcai-1.crt`) to Python's certifi CA bundle, allowing Python tools like `requests` to trust internal HTTPS endpoints signed by this CA:


In [None]:
!cat /mnt/shared/CA/my-private-ca-pcai-1.crt >> $(python -m certifi)

### Restart the Kernel

After running the setup commands above:

Go to the Jupyter menu and select:  
**`Kernel` → `Restart Kernel`**  

This step is necessary to activate:
- The newly installed packages.
- The updated CA certificates in the Python runtime.


### After Restart: Run the Remaining Notebook Cells

Now that the kernel has restarted, begin executing the remaining notebook cells starting from your LangChain and Weaviate imports.


## Importing Libraries

This cell imports all necessary libraries and modules required to build the RAG pipeline. It includes:

- **LangChain integrations** for:
  - NVIDIA AI Endpoints (chat and reranking)
  - Weaviate vector store
  - Document loaders and text splitters
  - RetrievalQA chains with contextual compression
- **Weaviate** for managing the vector database

These components collectively enable document ingestion, vector storage, retrieval, reranking, and response generation using cloud-hosted LLM endpoints.

In [9]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints.reranking import NVIDIARerank
from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
import copy
import weaviate

##  Fetching the Secret Token for RAG Essentials

This step retrieves the **secret access token** required to authenticate and connect to the **Weaviate vector database** instance used in the RAG pipeline.

In [2]:
import weaviate, os
from weaviate.classes.init import Auth

#getting the auth token
secret_file_path = "/etc/secrets/ezua/.auth_token"

with open(secret_file_path, "r") as file:
    token = file.read().strip()

##  Connecting to Weaviate

This cell establishes a connection to the **Weaviate vector database** using custom HTTP and gRPC endpoints configured for an internal HPE environment.

In [3]:
domain = ".cluster.local"
http_host = "weaviate.hpe-weaviate.svc.cluster.local"
grpc_host = "weaviate-grpc.hpe-weaviate.svc" + domain
weaviate_headers = {"x-auth-token": token}
#weaviate_headers = {"x-auth-token": "wrong token"}

client = weaviate.connect_to_custom(
    http_host=http_host,        # Hostname for the HTTP API connection
    http_port=80,              # Default is 80, WCD uses 443
    http_secure=False,           # Whether to use https (secure) for the HTTP API connection
    grpc_host=grpc_host,        # Hostname for the gRPC API connection
    grpc_port=50051,              # Default is 50051, WCD uses 443
    grpc_secure=False,           # Whether to use a secure channel for the gRPC API connection
    headers=weaviate_headers,
    skip_init_checks=False
)

print(client.is_ready())

HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/.well-known/openid-configuration "HTTP/1.1 404 Not Found"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/meta "HTTP/1.1 200 OK"
HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/.well-known/ready "HTTP/1.1 200 OK"


True


## Connecting to LLM Through RAG ESSENTIALS

This cell initializes a connection to a **Large Language Model (LLM)** served through **HPE's RAG Essentials**.

- **Model**: `meta/llama3.1-8b-instruct` – a powerful instruction-tuned LLM.
- **Endpoint**: Provided by the RAG Essentials `base_url`.
- **API Key**: Used for secure access to the inference service.
- **Parameters**:
  - `temperature`: Controls randomness in outputs (0.5 = balanced)
  - `max_tokens`: Limits response length
  - `top_p`: Controls nucleus sampling (1.0 = full probability mass)

This LLM is later used in the RAG pipeline for generating responses based on retrieved context.

In [23]:
llm_endpoint_rag_essentials = "paste the rag essentials endpoint here"

In [None]:
llm = ChatNVIDIA(
    base_url=llm_endpoint_rag_essentials,
    model="meta/llama-3.1-8b-instruct",
    api_key=token,
    temperature=0.5,
    max_tokens=1024,
    top_p=1.0,
)
llm.invoke("What cloud services does HPE provide ?")

## Data Extraction and Processing

This section handles the **loading and preprocessing of PDF documents** used in the RAG pipeline.

### Steps:
1. **Directory Loading**:
   - Loads all PDF files from the `./pdf` directory using `PyPDFDirectoryLoader`.

2. **Text Chunking**:
   - Splits documents into manageable chunks using `RecursiveCharacterTextSplitter`.
   - Parameters: 
     - `chunk_size=500` characters
     - `chunk_overlap=50` characters for better context continuity.

3. **Metadata Normalization**:
   - Extracts and standardizes metadata fields (e.g., `source`, `page`, `total_pages`, `title`) for each chunk.
   - This metadata is crucial for **citations during inference**, helping ensure **traceability and credibility** in generated responses.

This prepares the document data for indexing into the vector store with relevant context and citation support.


In [10]:
# Specify the directory where the PDF is stored
pdf_directory = "./pdf"
# Load the PDF documents
loader = PyPDFDirectoryLoader(pdf_directory)
documents = loader.load()
 
# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)
metadata_docs = [copy.deepcopy(doc) for doc in docs]
for doc in docs:
    temp_meta = {"source": doc.metadata['source'] if not hasattr(doc.metadata,'source') else "",
                "page": float(doc.metadata["page_label"]) if not hasattr(doc.metadata,'page_label') else 0,
                "total_pages": float(doc.metadata["total_pages"]) if not hasattr(doc.metadata,'total_pages') else 0,
                "title": doc.metadata["title"] if hasattr(doc.metadata,'title') else ""}
    doc.metadata = temp_meta

##  Vector Store Initialization

This section initializes the **vector store** by creating embeddings from the processed document chunks and storing them in **Weaviate**.

### Key Components:
- **Embeddings**:
  - Generated using the `nomic-embed-text:latest` model via **Ollama**, accessed through LangChain’s `OllamaEmbeddings`.

- **Vector Store**:
  - Uses `WeaviateVectorStore.from_documents()` to create vector representations of the document chunks.
  - Connects to the existing Weaviate client and stores the data under the index name **`RAG`**.

Once complete, all embedded chunks with associated metadata are indexed in Weaviate under the **RAG** collection, making them searchable during the retrieval phase of the RAG pipeline.


In [12]:
from langchain_ollama import OllamaEmbeddings

vector = WeaviateVectorStore.from_documents(docs, embedding=OllamaEmbeddings(model = "nomic-embed-text:latest", base_url="https://ollama.pcai1.genai1.hou"), client=client, index_name="RAG", text_key="Rag".lower() + "_key")


HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema/RAG2 "HTTP/1.1 404 Not Found"
HTTP Request: POST http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema/RAG2 "HTTP/1.1 200 OK"
HTTP Request: POST https://ollama.pcai1.genai1.hou/api/embed "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/nodes "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/nodes "HTTP/1.1 200 OK"


##  Retriever Initialization

This step configures the **Weaviate vector store** as a retriever to enable efficient document retrieval within the RAG pipeline.

By calling `vector.as_retriever()`, the vector database is wrapped with retrieval capabilities, allowing the system to fetch the most relevant document chunks based on query embeddings during inference.


In [13]:
retriever=vector.as_retriever()

## Reranking for Improved Retrieval

This section enhances the retrieval accuracy using a **contextual reranking mechanism**.

### Components:
- **NVIDIARerank**:
  - Uses the `nvidia/nv-rerankqa-mistral-4b-v3` model.
  - Re-evaluates and reranks the initially retrieved documents from the vector store based on their relevance to the query.

- **ContextualCompressionRetriever**:
  - Wraps the original retriever with the reranker.
  - Filters and returns only the most contextually relevant chunks, improving the final output quality from the LLM.

This reranking step ensures that only the most meaningful documents are passed to the LLM, increasing the **accuracy**, **precision**, and **credibility** of the generated answers.


In [18]:
compressor = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3",
                          base_url="https://reranker-5c3f14b5-predictor-ezai-services.pcai1.genai1.hou",
                          api_key=token)



In [19]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
) 

## User Query

In [16]:
query = "What is the key message for the SynergyHub campaign?"

## Output
This cell runs the **RetrievalQA chain**, which:

- Uses the **retriever** to fetch relevant document chunks from the vector store based on the user query.
- Passes the retrieved context to the **LLM** (`llm`) for generating a meaningful and context-aware response.
- Returns the generated answer along with the **source documents** used for citation, ensuring transparency and credibility in the output.

The call `chain.invoke(query)` triggers the entire RAG process for the input query.

In [None]:
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)
resp = chain.invoke(query)

In [None]:
result = resp["result"]
print("Assistant:", result)
for metadata in resp["source_documents"]:
    print(f"Source: {metadata.metadata['source']} Title: {metadata.metadata['title']} Page No: {metadata.metadata['page']}")