# RAG Using Different LLM from RAG Essentials

### This notebook uses PCAI RAG ESSENTIALS End-Points to Build a RAG Application

In [16]:
!cat my-private-ca.crt >> $(python -m certifi)

## Importing the Libraries


- **`ChatNVIDIA` & `NVIDIAEmbeddings`**  
  From `langchain_nvidia_ai_endpoints`: These provide access to NVIDIA's LLM and embedding models for tasks like chat and semantic search.

- **`NVIDIARerank`**  
  A reranker module from NVIDIA to reorder retrieved documents based on relevance scores, improving final answer quality.

- **`WeaviateVectorStore`**  
  An integration with Weaviate, a vector database used to store and query embedding vectors for semantic search.

- **`RetrievalQA`**  
  A LangChain chain that combines a retriever and a language model to answer questions based on retrieved documents.

- **`PyPDFLoader`**  
  A document loader that reads and extracts text content from PDF files.

- **`ContextualCompressionRetriever`**  
  A retriever wrapper that compresses context intelligently before passing it to the LLM, often using re-ranking or summarization.

- **`CharacterTextSplitter`**  
  A utility to split large documents into smaller, overlapping text chunks based on character count (useful for chunking PDFs for RAG).


In [1]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from langchain_nvidia_ai_endpoints.reranking import NVIDIARerank
from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain.text_splitter import CharacterTextSplitter

## Fetching the Secret Token for RAG Essentials

In [2]:
import weaviate, os
from weaviate.classes.init import Auth

#getting the auth token
secret_file_path = "/etc/secrets/ezua/.auth_token"

with open(secret_file_path, "r") as file:
    token = file.read().strip()

In [None]:
token

## Connecting to Weaviate

In [3]:
domain = ".cluster.local"
http_host = "weaviate.hpe-weaviate.svc.cluster.local"
grpc_host = "weaviate-grpc.hpe-weaviate.svc" + domain
weaviate_headers = {"x-auth-token": token}
#weaviate_headers = {"x-auth-token": "wrong token"}

client = weaviate.connect_to_custom(
    http_host=http_host,        # Hostname for the HTTP API connection
    http_port=80,              # Default is 80, WCD uses 443
    http_secure=False,           # Whether to use https (secure) for the HTTP API connection
    grpc_host=grpc_host,        # Hostname for the gRPC API connection
    grpc_port=50051,              # Default is 50051, WCD uses 443
    grpc_secure=False,           # Whether to use a secure channel for the gRPC API connection
    headers=weaviate_headers,
    skip_init_checks=False
)

print(client.is_ready())

HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/.well-known/openid-configuration "HTTP/1.1 404 Not Found"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/meta "HTTP/1.1 200 OK"
HTTP Request: GET https://pypi.org/pypi/weaviate-client/json "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/.well-known/ready "HTTP/1.1 200 OK"


True


## Connecting to LLM Through RAG ESSENTIALS

In [4]:
llm_endpoint_rag_essentials = "https://llama-3-1-8b-b7ee1686-predictor-ezai-services.pcai1.genai1.hou"

`base_url=llm_endpoint_rag_essentials`
The endpoint URL for accessing the NVIDIA-hosted LLM. This is where the requests will be sent.

`model="meta/llama-3.1-8b-instruct`
Specifies the model to use. In this case, it's Meta’s LLaMA 3.1 model with 8 billion parameters, optimized for instruction-following tasks.

`api_key=token`
Your API key used to authenticate the requests. This should be kept secure and typically stored in an environment variable.

`temperature=0.5`
Controls randomness in generation.

Lower values like 0.5 make the output more focused and deterministic.

Higher values increase creativity but reduce reliability.

`max_tokens=1024`
Maximum number of tokens (words or word-parts) the model can generate in the response.

`top_p=1.0`
Implements nucleus (top-p) sampling. The model considers only the smallest set of tokens whose cumulative probability is at least p.

Setting it to 1.0 disables nucleus filtering, allowing full probability distribution sampling.

In [5]:
llm = ChatNVIDIA(
    base_url=llm_endpoint_rag_essentials,
    model="meta/llama-3.1-8b-instruct",
    api_key=token,
    temperature=0.5,
    max_tokens=1024,
    top_p=1.0,
)



## Data Extraction and Processing

`pdf_path = "./HPE.pdf"`
Specifies the path to the PDF file that needs to be processed.

`loader = PyPDFLoader(pdf_path)`
Creates a PyPDFLoader instance to handle reading and parsing of the PDF file.

`documents = loader.load()`
Loads the content of the PDF into a list of Document objects, where each object typically represents one page.

CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
Initializes a text splitter that breaks down long documents into smaller chunks:

Each chunk will be up to 500 characters.

There will be a 50-character overlap between chunks to preserve context.

`docs = text_splitter.split_documents(documents)`
Applies the splitter to the list of loaded documents to produce a list of text chunks.

`for doc in docs: doc.metadata = {}`
Clears any metadata associated with the text chunks (e.g., page number, source), which can be useful for privacy or minimizing storage.

In [6]:
# Replace with the path to your PDF
pdf_path = "./HPE.pdf"

# Load PDF file
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Split into manageable chunks
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

for doc in docs:
    doc.metadata={}

## Vector Store Initialization

In [9]:
from langchain_ollama import OllamaEmbeddings

## Add Documents to the Vector Store 

In [10]:
vector = WeaviateVectorStore.from_documents(docs, OllamaEmbeddings(model = "nomic-embed-text:latest", base_url="http://10.79.253.112:11434"), client=client, index_name="Question2", text_key="Question2".lower() + "_key")


HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema/Question2 "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema/Question2 "HTTP/1.1 200 OK"
HTTP Request: POST http://10.79.253.112:11434/api/embed "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/schema "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/nodes "HTTP/1.1 200 OK"
HTTP Request: GET http://weaviate.hpe-weaviate.svc.cluster.local/v1/nodes "HTTP/1.1 200 OK"


## Retriever Initialization

In [11]:
retriever=vector.as_retriever()

## Reranking

`model="nvidia/nv-rerankqa-mistral-4b-v3"`

Specifies the name of the reranking model.
This model is designed to re-order retrieved documents based on how well they answer a query (RAG-based QA).
It is based on a 4B parameter Mistral architecture fine-tuned for reranking.

`base_url="https://reranker-5c3f14b5-predictor-ezai-services.pcai1.genai1.hou"`
The endpoint URL of the hosted PCAI Reranker.

`api_key=token`
The API key used for authentication, stored in the variable token.

In [12]:
compressor = NVIDIARerank(model="nvidia/nv-rerankqa-mistral-4b-v3",
                          base_url="https://reranker-5c3f14b5-predictor-ezai-services.pcai1.genai1.hou",
                          api_key=token)



## This compression_retriever performs a two-stage retrieval process:

**1. Retrieve:** Fetch many potentially relevant chunks from a vector store.

**2. Compress:** Rerank and filter those chunks using a more precise model.

In [13]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
) 

## User Query

The user Question.

In [14]:
query = "which GPU powers the HPE ProLiant Compute DL384 Gen12 ?"

## Output

Invoke the chain with `chain.invoke`

In [15]:
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
chain.invoke(query)

HTTP Request: POST http://10.79.253.112:11434/api/embed "HTTP/1.1 200 OK"


{'query': 'which GPU powers the HPE ProLiant Compute DL384 Gen12 ?',
 'result': 'According to the provided QuickSpecs, the HPE ProLiant Compute DL384 Gen12 is enabled with the NVIDIA GH200 NVL2.'}