# RAG Implementation

# **Step1: Install and import the dependecies**

In [None]:
#!pip install sentence-transformers bitsandbytes
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,SentenceTransformersTokenTextSplitter
)

In [None]:
import os
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAI


# **Step 2: Load the document**

Load the document that will be used as the knowledge source.

**Knowledge base**: The text document serves as the underlying knowledge base. Later, when a query is made, relevant parts of this document will be retrieved to augment the LLM's response.






In [None]:
text_loader = TextLoader("/content/state_of_union.txt")
text_document = text_loader.load()
print(text_document[:100])  # Prints the first 100 characters of the text document



[Document(metadata={'source': '/content/state_of_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their det

# **Step 3: Split the document into chunks**

Break down the large document into manageable pieces.

**Fine-Grained Retrieval**: Smaller chunks allow the retriever to more precisely locate the context relevant to the query, enhancing the generation step with focused context.

In [None]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
split_texts = doc_splitter.split_documents(text_document)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


53


# **Step 4: Generate embeddings for each chunk**

Convert text chunks into numerical vectors (embeddings) that capture semantic meaning.

**Semantic Search**: Embeddings allow the FAISS vector store to perform similarity searches, ensuring that the most relevant context is retrieved for any given query.

**Verification**: Printing the length of the embedding vector confirms the transformation was successful.

In [None]:
#model = "sentence-transformers/all-MiniLM-L6-v2"
#embeddings = HuggingFaceEmbeddings(model_name=model)

MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
hf_embed_result = hf_embed.embed_documents([text])
print(len(hf_embed_result[0]))  # Prints the length of the first embedded document

  hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)


768


#### If we quickly want to see how the embeddings for the chunks will look like we will do the below

In [None]:
embedded_chunks = [hf_embed.embed_query(chunk.page_content) for chunk in split_texts]

In [None]:
import pandas as pd
df_chunks = pd.DataFrame(embedded_chunks)
df_chunks


# **Step 5: Build the FAISS vector store and create a retriever**

Build an index (FAISS) for the document embeddings and create a retriever.

**Retrieval step**: The retriever is responsible for fetching the most relevant chunks from the document based on the query. These retrieved contexts will later be fed into the generation step to produce an informed answer.


In [None]:
!pip install faiss-cpu
vectorstore=FAISS.from_documents(split_texts, hf_embed)

# It will take the same embedding of the chunks as shown above and and create a vecor database for it which will be temporary, ie non persistent



#### Let's see if the retriever works

In [None]:
retriever=vectorstore.as_retriever()

In [None]:
print(dir(retriever))

['InputType', 'OutputType', '__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__or__', '__orig_bases__', '__parameters__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_on_complete__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_model__',

In [None]:
# The way the retriever works
'''query = "What are the key points from the State Of The Union"
# Directly call the retriever as a function
docs = retriever(query)  # NOTE: retriever is callable'''

'query = "What are the key points from the State Of The Union"\n# Directly call the retriever as a function\ndocs = retriever(query)  # NOTE: retriever is callable'

In [None]:
#Instead of using retriever, since we are using newer versions
query_embedding = hf_embed.embed_query("What are the key points from the State Of The Union")
similar_docs = vectorstore.similarity_search_by_vector(query_embedding, k=5)  # top 5 results

In [None]:
for doc in similar_docs:
    print(doc.page_content)

Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges. 

And let’s pass the PRO Act when a majority of workers want to form a union—they shouldn’t be stopped.  

When we invest in our workers, when we build the economy from the bottom up and the middle out together, we can do something we haven’t done in a long time: build a better America.
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.
But in my administration, the watch

In [None]:
query2 = "How is the United States supporting Ukraine economically and militarily?"

In [None]:
'''Query is embedded, FAISS retrieves the top-k most relevant document chunks
   These are grounding context, not model memory '''
query_embedding = hf_embed.embed_query(query2)
similar_docs = vectorstore.similarity_search_by_vector(query_embedding, k=5)  # top 5 results
for doc in similar_docs:
    print(doc.page_content)


The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. 

Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. 

We are giving more than $1 Billion in direct assistance to Ukraine. 

And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering.
Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. 

We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. 

Together with our allies –we are right now enforcing powerful economic sanctions.
The United States is a member along with 29 other nations. 

It matters. American diplo

In [None]:
import os
from langchain_openai import AzureChatOpenAI
from dotenv import load_dotenv
load_dotenv("/content/.env")

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("API_KEY"),
    api_version="2024-12-01-preview",
    deployment_name="gpt-4.1",
    temperature=0,
)

In [None]:
#Context construction
context = "\n".join([doc.page_content for doc in similar_docs])

#To limit tokens for large contexts
#context = "\n".join([doc.page_content[:1000] for doc in similar_docs])
answer = llm.invoke(f"Based on the following context, answer the question:\n\n{context}\n\nQuestion:\nWhat are the key points from the State Of The Union?")
print(answer)

content='**Key Points from the State of the Union (based on the provided context):**\n\n1. **Russian Stock Market and Economy:**  \n   - The Russian stock market has lost 40% of its value and trading is suspended.\n   - Russia’s economy is suffering, and President Putin is solely blamed for this situation.\n\n2. **Support for Ukraine:**  \n   - The U.S. and its allies are providing military, economic, and humanitarian assistance to Ukraine.\n   - Over $1 billion in direct assistance is being given to Ukraine.\n   - Continued commitment to aid the Ukrainian people as they defend their country and to help ease their suffering.\n\n3. **International Coalition:**  \n   - Support for Ukraine is broad, including all 27 EU members (France, Germany, Italy), the UK, Canada, Japan, Korea, Australia, New Zealand, Switzerland, and others.\n   - The U.S. is working with 29 other nations to enforce powerful economic sanctions on Russia.\n\n4. **Sanctions and Isolation of Russia:**  \n   - The U.S. a

# **Step 6: Design a prompt template for the language model**
Establish a prompt that instructs the LLM on how to utilize the retrieved context to generate a concise answer.

**Guiding Generation**: The prompt template bridges retrieval and generation by ensuring the LLM uses the provided context (from the retriever) to answer the query accurately.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

In [None]:
template="""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use one sentence and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""

In [None]:
prompt=ChatPromptTemplate.from_template(template)

In [None]:
output_parser=StrOutputParser()

In [None]:
def rag_chain(query, vectorstore, llm, prompt, output_parser, k=5):
    # Step 1: Retrieve relevant documents
    query_embedding = hf_embed.embed_query(query)
    similar_docs = vectorstore.similarity_search_by_vector(query_embedding, k=k)

    # Step 2: Build context string
    context = "\n".join([doc.page_content for doc in similar_docs])

    # Step 3: Format the prompt
    '''Context + question are injected into the prompt
       We are using chat messages, which is correct for GPT
       System + human message roles are respected '''
    messages = prompt.format_messages(
    question=query,
    context=context
    )

    # Step 4: Call LLM
    '''GPT-4.1 reasons only over the provided context
       The model is not searching FAISS or the web
       No fine-tuning is happening — this is pure inference '''
    llm_output = llm.invoke(messages)

    # Step 5: Parse output
    answer = output_parser.parse(llm_output)

    return answer

In [None]:
questions = [
    "What are the key points from the State Of The Union?",
    "How is the United States supporting Ukraine economically and militarily?"
]

for q in questions:
    answer = rag_chain(q, vectorstore, llm, prompt, output_parser, k=5)
    print(f"\nQ: {q}")
    print(f"A: {answer}")


Q: What are the key points from the State Of The Union?
A: content='Key points from the State of the Union include unity across parties, ongoing COVID-19 recovery, prosecuting pandemic fraud, significant deficit reduction, support for education and workers, raising the minimum wage, extending the Child Tax Credit, and focusing economic relief on working Americans.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 54, 'prompt_tokens': 555, 'total_tokens': 609, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4.1-2025-04-14', 'system_fingerprint': 'fp_f99638a8d7', 'id': 'chatcmpl-Cz5Ftz6h1x8Yh5zNnvZX1lKWyXfi5', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': 

In [None]:
#Update the RAG function to return sources
def rag_chain_with_sources(query, vectorstore, llm, prompt, output_parser, k=5):
    # 1. Retrieve documents
    query_embedding = hf_embed.embed_query(query)
    similar_docs = vectorstore.similarity_search_by_vector(
        query_embedding, k=k
    )

    # 2. Build context
    context = "\n".join(doc.page_content for doc in similar_docs)

    # 3. Format prompt into messages
    messages = prompt.format_messages(
        question=query,
        context=context
    )

    # 4. Invoke LLM
    llm_output = llm.invoke(messages)
    answer = output_parser.parse(llm_output)

    # 5. Collect sources
    sources = []
    for doc in similar_docs:
        source = doc.metadata.get("source", "unknown")
        sources.append(source)

    # Remove duplicates while preserving order
    sources = list(dict.fromkeys(sources))

    return {
        "answer": answer,
        "sources": sources,
    }


In [None]:
result = rag_chain_with_sources(
    "How is the United States supporting Ukraine economically and militarily?",
    vectorstore,
    llm,
    prompt,
    output_parser,
)

print("Answer:")
print(result["answer"])

print("\nSources:")
for src in result["sources"]:
    print("-", src)
'''
Now we have:
Grounded answers
Transparent citations
Auditable RAG
'''

Answer:
content='The United States is supporting Ukraine by providing over $1 billion in direct economic assistance, military aid, and humanitarian support, while also enforcing powerful economic sanctions against Russia.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 34, 'prompt_tokens': 498, 'total_tokens': 532, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4.1-2025-04-14', 'system_fingerprint': 'fp_f99638a8d7', 'id': 'chatcmpl-Cz5NhvZoX16d4iMAigH2oN4Y1FLIm', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, '

'\nNow we have:\nGrounded answers\nTransparent citations\nAuditable RAG\n'

In [None]:
#Goal
'''
Our system should:
Answer only when the retrieved context is strong
Say “I don’t know based on the provided context” when it isn’t
This is true RAG safety'''

'\nOur system should:\n\nAnswer only when the retrieved context is strong\nSay “I don’t know based on the provided context” when it isn’t\n\nThis is true RAG safety'

In [None]:
# We’ll use retrieval signal, not LLM guessing:
'''We can use Heuristics  (version-safe):
    Number of retrieved chunks
    Minimum similarity score
    Context length'''

'We can use Heuristics  (version-safe):\n    Number of retrieved chunks\n    Minimum similarity score\n    Context length'

In [None]:
#Define confidence thresholds
MAX_SCORE_THRESHOLD = 0.6      # tune this
MIN_CONTEXT_LENGTH = 300       # characters

In [None]:
def rag_chain_with_confidence(query, vectorstore, llm, prompt, output_parser, k=5):
    query_embedding = hf_embed.embed_query(query)

    results = vectorstore.similarity_search_with_score_by_vector(
        query_embedding, k=k
    )

    # Separate docs and scores
    docs = [doc for doc, score in results]
    scores = [score for doc, score in results]

    # Build context
    context = "\n".join(doc.page_content for doc in docs)

    # --- Confidence checks ---
    is_confident = True

    if len(docs) == 0:
        is_confident = False

    if min(scores) > MAX_SCORE_THRESHOLD:
        is_confident = False

    if len(context) < MIN_CONTEXT_LENGTH:
        is_confident = False

    # If confidence is low → safe answer
    if not is_confident:
        return {
            "answer": "I don't know based on the provided context.",
            "confidence": "low",
            "sources": [],
        }

    # Format prompt
    messages = prompt.format_messages(
        question=query,
        context=context
    )

    # LLM call
    llm_output = llm.invoke(messages)
    answer = output_parser.parse(llm_output)

    # Sources
    sources = list(dict.fromkeys(
        doc.metadata.get("source", "unknown") for doc in docs
    ))

    return {
        "answer": answer,
        "confidence": "high",
        "sources": sources,
    }

In [None]:
questions = [
    "How is the United States supporting Ukraine economically and militarily?",
    "What is the GDP of Atlantis in 2024?"
]

for q in questions:
    result = rag_chain_with_confidence(
        q, vectorstore, llm, prompt, output_parser
    )

    print("\nQ:", q)
    print("Answer:", result["answer"])
    print("Confidence:", result["confidence"])
    print("Sources:", result["sources"])



Q: How is the United States supporting Ukraine economically and militarily?
Answer: content='The United States is supporting Ukraine by providing over $1 billion in direct economic assistance, military aid, and humanitarian support, while also enforcing powerful economic sanctions against Russia.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 34, 'prompt_tokens': 498, 'total_tokens': 532, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4.1-2025-04-14', 'system_fingerprint': 'fp_f99638a8d7', 'id': 'chatcmpl-Cz5Z6wrZe16HDbPcwBkQ0OkTHPhDS', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': Fal

## **Without confidence rating:**

* LLMs answer even when retrieval is weak
* This causes hallucinations

  With confidence gating:
* Your system behaves honestly
* Users trust it


## **Without confidence rating:**

  * LLMs answer even when retrieval is weak
  * This causes hallucinations

    With confidence gating:
  * Your system behaves honestly
  * Users trust it


## Web search fallback
  When confidence is low, we’ll:
  * Automatically call web search
  * Re-answer using fresh data
  
  If FAISS retrieval confidence is low, your system should:
  Perform a web search
  * Build new context from live data
  * Ask GPT-4.1 again
  * Clearly label the answer as coming from the web

In [None]:
#User question -->FAISS Retrieval -->Confidence Check
#If High Confidence -->Answer from FAISS
#If Low Confidence -->Web Search -->GPT --> Answer

In [None]:
import requests
import os
from dotenv import load_dotenv
load_dotenv("/content/.env")
SERPAPI_KEY = os.getenv("SERPAPI_KEY")

def web_search(query, num_results=5):
    url = "https://serpapi.com/search.json"
    params = {
        "q": query,
        "api_key": SERPAPI_KEY,
        "num": num_results,
    }

    response = requests.get(url, params=params)
    data = response.json()

    results = []
    for r in data.get("organic_results", []):
        snippet = r.get("snippet")
        source = r.get("link")
        if snippet:
            results.append({
                "content": snippet,
                "source": source,
            })

    return results


In [None]:
#Build web context
def build_web_context(results):
    context = []
    sources = []

    for r in results:
        context.append(r["content"])
        sources.append(r["source"])

    return "\n".join(context), list(dict.fromkeys(sources))


In [None]:
def rag_chain_structured(
    query,
    vectorstore,
    llm,
    prompt,
    output_parser,
    k=5,
):
    # --- FAISS retrieval ---
    query_embedding = hf_embed.embed_query(query)
    results = vectorstore.similarity_search_with_score_by_vector(
        query_embedding, k=k
    )

    docs = [doc for doc, score in results]
    scores = [score for doc, score in results]

    context = "\n".join(f"{doc.page_content}" for doc in docs)

    # Confidence check
    confident = len(docs) > 0 and min(scores) <= 0.6 and len(context) >= 300

    if not confident:
        # Optional: fallback to web search
        web_results = web_search(query)
        if web_results:
            context = "\n".join(r['content'] for r in web_results)
            sources = [{
                "source": r['source'],  # Changed from r['url'] to r['source']
                "score": None,
                "text": r['content']
            } for r in web_results]
            source_type = "web"
            confidence = "medium"
        else:
            return {
                "answer": "I don't know based on available information.",
                "confidence": "low",
                "source_type": "none",
                "sources": [],
            }
    else:
        sources = [
            {
                "source": doc.metadata.get("source", "unknown"),
                "score": score,
                "text": doc.page_content
            }
            for doc, score in results
        ]
        source_type = "vectorstore"
        confidence = "high"

    # --- Prompt + LLM ---
    messages = prompt.format_messages(
        question=query,
        context=context
    )

    llm_output = llm.invoke(messages)
    answer = output_parser.parse(llm_output)

    return {
        "answer": answer,
        "confidence": confidence,
        "source_type": source_type,
        "sources": sources,
    }

In [None]:
questions = [
    "How is the United States supporting Ukraine economically and militarily?",
    "What happened in the latest Apple WWDC?"]

for q in questions:
    result = rag_chain_with_web_fallback(
        q, vectorstore, llm, prompt, output_parser
    )

    print("\nQ:", q)
    print("Answer:", result["answer"])
    print("Confidence:", result["confidence"])
    print("Source Type:", result["source_type"])
    print("Sources:")
    for s in result["sources"]:
        print("-", s)



Q: How is the United States supporting Ukraine economically and militarily?
Answer: content='The United States is supporting Ukraine by providing over $1 billion in direct economic assistance, military aid, and humanitarian support, while also enforcing powerful economic sanctions against Russia.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 34, 'prompt_tokens': 498, 'total_tokens': 532, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4.1-2025-04-14', 'system_fingerprint': 'fp_f99638a8d7', 'id': 'chatcmpl-Cz5rZdZN8RxRfsngUPiniD7gnWxjs', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': Fal

## Streaming responses (UX+ production readiness)
   Stream GPT tokens live
   * token streaming so answers appear progressively  instead of waiting for the full response.
   * Show partial answers while generating


In [None]:
'''
In new LangChain versions:

Streaming is handled via callbacks
Chat models emit tokens via on_llm_new_token
No model.eval() concept exists for GPT models
(that’s only for local PyTorch models)
'''

'\nIn new LangChain versions:\n\nStreaming is handled via callbacks\nChat models emit tokens via on_llm_new_token\nNo model.eval() concept exists for GPT models\n(that’s only for local PyTorch models)\n'

In [None]:
#Create a streaming callback handler
from langchain_core.callbacks import BaseCallbackHandler

class StreamingStdOutCallbackHandler(BaseCallbackHandler):
    def on_llm_new_token(self, token: str, **kwargs):
        print(token, end="", flush=True)


In [None]:
from langchain_openai import AzureChatOpenAI

streaming_llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("API_KEY"),
    api_version="2024-12-01-preview",
    deployment_name="gpt-4.1",
    temperature=0,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()],
)
#Note-callbacks must be passed at LLM creation time

In [None]:
#Update RAG to support streaming
def rag_chain_streaming(
    query,
    vectorstore,
    llm,
    prompt,
    output_parser,
    k=5,
):
    query_embedding = hf_embed.embed_query(query)
    results = vectorstore.similarity_search_with_score_by_vector(
        query_embedding, k=k
    )

    docs = [doc for doc, score in results]
    scores = [score for doc, score in results]

    context = "\n".join(doc.page_content for doc in docs)

    confident = (
        len(docs) > 0
        and min(scores) <= 0.6
        and len(context) >= 300
    )

    if confident:
        print("\nAnswer (FAISS):\n")

        messages = prompt.format_messages(
            question=query,
            context=context
        )

        llm.invoke(messages)  # streams automatically
        print("\n")

        return

    # ---- Web fallback ----
    print("\nAnswer (Web Search):\n")

    web_results = web_search(query)
    web_context, _ = build_web_context(web_results)

    messages = prompt.format_messages(
        question=query,
        context=web_context
    )

    llm.invoke(messages)
    print("\n")


In [None]:
#Run it now
questions = [
    "What are the key points from the State Of The Union?",
    "How is the United States supporting Ukraine economically and militarily?"
]

for q in questions:
    print("\n==============================")
    print("Q:", q)
    rag_chain_streaming(
        q,
        vectorstore,
        streaming_llm,
        prompt,
        output_parser,
    )



Q: What are the key points from the State Of The Union?

Answer (Web Search):

Key points from the State of the Union include a focus on the health of the economy and immigration.


Q: How is the United States supporting Ukraine economically and militarily?

Answer (FAISS):

The United States is supporting Ukraine by providing over $1 billion in direct economic assistance, military aid, and humanitarian support, while also enforcing powerful economic sanctions against Russia.



In [None]:
#Source Attribution & Citations (FAISS + Web)
'''Right now your system:

  -- Retrieves documents
  -- Answers correctly
  -- Streams responses

  But it does not explain where the answer came from.

  In Next Step we will:
  -- Track document metadata
  -- Inject sources into the prompt
  -- Return answer + citations
  -- Work for FAISS and web fallback
  '''

'Right now your system:\n\n  -- Retrieves documents\n  -- Answers correctly\n  -- Streams responses\n\n  But it does not explain where the answer came from.\n\n  In Next Step we will:\n  -- Track document metadata\n  -- Inject sources into the prompt\n  -- Return answer + citations\n  -- Work for FAISS and web fallback\n  '

In [None]:
'''
A citation is not a URL necessarily.
  It can be:
    File name
    Document ID
    Page number
    Chunk index
    Web URL
Your FAISS docs already support this via Document.metadata
'''

'\nA citation is not a URL necessarily.\n  It can be:\n    File name\n    Document ID\n    Page number\n    Chunk index\n    Web URL\nYour FAISS docs already support this via Document.metadata\n'

In [None]:
#Ensure metadata exists

for i, doc in enumerate(similar_docs):
    doc.metadata["source"] = "State of the Union Address"
    doc.metadata["chunk_id"] = i


In [None]:
#Modify context construction to include source tags
def build_context_with_sources(docs):
    context_chunks = []

    for i, doc in enumerate(docs):
        source = doc.metadata.get("source", "Unknown")
        chunk_id = doc.metadata.get("chunk_id", i)

        context_chunks.append(
            f"[Source {i+1}: {source}, chunk {chunk_id}]\n{doc.page_content}"
        )

    return "\n\n".join(context_chunks)


In [None]:
#Update prompt to Require citations
template = """
You are an assistant for question-answering tasks.

Use ONLY the provided context to answer.
Cite sources using the format [Source X].

If the answer is not contained in the context, say you don't know.

Question: {question}

Context:
{context}

Answer (with citations):
"""


In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
#Update RAG Chain
def rag_chain_with_citations(
    query,
    vectorstore,
    llm,
    prompt,
    k=5,
):
    query_embedding = hf_embed.embed_query(query)
    results = vectorstore.similarity_search_with_score_by_vector(
        query_embedding, k=k
    )

    docs = [doc for doc, score in results]
    scores = [score for doc, score in results]

    context = build_context_with_sources(docs)

    messages = prompt.format_messages(
        question=query,
        context=context
    )

    print("\n Answer:\n")
    llm.invoke(messages)
    print("\n")


In [None]:
def build_web_context(web_results):
    chunks = []

    for i, r in enumerate(web_results):
        chunks.append(
            f"[Source {i+1}: {r['source']}]\n{r['content']}" # Changed from r['url'] to r['source']
        )

    return "\n\n".join(chunks)

In [None]:
#Testing
doc = vectorstore.similarity_search("test")[0]
print(doc.metadata)

{'source': '/content/state_of_union.txt'}


In [None]:
#Test retrieval with scores
query = "How is the United States supporting Ukraine?"
query_embedding = hf_embed.embed_query(query)

results = vectorstore.similarity_search_with_score_by_vector(
    query_embedding, k=3
)

for i, (doc, score) in enumerate(results):
    print(f"\n--- Document {i+1} ---")
    print("Score:", score)
    print("Source:", doc.metadata.get("source"))
    print(doc.page_content[:300])



--- Document 1 ---
Score: 0.51034504
Source: State of the Union Address
The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. 

Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistan

--- Document 2 ---
Score: 0.6614249
Source: State of the Union Address
Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. 

We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now

--- Document 3 ---
Score: 0.7009914
Source: State of the Union Address
And a proud Ukrainian people, who have known 30 years  of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards.  

To all

In [None]:
#Test context construction with source tags
context = build_context_with_sources([doc for doc, _ in results])
print(context)

[Source 1: State of the Union Address, chunk 0]
The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. 

Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. 

We are giving more than $1 Billion in direct assistance to Ukraine. 

And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering.

[Source 2: State of the Union Address, chunk 1]
Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. 

We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. 

Together with our allies –we are right now enforcing powerful economic s

In [None]:
#Test prompt formatting
messages = prompt.format_messages(
    question=query,
    context=context
)

for m in messages:
    print(type(m).__name__)
    print(m.content[:300])


HumanMessage

You are an assistant for question-answering tasks.

Use ONLY the provided context to answer.
Cite sources using the format [Source X].

If the answer is not contained in the context, say you don't know.

Question: How is the United States supporting Ukraine?

Context:
[Source 1: State of the Union 


In [None]:
#Test LLM Citation behaviour
response = llm.invoke(messages)
print(response.content)

The United States is supporting Ukraine by providing military assistance, economic assistance, and humanitarian assistance. Specifically, the U.S. is giving more than $1 billion in direct assistance to Ukraine. Additionally, the U.S., together with its allies, is enforcing powerful economic sanctions on Russia to inflict pain on the Russian economy and support the people of Ukraine [Source 1][Source 2].


In [None]:
#Test Full RAG
result = rag_chain_with_citations(
    "How is the United States supporting Ukraine economically and militarily?",
    vectorstore,
    llm,
    prompt,
    k=5
)



 Answer:





In [None]:
#Here we add structured JSON output, ranking sources, and highlighting exact sentences.
'''
-- Return JSON instead of free text
-- Include retrieval scores for each chunk
-- Optionally, highlight which sentences were used in the answer
-- Compatible with both FAISS and web fallback
'''


'\n-- Return JSON instead of free text\n-- Include retrieval scores for each chunk\n-- Optionally, highlight which sentences were used in the answer\n-- Compatible with both FAISS and web fallback\n'

In [None]:
def rag_chain_structured(
    query,
    vectorstore,
    llm,
    prompt,
    output_parser,
    k=5,
):
    """
    Hybrid RAG pipeline with structured JSON output, source citations, and web fallback.

    Args:
        query (str): User question
        vectorstore: FAISS or similar retriever
        llm: AzureChatOpenAI LLM
        prompt: ChatPromptTemplate
        output_parser: StrOutputParser or similar
        k (int): Number of top documents to retrieve

    Returns:
        dict: {
            "answer": str,
            "confidence": "high"|"medium"|"low",
            "source_type": "vectorstore"|"web"|"none",
            "sources": list of dicts {
                "source": str,
                "score": float|None,
                "text": str
            }
        }
    """
    # --- Step 1: FAISS retrieval ---
    query_embedding = hf_embed.embed_query(query)
    results = vectorstore.similarity_search_with_score_by_vector(
        query_embedding, k=k
    )

    docs = [doc for doc, score in results]
    scores = [score for doc, score in results]

    context = "\n".join(f"{doc.page_content}" for doc in docs)

    # --- Step 2: Confidence check ---
    confident = len(docs) > 0 and min(scores) <= 0.6 and len(context) >= 300

    if confident:
        sources = [
            {
                "source": doc.metadata.get("source", "unknown"),
                "score": score,
                "text": doc.page_content
            }
            for doc, score in results
        ]
        source_type = "vectorstore"
        confidence = "high"
    else:
        # --- Step 3: Web fallback ---
        web_results = web_search(query)
        if web_results:
            context = "\n".join(r.get('content', '') for r in web_results)
            sources = [
                {
                    "source": r.get("link") or r.get("url") or "unknown",
                    "score": None,
                    "text": r.get("content", "")
                }
                for r in web_results
            ]
            source_type = "web"
            confidence = "medium"
        else:
            # No info available
            return {
                "answer": "I don't know based on available information.",
                "confidence": "low",
                "source_type": "none",
                "sources": [],
            }

    # --- Step 4: Prompt formatting and LLM call ---
    messages = prompt.format_messages(
        question=query,
        context=context
    )

    llm_output = llm.invoke(messages)
    answer = output_parser.parse(llm_output)

    # --- Step 5: Return structured JSON ---
    return {
        "answer": answer,
        "confidence": confidence,
        "source_type": source_type,
        "sources": sources,
    }


In [None]:
#Finally test your production grade RAG chain
questions = [
    "What are the key points from the State Of The Union?",
    "How is the United States supporting Ukraine economically and militarily?"
]

for q in questions:
    result = rag_chain_structured(q, vectorstore, llm, prompt, output_parser)
    print("\n==============================")
    print("Q:", q)
    print("Answer:", result["answer"])
    print("Confidence:", result["confidence"])
    print("Source Type:", result["source_type"])
    print("Sources:")
    for s in result["sources"]:
        print("-", s["source"], "(score:", s["score"], ")")



Q: What are the key points from the State Of The Union?
Answer: content="The key points from the State of the Union, ahead of President Joe Biden's third address, are that Americans are focused on the health of the economy and immigration [Source 1]." additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 38, 'prompt_tokens': 177, 'total_tokens': 215, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4.1-2025-04-14', 'system_fingerprint': 'fp_f99638a8d7', 'id': 'chatcmpl-Cz6P40pQO9Yk37hGxqsya0E8viODZ', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered':

# **Step 7: Load and configure a quantized language model**

Load a quantized version of a large language model (Falcon3-1B-Base) for efficient and cost-effective text generation.

**Generation Step**: This model is responsible for generating the final answer. It takes the prompt (which includes the retrieved context) and produces a response, completing the RAG pipeline.

**Efficiency**: 4-bit quantization reduces resource usage while maintaining performance, crucial for deploying RAG systems in production.

In [None]:
#Using transformers
'''
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

MODEL_NAME = "tiiuae/Falcon3-1B-Base"

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.eval()'''


'\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\nimport torch\n\nMODEL_NAME = "tiiuae/Falcon3-1B-Base"\n\n# Configure 4-bit quantization\nbnb_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_compute_dtype=torch.float16,\n    bnb_4bit_use_double_quant=True,\n    bnb_4bit_quant_type="nf4"\n)\n\n# Load model\nmodel = AutoModelForCausalLM.from_pretrained(\n    MODEL_NAME,\n    quantization_config=bnb_config,\n    device_map="auto"\n)\n\ntokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\nmodel.eval()'

In [None]:
from langchain_core.messages import (
    HumanMessage,
    SystemMessage,
    AIMessage,
)

prompt2 = ChatPromptTemplate.from_messages([
    SystemMessage(
        content="You are a knowledgeable assistant. Answer based on the retrieved context."
    ),
    ("human", "{context}\nQuestion: {question}")
])

In [None]:
#with model.eval()-->you are telling a locally loaded neural network to:
'''
Disable dropout
Disable training-only layers
Switch to inference mode

This is necessary because you own the model weights and execution'''

'''model.eval()
generation_config = model.generation_config
# Set temperature to 0 for deterministic responses
generation_config.temperature = 0.8
# Set number of returned sequences to 1
generation_config.num_return_sequences = 1
# Set maximum new tokens per response
generation_config.max_new_tokens = 256
# Disable token caching
generation_config.use_cache = False
# Set repetition penalty for more diverse responses
generation_config.repetition_penalty = 1.7
# Enable sampling for temperature to take effect
generation_config.do_sample = True
# Define pad and EOS token IDs
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id'''


'model.eval()\ngeneration_config = model.generation_config\n# Set temperature to 0 for deterministic responses\ngeneration_config.temperature = 0.8\n# Set number of returned sequences to 1\ngeneration_config.num_return_sequences = 1\n# Set maximum new tokens per response\ngeneration_config.max_new_tokens = 256\n# Disable token caching\ngeneration_config.use_cache = False\n# Set repetition penalty for more diverse responses\ngeneration_config.repetition_penalty = 1.7\n# Enable sampling for temperature to take effect\ngeneration_config.do_sample = True\n# Define pad and EOS token IDs\ngeneration_config.pad_token_id = tokenizer.eos_token_id\ngeneration_config.eos_token_id = tokenizer.eos_token_id'

In [None]:
'''from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
)'''

# **Step 8: Set up the generation pipeline and chain the components**

Build an end-to-end pipeline that seamlessly connects document retrieval with text generation.

**Integration**: The chain uses the retriever to fetch context, applies the prompt template to integrate the query with the retrieved context, and then passes the final prompt to the LLM for answer generation.

**Pipeline composition**: Using the pipe operator (|), the components are elegantly chained together to perform a complete RAG operation in one go.

In [None]:
'''from langchain.llms import HuggingFacePipeline # Import HuggingFacePipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Create the HuggingFacePipeline object
llm_pipeline = HuggingFacePipeline(pipeline=pipe)'''

Device set to use cuda:0


In [None]:
'''rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()}
    | prompt
    | llm_pipeline
    | output_parser
)'''

# **Step 9: Invoke the pipeline with a query**

Execute the entire RAG pipeline with a sample query.

**Final output**: The pipeline retrieves relevant chunks from the document, forms a context-rich prompt, and the LLM generates a concise answer based on that context.

**End-to-end flow**: This step demonstrates the full cycle of RAG—retrieval and augmented generation—in action.

In [None]:
'''result = rag_chain.invoke("How is the United States supporting Ukraine economically and militarily?")'''



In [None]:
'''result'''

'In order to provide financial or military resources directly towards helping those affected by conflict situations such as war crimes investigations can also involve funding humanitarian organizations working within these areas which may include medical teams assisting victims during conflicts; this would fall under international law regarding human rights protection measures when dealing specifically about how funds should ideally go if there exists any form of violence occurring between different groups living together peacefully but still having disagreements over territory/resources etc., so it might make sense depending upon specific circumstances whether certain types of donations made via official channels provided through government agencies responsible overseeing security forces operating near borders where clashes often occur due lack thereof proper communication systems being established beforehand among both sides involved making sure everyone understands exactly why money

# Conclusion

This RAG (Retrieval-augmented generation) pipeline exemplifies how to combine retrieval-based methods with generative AI to produce informed, context-driven answers. By following these high-level steps—setting up the environment, loading and splitting the document, generating embeddings, building a FAISS vector store, and creating a retriever—you establish a robust foundation for pinpointing the most relevant pieces of information. Integrating a prompt template ensures that the language model is guided to leverage this retrieved context effectively. Finally, by employing a quantized language model in an end-to-end chain, the system efficiently generates concise and accurate responses. Overall, this approach not only enhances the model’s output by grounding it in factual context but also streamlines the process, making it scalable and adaptable to various domains and applications.