# <a id='toc1_'></a>[RAG example](#toc0_)

> **📚 Sources:** 
* https://python.langchain.com/docs/tutorials/rag/
* https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/ (legacy)

**Table of contents**<a id='toc0_'></a>    
- [Architecture : break down each rag components with Langchain.](#toc2_)    
  - [Indexing](#toc2_1_)    
    - [Load](#toc2_1_1_)    
    - [Split](#toc2_1_2_)    
    - [Store](#toc2_1_3_)    
  - [Retrieval](#toc2_2_)    
  - [Generation](#toc2_3_)    
- [RAG Pipeline: Production-Oriented Syntax](#toc3_)    
  - [When to use LCEL, LangGraph, or built-in functions calls](#toc3_1_)    
  - [LCEL](#toc3_2_)    
  - [LangGraph](#toc3_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Architecture : break down each rag components with Langchain.](#toc0_)
We’ll create a typical RAG application, which has two main components:

**Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happens offline.  
**Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The full sequence from raw data to answer will look like:

**Indexing**
1. **Load**: First we need to load our data. We’ll use [DocumentLoaders](https://python.langchain.com/docs/concepts/document_loaders/) for this.
2. **Split**: [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/) break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/vectorstores/) and [Embeddings](https://python.langchain.com/docs/how_to/embed_text/) model.

**Retrieval and generation**
1. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a [Retriever](https://python.langchain.com/docs/concepts/retrievers/).
2. **Generate**: A [ChatModel](https://python.langchain.com/docs/concepts/chat_models/) / [LLM](https://python.langchain.com/docs/modules/model_io/llms/) produces an answer using a prompt that includes the question and the retrieved data


In [None]:
%load_ext autoreload
%autoreload 2

Load environment variables from `.env` file.

In [None]:
import os
from pathlib import Path

from dotenv import load_dotenv

os.chdir(Path.cwd().joinpath(".."))
print(Path.cwd())
load_dotenv(override=True)

Load Python dependencies

In [None]:
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

from lib.config import VECTOR_STORE_PATH

## <a id='toc2_1_'></a>[Indexing](#toc0_)

### <a id='toc2_1_1_'></a>[Load](#toc0_)

We need to first load the PDF contents. We can use [DocumentLoaders](https://python.langchain.com/docs/concepts/document_loaders/) for this, which are objects that load in data from a source and return a list of [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html). A `Document` is an object with some `page_content` (str) and `metadata` (dict).

We will load PDF using `pypdf` into array of documents, where each document contains the page content and metadata with page number.

In [None]:
loader = PyPDFDirectoryLoader("data/1_docs")
pages = loader.load()
len(pages)

In [None]:
pages[0]

In [None]:
pages[0].__dict__

### <a id='toc2_1_2_'></a>[Split](#toc0_)

Our loaded document can be too long to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the `Document` into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In this case we’ll split our documents into chunks of 2000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/), which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set `add_start_index=True` so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(pages)
len(all_splits)

In [None]:
all_splits[0]

In [None]:
all_splits[1]

### <a id='toc2_1_3_'></a>[Store](#toc0_)

Now we need to index our text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).

We can embed and store all of our document splits in a single command using the [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) vector store and [AzureOpenAIEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/) model.

In [None]:
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.getenv("EMBEDDINGS_AZURE_OPENAI_ENDPOINT"),
    openai_api_key=os.getenv("EMBEDDINGS_AZURE_OPENAI_API_KEY"),
    deployment=os.getenv("EMBEDDINGS_AZURE_OPENAI_DEPLOYMENT_NAME"),
)

In [None]:
vectorstore = Chroma(embedding_function=embeddings, persist_directory=f"{VECTOR_STORE_PATH}/1_chroma_db")

In [None]:
vectorstore.add_documents(all_splits)

## <a id='toc2_2_'></a>[Retrieval](#toc0_)

Now let’s write the actual application logic. We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

First we need to define our logic for searching over documents. LangChain defines a [Retriever](https://python.langchain.com/docs/concepts/retrievers/) interface which wraps an index that can return relevant `Documents` given a string query.

The most common type of `Retriever` is the [VectorStoreRetriever](https://python.langchain.com/docs/how_to/vectorstore_retriever/), which uses the similarity search capabilities of a vector store to facilitate retrieval. Any `VectorStore` can easily be turned into a `Retriever` with `VectorStore.as_retriever()`:

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

In [None]:
question = "Describe the architecture of Transformers."

retrieved_docs = retriever.invoke(question)
len(retrieved_docs)

In [None]:
retrieved_docs[0]

## <a id='toc2_3_'></a>[Generation](#toc0_)

Let’s put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

We first define a LLM model.

In [None]:
llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("LLM_AZURE_OPENAI_ENDPOINT"),
    openai_api_key=os.getenv("LLM_AZURE_OPENAI_API_KEY"),
    openai_api_version=os.getenv("LLM_AZURE_OPENAI_API_VERSION"),
    deployment_name=os.getenv("LLM_AZURE_OPENAI_DEPLOYMENT_NAME"),
    temperature=0.0,
    max_tokens=1024,
    timeout=120,
)

In [None]:
llm.invoke("Who are you ?")

Then we define the prompt.

In [None]:
PROMPT_TEMPLATE = """You will be given a mixed of text. Use this information to \
provide an answer to the user question.
Question:
{question}
Context:
{context}
"""

prompt = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

We can now run all the steps sequentially to obtain an answer to our question.

In [None]:
question = "Describe the architecture of Transformers in 100 words."

retrieved_docs = retriever.invoke(question)
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
prompt_inferred = prompt.invoke({"question": question, "context": docs_content})
answer = llm.invoke(prompt_inferred)

In [None]:
print(answer)

# <a id='toc3_'></a>[RAG Pipeline: Production-Oriented Syntax](#toc0_)

Once we reach this point, we have built a **basic RAG pipeline**.  
We will now introduce a **more production-oriented syntax**, which provides several advantages:

- **Support for multiple invocation modes**: without it, logic would need to be rewritten to enable streaming of output tokens or intermediate results.  
- **Built-in support for tracing** with LangSmith and for deployments with LangGraph Platform.  
- **Unified interface**: allows defining and running chains consistently, with built-in support for streaming, async execution, fallback models, typing, and runtime configuration.  
- **Automatic parallelization**: enables tasks to run in parallel, improving performance and user experience.  
- **Composability**: makes it easy to compose and modify chains, keeping code flexible and adaptable.  
- **(LangGraph only)**: supports persistence, human-in-the-loop workflows, and other advanced features.

We will now explore **two syntaxes** for building RAG pipelines: **LCEL** and **LangGraph**.

## <a id='toc3_1_'></a>[When to use LCEL, LangGraph, or built-in functions calls](#toc0_)

*sources : https://python.langchain.com/docs/concepts/lcel/#should-i-use-lcel* 

> LCEL is an orchestration solution -- it allows LangChain to handle run-time execution of chains in an optimized way.
>
> While we have seen users run chains with hundreds of steps in production, we generally recommend using LCEL for simpler orchestration tasks. When the application requires complex state management, branching, cycles or multiple agents, we recommend that users take advantage of LangGraph.
>
> In LangGraph, users define graphs that specify the application's flow. This allows users to keep using LCEL within individual nodes when LCEL is needed, while making it easy to define complex orchestration logic that is more readable and maintainable.
>
> Here are some guidelines:
>
> - If you are making a single LLM call, you don't need LCEL; instead call the underlying chat model directly.
> - If you have a simple chain (e.g., prompt + llm + parser, simple retrieval set up etc.), LCEL is a reasonable fit, if you're taking advantage of the LCEL benefits.
> - If you're building a complex chain (e.g., with branching, cycles, multiple agents, etc.) use LangGraph instead. Remember that you can always use LCEL within individual nodes in LangGraph.


## <a id='toc3_2_'></a>[LCEL](#toc0_)

We’ll use the [LCEL](https://python.langchain.com/docs/expression_language/) Runnable protocol to define the chain, allowing us to - pipe together components and functions in a transparent way - automatically trace our chain in LangSmith - get streaming, async, and batched calling out of the box.

In [None]:
from operator import itemgetter
from typing import List  # noqa: UP035

from langchain_core.runnables import Runnable, RunnableLambda, RunnableParallel


def format_docs(docs: List[Document]) -> str:  # noqa: UP006
    return "\n\n".join(doc.page_content for doc in docs)


def get_rag_chain() -> Runnable:
    return (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )


rag_chain = get_rag_chain()
rag_chain.get_graph().print_ascii()

Basic invocation:

In [None]:
question = "Describe the architecture of Transformers in 500 words."

rag_chain.invoke(question)

Invocation with streaming:

In [None]:
for chunk in rag_chain.stream(question):
    print(chunk, end="", flush=True)

Get detailed output:

In [None]:
def get_detailed_rag_chain() -> Runnable:
    return (
        RunnableParallel(
            {
                "source_documents": retriever,
                "question": RunnablePassthrough(),
            }
        )
        | RunnableParallel(
            {
                "source_documents": itemgetter("source_documents"),
                "context": RunnableLambda(itemgetter("source_documents")) | format_docs,
                "question": itemgetter("question"),
            }
        )
        | RunnableParallel(
            {
                "source_documents": itemgetter("source_documents"),
                "answer": prompt | llm | StrOutputParser(),
            }
        )
    )


rag_chain = get_detailed_rag_chain()
rag_chain.get_graph().print_ascii()

In [None]:
result = rag_chain.invoke(question)
print(result.keys())
result

More consise way with .assign method.

In [None]:
def get_detailed_rag_chain() -> Runnable:
    return (
        RunnableParallel(
            {
                "source_documents": retriever,
                "question": RunnablePassthrough(),
            }
        )
        .assign(context=RunnableLambda(itemgetter("source_documents")) | format_docs)
        .assign(answer=prompt | llm | StrOutputParser())
    )


rag_chain = get_detailed_rag_chain()
rag_chain.get_graph().print_ascii()

In [None]:
result = rag_chain.invoke(question)
print(result.keys())
result

#### <a id='toc3_2_1_1_'></a>[Drawbacks of LCEL](#toc0_)

> sources: [Unleashing Th power of LCEL, from Proof of Concept to Production](https://medium.com/artefact-engineering-and-data-science/unleashing-the-power-of-langchain-expression-language-lcel-from-proof-of-concept-to-production-8ad8eebdcb1d)

Despite its advantages, LCEL does have some potential drawbacks:

* **Not fully PEP compliant**: LCEL does not fully respect PEP20, the Zen of Python, which states that “explicit is better than implicit”. (To check PEP20 you can run import this in python). Additionally, LCEL’s syntax is not considered “Pythonic” as it feel like a different language, this could make LCEL less intuitive for some Python developers.
* **LCEL is a Domain-Specific Language (DSL)**: Users are expected to have some understanding of prompts, chains or LLMs in order to leverage the syntax efficiently.
* **Input / Output dependencies**: Intermediary inputs and final outputs must be passed down from the start to the end. For instance, if you want to use the output of an intermediate step as the final output, you must carry it through all subsequent steps. This can lead to extra arguments in most of your chains, which may not be used but are necessary if you want to access them through the output.

## <a id='toc3_3_'></a>[LangGraph](#toc0_)

LangGraph extends LangChain with a visual, graph-based interface for designing AI workflows, supporting stateful orchestration and multi-agent systems. It excels for complex, adaptive workflows, offers a more Pythonic syntax than LCEL for defining chains programmatically, but can add overhead and latency for simpler applications.

In [None]:
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict


# 1 - Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# 2 - Define application steps : "Nodes"
def retrieve(state: State) -> dict[str, List[Document]]:
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}


def generate(state: State) -> dict[str, str]:
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# 3 - Define edges between nodes
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")

# 4- Compile the graph
graph = graph_builder.compile()

In [None]:
from IPython.display import Image, display

display(Image(graph.get_graph().draw_mermaid_png()))

In [None]:
response = graph.invoke({"question": "Describe the architecture of Transformers in 500 words."})
response

Streaming invocation:

In [None]:
for message, metadata in graph.stream(
    {"question": "Describe the architecture of Transformers in 500 words."}, stream_mode="messages"
):
    print(message.content, end="", flush=True)