<a href="https://colab.research.google.com/github/fralfaro/clei2025-llm/blob/main/docs/dia3/notebooks/mcp_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generation (RAG)

In [21]:
from dotenv import load_dotenv
import pandas as pd
import json
from dotenv import load_dotenv
import os 
from IPython.display import display, Markdown
import pprint

In [22]:
from langchain.document_loaders.pdf import PyPDFDirectoryLoader 
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain_openai import OpenAIEmbeddings 
from langchain.schema import Document 
from langchain_chroma import Chroma # This is a Chroma wrapper from Langchain
from langchain_openai import ChatOpenAI # Import OpenAI LLM
from langchain_core.vectorstores import InMemoryVectorStore
from langchain import hub
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.document_loaders import PyPDFLoader

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_huggingface import HuggingFaceEmbeddings


In [3]:
# %pip install pypdf langchain-huggingface sentence-transformers

In [23]:
# Carga de variables de ambiente desde el archivo .env
load_dotenv()

True

In [24]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") # Local model

In [25]:
llm_model = os.environ["OPENAI_MODEL"]
print(llm_model)
llm = ChatOpenAI(model=llm_model, temperature=0.1)

gpt-4o-mini


In [26]:
llm.invoke("Hello, world!")  # Test LLM

AIMessage(content='Hello! How can I assist you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 11, 'total_tokens': 20, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CTxMQ5zsoZjCoyBmnpgmEW5yFn4v5', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--13c610c5-fb4a-450a-bc26-47bda60da95e-0', usage_metadata={'input_tokens': 11, 'output_tokens': 9, 'total_tokens': 20, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

El patron de prompt típico de RAG considera una pregunta y un *context*

In [27]:
# Example for a public prompt (https://smith.langchain.com/hub/rlm/rag-prompt)
rag_prompt = hub.pull("rlm/rag-prompt", include_model=True)
rag_prompt_template = rag_prompt.messages[0].prompt
rag_prompt_template.model_dump() # Pydantic object in JSON format



{'name': None,
 'input_variables': ['context', 'question'],
 'optional_variables': [],
 'output_parser': None,
 'partial_variables': {},
 'metadata': None,
 'tags': None,
 'template': "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:",
 'template_format': 'f-string',
 'validate_template': False}

In [28]:
pprint.pprint(rag_prompt_template.template) # El prompt en sí del RAG

('You are an assistant for question-answering tasks. Use the following pieces '
 "of retrieved context to answer the question. If you don't know the answer, "
 "just say that you don't know. Use three sentences maximum and keep the "
 'answer concise.\n'
 'Question: {question} \n'
 'Context: {context} \n'
 'Answer:')


Si se quisiera, se puede hacer un prompt más custom, siempre respetando la estructura básica y la variable *context*

In [9]:
from langchain_core.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate

TEMPLATE = """You are a helpful AI assistant for question-answering tasks. \
    Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know, don't try to make up an answer

    Question: {question}
   
    Context: {context}
    
    Answer (in Spanish, if question is formulated in Spanish):
"""

# my_rag_prompt = PromptTemplate(
#     input_variables=["context", "question"],
#     template=TEMPLATE,
# )

my_rag_prompt = ChatPromptTemplate.from_template(TEMPLATE)



### Ingestion

In [29]:
# We consider a large PDF file
pdf_path = "./docs/aws-general.pdf"

loader = PyPDFLoader(pdf_path) # Tool to load and process a PDF file
pdf_documents = loader.load() # Each document corresponds actually to a page
print(len(pdf_documents), "loaded")

3123 loaded


In [30]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100)

texts = text_splitter.split_documents(pdf_documents)
print(len(texts), "chunks")

4902 chunks


In [None]:
# We use a simple vector store for the chunks
vectorstore_chroma = Chroma(
        collection_name="aws_collection",
        embedding_function=embeddings,
        persist_directory="./aws_chroma_db" # Optional: specify a directory to persist your data
    )
vectorstore_chroma.add_documents(texts)

retriever = vectorstore_chroma.as_retriever(search_kwargs={"k": 7})

In [31]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"{i + 1}. Document {d.id}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

test_query = "I want to understand the main AWS services related to compute and storage, and how they integrate with each other."

# Solamente se está haciendo recuperación
context_docs =  retriever.invoke(input=test_query)
pretty_print_docs(context_docs)

1. Document c4996ad7-8d6e-4676-8057-ad501418e603:

AWS General Reference Reference guide
Service endpoints
The following sections describe the service endpoints for AWS AppConﬁg. AWS AppConﬁg uses
control plane APIs for setting up and conﬁguring AWS AppConﬁg applications, environments, 
conﬁguration proﬁles, and deployment strategies. AWS AppConﬁg uses the AWS AppConﬁg Data 
service to call data plane APIs for retrieving stored conﬁgurations.
Topics
• Control plane endpoints
• Data plane endpoints
Control plane endpoints
The following table contains AWS Region-speciﬁc endpoints that AWS AppConﬁg supports for 
control plane operations. Control plane operations are used for creating, updating, and managing 
conﬁguration data. For more information, see AWS AppConﬁg operations in the AWS AppConﬁg API 
Reference.
Region 
Name
Region Endpoint Protocol
US East 
(Ohio)
us-east-2 appconﬁg.us-east-2.amazonaws.com
appconﬁg-ﬁps.us-east-2.api.aws
appconﬁg-ﬁps.us-east-2.amazonaws.com
appconﬁg.us-eas

In [32]:
# Conexión entre el retriever y el LLM usando el prompt de RAG
rag_chain = {"context": retriever,  "question": RunnablePassthrough()} | rag_prompt | llm | StrOutputParser()

# query = "I want to understand the main AWS services related to compute and storage, and how they integrate with each other."
# query = "I want to know more about the endpoints of CloudWatch and how to use them."
query = "Quiero saber sobre los endpoints de CloudWatch y cómo utilizarlos."
result = rag_chain.invoke(input=test_query)
# pprint.pprint(result)
display(Markdown(result))

The main AWS services related to compute include Amazon EC2 (Elastic Compute Cloud) for scalable virtual servers and AWS Lambda for serverless computing. For storage, Amazon S3 (Simple Storage Service) provides scalable object storage, while Amazon EBS (Elastic Block Store) offers block storage for EC2 instances. These services integrate seamlessly, allowing EC2 instances to use EBS for persistent storage and S3 for data storage and retrieval.

### Re-Ranking

Es un paso adicional para intentar maximizar la chance de recuperar chunks relevantes a la pregunta, que pueden haber quedado más abajo en el ranking original del retriever. En este ejemplo se usa una estrategia de cross-encoding (de HuggingFace), pero existen otras estrategias e incluso mediante un prompt al LLM.

In [33]:
# Initialize the cross encoder
model = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")

# Create a reranker compressor
compressor = CrossEncoderReranker(model=model, top_n=3)

# Wrap your base retriever with the compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

In [34]:
# Use the compression retriever
compressed_docs = compression_retriever.invoke(test_query)
pretty_print_docs(compressed_docs)

1. Document 7906d49c-fb51-4abe-8e29-18792aa89b58:

AWS General Reference Reference guide
Amazon Simple Storage Service endpoints and quotas
To connect programmatically to an AWS service, you use an endpoint. AWS services oﬀer the 
following endpoint types in some or all of the AWS Regions that the service supports: IPv4 
endpoints, dual-stack endpoints, and FIPS endpoints. Some services provide global endpoints. For 
more information, see AWS service endpoints.
Service quotas, also referred to as limits, are the maximum number of service resources or 
operations for your AWS account. For more information, see AWS service quotas.
The following are the service endpoints and service quotas for this service.
Service endpoints
Amazon S3 endpoints
When you use the REST API to send requests to the endpoints shown in the following table, you 
can use the virtual-hosted style and path-style methods. For more information, see Virtual hosting 
of buckets.
Note
Some Regions support legacy endpoint

In [35]:
rag_chain1 = (
    {"context": compression_retriever,  "question": RunnablePassthrough()} 
    | rag_prompt 
    | llm
    | StrOutputParser()
)

result = rag_chain.invoke(test_query)
# pprint.pprint(result)
display(Markdown(result))

The main AWS services related to compute include Amazon EC2 (Elastic Compute Cloud) for scalable computing capacity and AWS Lambda for serverless computing. For storage, Amazon S3 (Simple Storage Service) provides scalable object storage, while Amazon EBS (Elastic Block Store) offers block storage for EC2 instances. These services integrate seamlessly, allowing EC2 instances to access data stored in S3 or EBS, facilitating efficient data processing and storage management.

---