# Build a Retrieval Augmented Generation (RAG) App
Este archivo es una prueba siguiendo el tutorial descripto en la página: https://python.langchain.com/docs/tutorials/rag/

# Installation

```console
py venv -m venv
.\\venv\scripts\activate
pip install --quiet --upgrade langchain langchain-community langchain-chroma
pip install -qU langchain-openai
pip install bs4
pip install lxml
pip install pypdf
```

## LangSmith
Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith.

In [1]:
import os
import utils

os.environ["LANGCHAIN_TRACING_V2"] = utils.config["LANG"]["LANGCHAIN_TRACING_V2"]
os.environ["LANGCHAIN_API_KEY"] = utils.config["LANG"]["LANGCHAIN_API_KEY"]
os.environ["LANGCHAIN_PROJECT"] = "RAG-PMX-Agent"


In [2]:
# Create a chat model

# os.environ["AZURE_OPENAI_ENDPOINT"] = utils.config["LLM"]["ENDPOINT"]
# os.environ["AZURE_OPENAI_API_KEY"] = utils.config["LLM"]["API_KEY"]
# os.environ["AZURE_OPENAI_API_VERSION"] = utils.config["LLM"]["API_VERSION"]
# os.environ["AZURE_OPENAI_DEPLOYMENT"] = utils.config["LLM"]["DEPLOYMENT"]

# from langchain_openai import AzureChatOpenAI

# llm = AzureChatOpenAI(
#     azure_endpoint=utils.config["LLM"]["ENDPOINT"],
#     azure_deployment=utils.config["LLM"]["DEPLOYMENT"],
#     openai_api_version=utils.config["LLM"]["API_VERSION"],
# )

os.environ["OPENAI_API_KEY"] = utils.config["LLM"]["OPENAI_API_KEY"]
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

## 1. Indexing: Load
We need to first load the blog post contents. We can use DocumentLoaders for this, which are objects that load in data from a source and return a list of Documents. A Document is an object with some page_content (str) and metadata (dict).

In this case we’ll use the WebBaseLoader, which uses urllib to load HTML from web URLs and BeautifulSoup to parse it to text. We can customize the HTML -> text parsing by passing in parameters into the BeautifulSoup parser via bs_kwargs (see BeautifulSoup docs). In this case only HTML tags with class “post-content”, “post-title”, or “post-header” are relevant, so we’ll remove all others.

In [3]:
from glob import glob
from langchain.document_loaders import BSHTMLLoader, PyPDFLoader

# Ruta con un patrón de búsqueda para archivos HTML
html_files = glob("C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es/*.html")
pdf_files = glob("C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/*.pdf")

# Cargar y procesar cada archivo
documents = []
for file_path in html_files:
    print(f"file path: {file_path}")
    loader = BSHTMLLoader(file_path, open_encoding="utf-8")
    documents.extend(loader.load())

for file_path in pdf_files:
    print(f"file path: {file_path}")
    loader = PyPDFLoader(file_path)
    documents.extend(loader.load())
# Ahora `documents` contiene todos los documentos cargados.
print(len(documents[0].page_content))

print(documents[0].page_content[:500])


file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\autoevaluacion.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\carga-plan-trabajo.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\dashboard.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\desarrollo.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\evaluacion.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\evaluaciones.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\feedback-enviado.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\feedback-recibido.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\feedback.html
file path: C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\index.html
f

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(documents)

print(len(all_splits))
print(len(all_splits[0].page_content))
print(all_splits[10].metadata)

76
927
{'source': 'C:/Users/baiscf/repos_local/Confluence-LangChain/PMX_Manual/tenaris/es\\dashboard.html', 'title': 'Dashboard - PMX - Guía de usuarios', 'start_index': 859}


In [5]:
from langchain_core.vectorstores import InMemoryVectorStore
# from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

# embed = AzureOpenAIEmbeddings(
#     model=utils.config["EMB"]["MODEL"],
#     azure_endpoint=utils.config["EMB"]["ENDPOINT"],
#     api_key=utils.config["EMB"]["API_KEY"],
#     api_version=utils.config["EMB"]["API_VERSION"],
# )
os.environ["OPENAI_API_KEY"] = utils.config["EMB"]["OPENAI_API_KEY"]

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vectorstore = InMemoryVectorStore(embeddings)

In [8]:
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict

# Index chunks
_ = vectorstore.add_documents(documents=all_splits)

# Define prompt for question-answering
template = """Sos un asistente de soporte sobre un sistema de evaluaciones de performance de empleados. Usa el contexto que se te proporciona para responder la pregunta del final.
El material que se te proporciona es una guia de usuario del sistema y la norma de evaluacion de la empresa.
Si no sabes la respuesta, solo di "no se", no trates de crear una respuesta.
Mantene la respuesta tan conscisa como sea posible.
Siempre contesta con "Gracias por tu pregunta" al final de la respuesta.

{context}

Pregunta: {question}

Respuesta:"""
custom_rag_prompt = PromptTemplate.from_template(template)


# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vectorstore.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = custom_rag_prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()


In [14]:
#response = graph.invoke({"question": "Que me podes contar sobre Dinamarca?"})
#response = graph.invoke({"question": "Que me podes indicar sobre la autoevaluación?"})
#response = graph.invoke({"question": "Donde puedo completar una evaluacion cliente proveedor?"})
response = graph.invoke({"question": "Para que sirve una evaluacion cliente proveedor?"})

# print(f'Context: {response["context"]}\n\n')
print(f'Answer: {response["answer"]}')

Answer: La evaluación cliente proveedor permite a los evaluadores primarios y secundarios obtener una valoración directa de la performance de la persona, lo cual es relevante para la gestión del rendimiento. Gracias por tu pregunta.
