### ChromaDB Vector Embeddings con LangChain
Ref> https://airbyte.com/data-engineering-resources/chroma-db-vector-embeddings

In [22]:
%pip install -qU langchain-community pypdf

Note: you may need to restart the kernel to use updated packages.


In [23]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./data/openaiguide.pdf"
loader = PyPDFLoader(file_path)

Extract the PDF by page. Each page is extracted as a langchain Document object:

In [24]:
import pprint

file_path = "./data/openaiguide.pdf"
loader = PyPDFLoader(
    file_path,
    mode="page",  ## Cargar por página, tambien puede ser entero usando "single"
)
docs = loader.load()
print(len(docs))

pprint.pp(docs[0].metadata)

34
{'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)',
 'creator': 'pdf-lib (https://github.com/Hopding/pdf-lib)',
 'creationdate': '2025-04-07T14:20:51+00:00',
 'moddate': '2025-04-07T14:20:54+00:00',
 'source': './data/openaiguide.pdf',
 'total_pages': 34,
 'page': 0,
 'page_label': '1'}


In [25]:
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

# Chunking Básico sin solapamiento
text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0
)
documents = text_splitter.split_documents(docs)

# Chunking Semántico con solapamiento
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # Mantenemos contexto entre chunks
    separators=["\n\n", "\n", " ", ""] # Definimos separadores jerárquicos
)
documents = text_splitter.split_documents(docs)

Embed text and store in ChromaDB

In [26]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

db = Chroma.from_documents(documents, OpenAIEmbeddings())

Similarity search

In [30]:
query = "workflow definition"
docs = db.similarity_search(query)
print(docs[0].page_content)

A s y ou e v alua t e wher e agen ts can add v alue ,  prioritiz e w orkflo w s tha t ha v e pr e viously  r esist ed 
aut oma tion,  especially  wher e tr aditional me thods encoun t er  fric tion:
01 C o m p l e x   
d e c i s i o n - m a k i n g :  
W orkflo w s in v olving nuanced judgmen t,  e x cep tions,  or   
con t e xt -sensitiv e decisions,  f or  e x ample r e fund appr o v al  
in cust omer  service w orkflo w s.
02 D i ffi c u l t - t o - m a i n t a i n  
r u l e s :
S y st ems tha t ha v e become unwieldy  due t o e xt ensiv e and 
in trica t e rulese ts,  making upda t es costly  or  err or -pr one ,   
f or  e x ample perf orming v endor  security  r e vie w s.  
03 H e a v y  r e l i a n c e  o n  
u n s t r u c t u r e d  d a t a :
Scenarios tha t in v olv e in t erpr e ting na tur al language ,   
e xtr ac ting meaning fr om documen ts,  or  in t er ac ting with  
user s con v er sa tionally ,  f or  e x ample pr ocessing a home 
insur ance claim.


Advanced RAG Implementation


In [None]:
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from dotenv import load_dotenv
load_dotenv()

# Create retriever with custom parameters
retriever = db.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": 0.7, "k": 4}
)

llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

# See full prompt at https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

result = rag_chain.invoke({"input": "What are autonomous agents?"})

print(result['answer'])
for document in result['context']:
    print(document)
    print('---\n\n')

Autonomous agents are systems that independently accomplish tasks on behalf of users with a high degree of independence. Unlike conventional software that requires user input to streamline and automate workflows, autonomous agents can perform these workflows autonomously, making decisions and handling complexity without direct user intervention. They are particularly suited for workflows where traditional deterministic and rule-based approaches may fall short, allowing them to manage complex and ambiguous situations effectively.
page_content='W h a t  i s  a n  
a g e n t ?
While con v en tional so ftw ar e enables user s t o str eamline and aut oma t e w orkflo w s,  agen ts ar e able 
t o perf orm the same w orkflo w s on the user s ’  behalf  with a high degr ee o f  independence .
A gen ts ar e s y st ems tha t independen tly accomplish task s on y our  behalf .
A  w orkflo w  is a sequence o f  st eps tha t must be e x ecut ed t o mee t the user’ s goal,  whe ther  tha t ' s 
r es

#### Ahora probemos con una web

In [50]:
# Load docs
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://github.com/chiphuyen/aie-book/blob/main/chapter-summaries.md")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

# Store splits
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

# LLM
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

Y usando un mecanismo abreviado LCEL

In [52]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# See full prompt at https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_chain = (
    {
        "context": vectorstore.as_retriever() | format_docs,
        "question": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

qa_chain.invoke("Define RAG en 3 bullets como si fueras Chip Huyen, con referencias a los documentos usados.")

'- RAG (Retrieval-Augmented Generation) combines retrieval of relevant documents with generative capabilities to enhance response accuracy.  \n- It is particularly useful for tasks like extracting meaning from documents or engaging in conversational interactions, such as processing home insurance claims.  \n- Before implementing RAG, ensure your use case meets specific criteria; otherwise, a simpler deterministic solution may be more appropriate.'