To feed big documents into Chroma for retrieval-augmented generation (RAG) with LangChain, you'll want to break it into manageable chunks, embed them, and store them efficiently for querying. 
Its also useful that the document has also its internal structure.

Here's an approach that works well:

## 1. Chunking the Text
Given its length (~8800 characters) and structure, you should split it into meaningful sections while maintaining context:
- Use LangChain's RecursiveCharacterTextSplitter or MarkdownTextSplitter to divide the document intelligently.
- Maintain coherence by chunking around 512–1500 characters per section.
- Ensure headers (e.g., "Onboarding of different kinds of entitlements") are preserved to retain contextual hierarchy. Ideally, headers should stay attached to their respective content during embedding to maintain context. Otherwise, when retrieving chunks, you may lose the structural meaning of the document.

For this ther best approach is to 
- format the document in markdown 
- leave extra carriage returns between the paragraphs 
- adjust the chunk_size




In [18]:
from langchain.text_splitter import MarkdownTextSplitter, RecursiveCharacterTextSplitter

# Load your document
with open("documents/myiam.txt", "r") as f:
    text = f.read()

# Chunk the text
splitter = MarkdownTextSplitter(chunk_size=1500, chunk_overlap=50)
chunks = splitter.split_text(text)
chunks

["# 0. Introduction\nThe name of our team is ROBOTIC. We are responsible of the creation and maintennance of BNP's private cloud via DPI wich is its action broker and library of products and Vanish wich is our orchestator. We are also the responsible team of creating and maintaining the onboarding automations for some entitlements of MyIAM. I would provide you some information and then I will ask you to answer some emails I will provide.\n\n## 1. Entitlements automatic onboarding\nOur team is known in MyIAM as IV2 Manual Provisioner. Despite the unfortunate name we have in MyIAM, ROBOTIC is responsible of maintaining the automatic onboarding flows for IV2 and we are not allowed nor we have rights to onboard anything manually. We do not outsource our automatic flows to onboard anything else. All our onboarding is done automatically and is triggered either by MyIAM to provision an entitlement managed by us or by ecosystem and subscription deployments.\nFor IV2 extended rights team define

With this basic splitting we see that the header and the body of a paraghraph will be stored toghether separated by \n. We can use that to create documents with a header metadata 




In [21]:

from langchain.schema import Document

# Merge headers with content
documents = []
for chunk in chunks:  # Assuming header is followed by /n
    header = chunk.split('\n')[0]
    documents.append(Document(page_content=chunk, metadata={"header": header, "category": "MyIAM"}))

documents

[Document(metadata={'header': '# 0. Introduction', 'category': 'MyIAM'}, page_content="# 0. Introduction\nThe name of our team is ROBOTIC. We are responsible of the creation and maintennance of BNP's private cloud via DPI wich is its action broker and library of products and Vanish wich is our orchestator. We are also the responsible team of creating and maintaining the onboarding automations for some entitlements of MyIAM. I would provide you some information and then I will ask you to answer some emails I will provide.\n\n## 1. Entitlements automatic onboarding\nOur team is known in MyIAM as IV2 Manual Provisioner. Despite the unfortunate name we have in MyIAM, ROBOTIC is responsible of maintaining the automatic onboarding flows for IV2 and we are not allowed nor we have rights to onboard anything manually. We do not outsource our automatic flows to onboard anything else. All our onboarding is done automatically and is triggered either by MyIAM to provision an entitlement managed by 

In [51]:
# PO  document
# Load your document
with open("documents/product_owner.txt", "r") as f:
    text = f.read()

# Chunk the text
splitter = MarkdownTextSplitter(chunk_size=1024, chunk_overlap=50)
chunks = splitter.split_text(text)
chunks

['# The Product Owner (PO)\nEvery team in the company creates products, the templates for create the items on the virtual cloud, so the users can order them. That is, for example, Unix server product, sqlserver product, etc. This teams are the products owners.',
 '## PO roles\nA product owner (PO) is a key role in agile development and product management. The PO is responsible for defining and prioritizing the product vision, goals, and requirements, and for ensuring that his development team understands and delivers on these objectives as well as to manage the incidents raised by instances of its product. The PO serves as the main point of contact for the development team, stakeholders, and customers, and is responsible for making decisions about the product roadmap and feature prioritization. The PO should have a deep understanding of the product he aims to build and its technical capabilities, customer needs, and user experience. The PO is responsible for defining product requiremen

In [53]:
po_documents = []
for chunk in chunks:  # Assuming header is followed by /n
    header = chunk.split('\n')[0]
    po_documents.append(Document(page_content=chunk, metadata={"header": header, "category": "Product Owner"}))
po_documents

[Document(metadata={'header': '# The Product Owner (PO)', 'category': 'Product Owner'}, page_content='# The Product Owner (PO)\nEvery team in the company creates products, the templates for create the items on the virtual cloud, so the users can order them. That is, for example, Unix server product, sqlserver product, etc. This teams are the products owners.'),
 Document(metadata={'header': '## PO roles', 'category': 'Product Owner'}, page_content='## PO roles\nA product owner (PO) is a key role in agile development and product management. The PO is responsible for defining and prioritizing the product vision, goals, and requirements, and for ensuring that his development team understands and delivers on these objectives as well as to manage the incidents raised by instances of its product. The PO serves as the main point of contact for the development team, stakeholders, and customers, and is responsible for making decisions about the product roadmap and feature prioritization. The PO

In [54]:
documents.extend(po_documents)

## 2. Embedding the Chunks/Documents
Once split, convert the text into vector embeddings using a model like:
- OpenAIEmbeddings (if using  -> remember to get the api key in the .env)
- HuggingFaceEmbeddings (for local models)
- SentenceTransformersEmbeddings (optimized for sentence-level retrieval)

## 3. Storing in Chroma
- Use Chroma as a vector store to hold the embeddings.
- You can initialize it with LangChain:


In [55]:
# from langchain.vectorstores import Chroma
from langchain_chroma import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
import dotenv
dotenv.load_dotenv()
REVIEWS_CHROMA_PATH="chroma_data"
# Generate embeddings
embeddings = OpenAIEmbeddings()
# vector_store = Chroma.from_texts(chunks, embeddings)
vector_store = Chroma.from_documents(documents, embeddings, persist_directory=REVIEWS_CHROMA_PATH)

# Query the vector store
query = "What is the name of our team?"
docs = vector_store.similarity_search(query, k=3)
print(docs[0].page_content) 

# 0. Introduction
The name of our team is ROBOTIC. We are responsible of the creation and maintennance of BNP's private cloud via DPI wich is its action broker and library of products and Vanish wich is our orchestator. We are also the responsible team of creating and maintaining the onboarding automations for some entitlements of MyIAM. I would provide you some information and then I will ask you to answer some emails I will provide.

## 1. Entitlements automatic onboarding
Our team is known in MyIAM as IV2 Manual Provisioner. Despite the unfortunate name we have in MyIAM, ROBOTIC is responsible of maintaining the automatic onboarding flows for IV2 and we are not allowed nor we have rights to onboard anything manually. We do not outsource our automatic flows to onboard anything else. All our onboarding is done automatically and is triggered either by MyIAM to provision an entitlement managed by us or by ecosystem and subscription deployments.
For IV2 extended rights team defined a com

### The RAG chain in one go

In [56]:
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough


# First define the system & human prompts -> chat prompt template
roboasistant_system_template_str = """You are a member of the robotic team. 
Your job is to answer questions about the onboarding of entitlements or the product ownership.
Use the following context to answer questions.
Be as detailed as possible, but don't make up any information
that's not from the context. If you don't know an answer, say
you don't know.
{context}
"""

roboasistant_system_prompt = SystemMessagePromptTemplate(
    prompt=PromptTemplate(
        input_variables=["context"], template=roboasistant_system_template_str
    )
)

roboasistant_human_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(input_variables=["question"], template="{question}")
)
roboassistant_messages = [roboasistant_system_prompt, roboasistant_human_prompt]

roboassistant_prompt_template = ChatPromptTemplate(
    input_variables=["context", "question"], messages=roboassistant_messages
)


# Second define the RAG chain = retriever, chat prompt template, chat model & output parser
chat_model = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

output_parser = StrOutputParser()

faqs_vector_db = Chroma(
    persist_directory=REVIEWS_CHROMA_PATH,
    embedding_function=OpenAIEmbeddings(),
)

faqs_retriever = faqs_vector_db.as_retriever(k=3)

faqs_roboassistant_chain = (
    {"context": faqs_retriever, "question": RunnablePassthrough()}
    | roboassistant_prompt_template
    | chat_model
    | StrOutputParser()
)
question = "Whats the name of your team?"
print(faqs_roboassistant_chain.invoke(question))

The name of our team is ROBOTIC.


In [58]:
question = "What are POs main responsibilities?"
print(faqs_roboassistant_chain.invoke(question))

The main responsibilities of a Product Owner (PO) include:

1. Defining and prioritizing the product vision, goals, and requirements.
2. Ensuring that the development team understands and delivers on these objectives.
3. Serving as the main point of contact for the development team, stakeholders, and customers.
4. Making decisions about the product roadmap and feature prioritization.
5. Having a deep understanding of the product, its technical capabilities, customer needs, and user experience.
6. Defining product requirements and working with the development team to ensure implementation.
7. Managing incidents raised by instances of the product.
8. Investigating incidents regarding the product and leading actions for correction.
