### Documents, Vector Stores and Retrievers

### Documents
Langchain implements a Document abstration, which is intended to represented a unit of text and associated metadata. It has two attributes:

- page_content: a string representing content;
- metadat: a dict containing arbitarary metadata.
The metadata attribute can capture information about the source of the document, its relationship to other documents, ans
other information. Note that an individual Document object often represents a chunk of larger document.

Lets genarate some simple documents:

In [1]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="The quick brown fox jumps over the lazy dog.",
        metadata={"source": "storybook", "page": 1}
    ),
    Document(
        page_content="In 2025, AI systems continue to transform industries worldwide.",
        metadata={"source": "tech_report", "year": 2025}
    ),
    Document(
        page_content="Photosynthesis converts light energy into chemical energy in plants.",
        metadata={"source": "biology_textbook", "chapter": 3}
    ),
    Document(
        page_content="To be or not to be, that is the question.",
        metadata={"source": "shakespeare", "work": "Hamlet"}
    ),
    Document(
        page_content="E = mc^2 is one of the most famous equations in physics.",
        metadata={"source": "physics_notes", "author": "Einstein"}
    )
]

In [2]:
documents

[Document(metadata={'source': 'storybook', 'page': 1}, page_content='The quick brown fox jumps over the lazy dog.'),
 Document(metadata={'source': 'tech_report', 'year': 2025}, page_content='In 2025, AI systems continue to transform industries worldwide.'),
 Document(metadata={'source': 'biology_textbook', 'chapter': 3}, page_content='Photosynthesis converts light energy into chemical energy in plants.'),
 Document(metadata={'source': 'shakespeare', 'work': 'Hamlet'}, page_content='To be or not to be, that is the question.'),
 Document(metadata={'source': 'physics_notes', 'author': 'Einstein'}, page_content='E = mc^2 is one of the most famous equations in physics.')]

In [8]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_groq import ChatGroq

groq_api_key = os.getenv('GROQ_API_KEY')
os.environ["HF_TOKEN"] = os.getenv('HF_TOKEN')

llm = ChatGroq(model='Llama3-8b-8192', groq_api_key=groq_api_key)
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x119d21900>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x1187de4d0>, model_name='Llama3-8b-8192', model_kwargs={}, groq_api_key=SecretStr('**********'))

In [11]:
from langchain_huggingface import HuggingFaceEmbeddings

embddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
## vector store
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(documents, embedding=embddings)

vectorstore

<langchain_chroma.vectorstores.Chroma at 0x119dffc40>

In [13]:
vectorstore.similarity_search("Photosynthesis")

[Document(id='793fb4a8-753e-4cc6-946c-c245d21471e7', metadata={'source': 'biology_textbook', 'chapter': 3}, page_content='Photosynthesis converts light energy into chemical energy in plants.'),
 Document(id='f8641976-0138-4aa6-8e0d-6fb656c77f07', metadata={'author': 'Einstein', 'source': 'physics_notes'}, page_content='E = mc^2 is one of the most famous equations in physics.'),
 Document(id='301a6a11-8d81-441c-a633-214c018eb715', metadata={'source': 'shakespeare', 'work': 'Hamlet'}, page_content='To be or not to be, that is the question.'),
 Document(id='8398a4ac-3598-4863-9be8-12adcc7a00a3', metadata={'page': 1, 'source': 'storybook'}, page_content='The quick brown fox jumps over the lazy dog.')]

In [15]:
await vectorstore.asimilarity_search("Photosynthesis")

[Document(id='793fb4a8-753e-4cc6-946c-c245d21471e7', metadata={'chapter': 3, 'source': 'biology_textbook'}, page_content='Photosynthesis converts light energy into chemical energy in plants.'),
 Document(id='f8641976-0138-4aa6-8e0d-6fb656c77f07', metadata={'source': 'physics_notes', 'author': 'Einstein'}, page_content='E = mc^2 is one of the most famous equations in physics.'),
 Document(id='301a6a11-8d81-441c-a633-214c018eb715', metadata={'work': 'Hamlet', 'source': 'shakespeare'}, page_content='To be or not to be, that is the question.'),
 Document(id='8398a4ac-3598-4863-9be8-12adcc7a00a3', metadata={'page': 1, 'source': 'storybook'}, page_content='The quick brown fox jumps over the lazy dog.')]

In [16]:
vectorstore.similarity_search_with_score("Photosynthesis")

[(Document(id='793fb4a8-753e-4cc6-946c-c245d21471e7', metadata={'source': 'biology_textbook', 'chapter': 3}, page_content='Photosynthesis converts light energy into chemical energy in plants.'),
  0.44923925399780273),
 (Document(id='f8641976-0138-4aa6-8e0d-6fb656c77f07', metadata={'source': 'physics_notes', 'author': 'Einstein'}, page_content='E = mc^2 is one of the most famous equations in physics.'),
  1.7091107368469238),
 (Document(id='301a6a11-8d81-441c-a633-214c018eb715', metadata={'source': 'shakespeare', 'work': 'Hamlet'}, page_content='To be or not to be, that is the question.'),
  1.849722981452942),
 (Document(id='8398a4ac-3598-4863-9be8-12adcc7a00a3', metadata={'page': 1, 'source': 'storybook'}, page_content='The quick brown fox jumps over the lazy dog.'),
  1.9243714809417725)]

In [17]:
### retrivers

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriver = RunnableLambda(vectorstore.similarity_search).bind(k=1)
retriver.batch(["Photosynthesis", "dog"])

[[Document(id='793fb4a8-753e-4cc6-946c-c245d21471e7', metadata={'source': 'biology_textbook', 'chapter': 3}, page_content='Photosynthesis converts light energy into chemical energy in plants.')],
 [Document(id='8398a4ac-3598-4863-9be8-12adcc7a00a3', metadata={'source': 'storybook', 'page': 1}, page_content='The quick brown fox jumps over the lazy dog.')]]

In [19]:
retriver = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":1}
)

retriver.batch(["Photosynthesis", "dog"])

[[Document(id='793fb4a8-753e-4cc6-946c-c245d21471e7', metadata={'chapter': 3, 'source': 'biology_textbook'}, page_content='Photosynthesis converts light energy into chemical energy in plants.')],
 [Document(id='8398a4ac-3598-4863-9be8-12adcc7a00a3', metadata={'source': 'storybook', 'page': 1}, page_content='The quick brown fox jumps over the lazy dog.')]]

In [22]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer the question using the provided context only.

{question}

context
{context}
"""

prompt = ChatPromptTemplate.from_messages(["human", message])

rag_chain = { "context": retriver, "question": RunnablePassthrough() } | prompt | llm

response = rag_chain.invoke("tell me about Photosynthesis")

print(response.content)

According to the provided context, Photosynthesis converts light energy into chemical energy in plants.
