## Documents in LangChain

In LangChain, the Document is the fundamental building block for working with external knowledge. Whenever you ingest data from a file, webpage, or database, LangChain represents that piece of content as a Document object.

A Document has two main components:

* **page_content (str)**
* **metadata (dict)**

The raw text you want the LLM to use.
Metadata is essential for traceability (e.g., citing the page number in an answer, knowing which file the text came from).

## Why Documents Are Important

* **Uniformity** → Regardless of source (PDF, web, SQL database), everything becomes a standardized Document.

* **Traceability** → You always know where the LLM’s answer is coming from.

* **Flexibility** → Metadata can store rich details (e.g., author, timestamp, topic) that help filtering and retrieval.

In [1]:
import os
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader, TextLoader, DirectoryLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
# 1. Single PDF
# pdf_docs = PyPDFLoader("sample.pdf").load()

# 2. Web page
web_docs = WebBaseLoader("https://python.langchain.com").load()

# 3. Single text file
# txt_docs = TextLoader("notes.txt", encoding="utf-8").load()

# 4. Whole folder (all Markdown files)
# dir_docs = DirectoryLoader("./docs", glob="**/*.md").load()

# 5. Minimal in-memory doc (for demo)
docs = [
    Document(
        page_content="LangChain streamlines LLM apps. It has loaders, splitters, embeddings, vectorstores, and retrievers.",
        metadata={"source": "in_memory"}
    )
]

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """LangChain makes it easy to work with documents by
loading, splitting, embedding, and retrieving them for LLM-powered applications."""

# Define splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10
)

# Split into smaller docs
docs = splitter.create_documents([text])

for i, d in enumerate(docs, 1):
    print(f"Chunk {i}: {d.page_content}")

Chunk 1: LangChain makes it easy to work with documents by
Chunk 2: loading, splitting, embedding, and retrieving
Chunk 3: them for LLM-powered applications.


In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Original Document
doc = Document(
    page_content=(
        "LangChain makes it easy to work with documents by loading, "
        "splitting, embedding, and retrieving them for LLM-powered applications."
    ),
    metadata={"source": "intro.txt", "author": "LangChain Team"}
)

# Splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10
)

# Split into smaller Document objects
chunks = splitter.split_documents([doc])

# Add chunk info to metadata
for idx, chunk in enumerate(chunks, start=1):
    chunk.metadata["chunk_id"] = idx

# Print results
for c in chunks:
    print(f"Chunk {c.metadata['chunk_id']}: {c.page_content}")
    print("Metadata:", c.metadata)
    print("-" * 60)

Chunk 1: LangChain makes it easy to work with documents by
Metadata: {'source': 'intro.txt', 'author': 'LangChain Team', 'chunk_id': 1}
------------------------------------------------------------
Chunk 2: by loading, splitting, embedding, and retrieving
Metadata: {'source': 'intro.txt', 'author': 'LangChain Team', 'chunk_id': 2}
------------------------------------------------------------
Chunk 3: them for LLM-powered applications.
Metadata: {'source': 'intro.txt', 'author': 'LangChain Team', 'chunk_id': 3}
------------------------------------------------------------


# LangChain Text Embeddings
An embedding is a vector representation of text. It captures the semantic meaning of words, sentences, or documents, so similar texts are close to each other in vector space.

In LangChain, embeddings are used to:

* **Enable semantic search** (find conceptually similar text).
* **Power retrieval-augmented** generation (RAG) workflows.

In [5]:
from langchain_openai import OpenAIEmbeddings

# Initialize embedding model
embedding_model = OpenAIEmbeddings(api_key=os.environ["OPEN_API_KEY"])

# Convert text to vector
vector = embedding_model.embed_query("LangChain makes working with LLMs easier")
print(len(vector), "dimensions") # -> e.g., 1536

1536 dimensions


# 🗄️ Vector Stores in LangChain

A Vector Store is a special type of database designed to store and query vector embeddings (numerical representations of text). Instead of exact keyword matches, vector stores allow semantic similarity search, meaning you can retrieve chunks of text that are conceptually related to a query.

In LangChain, vector stores are the backbone of retrieval-augmented generation (RAG) workflows.

###  🔑 Why Vector Stores Matter

* **Efficient search** → optimized for high-dimensional vector lookups.
* **Semantic retrieval** → retrieves relevant documents by meaning, not keywords.
* **Metadata storage** → keeps track of the original source, author, page, etc.
* **Scalability** → can handle millions of embeddings with low latency.

### 📚 Popular Vector Stores in LangChain

* **Chroma** → default, simple, and local.
* **FAISS** → Facebook AI Similarity Search (fast, offline).
* **Pinecone** → managed cloud vector DB.
* **Weaviate, Qdrant, Milvus** → other scalable open-source options.

In [6]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_core.documents import Document

In [7]:
# 1. Embedding model
embeddings = OpenAIEmbeddings(api_key=os.environ["OPEN_API_KEY"])

# 2. Example documents
docs = [
    Document(page_content="LangChain integrates with OpenAI.", metadata={"id": 1}),
    Document(page_content="Vector stores enable semantic search.", metadata={"id": 2}),
]

# 3. Create vector store
vectorstore = Chroma.from_documents(docs, embedding=embeddings)

# 4. Query the store
results = vectorstore.similarity_search("How to do semantic search?", k=2)

for r in results:
    print(r.page_content, "->", r.metadata)


Vector stores enable semantic search. -> {'id': 2}
LangChain integrates with OpenAI. -> {'id': 1}
