Great question ‚Äî you‚Äôre basically asking for the **full conceptual + technical architecture** of document loading, parsing, splitting, and chunking in LangChain. Let‚Äôs go step-by-step cleanly and properly.

---

# 1Ô∏è‚É£ Where are **PyPDF** and **PyMuPDF** used?

They are **PDF parsing engines**, used internally by PDF loaders.

### In LangChain:

```python
from langchain_community.document_loaders import PyPDFLoader
```

### Internally:

* `PyPDFLoader` ‚Üí uses **pypdf**
* `PyMuPDFLoader` ‚Üí uses **PyMuPDF (fitz)**

So you don‚Äôt import them directly ‚Äî LangChain uses them **under the hood**.

---

## What are they?

### üìÑ PyPDF (pypdf)

* Pure Python PDF reader
* Text extraction
* Slower
* Struggles with scanned PDFs
* No layout understanding

### ‚ö° PyMuPDF (fitz)

* Fast C++ backend
* Better layout handling
* Can extract:

  * text
  * images
  * metadata
* Better for:

  * research papers
  * academic PDFs
  * structured documents

---

## Install commands:

```bash
pip install pypdf
pip install pymupdf
```

---

# 2Ô∏è‚É£ What is BeautifulSoup?

```python
import bs4
```

BeautifulSoup = **HTML/XML parser**

### Purpose:

It converts raw HTML into a structured tree:

```text
HTML ‚Üí DOM Tree ‚Üí Tags ‚Üí Classes ‚Üí IDs ‚Üí Text
```

So you can extract only what you want.

---

## Why only BeautifulSoup used in WebBaseLoader?

Because websites return:

* HTML, not clean text
* messy tags
* ads
* navbars
* scripts
* footers
* cookies banners

So BeautifulSoup is used to **filter relevant content**.

Example:

```python
parse_only=bs4.soup_strainer(class_=("post-title","post-content","post-header"))
```

Means:
üëâ Only extract elements with these CSS classes
üëâ Ignore everything else

---

# Alternatives to BeautifulSoup

| Library     | Use                          |
| ----------- | ---------------------------- |
| lxml        | Fast XML/HTML parsing        |
| html5lib    | Browser-like parsing         |
| newspaper3k | News article extraction      |
| trafilatura | Clean web content extraction |
| goose3      | Blog/article extraction      |
| selenium    | Dynamic JS websites          |

---

## Install:

```bash
pip install beautifulsoup4
pip install lxml
pip install trafilatura
pip install goose3
pip install newspaper3k
```

---

# 3Ô∏è‚É£ Structure of loaders ‚Äî parameters, functions, syntax

## General structure:

```python
loader = LoaderClass(parameters)
docs = loader.load()
```

---

## Example patterns:

### TextLoader

```python
TextLoader(file_path, encoding=None)
```

### PyPDFLoader

```python
PyPDFLoader(file_path)
```

### WebBaseLoader

```python
WebBaseLoader(
    web_paths=(url,),
    bs_kwargs=dict(parse_only=bs4.soup_strainer(...))
)
```

### ArxivLoader

```python
ArxivLoader(query="1706.03762", load_max_docs=2)
```

### WikipediaLoader

```python
WikipediaLoader(query="Generative AI", load_max_docs=2)
```

---

# 4Ô∏è‚É£ What is returned by `.load()` ?

Always returns:

```python
List[Document]
```

Each `Document` object has:

```python
Document(
    page_content="actual text",
    metadata={source, page, url, etc}
)
```

So:

```python
docs[0].page_content
docs[0].metadata
```

---

# 5Ô∏è‚É£ `create_documents`, `split`, `load` ‚Äî difference

### `load()`

üëâ Reads data
üëâ Creates documents
üëâ No chunking
üëâ Raw extraction

---

### `split()`

üëâ Splits already loaded documents into chunks
üëâ Uses text splitters

---

### `create_documents()`

üëâ Converts raw strings into Document objects
üëâ Used when **you already have text**, not files

Example:

```python
splitter.create_documents(["hello world", "another text"])
```

---

# 6Ô∏è‚É£ Why splitting / chunking is needed?

Because LLMs have **context limits**:

| Model     | Token limit              |
| --------- | ------------------------ |
| GPT-3.5   | ~4k                      |
| GPT-4     | ~8k                      |
| GPT-4-32k | 32k                      |
| GPT-5     | larger but still limited |

---

### Problems without chunking:

‚ùå token overflow
‚ùå context loss
‚ùå slow embeddings
‚ùå retrieval failure
‚ùå hallucinations
‚ùå incomplete answers

---

# 7Ô∏è‚É£ Chunking architecture

Pipeline:

```
Loader ‚Üí Documents ‚Üí Splitter ‚Üí Chunks ‚Üí Embeddings ‚Üí Vector DB ‚Üí Retrieval ‚Üí LLM
```

---

# 8Ô∏è‚É£ Rules for splitting / chunking

## Core parameters:

```python
chunk_size
chunk_overlap
```

---

### Example:

```python
RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
```

---

## Meaning:

### `chunk_size=1000`

üëâ max characters per chunk

### `chunk_overlap=200`

üëâ 200 characters repeated in next chunk
(prevents context breaking)

---

# 9Ô∏è‚É£ Chunking strategies

### Character based

```python
CharacterTextSplitter
```

### Recursive (best)

```python
RecursiveCharacterTextSplitter
```

Tries:

```
\n\n ‚Üí \n ‚Üí . ‚Üí space ‚Üí character
```

### Token based

```python
TokenTextSplitter
```

### Sentence based

```python
NLTKTextSplitter
SpacyTextSplitter
```

---

# üî• Best practical splitter

```python
RecursiveCharacterTextSplitter
```

---

# üîü Chunking parameter logic (REAL WORLD TRICKS)

### For RAG systems:

```python
chunk_size = 500‚Äì1000
chunk_overlap = 100‚Äì200
```

### For research papers:

```python
chunk_size = 1200‚Äì1500
chunk_overlap = 150‚Äì300
```

### For legal docs:

```python
chunk_size = 800
chunk_overlap = 200
```

### For chat memory:

```python
chunk_size = 300‚Äì500
chunk_overlap = 50‚Äì100
```

---

# üéØ Smart tuning logic

### Rule 1:

> chunk_size ‚âà 1/4 of model context window

### Rule 2:

> overlap ‚âà 15‚Äì25% of chunk_size

---

# 1Ô∏è‚É£1Ô∏è‚É£ Why overlap is critical?

Without overlap:

```
Chunk1: "Neural networks learn patterns"
Chunk2: " from data using backpropagation"
```

Meaning breaks.

With overlap:

```
Chunk1: "... learn patterns from data"
Chunk2: "patterns from data using backpropagation"
```

Context preserved ‚úÖ

---

# 1Ô∏è‚É£2Ô∏è‚É£ Full professional pipeline example

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = splitter.split_documents(docs)
```

---

# üß† Mental Model (Important)

Think like this:

| Stage     | Role                           |
| --------- | ------------------------------ |
| Loader    | Reads data                     |
| Parser    | Cleans structure               |
| Splitter  | Breaks into memory-sized units |
| Embedder  | Converts text ‚Üí vectors        |
| VectorDB  | Stores memory                  |
| Retriever | Finds relevant chunks          |
| LLM       | Generates answer               |

---

# ‚úÖ Final Conceptual Summary

| Component       | Purpose                     |
| --------------- | --------------------------- |
| PyPDF / PyMuPDF | PDF text extraction engines |
| BeautifulSoup   | HTML parsing                |
| Loaders         | Data ingestion              |
| Documents       | Standard format             |
| Splitters       | Chunk creation              |
| Overlap         | Context continuity          |
| Chunking        | LLM compatibility           |
| Splitting rules | Semantic preservation       |

---

If you want, I can give you a **full LangChain architecture diagram** (Loader ‚Üí Splitter ‚Üí Embedding ‚Üí VectorDB ‚Üí Retriever ‚Üí LLM) in text form for revision/notes üìò
or a **ready-made RAG template** using your loaders + splitters + embeddings + FAISS.


In [4]:
#this text loader loads the entire text into one document
from langchain_community.document_loaders import TextLoader
loader=TextLoader('speech.txt')
text_documents=loader.load()
text_documents


[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n√¢‚Ç¨¬¶\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairn

In [3]:
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('attention.pdf')
docs=loader.load()
docs


[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-03T00:07:29+00:00', 'author': '', 'keywords': '', 'moddate': '2023-08-03T00:07:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani‚àó\nGoogle Brain\navaswani@google.com\nNoam Shazeer‚àó\nGoogle Brain\nnoam@google.com\nNiki Parmar‚àó\nGoogle Research\nnikip@google.com\nJakob Uszkoreit‚àó\nGoogle Research\nusz@google.com\nLlion Jones‚àó\nGoogle Research\nllion@google.com\nAidan N. Gomez‚àó ‚Ä†\nUniversity of Toronto\naidan@cs.toronto.edu\n≈Åukasz Kaiser ‚àó\nGoo

In [5]:
type(docs[0])

langchain_core.documents.base.Document

In [15]:
from langchain_community.document_loaders import WebBaseLoader
import bs4
loader=WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                     bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-title","post-content","post-header")
                     ))
                     )
loader.load()




[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent‚Äôs brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mista

In [16]:
docs

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-08-03T00:07:29+00:00', 'author': '', 'keywords': '', 'moddate': '2023-08-03T00:07:29+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani‚àó\nGoogle Brain\navaswani@google.com\nNoam Shazeer‚àó\nGoogle Brain\nnoam@google.com\nNiki Parmar‚àó\nGoogle Research\nnikip@google.com\nJakob Uszkoreit‚àó\nGoogle Research\nusz@google.com\nLlion Jones‚àó\nGoogle Research\nllion@google.com\nAidan N. Gomez‚àó ‚Ä†\nUniversity of Toronto\naidan@cs.toronto.edu\n≈Åukasz Kaiser ‚àó\nGoo

In [23]:
from langchain_community.document_loaders import ArxivLoader
loader=ArxivLoader(query="1706.03762", load_max_docs=2)
docs=loader.load()
docs



[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation 

In [25]:
from langchain_community.document_loaders import  WikipediaLoader
loader=WikipediaLoader(query="Gen AI",load_max_docs=2).load()
loader


[Document(metadata={'title': 'Generative artificial intelligence', 'summary': 'Generative artificial intelligence (Generative AI or GenAI) is a subfield of artificial intelligence that uses generative models to generate text, images, videos, audio, software code or other forms of data. These models learn the underlying patterns and structures of their training data and use them to generate new data in response to input, which often takes the form of natural language prompts.\nThe prevalence of generative AI tools has increased significantly since the AI boom in the 2020s. This boom was made possible by improvements in deep neural networks, particularly large language models (LLMs), which are based on the transformer architecture. Generative AI applications include chatbots such as ChatGPT, Claude, Copilot, DeepSeek, Google Gemini and Grok; text-to-image models such as Stable Diffusion, Midjourney, and DALL-E; and text-to-video models such as Veo, LTX and Sora. Technology companies deve

In [29]:
import json
import requests
json_data=requests.get("https://api.smith.langchain.com/openapi.json").json()

In [30]:
json_data

{'openapi': '3.1.0',
 'info': {'title': 'LangSmith',
  'description': 'The LangSmith API is used to programmatically create and manage LangSmith resources.\n\n## Host\nhttps://api.smith.langchain.com\n\n## Authentication\nTo authenticate with the LangSmith API, set the `X-Api-Key` header\nto a valid [LangSmith API key](https://docs.langchain.com/langsmith/create-account-api-key#create-an-api-key).\n\n',
  'version': '0.1.0'},
 'paths': {'/api/v1/audit-logs': {'get': {'tags': ['audit-logs'],
    'summary': 'Get Audit Logs',
    'description': "Retrieve audit log records for the authenticated user's organization in OCSF format.\n\nRequires both start_time and end_time parameters to filter logs within a date range.\nSupports cursor-based pagination.\n\nReturns results in OCSF API Activity (Class UID: 6003) format,\nwhich is compatible with security monitoring and SIEM tools.\nReference: https://schema.ocsf.io/1.7.0/classes/api_activity",
    'operationId': 'get_audit_logs_api_v1_audit_log

In [36]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter=RecursiveJsonSplitter(max_chunk_size=300)
json_chunks=json_splitter.split_json(json_data)

In [37]:
json_chunks

[{'openapi': '3.1.0',
  'info': {'title': 'LangSmith',
   'description': 'The LangSmith API is used to programmatically create and manage LangSmith resources.\n\n## Host\nhttps://api.smith.langchain.com\n\n## Authentication\nTo authenticate with the LangSmith API, set the `X-Api-Key` header\nto a valid [LangSmith API key](https://docs.langchain.com/langsmith/create-account-api-key#create-an-api-key).\n\n'}},
 {'info': {'version': '0.1.0'},
  'paths': {'/api/v1/audit-logs': {'get': {'tags': ['audit-logs'],
     'summary': 'Get Audit Logs'}}}},
 {'paths': {'/api/v1/audit-logs': {'get': {'description': "Retrieve audit log records for the authenticated user's organization in OCSF format.\n\nRequires both start_time and end_time parameters to filter logs within a date range.\nSupports cursor-based pagination.\n\nReturns results in OCSF API Activity (Class UID: 6003) format,\nwhich is compatible with security monitoring and SIEM tools.\nReference: https://schema.ocsf.io/1.7.0/classes/api_act

In [38]:
docs=json_splitter.create_documents(texts=[json_data])
docs

[Document(metadata={}, page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "description": "The LangSmith API is used to programmatically create and manage LangSmith resources.\\n\\n## Host\\nhttps://api.smith.langchain.com\\n\\n## Authentication\\nTo authenticate with the LangSmith API, set the `X-Api-Key` header\\nto a valid [LangSmith API key](https://docs.langchain.com/langsmith/create-account-api-key#create-an-api-key).\\n\\n"}}'),
 Document(metadata={}, page_content='{"info": {"version": "0.1.0"}, "paths": {"/api/v1/audit-logs": {"get": {"tags": ["audit-logs"], "summary": "Get Audit Logs"}}}}'),
 Document(metadata={}, page_content='{"paths": {"/api/v1/audit-logs": {"get": {"description": "Retrieve audit log records for the authenticated user\'s organization in OCSF format.\\n\\nRequires both start_time and end_time parameters to filter logs within a date range.\\nSupports cursor-based pagination.\\n\\nReturns results in OCSF API Activity (Class UID: 6003) format,\\nw

In [41]:
texts=json_splitter.split_text(json_data)
print(texts[0])
texts[1]


{"openapi": "3.1.0", "info": {"title": "LangSmith", "description": "The LangSmith API is used to programmatically create and manage LangSmith resources.\n\n## Host\nhttps://api.smith.langchain.com\n\n## Authentication\nTo authenticate with the LangSmith API, set the `X-Api-Key` header\nto a valid [LangSmith API key](https://docs.langchain.com/langsmith/create-account-api-key#create-an-api-key).\n\n"}}


'{"info": {"version": "0.1.0"}, "paths": {"/api/v1/audit-logs": {"get": {"tags": ["audit-logs"], "summary": "Get Audit Logs"}}}}'

Perfect, this is a **very important LangChain concept** and you‚Äôre asking exactly the right comparison üëç
Let‚Äôs break it down **slowly, clearly, and with a mental model**.

---

## First: what `json_data` is

Assume:

```python
json_data = "... very long JSON string ..."
```

So:

* It is **raw text**
* Not a `Document`
* No metadata yet

---

# 1Ô∏è‚É£ `create_documents()`

```python
docs = json_splitter.create_documents(texts=[json_data])
```

### What it does

‚úÖ Splits the text
‚úÖ Wraps each chunk into a **Document object**
‚úÖ Adds metadata automatically

### Output type

```python
List[Document]
```

Each item:

```python
Document(
    page_content="chunk text",
    metadata={"source": "..." }
)
```

### Example

```python
docs[0].page_content
docs[0].metadata
```

---

### When to use `create_documents()`

‚úî When you want to:

* Store chunks in a **vector database**
* Use **RAG**
* Preserve metadata
* Pass data into LangChain pipelines

üëâ **Production / real systems**

---

# 2Ô∏è‚É£ `split_text()`

```python
texts = json_splitter.split_text(json_data)
```

### What it does

‚úÖ Splits text
‚ùå Does NOT create Document objects
‚ùå No metadata

### Output type

```python
List[str]
```

Example:

```python
texts[0]
texts[1]
```

---

### When to use `split_text()`

‚úî When you:

* Just want to see the chunks
* Are debugging
* Need plain text
* Don‚Äôt care about metadata

üëâ **Exploration / testing / preprocessing**

---

# 3Ô∏è‚É£ Side-by-side comparison (clear)

| Feature   | `create_documents()` | `split_text()` |
| --------- | -------------------- | -------------- |
| Input     | List[str]            | Single string  |
| Output    | List[Document]       | List[str]      |
| Metadata  | ‚úÖ Yes                | ‚ùå No           |
| RAG-ready | ‚úÖ Yes                | ‚ùå No           |
| Vector DB | ‚úÖ Yes                | ‚ùå No           |
| Debugging | ‚ö†Ô∏è Overkill          | ‚úÖ Perfect      |

---

# 4Ô∏è‚É£ Why `create_documents()` takes a LIST

```python
texts=[json_data]
```

Because:

* You may have **multiple texts**
* Each text becomes its own **document group**
* Metadata tracks source text

Example:

```python
create_documents([
  text_from_file1,
  text_from_file2
])
```

---

# 5Ô∏è‚É£ Why metadata matters (real reason)

Example:

```python
{
  "source": "api_response.json",
  "chunk": 3
}
```

During retrieval:

> ‚ÄúThis answer came from chunk 3 of api_response.json‚Äù

LLMs **depend on metadata** for:

* source attribution
* ranking
* filtering
* citations

---

# 6Ô∏è‚É£ Visual mental model

### `split_text()`

```
Raw Text
   ‚Üì
["chunk1", "chunk2", "chunk3"]
```

### `create_documents()`

```
Raw Text
   ‚Üì
[Document(chunk1), Document(chunk2), Document(chunk3)]
```

---

# 7Ô∏è‚É£ Which one should YOU use?

### For JSON ‚Üí RAG

‚úÖ **create_documents**

### For inspection / logging

‚úÖ **split_text**

---

# 8Ô∏è‚É£ Common beginner mistake (‚ö†Ô∏è)

‚ùå Doing this:

```python
texts = splitter.split_text(json_data)
vectorstore.add_documents(texts)  # ‚ùå WRONG
```

‚úÖ Correct:

```python
docs = splitter.create_documents([json_data])
vectorstore.add_documents(docs)
```

---

# 9Ô∏è‚É£ One-line rule to remember üß†

> **If it‚Äôs going into embeddings ‚Üí use `create_documents()`**
> **If you‚Äôre just looking ‚Üí use `split_text()`**

---

If you want next:

* JSON-specific chunking best practices
* How to chunk nested JSON properly
* Why normal splitters break JSON & how to fix it
* End-to-end JSON ‚Üí FAISS ‚Üí RAG example

Just tell me üëç


####Doubt 2
Ahhh ‚Äî good catch üëÄ
This confusion is **very common**, and you‚Äôre actually noticing an *important overload behavior* in LangChain. Let‚Äôs untangle it cleanly.

---

## Short answer first (so your doubt is resolved immediately)

üëâ **`split_text()` never returns metadata.**
üëâ **`split_documents()` returns `Document` objects, so you see metadata.**

If you saw metadata when you thought you used `split_text`, then **you were actually using `split_documents()` somewhere**, or inspecting `docs`, not the result of `split_text()`.

---

## Now let‚Äôs break it down properly

### Your code (correct)

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

final_documents = text_splitter.split_documents(docs)
```

### Output

```python
[
  Document(
    page_content="...",
    metadata={'source': 'speech.txt'}
  ),
  ...
]
```

### Why metadata exists here

Because:

* `docs` is already a **List[Document]**
* Each `Document` has metadata
* `split_documents()`:

  * splits `page_content`
  * **copies metadata into each chunk**

‚úÖ Metadata is **propagated**, not created.

---

## What `split_text()` ACTUALLY does

```python
texts = text_splitter.split_text(json_data)
```

### Output

```python
List[str]
```

Example:

```python
texts[0]   # just a string
```

‚ùå No metadata
‚ùå No `Document` object

---

## Core difference (this is the key)

| Method              | Input            | Output           | Metadata |
| ------------------- | ---------------- | ---------------- | -------- |
| `split_text()`      | `str`            | `List[str]`      | ‚ùå No     |
| `split_documents()` | `List[Document]` | `List[Document]` | ‚úÖ Yes    |

---

## Why LangChain behaves this way (design reason)

LangChain assumes **two different pipelines**:

### Pipeline A ‚Äî raw text

```
str ‚Üí split_text ‚Üí List[str]
```

### Pipeline B ‚Äî structured docs (RAG)

```
Document ‚Üí split_documents ‚Üí Document chunks (metadata preserved)
```

This separation avoids:

* accidental metadata loss
* mixing raw text with structured pipelines

---

## Why metadata is copied to *every* chunk

Example:

```python
metadata = {'source': 'speech.txt'}
```

If chunking produces 5 chunks:

```python
chunk1 ‚Üí {'source': 'speech.txt'}
chunk2 ‚Üí {'source': 'speech.txt'}
...
```

### Why?

So that during retrieval:

> ‚ÄúThis answer came from **speech.txt**‚Äù

Without this, RAG would break source tracking ‚ùå

---

## Common mistake (what likely happened)

Many learners do this and get confused:

```python
docs = loader.load()
texts = text_splitter.split_text(docs[0].page_content)
```

Then later:

```python
docs  # still contains metadata
```

So it *looks* like `split_text` gave metadata ‚Äî but it didn‚Äôt.

---

## Proof (try this)

```python
texts = text_splitter.split_text(docs[0].page_content)
type(texts[0])
```

Output:

```python
<class 'str'>
```

Vs:

```python
final_documents = text_splitter.split_documents(docs)
type(final_documents[0])
```

Output:

```python
<class 'langchain.schema.document.Document'>
```

---

## One-line rule to lock it in üß†

> **Metadata only exists inside `Document` objects.**
> **Anything that returns `str` cannot have metadata.**

---

## Final mental picture

```
split_text()
   ‚Üì
"text chunk"      ‚ùå metadata

split_documents()
   ‚Üì
Document(
  page_content="text chunk",
  metadata={...}
)
```

---

If you want, next I can explain:

* how metadata duplication affects vector search ranking
* how to add custom metadata per chunk
* how to filter retrieval by metadata (`source`, `page`, etc.)
* why recursive splitting gave you that exact boundary

Just say the word üëå
