# Data Ingetion

## LangChain Document Structure

### Overview
- LangChain represents textual inputs as Document objects that carry the text plus contextual metadata. This structure standardizes how loaders, splitters, embedders, vector stores, and retrievers interact in an LLM pipeline.

### Core fields
- `page_content` (string): the raw text for the document chunk.
- `metadata` (dict): JSON-serializable key/value pairs (e.g., `{"source": "...", "page": 2, "author": "...", "timestamp": "..."}`).
- `id` (optional string): unique identifier for the document or chunk.
- (Implementations may add other attrs such as embeddings when stored in a VectorStore.)

### Typical pipeline
1. DocumentLoaders: read files, PDFs, HTML, S3, etc. -> produce full Document(s).
2. TextSplitters: split long documents into chunks (200 - 1000 tokens) with overlap for context.
3. Embeddings: convert chunks into vectors.
4. VectorStore: persist vectors + metadata for efficient similarity search.
5. Retriever: fetch relevant Document(s) for a query.
6. Chains/Prompting: format retrieved content into prompts for the LLM.

### Best practices
- Keep `metadata` small and JSON-serializable (avoid storing large binary blobs).
- Include provenance fields: `source`, `page`, `url`, `timestamp`, and optionally a content hash.
- Use consistent chunk size and overlap tuned to your model/context window.
- Preserve original document IDs so you can trace chunks back to sources.
- Normalize text (unicode, whitespace) before chunking and embedding.
- If you need to store structured data, keep it in metadata (not inside `page_content`).


In [12]:
from langchain_core.documents import Document

In [None]:
## example document representation
doc = Document(
    page_content = "this is the main text content I am learning RAG.",
    metadata = {
        "source" : "example.txt",
        "pages" : 20,
        "author": "Dhruvil Lathiya",
        "date_created" : "2025-01-01",
    }
)

doc

Document(metadata={'source': 'example.txt', 'pages': 20, 'author': 'Dhruvil Lathiya', 'date_created': '2025-01-01'}, page_content='this is the main text content I am learning RAG.')

In [None]:
## create simple Text file
import os

os.makedirs("../data/text_files", exist_ok=True)

In [16]:
sample_text = {
"../data/text_files/langchain_intro.txt" : """LangChain is a framework for building applications with language models. 
- It connects to data sources and interacts with external environments.
- LangChain provides tools for document loading, text splitting, and embedding generation. 
- It also supports vector storage and retrieval, enabling efficient similarity searches. 
- With LangChain, developers can create pipelines that process and transform text data, 
- making it easier to integrate language models into real-world applications."""
}

for filepath, content in sample_text.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("Sample text files created!")


Sample text files created!


In [None]:
## TextLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/langchain_intro.txt", encoding="utf-8")
document = loader.load()

print(document)


[Document(metadata={'source': '../data/text_files/langchain_intro.txt'}, page_content='LangChain is a framework for building applications with language models. \n- It connects to data sources and interacts with external environments.\n- LangChain provides tools for document loading, text splitting, and embedding generation. \n- It also supports vector storage and retrieval, enabling efficient similarity searches. \n- With LangChain, developers can create pipelines that process and transform text data, \n- making it easier to integrate language models into real-world applications.')]


In [33]:
## Directory Loader
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    path = "../data/text_files",
    glob = "**/*.txt", ## Pattern to match files
    loader_cls = TextLoader,
    loader_kwargs = {'encoding' : 'utf-8'}
)

text_documents = dir_loader.load()
text_documents

[Document(metadata={'source': '..\\data\\text_files\\derivatives.txt'}, page_content='## 1️⃣ Where to Find Option Data\n\n**Options are listed on exchanges**, so all details are publicly available. In India, the main exchange is **NSE (National Stock Exchange)**.\n\nYou can also see this via **broker platforms** (Zerodha Kite, Upstox, Groww, etc.) or financial websites (Moneycontrol, NSE India, Investing.com).\n\n---\n\n## 2️⃣ What You’ll See in an Option Chain\n\nAn **option chain** shows all options for a particular underlying stock or index.\n\nHere’s what you can expect:\n\n| Column                      | Meaning                                                            |\n| --------------------------- | ------------------------------------------------------------------ |\n| **Strike Price**            | Fixed price at which you can exercise (buy for Call, sell for Put) |\n| **LTP (Last Traded Price)** | Latest premium price for that option                               |\n| **Bid

In [35]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader = DirectoryLoader(
    path = "../data/pdf",
    glob = "**/*.pdf",
    loader_cls = PyMuPDFLoader,
)

pdf_documents = dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20251029145556', 'source': '..\\data\\pdf\\chapter-1.pdf', 'file_path': '..\\data\\pdf\\chapter-1.pdf', 'total_pages': 17, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': 'D:20251029145556', 'page': 0}, page_content='Part 1 \nFrom Far Rockaway to MIT \nHe Fixes Radios by Thinking! \nWhen I was about eleven or twelve I set up a lab in my house. It consisted of an old wooden packing box that I put shelves in. I had a heater, and \nI\'d put in fat and cook french-fried potatoes all the time. I also had a storage battery, and a lamp bank. \nTo build the lamp bank I went down to the five-and-ten and got some sockets you can screw down to a wooden base, and connected them with \npieces of bell wire. By  making different combinations of switches--in series or parallel--I knew I could get different voltages. But what I had