### Data Ingestion

In [2]:
### Document Structure

from langchain_core.documents import Document

In [3]:
doc = Document(
    page_content="This is the main text content, I am using to create RAG",
    metadata={
        # This plays a vital role while filtering data
        "source":"example.txt",
        "pages":1,
        "author":"Ashish Kumar",
        "date_created":"2026-01-03"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Ashish Kumar', 'date_created': '2026-01-03'}, page_content='This is the main text content, I am using to create RAG')

In [4]:
## Create a simple txt file
import os
os.makedirs("../data/text_files", exist_ok=True)

In [6]:
sample_texts={
    "../data/text_files/python_intro.txt":""" Python Introduction

Python is a high-level, interpreted programming language that has become the gold standard for modern software development due to its focus on readability and simplicity. Created by Guido van Rossum and first released in 1991, Python was designed to be powerful enough for complex systems while remaining intuitive enough for beginners.

The language's philosophy, often summarized in the "Zen of Python," prioritizes clean, elegant code. By using English-like keywords and significant indentation instead of complex symbols like curly braces or semicolons, Python allows developers to express concepts in fewer lines of code than languages like C++ or Java.

Key Features
- Interpreted Nature: Python executes code line-by-line, which simplifies debugging and makes the development process highly interactive.
- Dynamically Typed: You don't need to declare variable types explicitly; the interpreter handles this at runtime.
- Comprehensive Standard Library: Often described as having "batteries included," Python’s Standard Library provides tools for everything from web harvesting to cryptography.

Applications in 2026
As of 2026, Python remains the undisputed leader in Artificial Intelligence and Machine Learning, supported by industry-standard libraries like PyTorch and TensorFlow. It is also the primary tool for Data Science, where professionals use Pandas and NumPy to process massive datasets. Beyond data, Python powers the backends of major web platforms via Django and automates repetitive tasks for system administrators worldwide.

Whether you are looking to build a neural network, scrape web data, or simply automate your daily workflow, Python offers a massive, supportive community and a vast ecosystem of third-party packages available through the Python Package Index (PyPI). It is a versatile, future-proof language that continues to bridge the gap between human logic and machine execution.
 
 """,

 "../data/text_files/machine_learning.txt":"""Machine Learning Basics

 Machine Learning (ML) is a subset of Artificial Intelligence that enables systems to learn from data and improve their performance without being explicitly programmed for every task. Instead of following rigid "if-then" rules, the computer identifies patterns in information to make predictions or decisions.

### Core Concepts

To understand how ML works, you need to know these three components:

1. Data: The "textbook" the computer learns from.
2. Model: The mathematical engine or algorithm that processes the data.
3. Training: The process of feeding data into the model so it can recognize patterns.

---

### The Three Main Types

Machine Learning is generally categorized based on how the algorithm learns:

* Supervised Learning: The model is trained on "labeled" data (input-output pairs). Think of it like a student learning with an answer key. It is used for tasks like spam detection or predicting house prices.
* Unsupervised Learning: The model looks at "unlabeled" data and tries to find hidden structures or groupings on its own. A common use case is customer segmentation in marketing.
* Reinforcement Learning: The model learns through trial and error, receiving rewards for good actions and penalties for bad ones. This is how AI learns to play games (like Chess or Go) or navigate robots.

### Why It Matters

ML is the engine behind modern technology, from the product recommendations on your favorite shopping site to the facial recognition on your phone. Its power lies in its ability to handle complexity and scale far beyond what a human could manually code.

Would you like me to dive deeper into one of these types, or perhaps explain a specific algorithm like Linear Regression?

"""

}

for filepath, content in sample_texts.items():
    with open(filepath, 'w', encoding="utf-8") as f:
        f.write(content)

print("✅ Sample text files created!")

✅ Sample text files created!


In [16]:
### textLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document = loader.load()
document

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content=' Python Introduction\n\nPython is a high-level, interpreted programming language that has become the gold standard for modern software development due to its focus on readability and simplicity. Created by Guido van Rossum and first released in 1991, Python was designed to be powerful enough for complex systems while remaining intuitive enough for beginners.\n\nThe language\'s philosophy, often summarized in the "Zen of Python," prioritizes clean, elegant code. By using English-like keywords and significant indentation instead of complex symbols like curly braces or semicolons, Python allows developers to express concepts in fewer lines of code than languages like C++ or Java.\n\nKey Features\n- Interpreted Nature: Python executes code line-by-line, which simplifies debugging and makes the development process highly interactive.\n- Dynamically Typed: You don\'t need to declare variable types explicitly;

In [17]:
### Directory Loader
from langchain_community.document_loaders import DirectoryLoader

# load all the text files from the directory
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls=TextLoader,
    show_progress=False
)

documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\n Machine Learning (ML) is a subset of Artificial Intelligence that enables systems to learn from data and improve their performance without being explicitly programmed for every task. Instead of following rigid "if-then" rules, the computer identifies patterns in information to make predictions or decisions.\n\n### Core Concepts\n\nTo understand how ML works, you need to know these three components:\n\n1. Data: The "textbook" the computer learns from.\n2. Model: The mathematical engine or algorithm that processes the data.\n3. Training: The process of feeding data into the model so it can recognize patterns.\n\n---\n\n### The Three Main Types\n\nMachine Learning is generally categorized based on how the algorithm learns:\n\n* Supervised Learning: The model is trained on "labeled" data (input-output pairs). Think of it like a student learning with an answer key. It is us

In [23]:
from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader

dir_loader = DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf",
    loader_cls=PyMuPDFLoader,
    show_progress=False
)

pdf_documents = dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.12', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-03-02T01:33:21+00:00', 'source': '..\\data\\pdf\\1409.7495v2.pdf', 'file_path': '..\\data\\pdf\\1409.7495v2.pdf', 'total_pages': 11, 'format': 'PDF 1.5', 'title': 'Unsupervised Domain Adaptation by Backpropagation', 'author': 'Yaroslav Ganin, Victor Lempitsky', 'subject': '', 'keywords': 'Gradient Reversal, Unsupervised Domain Adaptation, Deep Learning', 'moddate': '2015-03-02T01:33:21+00:00', 'trapped': '', 'modDate': 'D:20150302013321Z', 'creationDate': 'D:20150302013321Z', 'page': 0}, page_content='Unsupervised Domain Adaptation by Backpropagation\nYaroslav Ganin\nGANIN@SKOLTECH.RU\nVictor Lempitsky\nLEMPITSKY@SKOLTECH.RU\nSkolkovo Institute of Science and Technology (Skoltech)\nAbstract\nTop-performing deep architectures are trained on\nmassive amounts of labeled data. In the absence\nof labeled data for a certain task, domain adap-\ntation often provides an attractive 

In [24]:
type(pdf_documents[0])

langchain_core.documents.base.Document