## Data Ingestion

In [1]:
## langchain document has 2 attr page content and metadata about it 
# it has loader for pdf,csv etc..
from langchain_core.documents import Document

document = Document(
    page_content="Hello, world! this is main text content used in rag",
    metadata={
        "source": "https://example.com",
        "Autor":"Mahesh Bhandari",
        "date_created":'2026-01-27'
    }
)

document

  from pydantic.v1.fields import FieldInfo as FieldInfoV1


Document(metadata={'source': 'https://example.com', 'Autor': 'Mahesh Bhandari', 'date_created': '2026-01-27'}, page_content='Hello, world! this is main text content used in rag')

In [2]:
import os
os.makedirs("../data/text_files",exist_ok = True)

In [5]:
sample_texts = {

    "../data/text_files/python_intro.txt":"""Python is a high-level, general-purpose programming language that has quietly become one of the most influential tools in modern computing. It doesn’t try to impress with flashy syntax or low-level tricks. Instead, it focuses on clarity, readability, and getting real work done with minimal friction. That design choice is exactly why Python is everywhere—from beginner classrooms to research labs, startups, and big tech companies.

A language designed for humans

Python was created by Guido van Rossum in the late 1980s, with a simple philosophy: code should be easy to read and easy to write. This idea is baked into the language itself. Python uses indentation instead of braces, which forces a clean structure. There’s no clutter of semicolons or boilerplate syntax. What you read often looks close to plain English.

For example, a loop in Python almost explains itself:

for number in range(5):
    print(number)


Even if you’ve never written Python before, you can guess what this does. That’s not an accident. Python’s design encourages programs that are understandable not just by machines, but by people who may read the code months or years later—including your future self.

Interpreted, not compiled (mostly)

Python is an interpreted language, meaning code is executed line by line rather than being compiled into machine code ahead of time. This makes development fast and interactive. You can write a few lines, run them immediately, and see what happens. That feedback loop is gold when you’re learning or experimenting.

The tradeoff is performance. Python is generally slower than compiled languages like C or C++. But in practice, this rarely matters as much as people think. Many Python programs rely on optimized libraries written in C or C++ under the hood. You write clean Python code; the heavy lifting happens elsewhere.

A massive standard library

One of Python’s biggest strengths is its standard library. It ships with tools for file handling, networking, regular expressions, data serialization, math, statistics, and much more. This is often summarized as “batteries included,” and it’s accurate.

Need to read a CSV file? There’s a module for that.
Need to work with dates and times? Covered.
Need to make an HTTP request? Already there.

This reduces the need to reinvent the wheel and helps you focus on solving the actual problem instead of wiring basic utilities together.

Python’s ecosystem: where the real power is

Beyond the standard library, Python has one of the largest ecosystems of third-party packages in the world. Tools like NumPy, Pandas, and SciPy dominate scientific computing. TensorFlow and PyTorch power machine learning and deep learning research. Django and Flask are widely used for backend web development. FastAPI has become a favorite for building clean, high-performance APIs.

What’s important here isn’t just the number of libraries, but their quality and community support. If you’re working in data science, web development, automation, DevOps, or AI, Python probably already has a mature solution waiting for you.

Python in data science and machine learning

Python’s rise in data science is no coincidence. The language makes it easy to express mathematical ideas in code. Libraries like NumPy provide fast numerical arrays, while Pandas offers high-level data structures that feel natural when working with tabular data.

In machine learning, Python acts as the glue language. Researchers prototype models quickly, experiment with ideas, and visualize results without fighting the language. This speed of experimentation matters more than raw execution speed, especially in research and applied ML.

That’s why Python dominates notebooks, labs, and Kaggle competitions—and why it’s now deeply embedded in production ML systems as well.

Web development and backend systems

Python is also a serious backend language. Frameworks like Django offer a full-featured approach with authentication, ORM, and admin tools out of the box. Flask and FastAPI take a lighter approach, giving developers more control and flexibility.

What Python does well here is reduce mental overhead. You’re not wrestling with syntax or obscure language rules. You’re thinking about routes, data flow, and business logic. For small teams and startups, this can dramatically speed up development.

Automation and scripting

If there’s one area where Python feels unbeatable, it’s automation. Writing scripts to rename files, scrape websites, process logs, or automate repetitive tasks is straightforward. Python’s readable syntax and powerful libraries make it ideal for gluing systems together.

This is why Python is popular among system administrators, QA engineers, and even non-programmers who want to automate parts of their workflow.

Object-oriented, functional, and everything in between

Python doesn’t force you into one programming style. You can write simple scripts, object-oriented systems, or functional-style code using lambdas and higher-order functions. This flexibility is a double-edged sword: it’s powerful, but it also means codebases can become inconsistent if teams aren’t disciplined.

Still, for most developers, this flexibility is a feature, not a flaw. You can start simple and grow your design as the project evolves.

Weaknesses and limitations

Python isn’t perfect. Its speed can be an issue in CPU-bound tasks. The Global Interpreter Lock (GIL) limits true multi-threaded execution in many cases. Memory usage can also be higher compared to lower-level languages.

But here’s the thing: Python doesn’t pretend to be something it’s not. When performance really matters, developers optimize critical sections in C, use multiprocessing, or rely on specialized libraries. Python stays focused on being a productive, expressive layer on top.

Why Python keeps winning

Python’s success isn’t just technical. It’s cultural. The community emphasizes readability, documentation, and beginner-friendliness. The learning curve is gentle, but the ceiling is high. You can start with print statements and end up building distributed systems or training neural networks.

In short, Python lowers the barrier to entry without limiting ambition.

Final thoughts

Python is not about clever code. It’s about clear code. It’s about moving from idea to implementation with minimal resistance. Whether you’re a student learning programming, a developer building web apps, a data scientist analyzing trends, or a researcher experimenting with models, Python meets you where you are.

That’s why it’s not just a popular language—it’s a practical one. And that’s why it’s not going away anytime soon."""
}

In [6]:
for filepath,content in sample_texts.items():
    with open (filepath,'w',encoding = 'utf-8') as file:
        file.write(content)

In [2]:
### TextLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt",encoding='utf-8')
document = loader.load()

  from pydantic.v1.fields import FieldInfo as FieldInfoV1
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
document

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python is a high-level, general-purpose programming language that has quietly become one of the most influential tools in modern computing. It doesn’t try to impress with flashy syntax or low-level tricks. Instead, it focuses on clarity, readability, and getting real work done with minimal friction. That design choice is exactly why Python is everywhere—from beginner classrooms to research labs, startups, and big tech companies.\n\nA language designed for humans\n\nPython was created by Guido van Rossum in the late 1980s, with a simple philosophy: code should be easy to read and easy to write. This idea is baked into the language itself. Python uses indentation instead of braces, which forces a clean structure. There’s no clutter of semicolons or boilerplate syntax. What you read often looks close to plain English.\n\nFor example, a loop in Python almost explains itself:\n\nfor number in range(5):\n    

In [9]:
#Directory Loader
from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs= {'encoding':'utf-8'},
    show_progress=True
)

In [10]:
doc = dir_loader.load()
print(doc)

100%|██████████| 1/1 [00:00<00:00, 944.24it/s]

[Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python is a high-level, general-purpose programming language that has quietly become one of the most influential tools in modern computing. It doesn’t try to impress with flashy syntax or low-level tricks. Instead, it focuses on clarity, readability, and getting real work done with minimal friction. That design choice is exactly why Python is everywhere—from beginner classrooms to research labs, startups, and big tech companies.\n\nA language designed for humans\n\nPython was created by Guido van Rossum in the late 1980s, with a simple philosophy: code should be easy to read and easy to write. This idea is baked into the language itself. Python uses indentation instead of braces, which forces a clean structure. There’s no clutter of semicolons or boilerplate syntax. What you read often looks close to plain English.\n\nFor example, a loop in Python almost explains itself:\n\nfor number in range(5):\n 


