### Data Ingestion


In [1]:
### Document  structure

from langchain_core.documents import Document

In [None]:
doc = Document(
    page_content="this is the example test content for testing the document loader",
    metadata={
        "source": "example.txt",
        "pages": 1,
        "author": "Badal Kumar",
        "date_created": "2025-01-01",
    },
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Badal Kumar', 'date_created': '2025-01-01'}, page_content='this is the example test content for testing the document loader')

In [None]:
## create a simple txt file
import os

os.makedirs("../data/text_files", exist_ok=True)

In [None]:
sample_texts = {
    "../data/text_files/python_intro.txt": """Introduction to Python

Python is a high-level, interpreted, and general-purpose programming language. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability, using simple syntax that allows programmers to express ideas in fewer lines of code compared to other languages like Java or C++.

⚙️ Key Features of Python

Easy to Learn and Read – Python’s syntax is close to natural English, making it beginner-friendly.

Interpreted Language – You don’t need to compile Python code; it runs line by line.

Dynamically Typed – You don’t have to declare variable types explicitly.

Extensive Libraries – Python has powerful standard libraries (like math, os, datetime) and third-party ones (NumPy, Pandas, TensorFlow, etc.).

Portable and Cross-Platform – The same code can run on Windows, macOS, or Linux.

Object-Oriented and Functional – Supports both programming styles.

Open Source – Free to use and distribute.

💡 What Can You Do with Python?

Python is extremely versatile. It’s used in:

Web Development → Django, Flask

Data Science & Machine Learning → Pandas, NumPy, Scikit-learn, TensorFlow

Automation / Scripting → Automating repetitive tasks

Game Development → Pygame

Cybersecurity → Ethical hacking tools and analysis

Artificial Intelligence → Chatbots, NLP, Computer Vision

Internet of Things (IoT) → Controlling hardware devices
    
    """,
    "../data/text_files/machine_learning_intro.txt": """Machine Learning is a branch of Artificial Intelligence (AI) that focuses on creating systems that can learn from data and improve their performance over time without being explicitly programmed.

In traditional programming, a developer writes a set of rules for the computer to follow. In machine learning, instead of giving rules, we feed the computer a large amount of data, and the system automatically learns patterns and makes predictions or decisions based on that data.

Machine learning uses algorithms that identify relationships, detect patterns, and make predictions. These algorithms are trained using datasets that contain input (features) and sometimes output (labels). Once trained, the model can make predictions on new, unseen data.

Types of Machine Learning

Supervised Learning – The model is trained using labeled data (input and correct output are known).
Examples: Predicting house prices, email spam detection.

Unsupervised Learning – The model learns from unlabeled data and tries to find hidden patterns or groupings.
Examples: Customer segmentation, market basket analysis.

Reinforcement Learning – The model learns by interacting with an environment and receiving rewards or penalties for its actions.
Examples: Self-driving cars, game-playing AI.

Applications of Machine Learning

Recommendation Systems – Netflix, YouTube, and Amazon suggestions.

Image and Speech Recognition – Face detection, voice assistants.

Healthcare – Disease prediction, drug discovery.

Finance – Fraud detection, credit scoring.

Autonomous Vehicles – Navigation and obstacle detection.

Natural Language Processing (NLP) – Chatbots and translation systems.

Conclusion

Machine Learning enables computers to analyze data, identify patterns, and make decisions with minimal human intervention. It is a key technology driving innovations in many industries today and forms the backbone of modern AI systems.
    
    """,
}

for filepath, content in sample_texts.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)
print("✅ sample text files created!")

✅ sample text files created!


In [None]:
### Text Loaders

from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
loader.load()

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Introduction to Python\n\nPython is a high-level, interpreted, and general-purpose programming language. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability, using simple syntax that allows programmers to express ideas in fewer lines of code compared to other languages like Java or C++.\n\n⚙️ Key Features of Python\n\nEasy to Learn and Read – Python’s syntax is close to natural English, making it beginner-friendly.\n\nInterpreted Language – You don’t need to compile Python code; it runs line by line.\n\nDynamically Typed – You don’t have to declare variable types explicitly.\n\nExtensive Libraries – Python has powerful standard libraries (like math, os, datetime) and third-party ones (NumPy, Pandas, TensorFlow, etc.).\n\nPortable and Cross-Platform – The same code can run on Windows, macOS, or Linux.\n\nObject-Oriented and Functional – Supports both programm

In [18]:
### Directory Loader
from langchain_community.document_loaders import DirectoryLoader

## load all the text files from the directory
dir_loader = DirectoryLoader("../data/text_files",
glob = "**/*.txt", ## pattern to match files
loader_cls = TextLoader,
loader_kwargs = {"encoding":"utf-8"},
show_progress= True
)
dir_loader.load()

100%|██████████| 2/2 [00:00<00:00, 999.60it/s]


[Document(metadata={'source': '..\\data\\text_files\\machine_learning_intro.txt'}, page_content='Machine Learning is a branch of Artificial Intelligence (AI) that focuses on creating systems that can learn from data and improve their performance over time without being explicitly programmed.\n\nIn traditional programming, a developer writes a set of rules for the computer to follow. In machine learning, instead of giving rules, we feed the computer a large amount of data, and the system automatically learns patterns and makes predictions or decisions based on that data.\n\nMachine learning uses algorithms that identify relationships, detect patterns, and make predictions. These algorithms are trained using datasets that contain input (features) and sometimes output (labels). Once trained, the model can make predictions on new, unseen data.\n\nTypes of Machine Learning\n\nSupervised Learning – The model is trained using labeled data (input and correct output are known).\nExamples: Predi

In [None]:
### Directory Loader
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader


## load all the text files from the directory
dir_loader = DirectoryLoader("../data/pdf",
glob = "**/*.pdf", ## pattern to match files
loader_cls = PyMuPDFLoader,
loader_kwargs = {"encoding":"utf-8"},
show_progress= True
)
pdf_documents = dir_loader.load()
pdf_documents