### TextLoader - Read single text file

In [1]:
import os
os.makedirs("data/text_files", exist_ok=True)
from langchain.document_loaders import TextLoader

In [2]:
txt_loader = TextLoader(file_path="1_data_ingestion/data/text_files/python_intro.txt", encoding='utf-8')
txt_data = txt_loader.load()

txt_data

[Document(metadata={'source': '1_data_ingestion/data/text_files/python_intro.txt'}, page_content='\nPython Programming Introduction\n\nWhat is Python?\n\nDefinition:\nPython is a high-level, interpreted, general-purpose programming language known for its readability, simplicity, and versatility.\n')]

In [3]:
print(txt_data[0].page_content[0:100])


Python Programming Introduction

What is Python?

Definition:
Python is a high-level, interpreted, 


In [4]:
txt_data[0].metadata

{'source': '1_data_ingestion/data/text_files/python_intro.txt'}

### DirectoryLoader - Multiple text files

In [5]:
from langchain.document_loaders import DirectoryLoader, TextLoader

dir_path = r"D:\Learning_Aug_2025\final_gen_ai\AgenticAI\1_data_ingestion\data\text_files"

dir_loader = DirectoryLoader(
    path=dir_path,
    glob="**/*.txt",
    loader_cls=lambda path: TextLoader(path, encoding="utf-8")   # force utf-8
)

data_doc = dir_loader.load()

for i, doc in enumerate(data_doc, start=1):
    print(f"\nDocument {i}")
    print(f"metadata: {doc.metadata}")
    print(f"length: {len(doc.page_content)}")



Document 1
metadata: {'source': 'D:\\Learning_Aug_2025\\final_gen_ai\\AgenticAI\\1_data_ingestion\\data\\text_files\\ml_intro.txt'}
length: 7012

Document 2
metadata: {'source': 'D:\\Learning_Aug_2025\\final_gen_ai\\AgenticAI\\1_data_ingestion\\data\\text_files\\python_intro.txt'}
length: 3272


### Text Splitting Strategies

In [6]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TextSplitter

char_txt_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

data_doc_splited = char_txt_splitter.split_text(data_doc[0].page_content)

Created a chunk of size 471, which is longer than the specified 200
Created a chunk of size 363, which is longer than the specified 200


In [7]:
print(len(data_doc_splited))
for i in range(len(data_doc_splited)):
    print(data_doc_splited[i])
    print("----------------------------")

39
Machine Learning Introduction
---
## What is Machine Learning?
----------------------------
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that focuses on building systems capable of **learning from data** and **improving over time without being explicitly programmed**. Traditional software relies on fixed instructions provided by humans, but ML systems derive rules and patterns directly from examples. This ability makes ML a powerful tool for solving complex problems that are difficult or impossible to describe with hand-written rules.
----------------------------
At its core, ML answers the question: *“Given data, how can we build algorithms that improve their performance as more data becomes available?”*
---
## The Essence of ML
----------------------------
## The Essence of ML
* **Data-driven approach**: Instead of crafting rules, we provide data and let the algorithm discover structure.
* **Adaptability**: Models can evolve as new data arrives.
------------

In [8]:
len(data_doc[1].page_content)

3272