### Data Ingestion - Document loaders

https://docs.langchain.com/oss/python/integrations/document_loaders

In [None]:
## Text Loader for .txt files

from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
docs = loader.load()
docs
# print(type(docs[0]))  # langchain.document.Document

[Document(metadata={'source': 'speech.txt'}, page_content='Freedom was not gifted; it was earned through courage and sacrifice.\nCountless voices rose together to demand dignity and self-rule.\nEvery step toward independence carried the weight of hope and loss.\nThe struggle taught us unity beyond language, region, or belief.\nIndependence is not just a date, but a responsibility we carry daily.\nIt reminds us to protect justice, equality, and truth.\nThe past whispers lessons of resilience and bravery.\nThe present asks us to build with integrity and compassion.\nThe future depends on how wisely we use our freedom today.\nIndependence lives on when we choose progress over fear.\n')]

In [None]:
## Reading PDF using pyPDF

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("samplePdf.pdf")
docs = loader.load()
docs
# print(type(docs[0]))  # langchain.document.Document

[Document(metadata={'producer': 'Skia/PDF m118', 'creator': 'Chromium', 'creationdate': '2025-12-04T07:07:18+00:00', 'moddate': '2025-12-04T07:07:18+00:00', 'source': 'samplePdf.pdf', 'total_pages': 5, 'page': 0, 'page_label': '1'}, page_content="Billable152\x0000hNon Billable80\x0000hTotal232\x0000h\nYasref SCH\nImplementation\n04\x0000 DJDennis Joshua - 5000059 Billable\nApproved\nconfiguring and fixing bugs in pages,\nbv's, and ruleset's required for\nvendor registration and modification\nprocess.f\nDJDennis Joshua\nYasref SCH\nImplementation\n04\x0000 DJDennis Joshua - 5000059 Billable\nApproved\nconfiguring and fixing bugs in pages,\nbv's, and ruleset's required for\nvendor registration and modification\nprocess.\nDJDennis Joshua\nHolidays and\nLeave [Only]\n08\x0000 DJDennis Joshua - 5000059 Non\nBillable Approved satuday. DJDennis Joshua\nHolidays and\nLeave [Only]\n08\x0000 DJDennis Joshua - 5000059 Non\nBillable Approved Friday. DJDennis Joshua\nYasref SCH\nImplementation\n08\

In [None]:
## web base loader

from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(
    web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/"),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-title", "post-content"))),
)
docs = loader.load()
docs
# print(type(docs[0]))  # langchain.document.Document

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n      LLM Powered Autonomous Agents\n    Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\

In [None]:
## arxiv loader

from langchain_community.document_loaders import ArxivLoader

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="2504.10903",
    load_max_docs=2,
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)
docs = loader.load()
docs
# print(type(docs[0]))  # langchain.document.Document

[Document(metadata={'Published': '2025-09-29', 'Title': 'Efficient Reasoning Models: A Survey', 'Authors': 'Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang', 'Summary': 'Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this "slow-thinking" paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and rein

In [1]:
## wikipedia loader

from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader(query="HUNTER X HUNTER", load_max_docs=2)
docs = loader.load()
len(docs)

2

In [15]:
docs 

[Document(metadata={'title': 'Hunter × Hunter', 'summary': 'Hunter × Hunter (pronounced "hunter hunter") is a Japanese manga series written and illustrated by Yoshihiro Togashi. It has been serialized in Shueisha\'s shōnen manga magazine Weekly Shōnen Jump since March 1998, although the manga has frequently gone on extended hiatuses since 2006. Its chapters have been collected in 38 tankōbon volumes as of September 2024. The story focuses on a young boy named Gon Freecss who discovers that his father, who left him at a young age, is actually a world-renowned Hunter, a licensed professional who specializes in fantastical pursuits such as locating rare or unidentified animal species, treasure hunting, surveying unexplored enclaves, or hunting down lawless individuals. Gon departs on a journey to become a Hunter and eventually find his father. Along the way, Gon meets various other Hunters and encounters the paranormal.\nHunter × Hunter was adapted into a 62-episode anime television serie