https://python.langchain.com/docs/how_to/#document-loaders

In [3]:
## Reading Text file

from langchain_community.document_loaders import TextLoader

In [25]:
txtloader = TextLoader("Data Sources/speech.txt")

In [26]:
text_documents = txtloader.load()
text_documents

[Document(metadata={'source': 'Data Sources/speech.txt'}, page_content="Dr. A.P.J. Abdul Kalam gave many speeches on a variety of topics, including education, the environment, and India's development. Here are some examples of his speeches: [1, 2, 3]  \n\nAddress to students \n\nKalam encouraged students to excel in their studies and become leaders in their chosen fields. He also emphasized the importance of having a strong moral compass and traditional values. [1]  \n\nVision for India \n\nKalam spoke about his vision for India to become a developed nation, and how the country should stand up to the world. He also emphasized the importance of protecting the environment and leaving the planet in good condition for future generations. [2, 3]  \n\nFoundation Day Lecture \n\nKalam spoke about the importance of teachers and how they can inspire students to achieve their goals. He also emphasized the importance of creating an environment where all students are valued equally. [4]  \n\nSpeec

In [32]:
type(text_documents)

list

In [27]:
## Reading PDF file

from langchain_community.document_loaders import PyPDFLoader

In [30]:
pdfloader = PyPDFLoader("Data Sources/DeepSeek_R1.pdf")

In [31]:
pdf_documents = pdfloader.load()
pdf_documents

[Document(metadata={'source': 'Data Sources/DeepSeek_R1.pdf', 'page': 0, 'page_label': '1'}, page_content='DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\nReinforcement Learning\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.\nDeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-\nvised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.\nThrough RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing\nreasoning behaviors. However, it encounters challenges such as poor readability, and language\nmixing. To address these issues and further enhance reasoning performance, we introduce\nDeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-\nR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the\nresearch community, we open-sou

In [33]:
type(pdf_documents)

list

In [34]:
type(pdf_documents[0])

langchain_core.documents.base.Document

In [41]:
## Web based loader

from langchain_community.document_loaders import WebBaseLoader
import bs4

In [42]:
webloader = WebBaseLoader(web_paths=["https://python.langchain.com/docs/how_to/document_loader_pdf/"], bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_ = ("anchor anchorWithStickyNavbar_LWe7")
)))

In [43]:
web_documents = webloader.load()
web_documents

[Document(metadata={'source': 'https://python.langchain.com/docs/how_to/document_loader_pdf/'}, page_content='Simple and fast text extraction\u200bVector search over PDFs\u200bLayout analysis and extraction of text from images\u200bExtracting tables and other structures\u200bExtracting text from specific sections\u200bExtracting text from images\u200bLocal parsing\u200bUse of multimodal models\u200bOther PDF loaders\u200b')]

In [44]:
## XML loader

from langchain_community.document_loaders import UnstructuredXMLLoader

In [47]:
xml_loader = UnstructuredXMLLoader("./Data Sources/API_SP.POP.TOTL_DS2_en_xml_v2_604.xml")

In [48]:
xml_documents = xml_loader.load()
xml_documents

[Document(metadata={'source': './Data Sources/API_SP.POP.TOTL_DS2_en_xml_v2_604.xml'}, page_content="Aruba\n\nPopulation, total\n\n1960\n\n54922\n\nAruba\n\nPopulation, total\n\n1961\n\n55578\n\nAruba\n\nPopulation, total\n\n1962\n\n56320\n\nAruba\n\nPopulation, total\n\n1963\n\n57002\n\nAruba\n\nPopulation, total\n\n1964\n\n57619\n\nAruba\n\nPopulation, total\n\n1965\n\n58190\n\nAruba\n\nPopulation, total\n\n1966\n\n58694\n\nAruba\n\nPopulation, total\n\n1967\n\n58990\n\nAruba\n\nPopulation, total\n\n1968\n\n59069\n\nAruba\n\nPopulation, total\n\n1969\n\n59052\n\nAruba\n\nPopulation, total\n\n1970\n\n58950\n\nAruba\n\nPopulation, total\n\n1971\n\n58781\n\nAruba\n\nPopulation, total\n\n1972\n\n58047\n\nAruba\n\nPopulation, total\n\n1973\n\n58299\n\nAruba\n\nPopulation, total\n\n1974\n\n58349\n\nAruba\n\nPopulation, total\n\n1975\n\n58295\n\nAruba\n\nPopulation, total\n\n1976\n\n58368\n\nAruba\n\nPopulation, total\n\n1977\n\n58580\n\nAruba\n\nPopulation, total\n\n1978\n\n58776\n\nAruba\

In [49]:
## CSV loader

from langchain_community.document_loaders.csv_loader import CSVLoader

In [50]:
csv_loader = CSVLoader("Data Sources/API_SP.POP.TOTL_DS2_en_csv_v2_87.csv")

In [51]:
csv_documents = csv_loader.load()
csv_documents

[Document(metadata={'source': 'Data Sources/API_SP.POP.TOTL_DS2_en_csv_v2_87.csv', 'row': 0}, page_content='ï»¿"Data Source": Last Updated Date\nWorld Development Indicators: 2025-01-28\n: '),
 Document(metadata={'source': 'Data Sources/API_SP.POP.TOTL_DS2_en_csv_v2_87.csv', 'row': 1}, page_content='ï»¿"Data Source": Country Name\nWorld Development Indicators: Country Code\n: Indicator Name\nNone: Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,'),
 Document(metadata={'source': 'Data Sources/API_SP.POP.TOTL_DS2_en_csv_v2_87.csv', 'row': 2}, page_content='ï»¿"Data Source": Aruba\nWorld Development Indicators: ABW\n: Population, total\nNone: SP.POP.TOTL,54922,55578,56320,57002,57619,58190,58694,58990,59069