# Document Loaders

There are many other types of Documents that can be loaded in, including integrations which we'll cover in the next notebook. You can see all the document loaders available here: https://python.langchain.com/docs/modules/data_connection/document_loaders/

Keep in mind many Loaders are dependent on other libraries, meaning issues in those libraries can end up breaking the Langchain loaders.

In [2]:
import sys
sys.path.append('../')

from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(), override=True)

from langchain_core.documents import Document

from document_loaders.load_document import load_document
from document_loaders.load_from_web import load_from_wikipedia, load_from_web, load_with_chromium, crawl_with_apify
from document_loaders.load_all_documents import load_all_documents, load_all_documents_lazy, LoadAllDocuments
from document_loaders.load_directory import load_directory, scan_load_directory, ScanLoadDirectory

In [3]:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

## CSV

In [4]:
docs = load_document('./files/penguins.csv', text_splitter="auto")

In [5]:
type(docs)

list

In [6]:
type(docs[0])

langchain_core.documents.base.Document

In [7]:
len(docs)

344

In [8]:
docs[0]

Document(metadata={'source': './files/penguins.csv', 'row': 0}, page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.1\nbill_depth_mm: 18.7\nflipper_length_mm: 181\nbody_mass_g: 3750\nsex: MALE')

## HTML

In [9]:
docs = load_document("./files/some_website.html", text_splitter="auto")

len(docs)

1

In [10]:
docs[0]

Document(metadata={'Header 1': '#TITLE#', 'source': './files/some_website.html'}, page_content='Foo\n\nSome intro text about Foo.\n\nBar main section\n\nSome intro text about Bar.\n\nBar subsection 1\n\nSome text about the first subtopic of Bar.\n\nBar subsection 2\n\nSome text about the second subtopic of Bar.\n\nBaz\n\nSome text about Baz\n\nSome concluding text about Foo')

In [11]:
docs = load_document("./files/some_website.html", mode="raw", text_splitter="auto")

len(docs)

5

In [12]:
docs[0]

Document(metadata={'Header 1': 'Foo', 'source': './files/some_website.html'}, page_content='Foo \n Some intro text about Foo.')

## PDF

In [13]:
docs = load_document('./files/some_report.pdf', text_splitter="auto")

docs

[Document(metadata={'source': './files/some_report.pdf'}, page_content='This is the first line PDF. This is the second line in the PDF. This is the third line in the PDF.')]

## JSON

In [14]:
docs = load_document("./files/some_json.json", text_splitter="auto")

len(docs)

2

In [15]:
docs[len(docs)-1].page_content

'{"web-app": {"servlet-mapping": {"cofaxCDS": "/", "cofaxEmail": "/cofaxutil/aemail/*", "cofaxAdmin": "/admin/*", "fileServlet": "/static/*", "cofaxTools": "/tools/*"}, "taglib": {"taglib-uri": "cofax.tld", "taglib-location": "/WEB-INF/tlds/cofax.tld"}}}'

## Programming Language

In [16]:
docs = load_document('../multi_vectorstore.py', text_splitter="auto")

len(docs)

9

In [17]:
print(docs[0].page_content[0:49])

import uuid

from typing import (

Any,

Callable


## Wikipedia

In [18]:
docs = load_from_wikipedia(
            "Lionel Messi",
            lang='es',
            load_max_docs=1,
            text_splitter="recursive",
            splitter_kwargs={
                "chunk_size": CHUNK_SIZE,
                "chunk_overlap": CHUNK_OVERLAP,
            }
        )

len(docs)

12

In [19]:
print(docs[0].page_content)

Lionel Andrés Messi Cuccittini (Rosario, 24 de junio de 1987), conocido como Leo Messi, es un futbolista argentino que juega como delantero o centrocampista. Desde 2023, integra el plantel del Inter Miami de la MLS canadoestadounidense. Es también internacional con la selección de Argentina, de la que es capitán.


## WEB

In [20]:
docs = load_from_web(
            "https://www.money.com",
            default_parser="html.parser",
            bs_get_text_kwargs={"strip":True},
            text_splitter="recursive",
            splitter_kwargs={
                "chunk_size": CHUNK_SIZE,
                "chunk_overlap": CHUNK_OVERLAP,
            }
        )

len(docs)

30

In [21]:
print(docs[0].page_content)

Money Official Site – Finance News & Advice Since 1972 | MoneyCloseCreditMoney's Best, News & GuidesBest Credit Repair CompaniesBest Identity Theft Protection ServicesBest Credit Monitoring ServicesInsuranceMoney's Best, News & GuidesBest Long-Term Care InsuranceBest Pet InsuranceBest Life InsuranceBest Travel InsuranceBest RV InsuranceBest Dental InsuranceBest Motorcycle InsuranceBest Car InsuranceBest Homeowners InsuranceBest Health InsuranceLoansMoney's Best, News & GuidesBest Personal


In [22]:
docs = load_with_chromium(
            "https://www.money.com",
            transform=True,
            tags_to_extract=["span"],
            text_splitter="recursive",
            splitter_kwargs={
                "chunk_size": CHUNK_SIZE,
                "chunk_overlap": CHUNK_OVERLAP,
            }
        )

len(docs)

7

In [23]:
print(docs[0].page_content)

Close  <Back <Back EN ESP Best Credit Repair Companies Best Identity Theft Protection Services Best Credit Monitoring Services Best Long-Term Care Insurance Best Pet Insurance Best Life Insurance Best Travel Insurance Best RV Insurance Best Dental Insurance Best Motorcycle Insurance Best Car Insurance Best Homeowners Insurance Best Health Insurance Best Personal Loans Best Small Business Loans Best Student Loans Best Student Loan Refinance Best Auto Refinance Best Auto Loans Best Debt


## Load by file extension

In [24]:
docs = load_all_documents("../", ".py", recursive=True)

len(docs)

168

In [25]:
len(list(load_all_documents_lazy("../", ".py", recursive=True)))

168

In [26]:
loader = LoadAllDocuments("../", ".py", recursive=True)

In [27]:
docs = loader.load()

len(docs)

168

In [28]:
len(list(loader.lazy_load()))

168

### Load directories

In [29]:
from langchain.document_loaders import PythonLoader

In [30]:
docs = load_directory("../", glob="**/*.py", loader_cls=PythonLoader, silent_errors=True)

len(docs)

19

In [31]:
len(list(scan_load_directory("../", recursive=True, file_filter="*.py", autodetect_encoding=True)))

19

In [32]:
loader = ScanLoadDirectory("../", recursive=True, file_filter="*.py", autodetect_encoding=True)

In [33]:
docs = loader.load()

len(docs)

19

In [34]:
len(list(loader.lazy_load()))

19