# Sitemap

Extends from the `WebBaseLoader`, `SitemapLoader` loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document.

The scraping is done concurrently.  There are reasonable limits to concurrent requests, defaulting to 2 per second.  If you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load. Note, while this will speed up the scraping process, but it may cause the server to block you.  Be careful!

In [5]:
!pip install nest_asyncio



In [6]:
# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()

In [7]:
from langchain.document_loaders.sitemap import SitemapLoader

In [8]:
sitemap_loader = SitemapLoader(web_path="https://langchain.readthedocs.io/sitemap.xml")

docs = sitemap_loader.load()

Fetching pages: 100%|##########| 6/6 [00:03<00:00,  1.78it/s]


You can change the `requests_per_second` parameter to increase the max concurrent requests. and use `requests_kwargs` to pass kwargs when send requests.

In [9]:
sitemap_loader.requests_per_second = 2
# Optional: avoid `[SSL: CERTIFICATE_VERIFY_FAILED]` issue
sitemap_loader.requests_kwargs = {"verify": False}

In [10]:
docs[0]

Document(page_content='\n\n\n\n\n\nWelcome to LangChain — 🦜🔗 LangChain 0.0.199\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main content\n\n\n\n\n\n\n\n\n\n\nCtrl+K\n\n\n\n\n\n\n\n\n\n\n\n\n🦜🔗 LangChain 0.0.199\n\n\n\nGetting Started\n\nQuickstart Guide\nConcepts\nTutorials\n\nModules\n\nModels\nGetting Started\nLLMs\nGetting Started\nGeneric Functionality\nHow to use the async API for LLMs\nHow to write a custom LLM wrapper\nHow (and why) to use the fake LLM\nHow (and why) to use the human input LLM\nHow to cache LLM calls\nHow to serialize LLM classes\nHow to stream LLM and Chat Model responses\nHow to track token usage\n\n\nIntegrations\nAI21\nAleph Alpha\nAnyscale\nAviary\nAzure OpenAI\nBanana\nBaseten\nBeam\nBedrock\nCerebriumAI\nCohere\nC Transformers\nDatabricks\nDeepInfra\nForefrontAI\nGoogle Cloud Platform Vertex AI PaLM\nGooseAI\nGPT4All\nHugging Face Hub\nHugging Face Pipeline\nHuggingface TextGen Inference\nJsonformer\

## Filtering sitemap URLs

Sitemaps can be massive files, with thousands of URLs.  Often you don't need every single one of them.  You can filter the URLs by passing a list of strings or regex patterns to the `url_filter` parameter.  Only URLs that match one of the patterns will be loaded.

In [11]:
loader = SitemapLoader(
    "https://langchain.readthedocs.io/sitemap.xml",
    filter_urls=["https://python.langchain.com/en/latest/"]
)
documents = loader.load()

Fetching pages: 100%|##########| 1/1 [00:00<00:00, 15.77it/s]


In [12]:
documents[0]

Document(page_content='\n\n\n\n\n\nWelcome to LangChain — 🦜🔗 LangChain 0.0.199\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main content\n\n\n\n\n\n\n\n\n\n\nCtrl+K\n\n\n\n\n\n\n\n\n\n\n\n\n🦜🔗 LangChain 0.0.199\n\n\n\nGetting Started\n\nQuickstart Guide\nConcepts\nTutorials\n\nModules\n\nModels\nGetting Started\nLLMs\nGetting Started\nGeneric Functionality\nHow to use the async API for LLMs\nHow to write a custom LLM wrapper\nHow (and why) to use the fake LLM\nHow (and why) to use the human input LLM\nHow to cache LLM calls\nHow to serialize LLM classes\nHow to stream LLM and Chat Model responses\nHow to track token usage\n\n\nIntegrations\nAI21\nAleph Alpha\nAnyscale\nAviary\nAzure OpenAI\nBanana\nBaseten\nBeam\nBedrock\nCerebriumAI\nCohere\nC Transformers\nDatabricks\nDeepInfra\nForefrontAI\nGoogle Cloud Platform Vertex AI PaLM\nGooseAI\nGPT4All\nHugging Face Hub\nHugging Face Pipeline\nHuggingface TextGen Inference\nJsonformer\

## Add custom scraping rules

The `SitemapLoader` uses `beautifulsoup4` for the scraping process, and it scrapes every element on the page by default. The `SitemapLoader` constructor accepts a custom scraping function. This feature can be helpful to tailor the scraping process to your specific needs; for example, you might want to avoid scraping headers or navigation elements.

 The following example shows how to develop and use a custom function to avoid navigation and header elements.

Import the `beautifulsoup4` library and define the custom function.

In [13]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [14]:
from bs4 import BeautifulSoup

def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all('nav')
    header_elements = content.find_all('header')

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())

Add your custom function to the `SitemapLoader` object.

In [22]:
loader = SitemapLoader(
    # "https://langchain.readthedocs.io/sitemap.xml",
    # "https://illinois.edu/sitemap.xml",
    "https://kastanday.com/sitemap.xml",
    # filter_urls=["https://python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements
)

In [None]:
from langchain.document_loaders import WebBaseLoader


In [23]:
res = loader.load()
res

Fetching pages: 100%|##########| 39/39 [00:04<00:00,  7.85it/s]


[Document(page_content='\n\n\n\n\n\nResume – Kastan Day\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\nday_kastan_resumeDownload\nPersonal statement\nAs an AI practitioner, I’m focused on applied machine learning by putting the best research from academia to use on solving real world problems. I’m fluent in machine learning best practices for large scale distributed systems, robust model training, high-speed model inference and continuous testing and integration of improved models. My work experience is in applied ML for vision, and 3D scene understanding (SLAM) for autonomous robotics at Sarcos and NASA. During my Masters I’m specializing in large scale, distributed ML modeling for scientific applications, working as an AI consultant on a heterogeneous team of scientists to solve cross-discipline problems with AI.\nMy favorite tools are first and foremost Python, often super-powe

## Local Sitemap

The sitemap loader can also be used to load local files.

In [30]:
sitemap_loader = SitemapLoader(web_path="illinois_sitemap.xml", is_local=True, parsing_function=remove_nav_and_header_elements)

docs = sitemap_loader.load()

Fetching pages:   0%|          | 0/34 [00:00<?, ?it/s]

Fetching pages: 100%|##########| 34/34 [00:07<00:00,  4.71it/s]


In [32]:
for doc in docs:# strip newlines 
  doc.page_content = doc.page_content.replace("\n", " ")

In [33]:
print("\n-----------------------------------------------------".join([doc.page_content for doc in docs]))

            Home  Homepage Feature Stories         Illinois News                                More Featured Content    Research News Categories  Agriculture Arts Business Campus Education Engineering Health Humanities Law Life Sciences Physical Sciences Social Sciences Veterinary Medicine News Bureau website        Featured Events and Calendars       All campus calendars        Colleges, Schools & Institutes  Agricultural, Consumer and Environmental Sciences Applied Health Sciences  Beckman Institute for Advanced Science and Technology Cancer Center at Illinois  Carl R. Woese Institute for Genomic Biology Carle Illinois College of Medicine Center for Social & Behavioral Science Education Fine and Applied Arts General Studies Gies College of Business  Graduate College The Grainger College of Engineering Humanities Research Institute Information Sciences Institute for Sustainability, Energy, and Environment Interdisciplinary Health Sciences Institute Labor and Employment Relations Law 