- Table of content
  - [How to load a web page](#how-to-load-a-web-pages)
    - [Simple and fast](#simple-and-fast)
    - [Advanced parsing](#advance-parsing)
      - [Extracting content from specific sections](#extracting-content-from-specific-sections)
    - [Vector Embedding](#vector-embeddings-over-page-content)

# How to Load a web pages

- In here we see how to loaded web pages into LangChain Documents.
- The right parsing will depend on you need here we demonstrate two possibilities.
    1. [**`Simple and fast`**](#simple-and-fast): Here we get one [**`Document`**](../document_loader.md) per page, it's content represented as flattened **`string`**.
    2. [**`Advance Parsing`**](#advance-parsing): Here we can get multiple [**`Document`**](../document_loader.md) per page, allowing to get different sections such as **`links`**, **`table`** and other structure.

## Simple and fast 

- This method is appropriate, If you want to get simple string embedding in a web page.
- It will return a single **`Document`** object -- one per page -- containing a single string of the page.
- Under the hood is use **`Beautifulsoup4`**.
- To load the document we can use **`lazy_load`** and for async we have **`alazy_load`**.

In [15]:
from bs4 import SoupStrainer
from langchain_community.document_loaders import WebBaseLoader

page_url = "https://python.langchain.com/docs/how_to/chatbots_memory/"

loader = WebBaseLoader(
    web_path=page_url,
    bs_kwargs={
        "parse_only": SoupStrainer(class_="theme-doc-markdown markdown"),
    }
)
docs = []

async for doc in loader.alazy_load():
    docs.append(doc)

assert len(docs) == 1
doc = docs[0]

Fetching pages: 100%|##########| 1/1 [00:00<00:00, 23.76it/s]


In [16]:
print(f"MetaData {doc.metadata}")
print(f"Page Content: {doc.page_content.strip()}")

MetaData {'source': 'https://python.langchain.com/docs/how_to/chatbots_memory/'}
Page Content: How to add memory to chatbots
A key feature of chatbots is their ability to use the content of previous conversational turns as context. This state management can take several forms, including:

Simply stuffing previous messages into a chat model prompt.
The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
More complex modifications like synthesizing summaries for long running conversations.

We'll go into more detail on a few techniques below!
noteThis how-to guide previously built a chatbot using RunnableWithMessageHistory. You can access this version of the guide in the v0.2 docs.As of the v0.3 release of LangChain, we recommend that LangChain users take advantage of LangGraph persistence to incorporate memory into new LangChain applications.If your code is already relying on RunnableWithMessageHistory or BaseChatMessageHistory, y

- It you want essential data from the web page. you can do by specifying classes and other parameters. 

## Advance Parsing

- If you want more control to process the page content this is appropriate for that.
- In here instead of generating single document per page you can generate multiple **`document`** object representing distinct structure of web page.

In [19]:
from langchain_unstructured import UnstructuredLoader

page_url = "https://python.langchain.com/docs/how_to/chatbots_memory/"
loader = UnstructuredLoader(web_url=page_url)

docs = []
async for doc in loader.alazy_load():
    docs.append(doc)

In [23]:
for doc in docs[:5]:
    print(doc.page_content)

Open In Colab
Open on GitHub
How to add memory to chatbots
A key feature of chatbots is their ability to use the content of previous conversational turns as context. This state management can take several forms, including:
Simply stuffing previous messages into a chat model prompt.


### Extracting content from specific sections

- Each `Document` object represent an element of the page. Its metadata contains useful information.
- Also Each element can have parent-child relationship.

In [26]:
for doc in docs[:5]:
    print(f'{doc.metadata["category"]}: {doc.page_content}')

Image: Open In Colab
Image: Open on GitHub
Title: How to add memory to chatbots
NarrativeText: A key feature of chatbots is their ability to use the content of previous conversational turns as context. This state management can take several forms, including:
ListItem: Simply stuffing previous messages into a chat model prompt.


In [27]:
from typing import List
from langchain_core.documents import Document


async def _get_setup_docs_from_url(page_url: str) -> List[Document]:
    loader = UnstructuredLoader(web_url=page_url)

    setup_docs = []
    parent_id = -1
    async for doc in loader.alazy_load():
        if doc.metadata["category"] == "Title" and doc.page_content.startswith("Setup"):
            parent_id = doc.metadata["element_id"]

        if parent_id == doc.metadata.get("parent_id"):
            setup_docs.append(doc)

    return setup_docs


page_urls = [
    "https://python.langchain.com/docs/how_to/chatbots_memory/",
    "https://python.langchain.com/docs/how_to/chatbots_tools/",
]
setup_docs = []
for url in page_urls:
    page_setup_docs = await _get_setup_docs_from_url(url)
    setup_docs.extend(page_setup_docs)

In [32]:
from collections import defaultdict

setup_text = defaultdict(str)

for doc in setup_docs:
    url = doc.metadata["url"]
    setup_text[url] += f"{doc.page_content}\n"

dict(setup_text)

{'https://python.langchain.com/docs/how_to/chatbots_memory/': 'You\'ll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:\n%pip install --upgrade --quiet langchain langchain-openai langgraph\n\nimport getpass\nimport os\n\nif not os.environ.get("OPENAI_API_KEY"):\n    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")\nOpenAI API Key: ········\nLet\'s also set up a chat model that we\'ll use for the below examples.\nfrom langchain_openai import ChatOpenAI\n\nmodel = ChatOpenAI(model="gpt-4o-mini")\nAPI Reference:ChatOpenAI\n',
 'https://python.langchain.com/docs/how_to/chatbots_tools/': 'For this guide, we\'ll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you\'re using Tavily.\nYou\'ll need to sign up for an account on the Tavily website, and install the followi

## Vector Embeddings over page content

- Once we load the web content in LangChain **`Document`** object, We can index them (e.g for **`RAG`** application).

In [33]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(
    setup_docs, OpenAIEmbeddings(model="text-embedding-3-small"))
retrieved_docs = vector_store.similarity_search("Install Tavily", k=2)

INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [36]:
for doc in retrieved_docs:
    print(f'Page {doc.metadata["url"]}: {doc.page_content}\n')

Page https://python.langchain.com/docs/how_to/chatbots_tools/: You'll need to sign up for an account on the Tavily website, and install the following packages:

Page https://python.langchain.com/docs/how_to/chatbots_tools/: For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.

