# WebBaseLoader

`WebBaseLoader` là một document loader chuyên biệt trong LangChain được thiết kế để xử lý nội dung dựa trên web.

Nó sử dụng thư viện `BeautifulSoup4` để phân tích cú pháp các trang web một cách hiệu quả, cung cấp các tùy chọn phân tích cú pháp tùy chỉnh thông qua `SoupStrainer` và các tham số `bs4` bổ sung.

Hướng dẫn này trình bày cách sử dụng `WebBaseLoader` để:

1. Load và phân tích cú pháp các tài liệu web một cách hiệu quả.
2. Tùy chỉnh hành vi phân tích cú pháp bằng cách sử dụng các tùy chọn `BeautifulSoup`.
3. Xử lý linh hoạt các cấu trúc nội dung web khác nhau.

```bash
pip install beautifulsoup4
```


## Load Web-based documents

`WebBaseLoader` is a loader designed for loading web-based documents.

It uses the `bs4` library to parse web pages.

Key Features:
- Uses `bs4.SoupStrainer` to specify elements to parse.
- Accepts additional arguments for `bs4.SoupStrainer` through the `bs_kwargs` parameter.

For more details, refer to the API documentation.

In [1]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Load news article content using WebBaseLoader
loader = WebBaseLoader(
   web_paths=("https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/",),
   # Configure BeautifulSoup to parse only specific div elements
   bs_kwargs=dict(
       parse_only=bs4.SoupStrainer(
           "div",
           attrs={"class": ["entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained"]},
       )
   ),
   # Set user agent in request header to mimic browser
   header_template={
       "User_Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
   },
)

# Load and process the documents
docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs[0]

USER_AGENT environment variable not set, consider setting it to identify your requests.


Number of documents: 1


Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nGoogle CEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we h

To bypass SSL authentication errors, you can set the `“verify”` option.

In [2]:
# Bypass SSL certificate verification
loader.requests_kwargs = {"verify": False}

# Load documents from the web
docs = loader.load()
docs[0]



Document(metadata={'source': 'https://techcrunch.com/2024/12/28/google-ceo-says-ai-model-gemini-will-the-companys-biggest-focus-in-2025/'}, page_content='\nGoogle CEO Sundar Pichai reportedly told Google employees that 2025 will be a “critical” year for the company.\nCNBC reports that it obtained audio from a December 18 strategy meeting where Pichai and other executives put on ugly holiday sweaters and laid out their priorities for the coming year.\n\n\n\n\n\n\n\n\n“I think 2025 will be critical,” Pichai said. “I think it’s really important we internalize the urgency of this moment, and need to move faster as a company. The stakes are high.”\nThe moment, of course, is one where tech companies like Google are making heavy investments in AI, and often with mixed results. Pichai acknowledged that the company has some catching up to do on the AI side — he described the Gemini app (based on the company’s AI model of the same name) as having “strong momentum,” while also acknowledging “we h

You can also load multiple webpages at once. To do this, you can pass a list of **urls** to the loader, which will return a list of documents in the order of the **urls** passed.

In [3]:
# Initialize the WebBaseLoader with web page paths and parsing configurations
loader = WebBaseLoader(
    web_paths=[
        # List of web pages to load
        "https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/",
        "https://techcrunch.com/2024/12/29/ai-data-centers-could-be-distorting-the-us-power-grid/",
    ],
    bs_kwargs=dict(
        # BeautifulSoup settings to parse only the specific content section
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["entry-content wp-block-post-content is-layout-constrained wp-block-post-content-is-layout-constrained"]},
        )
    ),
    header_template={
        # Custom HTTP headers for the request (e.g., User-Agent for simulating a browser)
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
    },
)

# Load the data from the specified web pages
docs = loader.load()

# Check and print the number of documents loaded
print(len(docs))


2


Output the results fetched from the web.

In [4]:
print(docs[0].page_content[:500])
print("===" * 10)
print(docs[1].page_content[:500])


We are at the dawn of a new space age.  If you doubt, simply look back at the last year: From SpaceX’s historic catch of the Super Heavy booster to the record-breaking number of lunar landing attempts, this year was full of historic and ambitious missions and demonstrations. 
We’re taking a look back at the five most significant moments or trends in the space industry this year. Naysayers might think SpaceX is overrepresented on this list, but that just shows how far ahead the space behemoth is

The proliferation of data centers aiming to meet the computational needs of AI could be bad news for the U.S. power grid, according to a new report in Bloomberg.
Using the 1 million residential sensors tracked by Whisker Labs, along with market intelligence data from DC Byte, Bloomberg found that more than half of the households showing the worst power distortions live within 20 miles of significant data center activity.








In other words, there appears to be a link between data center pr

## Load Multiple URLs Concurrently with `alazy_load()`

Bạn có thể tăng tốc quá trình cào và phân tích cú pháp nhiều URL bằng cách sử dụng tải bất đồng bộ. Điều này cho phép bạn tìm nạp tài liệu đồng thời, cải thiện hiệu quả trong khi tuân thủ giới hạn tốc độ.

### Các điểm chính:

-   **Rate Limit** (Giới hạn tốc độ): Tham số `requests_per_second` kiểm soát số lượng yêu cầu được thực hiện mỗi giây. Trong ví dụ này, nó được đặt thành 1 để tránh làm quá tải máy chủ.
-   **Asynchronous Loading** (Tải bất đồng bộ): Hàm `alazy_load()` được sử dụng để tải tài liệu bất đồng bộ, cho phép xử lý nhanh hơn nhiều URL.
-   **Jupyter Notebook Compatibility** (Khả năng tương thích với Jupyter Notebook): Nếu chạy trong Jupyter Notebook, cần có `nest_asyncio` để xử lý đúng cách các tác vụ bất đồng bộ.

Đoạn mã dưới đây minh họa cách định cấu hình và tải tài liệu bất đồng bộ:


In [5]:
# only for jupyter notebook (asyncio)
import nest_asyncio

nest_asyncio.apply()

In [6]:
# Set the requests per second rate limit
loader.requests_per_second = 1

# Load documents asynchronously
# The aload() is deprecated and alazy_load() is used since the langchain 3.14 update)
docs=[]
async for doc in loader.alazy_load():
    docs.append(doc)

Fetching pages: 100%|##########| 2/2 [00:01<00:00,  1.87it/s]


In [7]:
# Display loaded documents
docs

[Document(metadata={'source': 'https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/'}, page_content='\nWe are at the dawn of a new space age.  If you doubt, simply look back at the last year: From SpaceX’s historic catch of the Super Heavy booster to the record-breaking number of lunar landing attempts, this year was full of historic and ambitious missions and demonstrations.\xa0\nWe’re taking a look back at the five most significant moments or trends in the space industry this year. Naysayers might think SpaceX is overrepresented on this list, but that just shows how far ahead the space behemoth is in relation to its competitors. \n\n\n\n\n\n\n\n\nHere we go, in no particular order:\xa0\n1. Boeing’s bungled Starliner mission turns into a SpaceX win\xa0\nNASA and Boeing no doubt had high hopes when the Starliner vehicle lifted off for its first crewed test mission in June. But a series of technical malfunctions occurred as the vehicle\xa0made 

## Load XML Documents

`WebBaseLoader` có thể xử lý các tệp XML bằng cách chỉ định một trình phân tích cú pháp `BeautifulSoup` khác. Điều này đặc biệt hữu ích khi làm việc với nội dung XML có cấu trúc như sơ đồ trang web hoặc dữ liệu chính phủ.

### Tải XML cơ bản

Ví dụ sau đây minh họa việc tải một tài liệu XML từ một trang web của chính phủ:


In [8]:
from langchain_community.document_loaders import WebBaseLoader

# Initialize loader with XML document URL
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)

# Set parser to XML mode
loader.default_parser = "xml"

# Load and process the document
docs = loader.load()

### Memory-Efficient Loading

For handling large documents, `WebBaseLoader` provides two memory-efficient loading methods:

1. lazy_load() - loads one page at a time
2. alazy_load() - asynchronous page loading for better performance

In [9]:
# Lazy Loading Example
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

# Print first 100 characters and metadata of the first page
print(pages[0].page_content[:100])
print(pages[0].metadata)



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}


In [10]:
# Async Loading Example
pages = []
async for doc in loader.alazy_load():
    pages.append(doc)

# Print first 100 characters and metadata of the first page
print(pages[0].page_content[:100])
print(pages[0].metadata)

Fetching pages: 100%|##########| 1/1 [00:01<00:00,  1.41s/it]



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}





## Load Web-based Document Using Proxies

Đôi khi bạn có thể cần sử dụng proxy để vượt qua việc chặn IP.

Để sử dụng proxy, bạn có thể truyền một từ điển proxy cho trình tải (và thư viện `requests` cơ bản của nó).

### ⚠️ Cảnh báo:

-   Thay thế `{username}`, `{password}` và `proxy.service.com` bằng thông tin xác thực proxy và thông tin máy chủ thực tế của bạn.
-   Nếu không có cấu hình proxy hợp lệ, có thể xảy ra lỗi như **ProxyError** hoặc **AuthenticationError**.


In [None]:
loader = WebBaseLoader(
   "https://www.google.com/search?q=parrots",
   proxies={
       "http": "http://{username}:{password}:@proxy.service.com:6666/",
       "https": "https://{username}:{password}:@proxy.service.com:6666/",
   },
   # Initialize the web loader with proxy settings
   # Configure proxy for both HTTP and HTTPS requests
)

# Load documents using the proxy
docs = loader.load()

## Simple Web Content Loading with `MarkItDown`

Không giống như `WebBaseLoader` sử dụng `BeautifulSoup4` để phân tích cú pháp HTML phức tạp, `MarkItDown` cung cấp một phương pháp đơn giản hơn để tải nội dung web. Nó trực tiếp tìm nạp nội dung web bằng các yêu cầu HTTP và chuyển đổi nó thành định dạng markdown mà không có khả năng phân tích cú pháp chi tiết.

```bash
pip install markitdown
```

Dưới đây là một ví dụ cơ bản về tải nội dung web bằng `MarkItDown`:


In [12]:
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://techcrunch.com/2024/12/28/revisiting-the-biggest-moments-in-the-space-industry-in-2024/")
result_text = result.text_content

In [13]:
print(result_text[:1000])

[![](https://techcrunch.com/wp-content/uploads/2024/09/tc-lockup.svg) TechCrunch Desktop Logo](https://techcrunch.com)

[![](https://techcrunch.com/wp-content/uploads/2024/09/tc-logo-mobile.svg) TechCrunch Mobile Logo](https://techcrunch.com)

* [Latest](/latest/)
* [Startups](/category/startups/)
* [Venture](/category/venture/)
* [Apple](/tag/apple/)
* [Security](/category/security/)
* [AI](/category/artificial-intelligence/)
* [Apps](/category/apps/)
* [SXSW 2025](https://techcrunch.com/storyline/sxsw-2025-live-coverage-jay-grabers-keynote-plus-more-ai-wooly-mammoths-and-death-stranding-2/)

* [Events](/events/)
* [Podcasts](/podcasts/)
* [Newsletters](/newsletters/)

[Sign In](https://oidc.techcrunch.com/login/?dest=https%3A%2F%2Ftechcrunch.com%2F2024%2F12%2F28%2Frevisiting-the-biggest-moments-in-the-space-industry-in-2024%2F)
[![]()](https://techcrunch.com/my-account/)

Search

Submit

Site Search Toggle

Mega Menu Toggle

### Topics

[Latest](/latest/)

[AI](/category/artificial-i