A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem. Author: benridane# HTML Parser Plugin for Dify
A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem. Version: 0.0.1# HTML Parser Plugin for Dify
A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem. Type: tool# HTML Parser Plugin for Dify
A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem.
A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem.
A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem.
A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.
- Text Extraction: Extract clean text content from HTML
- Element Finding: Find specific HTML elements using CSS selectors
- Link Extraction: Extract all links with their properties
- Image Extraction: Extract all images with their attributes
- Attribute Parsing: Get specific attributes from HTML elements
- URL Support: Fetch and parse content directly from URLs
- CSS Selector Support: Use powerful CSS selectors for precise element targeting
- Install the required dependencies:
pip install -r requirements.txt
- The plugin includes:
beautifulsoup4>=4.12.0
- HTML parsinglxml>=4.9.0
- Fast XML parserrequests>=2.31.0
- HTTP requests for URL fetchingdify_plugin>=0.2.0,<0.3.0
- Dify plugin framework
Parameter | Type | Required | Description |
---|---|---|---|
html_content |
string | Yes | HTML content to parse or URL to fetch |
operation |
select | Yes | Type of parsing operation |
selector |
string | No | CSS selector for targeting specific elements |
attribute_name |
string | No | Name of attribute to extract |
strip_tags |
boolean | No | Whether to remove HTML tags (default: true) |
output_format |
select | No | Output format: 'text' or 'json' (default: text) |
The plugin supports two output formats:
- Text Output (
output_format: 'text'
): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results. - JSON Output (
output_format: 'json'
): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.
Text Output Examples:
# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"
# Link extraction
Operation: extract_links
Output:
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"
# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"
JSON Output Examples:
{
"operation": "extract_text",
"selector": "all",
"result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
"count": 1
}
Extracts all text content from HTML, optionally targeting specific elements.
Example:
Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"
Finds HTML elements and returns detailed information about them.
Example:
Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]
Extracts all links from the HTML with their properties.
Example:
Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]
Extracts all images with their attributes.
Example:
Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]
Extracts specific attributes from targeted elements.
Example:
Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]
"p"
- All paragraph elements".class-name"
- Elements with specific class"#element-id"
- Element with specific ID"div.container"
- Div elements with "container" class"a[href]"
- All links with href attribute"img[src*='photo']"
- Images with "photo" in src"h1, h2, h3"
- All heading elements"div > p"
- Direct paragraph children of div"li:first-child"
- First list item
The plugin automatically detects URLs and fetches their content:
Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content
The plugin handles various error scenarios:
- Invalid HTML content
- Network errors when fetching URLs
- Invalid CSS selectors
- Missing required parameters
Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"
Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"
Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']" # External links only
Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"
Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"
- Parser: BeautifulSoup with lxml backend for fast and accurate parsing
- Encoding: Automatic encoding detection for web content
- Limits: Element results limited to 50 items for performance
- Timeout: 10-second timeout for URL requests
- Memory: Optimized for processing large HTML documents
To extend the plugin:
- Add new operations in
_perform_operation
method - Update the YAML configuration with new parameters
- Add corresponding test cases
This plugin is part of the Dify plugin ecosystem.