HTML Parser Plugin for Dify

html-parser# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem. Author: benridane# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem. Version: 0.0.1# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem. Type: tool# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

Description# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

Text Extraction: Extract clean text content from HTML
Element Finding: Find specific HTML elements using CSS selectors
Link Extraction: Extract all links with their properties
Image Extraction: Extract all images with their attributes
Attribute Parsing: Get specific attributes from HTML elements
URL Support: Fetch and parse content directly from URLs
CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

Install the required dependencies:

pip install -r requirements.txt

The plugin includes:
- beautifulsoup4>=4.12.0 - HTML parsing
- lxml>=4.9.0 - Fast XML parser
- requests>=2.31.0 - HTTP requests for URL fetching
- dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter	Type	Required	Description
`html_content`	string	Yes	HTML content to parse or URL to fetch
`operation`	select	Yes	Type of parsing operation
`selector`	string	No	CSS selector for targeting specific elements
`attribute_name`	string	No	Name of attribute to extract
`strip_tags`	boolean	No	Whether to remove HTML tags (default: true)
`output_format`	select	No	Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (`extract_text`)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (`find_elements`)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (`extract_links`)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (`extract_images`)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (`get_attributes`)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

"p" - All paragraph elements
".class-name" - Elements with specific class
"#element-id" - Element with specific ID
"div.container" - Div elements with "container" class
"a[href]" - All links with href attribute
"img[src*='photo']" - Images with "photo" in src
"h1, h2, h3" - All heading elements
"div > p" - Direct paragraph children of div
"li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

Invalid HTML content
Network errors when fetching URLs
Invalid CSS selectors
Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

Parser: BeautifulSoup with lxml backend for fast and accurate parsing
Encoding: Automatic encoding detection for web content
Limits: Element results limited to 50 items for performance
Timeout: 10-second timeout for URL requests
Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

Add new operations in _perform_operation method
Update the YAML configuration with new parameters
Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
_assets		_assets
provider		provider
readme		readme
tools		tools
.difyignore		.difyignore
.env.example		.env.example
.gitignore		.gitignore
GUIDE.md		GUIDE.md
PRIVACY.md		PRIVACY.md
README.md		README.md
example.html		example.html
main.py		main.py
manifest.yaml		manifest.yaml
requirements.txt		requirements.txt
test_parser.py		test_parser.py
test_text_output.py		test_text_output.py

benridane/dify-html-parser

Folders and files

Latest commit

History

Repository files navigation

html-parser# HTML Parser Plugin for Dify

Features

Installation

Usage

Parameters

Output Formats

Operations

1. Extract Text (extract_text)

2. Find Elements (find_elements)

3. Extract Links (extract_links)

4. Extract Images (extract_images)

5. Get Attributes (get_attributes)

CSS Selector Examples

URL Support

Error Handling

Example Use Cases

1. Web Scraping

2. Email Template Processing

3. Content Analysis

4. Image Inventory

5. Data Extraction

Technical Details

Contributing

License

Features

Installation

Usage

Parameters

Output Formats

Operations

1. Extract Text (extract_text)

2. Find Elements (find_elements)

3. Extract Links (extract_links)

4. Extract Images (extract_images)

5. Get Attributes (get_attributes)

CSS Selector Examples

URL Support

Error Handling

Example Use Cases

1. Web Scraping

2. Email Template Processing

3. Content Analysis

4. Image Inventory

5. Data Extraction

Technical Details

Contributing

License

Features

Installation

Usage

Parameters

Output Formats

Operations

1. Extract Text (extract_text)

2. Find Elements (find_elements)

3. Extract Links (extract_links)

4. Extract Images (extract_images)

5. Get Attributes (get_attributes)

CSS Selector Examples

URL Support

Error Handling

Example Use Cases

1. Web Scraping

2. Email Template Processing

3. Content Analysis

4. Image Inventory

5. Data Extraction

Technical Details

Contributing

License

Features

Installation

Usage

Parameters

Output Formats

1. Extract Text (`extract_text`)

2. Find Elements (`find_elements`)

3. Extract Links (`extract_links`)

4. Extract Images (`extract_images`)

5. Get Attributes (`get_attributes`)

1. Extract Text (`extract_text`)

2. Find Elements (`find_elements`)

3. Extract Links (`extract_links`)

4. Extract Images (`extract_images`)

5. Get Attributes (`get_attributes`)

1. Extract Text (`extract_text`)

2. Find Elements (`find_elements`)

3. Extract Links (`extract_links`)

4. Extract Images (`extract_images`)

5. Get Attributes (`get_attributes`)

1. Extract Text (`extract_text`)

2. Find Elements (`find_elements`)

3. Extract Links (`extract_links`)

4. Extract Images (`extract_images`)

5. Get Attributes (`get_attributes`)

1. Extract Text (`extract_text`)

2. Find Elements (`find_elements`)

3. Extract Links (`extract_links`)

4. Extract Images (`extract_images`)

5. Get Attributes (`get_attributes`)

1. Extract Text (`extract_text`)

2. Find Elements (`find_elements`)

3. Extract Links (`extract_links`)

4. Extract Images (`extract_images`)

5. Get Attributes (`get_attributes`)