Skip to content

benridane/dify-html-parser

Repository files navigation

html-parser# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem. Author: benridane# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem. Version: 0.0.1# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem. Type: tool# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

Description# HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

HTML Parser Plugin for Dify

A powerful HTML parsing plugin for Dify that uses BeautifulSoup to extract text, elements, links, images, and attributes from HTML content. This plugin supports both direct HTML input and URL fetching.

Features

  • Text Extraction: Extract clean text content from HTML
  • Element Finding: Find specific HTML elements using CSS selectors
  • Link Extraction: Extract all links with their properties
  • Image Extraction: Extract all images with their attributes
  • Attribute Parsing: Get specific attributes from HTML elements
  • URL Support: Fetch and parse content directly from URLs
  • CSS Selector Support: Use powerful CSS selectors for precise element targeting

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. The plugin includes:
    • beautifulsoup4>=4.12.0 - HTML parsing
    • lxml>=4.9.0 - Fast XML parser
    • requests>=2.31.0 - HTTP requests for URL fetching
    • dify_plugin>=0.2.0,<0.3.0 - Dify plugin framework

Usage

Parameters

Parameter Type Required Description
html_content string Yes HTML content to parse or URL to fetch
operation select Yes Type of parsing operation
selector string No CSS selector for targeting specific elements
attribute_name string No Name of attribute to extract
strip_tags boolean No Whether to remove HTML tags (default: true)
output_format select No Output format: 'text' or 'json' (default: text)

Output Formats

The plugin supports two output formats:

  • Text Output (output_format: 'text'): Returns human-readable text messages. This is the default format and is ideal when you want clean, readable text results.
  • JSON Output (output_format: 'json'): Returns structured JSON data with detailed metadata. This format is better for programmatic processing.

Text Output Examples:

# Text extraction
Operation: extract_text
Output: "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。"

# Link extraction  
Operation: extract_links
Output: 
"Found 1 link(s):
1. BeautifulSoup -> https://example.com"

# Element finding
Operation: find_elements
Output:
"Found 3 element(s):
1. <p class='description'>: このツールはBeautifulSoupを使用してHTMLを解析します。
2. <li>: テキスト抽出
3. <li>: 要素検索"

JSON Output Examples:

{
  "operation": "extract_text",
  "selector": "all",
  "result": "HTML パーサーのテスト このツールはBeautifulSoupを使用してHTMLを解析します。",
  "count": 1
}

Operations

1. Extract Text (extract_text)

Extracts all text content from HTML, optionally targeting specific elements.

Example:

Input: <div><h1>Title</h1><p>Content</p></div>
Selector: "h1"
Output: "Title"

2. Find Elements (find_elements)

Finds HTML elements and returns detailed information about them.

Example:

Input: <div class="content"><p id="para1">Text</p></div>
Selector: "p"
Output: [{"tag": "p", "text": "Text", "id": "para1", ...}]

3. Extract Links (extract_links)

Extracts all links from the HTML with their properties.

Example:

Input: <a href="https://example.com" title="Example">Link</a>
Output: [{"text": "Link", "href": "https://example.com", "title": "Example"}]

4. Extract Images (extract_images)

Extracts all images with their attributes.

Example:

Input: <img src="image.jpg" alt="Photo" width="100">
Output: [{"src": "image.jpg", "alt": "Photo", "width": "100"}]

5. Get Attributes (get_attributes)

Extracts specific attributes from targeted elements.

Example:

Input: <div class="container" id="main">Content</div>
Selector: "div"
Attribute: "class"
Output: [{"tag": "div", "class": "container", "text": "Content"}]

CSS Selector Examples

  • "p" - All paragraph elements
  • ".class-name" - Elements with specific class
  • "#element-id" - Element with specific ID
  • "div.container" - Div elements with "container" class
  • "a[href]" - All links with href attribute
  • "img[src*='photo']" - Images with "photo" in src
  • "h1, h2, h3" - All heading elements
  • "div > p" - Direct paragraph children of div
  • "li:first-child" - First list item

URL Support

The plugin automatically detects URLs and fetches their content:

Input: "https://example.com"
Operation: "extract_text"
Result: Fetches the webpage and extracts all text content

Error Handling

The plugin handles various error scenarios:

  • Invalid HTML content
  • Network errors when fetching URLs
  • Invalid CSS selectors
  • Missing required parameters

Example Use Cases

1. Web Scraping

Operation: extract_text
HTML Content: "https://news.example.com"
Selector: "article .content"

2. Email Template Processing

Operation: find_elements
HTML Content: "<email template HTML>"
Selector: ".call-to-action"

3. Content Analysis

Operation: extract_links
HTML Content: "<webpage HTML>"
Selector: "a[href^='https']"  # External links only

4. Image Inventory

Operation: extract_images
HTML Content: "<gallery HTML>"
Selector: ".gallery img"

5. Data Extraction

Operation: get_attributes
HTML Content: "<product page HTML>"
Selector: ".product"
Attribute: "data-price"

Technical Details

  • Parser: BeautifulSoup with lxml backend for fast and accurate parsing
  • Encoding: Automatic encoding detection for web content
  • Limits: Element results limited to 50 items for performance
  • Timeout: 10-second timeout for URL requests
  • Memory: Optimized for processing large HTML documents

Contributing

To extend the plugin:

  1. Add new operations in _perform_operation method
  2. Update the YAML configuration with new parameters
  3. Add corresponding test cases

License

This plugin is part of the Dify plugin ecosystem.

About

html-parser# HTML Parser Plugin for Dify

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published