# Metadata

**L1 Taxonomy** - Computing Paradigms

**L2 Taxonomy** - Domain-Specific Languages (DSL)

**Subtopic** - Implementing a parser for a subset of Markdown to HTML conversion (DSL for document formatting)

**Use Case** - Develop a Python script that parses a subset of Markdown syntax from a local text file and converts it into HTML. The script should support the following Markdown elements: headers (H1 to H6), bold text, italic text, lists (unordered and ordered), and links. The script should output the resulting HTML to a new local file.

**Programming Language** - Python

**Target Model** - GPT-4o

# Setup

```requirements.txt
```


# Prompt

## Problem Overview

You are tasked with creating a sophisticated Markdown to HTML conversion system that can process complex nested structures, handle multiple input files simultaneously, and provide extensible parsing capabilities. The system must implement a robust parser that can handle intricate combinations of Markdown elements including nested lists, inline formatting within headers, links containing formatted text, and mixed content scenarios. The parser should be designed as a modular system that can be easily extended with additional Markdown features in the future.

Your implementation must process Markdown files with sophisticated nesting patterns, maintain proper HTML structure integrity, handle malformed input gracefully, and provide detailed error reporting. The system should support batch processing of multiple files, custom output directory management, and configurable parsing options.


## Input Format

The input consists of one or more Markdown text files (.md extension) containing a subset of Markdown syntax. Each file may contain:

- Headers from H1 to H6 using hash notation (# to ######)
- Bold text enclosed in double asterisks (**text**)
- Italic text enclosed in single asterisks (*text*)
- Unordered lists using hyphens (-) or asterisks (*)
- Ordered lists using numbers followed by periods (1., 2., etc.)
- Links in the format [link text](URL)
- Mixed inline formatting within headers and list items
- Nested lists with proper indentation (2 or 4 spaces per level)
- Empty lines and whitespace that should be preserved appropriately


## Output Format

The output should be well-formed HTML files with the following characteristics:

- Proper HTML5 document structure with DOCTYPE, html, head, and body tags
- Correct nesting of HTML elements
- Preservation of text content and formatting
- Clean indentation for readability
- Proper escaping of HTML special characters in content
- CSS classes for styling hooks (optional but recommended)
- Valid HTML that passes basic validation checks


## Examples

**Input file (example.md):**
```
# Main Title with *italic* and **bold** text

This is a paragraph with **bold text** and *italic text*.

## Subsection Header

- First item with [a link](https://example.com)
- Second item with **bold** content
  - Nested item with *italic* text
  - Another nested item
- Third item

### Numbered List Example

1. First numbered item
2. Second item with **formatting**
   1. Nested numbered item
   2. Another nested item with [link](https://test.com)
3. Final item

#### Deep Header

A paragraph with multiple [links](https://one.com) and [another link](https://two.com).
```

**Expected Output (example.html):**
```xml
<!DOCTYPE html>
<html>
  <head>
      <meta charset="UTF-8">
      <title>Converted Markdown</title>
  </head>
  <body>
      <h1>Main Title with <em>italic</em> and <strong>bold</strong> text</h1>
      <p>This is a paragraph with <strong>bold text</strong> and <em>italic text</em>.</p>
      <h2>Subsection Header</h2>
      <ul>
          <li>First item with <a href="https://example.com">a link</a></li>
          <li>Second item with <strong>bold</strong> content
              <ul>
                  <li>Nested item with <em>italic</em> text</li>
                  <li>Another nested item</li>
              </ul>
          </li>
          <li>Third item</li>
      </ul>
      <h3>Numbered List Example</h3>
      <ol>
          <li>First numbered item</li>
          <li>Second item with <strong>formatting</strong>
              <ol>
                  <li>Nested numbered item</li>
                  <li>Another nested item with <a href="https://test.com">link</a></li>
              </ol>
          </li>
          <li>Final item</li>
      </ol>
      <h4>Deep Header</h4>
      <p>A paragraph with multiple <a href="https://one.com">links</a> and <a href="https://two.com">another link</a>.</p>
  </body>
</html>

```


# Requirements

## Explicit Requirements

1. **Modular Design**: Implement the parser as an independent module that can be imported and used in other projects
2. **File I/O Handling**: Support reading from input files and writing to output files with proper error handling
3. **Nested Structure Support**: Handle nested lists up to at least 4 levels deep with proper HTML nesting
4. **Inline Formatting**: Process bold, italic, and links within headers, list items, and paragraphs
5. **Mixed Content Processing**: Handle combinations of different Markdown elements in the same line
6. **HTML Escaping**: Properly escape HTML special characters (&, , ", ') in text content
7. **Whitespace Management**: Preserve paragraph breaks and handle extra whitespace appropriately
8. **Error Recovery**: Continue processing when encountering malformed Markdown syntax
9. **Batch Processing**: Support processing multiple input files in a single operation
10. **Configuration Options**: Allow customization of output formatting and parsing behavior


## Implicit Requirements

1. **Performance Optimization**: Efficiently handle large files without excessive memory usage
2. **Code Maintainability**: Write clean, well-documented code with clear separation of concerns
3. **Extensibility**: Design the parser architecture to easily accommodate new Markdown features
4. **Robust Error Handling**: Provide meaningful error messages and graceful degradation
5. **Standards Compliance**: Generate valid HTML5 that follows web standards
6. **Memory Management**: Avoid memory leaks when processing multiple files
7. **Thread Safety**: Ensure the parser can be used safely in multi-threaded environments
8. **Logging Capability**: Provide optional logging for debugging and monitoring purposes
9. **Cross-Platform Compatibility**: Work consistently across different operating systems
10. **Resource Cleanup**: Properly close files and clean up resources after processing


## Function Signature(s) and/or Class Definition(s)

```python
class MarkdownParser:
    """
    Advanced Markdown to HTML parser with support for nested structures
    and inline formatting combinations.
    """
    
    def __init__(self, config=None):
        """
        Initialize the parser with optional configuration.
        
        Args:
            config (dict, optional): Configuration options for parsing behavior
        """
        pass
    
    def parse_file(self, input_path, output_path=None):
        """
        Parse a single Markdown file and convert to HTML.
        
        Args:
            input_path (str): Path to the input Markdown file
            output_path (str, optional): Path for output HTML file
            
        Returns:
            str: Generated HTML content
            
        Raises:
            FileNotFoundError: If input file doesn't exist
            PermissionError: If unable to write output file
            MarkdownParseError: If parsing fails critically
        """
        pass
    
    def parse_batch(self, input_files, output_directory=None):
        """
        Process multiple Markdown files in batch.
        
        Args:
            input_files (list): List of input file paths
            output_directory (str, optional): Directory for output files
            
        Returns:
            dict: Mapping of input files to output results
        """
        pass
    
    def parse_string(self, markdown_content):
        """
        Parse Markdown content from string.
        
        Args:
            markdown_content (str): Markdown content to parse
            
        Returns:
            str: Generated HTML content
        """
        pass

class MarkdownParseError(Exception):
    """Custom exception for Markdown parsing errors."""
    pass

def convert_markdown_to_html(input_path, output_path=None, config=None):
    """
    Convenience function for simple Markdown to HTML conversion.
    
    Args:
        input_path (str): Path to input Markdown file
        output_path (str, optional): Path for output HTML file
        config (dict, optional): Parser configuration options
        
    Returns:
        str: Path to generated HTML file
    """
    pass
```


## Edge Cases

1. **Empty Files**: Handle completely empty Markdown files gracefully
2. **Malformed Nesting**: Process lists with inconsistent indentation levels
3. **Unclosed Formatting**: Handle bold/italic markers that are not properly closed
4. **Invalid Links**: Process malformed link syntax without breaking the parser
5. **Mixed Line Endings**: Handle files with different line ending conventions (Windows/Unix)
6. **Unicode Content**: Properly process and preserve Unicode characters and emojis
7. **Very Deep Nesting**: Handle excessively nested lists (beyond normal use cases)
8. **Conflicting Markers**: Process overlapping or nested formatting markers
9. **Binary Content**: Detect and reject binary files gracefully
10. **Large Files**: Handle files that approach memory limits efficiently
11. **Special Characters in URLs**: Process links containing special characters correctly
12. **Header Level Overflow**: Handle more than 6 hash marks appropriately
13. **List Marker Variations**: Process different list marker styles consistently
14. **Whitespace-Only Lines**: Handle lines containing only whitespace characters
15. **File Permission Issues**: Gracefully handle read/write permission problems


## Constraints

1. **Maximum File Size**: Individual input files should not exceed 100 MB
2. **Nesting Depth Limit**: Support list nesting up to 10 levels deep
3. **Memory Usage**: Peak memory usage should not exceed 2x the largest input file size
4. **Processing Time**: Complete parsing of a 1 MB file within 5 seconds on standard hardware
5. **Output File Size**: Generated HTML should not be more than 5x the input file size
6. **Error Tolerance**: Parser should handle at least 95% of malformed input gracefully
7. **Character Encoding**: Support UTF-8 encoding for input and output files
8. **Line Length**: Handle individual lines up to 10,000 characters
9. **Concurrent Processing**: Support processing up to 10 files simultaneously
10. **Python Version**: Compatible with Python 3.7 and higher
11. **External Dependencies**: Minimize external dependencies; prefer standard library
12. **File Path Length**: Support file paths up to 260 characters on Windows
13. **Link URL Length**: Handle URLs up to 2,048 characters in length
14. **Header Text Length**: Support header text up to 1,000 characters
15. **List Item Limit**: Handle lists with up to 10,000 items efficiently

In [None]:
# code

"""
Advanced Markdown to HTML Parser Module

This module provides a sophisticated Markdown to HTML conversion system that
handles complex nested structures, multiple file processing, and provides
extensible parsing capabilities.
"""

import os
import re
import logging
from pathlib import Path
from typing import Dict, List, Optional, Union


class MarkdownParseError(Exception):
    """Custom exception for Markdown parsing errors."""
    pass


class MarkdownParser:
    """
    Advanced Markdown to HTML parser with support for nested structures
    and inline formatting combinations.
    """

    def __init__(self, config: Optional[Dict] = None):
        """
        Initialize the parser with optional configuration.

        Args:
            config (dict, optional): Configuration options for parsing behavior
                - max_file_size: Maximum file size in bytes (default: 100MB)
                - max_nesting_depth: Maximum list nesting depth (default: 10)
                - enable_logging: Enable debug logging (default: False)
                - indent_size: HTML indentation size (default: 4)
        """
        self.config = config if config else {}
        self.max_file_size = self.config.get('max_file_size', 100 * 1024 * 1024)
        self.max_nesting_depth = self.config.get('max_nesting_depth', 10)
        self.indent_size = self.config.get('indent_size', 4)

        if self.config.get('enable_logging', False):
            logging.basicConfig(level=logging.DEBUG)

        self.logger = logging.getLogger(__name__)

    def parse_file(self, input_path: str, output_path: Optional[str] = None) -> str:
        """
        Parse a single Markdown file and convert to HTML.

        Args:
            input_path (str): Path to the input Markdown file
            output_path (str, optional): Path for output HTML file

        Returns:
            str: Generated HTML content

        Raises:
            FileNotFoundError: If input file doesn't exist
            PermissionError: If unable to write output file
            MarkdownParseError: If parsing fails critically
        """
        # Validate input file exists
        if not os.path.isfile(input_path):
            raise FileNotFoundError(f"Input file '{input_path}' not found")

        # Check file size constraint
        file_size = os.path.getsize(input_path)
        if file_size > self.max_file_size:
            raise MarkdownParseError(
                f"File size {file_size} exceeds maximum allowed size "
                f"{self.max_file_size}"
            )

        # Check if file is binary
        if self._is_binary_file(input_path):
            raise MarkdownParseError(
                f"File '{input_path}' appears to be binary, not text"
            )

        try:
            # Read the content with UTF-8 encoding
            with open(input_path, 'r', encoding='utf-8', errors='replace') as file:
                markdown_content = file.read()
        except UnicodeDecodeError as e:
            raise MarkdownParseError(f"Unicode decode error: {e}")
        except IOError as e:
            raise PermissionError(f"Cannot read file '{input_path}': {e}")

        # Parse the content
        html_content = self.parse_string(markdown_content)

        # Write output if output path is specified
        if output_path:
            try:
                # Create output directory if it has a directory component
                output_dir = os.path.dirname(output_path)
                if output_dir:  # Only create directory if there's a directory part
                    os.makedirs(output_dir, exist_ok=True)

                with open(output_path, 'w', encoding='utf-8') as file:
                    file.write(html_content)
            except IOError as e:
                raise PermissionError(f"Cannot write to '{output_path}': {e}") from e

        self.logger.info(f"Successfully parsed '{input_path}'")
        return html_content

    def parse_batch(self, input_files: List[str],
                   output_directory: Optional[str] = None) -> Dict[str, Union[str, Exception]]:
        """
        Process multiple Markdown files in batch.

        Args:
            input_files (list): List of input file paths
            output_directory (str, optional): Directory for output files

        Returns:
            dict: Mapping of input files to output results or error messages
        """
        results = {}

        # Create output directory if specified
        if output_directory:
            os.makedirs(output_directory, exist_ok=True)

        for input_file in input_files:
            try:
                # Determine output path
                if output_directory:
                    output_filename = f"{Path(input_file).stem}.html"
                    output_path = os.path.join(output_directory, output_filename)
                else:
                    # Place output file alongside input file
                    input_path = Path(input_file)
                    output_path = str(input_path.parent / f"{input_path.stem}.html")

                html_content = self.parse_file(input_file, output_path)
                results[input_file] = html_content

            except (FileNotFoundError, PermissionError, MarkdownParseError) as e:
                results[input_file] = e
                self.logger.error(f"Error processing '{input_file}': {e}")
            except Exception as e:
                results[input_file] = MarkdownParseError(f"Unexpected error: {e}")
                self.logger.error(f"Unexpected error processing '{input_file}': {e}")

        return results

    def parse_string(self, markdown_content: str) -> str:
        """
        Parse Markdown content from string.

        Args:
            markdown_content (str): Markdown content to parse

        Returns:
            str: Generated HTML content
        """
        if not markdown_content.strip():
            return self._generate_empty_html()

        lines = self._normalize_line_endings(markdown_content).split('\n')
        html_lines = []

        # Add HTML document structure
        html_lines.extend([
            '<!DOCTYPE html>',
            '<html>',
            '<head>',
            '    <meta charset="UTF-8">',
            '    <title>Converted Markdown</title>',
            '</head>',
            '<body>'
        ])

        # Parse content
        body_content = self._parse_content(lines)
        html_lines.extend(body_content)

        # Close HTML structure
        html_lines.extend([
            '</body>',
            '</html>'
        ])

        return '\n'.join(html_lines)

    def _parse_content(self, lines: List[str]) -> List[str]:
        """
        Parse the main content of the markdown.

        Args:
            lines (list): List of markdown lines

        Returns:
            list: List of HTML lines with proper indentation
        """
        html_lines = []
        list_stack = []
        in_paragraph = False
        paragraph_lines = []

        i = 0
        while i < len(lines):
            line = lines[i]
            line = line.rstrip()

            # Handle empty lines
            if not line:
                if in_paragraph:
                    self._close_paragraph(paragraph_lines, html_lines)
                    in_paragraph = False
                    paragraph_lines = []
                i += 1
                continue

            # Check for headers
            header_match = re.match(r'^(#{1,6})\s+(.*)', line)
            if header_match:
                if in_paragraph:
                    self._close_paragraph(paragraph_lines, html_lines)
                    in_paragraph = False
                    paragraph_lines = []

                self._close_all_lists(list_stack, html_lines)
                level = len(header_match.group(1))
                content = self._parse_inline_formatting(header_match.group(2))
                html_lines.append(f'    <h{level}>{content}</h{level}>')
                i += 1
                continue

            # Check for lists (unordered and ordered)
            list_match = self._match_list_item(line)
            if list_match:
                if in_paragraph:
                    self._close_paragraph(paragraph_lines, html_lines)
                    in_paragraph = False
                    paragraph_lines = []

                indent_level, list_type, content = list_match

                # Handle nested lists
                i = self._handle_list_item(
                    lines, i, indent_level, list_type, content,
                    list_stack, html_lines
                )
                continue

            # Regular paragraph text
            if list_stack:
                self._close_all_lists(list_stack, html_lines)

            if not in_paragraph:
                in_paragraph = True
                paragraph_lines = []

            paragraph_lines.append(line)
            i += 1

        # Close any remaining open elements
        if in_paragraph:
            self._close_paragraph(paragraph_lines, html_lines)

        self._close_all_lists(list_stack, html_lines)

        return html_lines

    def _match_list_item(self, line: str) -> Optional[tuple]:
        """
        Match and extract list item information.

        Args:
            line (str): Line to check for list item pattern

        Returns:
            tuple: (indent_level, list_type, content) or None
        """
        # Calculate indentation
        indent_level = len(line) - len(line.lstrip())
        stripped_line = line.lstrip()

        # Check for unordered list
        ul_match = re.match(r'^[-*]\s+(.*)', stripped_line)
        if ul_match:
            return (indent_level, 'ul', ul_match.group(1))

        # Check for ordered list
        ol_match = re.match(r'^\d+\.\s+(.*)', stripped_line)
        if ol_match:
            return (indent_level, 'ol', ol_match.group(1))

        return None

    def _handle_list_item(self, lines: List[str], start_index: int,
                         indent_level: int, list_type: str, content: str,
                         list_stack: List[tuple], html_lines: List[str]) -> int:
        """
        Handle a list item and its potential nested content.

        Args:
            lines (list): All markdown lines
            start_index (int): Current line index
            indent_level (int): Indentation level of the list item
            list_type (str): Type of list ('ul' or 'ol')
            content (str): Content of the list item
            list_stack (list): Stack tracking open lists
            html_lines (list): HTML output lines

        Returns:
            int: Next line index to process
        """
        # Adjust list stack based on indentation
        self._adjust_list_stack(indent_level, list_type, list_stack, html_lines)

        # Parse the list item content
        parsed_content = self._parse_inline_formatting(content)

        # Check if this item has nested content
        next_index = start_index + 1
        nested_content = []

        while next_index < len(lines):
            next_line = lines[next_index]
            if not next_line.strip():
                next_index += 1
                continue

            next_indent = len(next_line) - len(next_line.lstrip())

            # If next line is more indented, it's nested content
            if next_indent > indent_level:
                nested_content.append(next_line)
                next_index += 1
            else:
                break

        # Generate list item HTML
        current_indent = '    ' * (len(list_stack) + 1)

        if nested_content:
            html_lines.append(f'{current_indent}<li>{parsed_content}')

            # Parse nested content
            nested_html = self._parse_content(nested_content)

            # Add nested HTML with proper indentation
            for nested_line in nested_html:
                if nested_line.strip():
                    html_lines.append(f'    {nested_line}')

            html_lines.append(f'{current_indent}</li>')
        else:
            html_lines.append(f'{current_indent}<li>{parsed_content}</li>')

        return next_index

    def _adjust_list_stack(self, indent_level: int, list_type: str,
                          list_stack: List[tuple], html_lines: List[str]):
        """
        Adjust the list stack based on current indentation level.

        Args:
            indent_level (int): Current indentation level
            list_type (str): Type of list ('ul' or 'ol')
            list_stack (list): Stack of open lists
            html_lines (list): HTML output lines
        """
        # Close lists that are at deeper or equal indentation
        while (list_stack and
               list_stack[-1][0] >= indent_level):
            _, old_type = list_stack.pop()
            current_indent = '    ' * (len(list_stack) + 1)
            html_lines.append(f'{current_indent}</{old_type}>')

        # Check if we need to start a new list
        if (not list_stack or
            list_stack[-1][0] < indent_level or
            list_stack[-1][1] != list_type):

            list_stack.append((indent_level, list_type))
            current_indent = '    ' * len(list_stack)
            html_lines.append(f'{current_indent}<{list_type}>')

    def _close_all_lists(self, list_stack: List[tuple], html_lines: List[str]):
        """
        Close all open lists in the stack.

        Args:
            list_stack (list): Stack of open lists
            html_lines (list): HTML output lines
        """
        while list_stack:
            _, list_type = list_stack.pop()
            current_indent = '    ' * (len(list_stack) + 1)
            html_lines.append(f'{current_indent}</{list_type}>')

    def _close_paragraph(self, paragraph_lines: List[str], html_lines: List[str]):
        """
        Close a paragraph and add it to HTML output.

        Args:
            paragraph_lines (list): Lines of the paragraph
            html_lines (list): HTML output lines
        """
        if paragraph_lines:
            content = ' '.join(paragraph_lines)
            parsed_content = self._parse_inline_formatting(content)
            html_lines.append(f'    <p>{parsed_content}</p>')

    def _parse_inline_formatting(self, text: str) -> str:
        """
        Parse and convert inline Markdown syntax to HTML.

        Args:
            text (str): Text to parse for inline formatting

        Returns:
            str: Text with inline formatting converted to HTML
        """
        # Escape HTML special characters first
        text = self._escape_html(text)

        # Handle links first (they can contain other formatting)
        text = self._parse_links(text)

        # Handle bold text (must come before italic to avoid conflicts)
        text = self._parse_bold(text)

        # Handle italic text
        text = self._parse_italic(text)

        return text

    def _parse_links(self, text: str) -> str:
        """
        Parse markdown links and convert to HTML.

        Args:
            text (str): Text containing potential links

        Returns:
            str: Text with links converted to HTML
        """
        # Handle malformed links gracefully
        def replace_link(match):
            link_text = match.group(1)
            url = match.group(2)

            # Basic URL validation and length check
            if len(url) > 2048:
                return match.group(0)  # Return original if URL too long

            # Parse inline formatting within link text
            formatted_text = self._parse_bold(self._parse_italic(link_text))

            return f'<a href="{url}">{formatted_text}</a>'

        # Match [text](url) pattern with error recovery
        try:
            text = re.sub(r'\[([^\]]*)\]\(([^)]*)\)', replace_link, text)
        except Exception:
            # If regex fails, return original text
            pass

        return text

    def _parse_bold(self, text: str) -> str:
        """
        Parse bold formatting and convert to HTML.

        Args:
            text (str): Text containing potential bold formatting

        Returns:
            str: Text with bold formatting converted to HTML
        """
        # Handle nested and malformed bold markers gracefully
        def replace_bold(match):
            content = match.group(1)
            if content.strip():  # Only convert non-empty content
                return f'<strong>{content}</strong>'
            return match.group(0)  # Return original if empty

        try:
            # Match **text** but handle unclosed markers gracefully
            text = re.sub(r'\*\*([^*\n]*?)\*\*', replace_bold, text)
        except Exception:
            pass

        return text

    def _parse_italic(self, text: str) -> str:
        """
        Parse italic formatting and convert to HTML.

        Args:
            text (str): Text containing potential italic formatting

        Returns:
            str: Text with italic formatting converted to HTML
        """
        def replace_italic(match):
            content = match.group(1)
            if content.strip():  # Only convert non-empty content
                return f'<em>{content}</em>'
            return match.group(0)  # Return original if empty

        try:
            # Match *text* but avoid matching ** patterns and handle unclosed
            text = re.sub(r'(?<!\*)\*([^*\n]*?)\*(?!\*)', replace_italic, text)
        except Exception:
            pass

        return text

    def _escape_html(self, text: str) -> str:
        """
        Escape HTML special characters in text.

        Args:
            text (str): Text to escape

        Returns:
            str: Text with HTML characters escaped
        """
        html_escape_table = {
            '&': '&amp;',
            '<': '&lt;',
            '>': '&gt;',
            '"': '&quot;',
            "'": '&#x27;',
        }

        for char, escape in html_escape_table.items():
            text = text.replace(char, escape)

        return text

    def _normalize_line_endings(self, content: str) -> str:
        """
        Normalize different line ending conventions.

        Args:
            content (str): Content with potentially mixed line endings

        Returns:
            str: Content with normalized line endings
        """
        # Convert Windows and Mac line endings to Unix
        content = content.replace('\r\n', '\n')
        content = content.replace('\r', '\n')
        return content

    def _is_binary_file(self, file_path: str) -> bool:
        """
        Check if a file is binary by examining its content.

        Args:
            file_path (str): Path to the file to check

        Returns:
            bool: True if file appears to be binary
        """
        try:
            with open(file_path, 'rb') as file:
                chunk = file.read(1024)  # Read first 1KB

            # Check for null bytes (common in binary files)
            if b'\x00' in chunk:
                return True

            # Try to decode as UTF-8
            try:
                chunk.decode('utf-8')
                return False
            except UnicodeDecodeError:
                return True

        except (IOError, OSError):
            return True  # Assume binary if can't read

    def _generate_empty_html(self) -> str:
        """
        Generate HTML structure for empty input.

        Returns:
            str: Basic HTML structure
        """
        return '\n'.join([
            '<!DOCTYPE html>',
            '<html>',
            '<head>',
            '    <meta charset="UTF-8">',
            '    <title>Converted Markdown</title>',
            '</head>',
            '<body>',
            '</body>',
            '</html>'
        ])


def convert_markdown_to_html(input_path: str, output_path: Optional[str] = None,
                           config: Optional[Dict] = None) -> str:
    """
    Convenience function for simple Markdown to HTML conversion.

    Args:
        input_path (str): Path to input Markdown file
        output_path (str, optional): Path for output HTML file
        config (dict, optional): Parser configuration options

    Returns:
        str: Path to generated HTML file

    Raises:
        FileNotFoundError: If input file doesn't exist
        PermissionError: If unable to read/write files
        MarkdownParseError: If parsing fails
    """
    parser = MarkdownParser(config)

    # Generate output path if not provided
    if output_path is None:
        input_file = Path(input_path)
        output_path = str(input_file.parent / f"{input_file.stem}.html")

    # Parse the file
    parser.parse_file(input_path, output_path)

    return output_path


In [None]:
# tests
"""Unit tests for the MarkdownParser and conversion utilities."""

import unittest
import tempfile
import os
from main import MarkdownParser, MarkdownParseError, convert_markdown_to_html


class TestMarkdownParser(unittest.TestCase):
    """Unit tests for the MarkdownParser class."""

    def setUp(self):
        """Initialize parser instance for test cases."""
        self.parser = MarkdownParser()

    def test_empty_string(self):
        """Test that an empty string returns a valid HTML structure."""
        html = self.parser.parse_string("")
        self.assertIn("<body>", html)
        self.assertIn("</body>", html)

    def test_single_header(self):
        """Test parsing a single level-1 header."""
        html = self.parser.parse_string("# Hello World")
        self.assertIn("<h1>Hello World</h1>", html)

    def test_multiple_headers(self):
        """Test parsing headers of various levels."""
        html = self.parser.parse_string("# H1\n## H2\n### H3")
        self.assertIn("<h1>H1</h1>", html)
        self.assertIn("<h2>H2</h2>", html)
        self.assertIn("<h3>H3</h3>", html)

    def test_bold_formatting(self):
        """Test parsing bold syntax."""
        html = self.parser.parse_string("This is **bold** text.")
        self.assertIn("<strong>bold</strong>", html)

    def test_italic_formatting(self):
        """Test parsing italic syntax."""
        html = self.parser.parse_string("This is *italic* text.")
        self.assertIn("<em>italic</em>", html)

    def test_link_formatting(self):
        """Test parsing a basic markdown link."""
        html = self.parser.parse_string("Visit [Google](https://google.com)")
        self.assertIn('<a href="https://google.com">Google</a>', html)

    def test_bold_and_italic_inside_link(self):
        """Test parsing bold content inside a link."""
        html = self.parser.parse_string("[**Bold Link**](https://bold.com)")
        self.assertIn(
            '<a href="https://bold.com"><strong>Bold Link</strong></a>', html)

    def test_unordered_list(self):
        """Test rendering of a basic unordered list."""
        md = "- Item 1\n- Item 2"
        html = self.parser.parse_string(md)
        self.assertIn("<ul>", html)
        self.assertIn("<li>Item 1</li>", html)

    def test_ordered_list(self):
        """Test rendering of an ordered list."""
        md = "1. First\n2. Second"
        html = self.parser.parse_string(md)
        self.assertIn("<ol>", html)
        self.assertIn("<li>First</li>", html)

    def test_nested_list(self):
        """Test rendering of a nested unordered list."""
        md = "- Item 1\n  - Subitem"
        html = self.parser.parse_string(md)
        self.assertIn("<ul>", html)
        self.assertIn("<li>Subitem</li>", html)

    def test_malformed_list_recovery(self):
        """Test how parser handles malformed list markers."""
        md = "- Item 1\n *Item with malformed marker"
        html = self.parser.parse_string(md)
        self.assertIn("<li>Item 1", html)
        self.assertIn("<p> *Item with malformed marker</p>", html)
        self.assertIn("</li>", html)

    def test_inline_formatting_in_paragraph(self):
        """Test combined inline formatting within a paragraph."""
        md = "Hello *world* and **Python** [link](https://test.com)"
        html = self.parser.parse_string(md)
        self.assertIn("<em>world</em>", html)
        self.assertIn("<strong>Python</strong>", html)
        self.assertIn('<a href="https://test.com">link</a>', html)

    def test_html_escaping(self):
        """Test that HTML-sensitive characters are escaped."""
        md = "5 < 10 & 3 > 2"
        html = self.parser.parse_string(md)
        self.assertIn("5 &lt; 10 &amp; 3 &gt; 2", html)

    def test_malformed_link(self):
        """Test that malformed links are not converted."""
        md = "[Bad Link](not closed"
        html = self.parser.parse_string(md)
        self.assertIn("[Bad Link](not closed", html)

    def test_unclosed_bold(self):
        """Test that unclosed bold formatting is ignored."""
        md = "**Bold"
        html = self.parser.parse_string(md)
        self.assertIn("**Bold", html)

    def test_unicode_content(self):
        """Test rendering of Unicode content."""
        md = "# Unicode 😊 🚀"
        html = self.parser.parse_string(md)
        self.assertIn("Unicode 😊 🚀", html)

    def test_line_ending_normalization(self):
        """Test that different line endings are normalized."""
        md = "# Heading\r\n\r\nParagraph\rLine"
        html = self.parser.parse_string(md)
        self.assertIn("<h1>Heading</h1>", html)

    def test_parse_batch_with_valid_and_invalid_file(self):
        """Test batch parsing with one valid file and one invalid path."""
        with tempfile.TemporaryDirectory() as tmpdir:
            good_file = os.path.join(tmpdir, "good.md")
            bad_file = os.path.join(tmpdir, "bad.md")
            with open(good_file, "w", encoding="utf-8") as f:
                f.write("# Good")
            os.mkdir(bad_file)
            result = self.parser.parse_batch([good_file, bad_file])
            self.assertIn(good_file, result)
            self.assertIn(bad_file, result)
            self.assertIsInstance(result[bad_file], Exception)

    def test_parse_file_with_binary_detection(self):
        """Test that binary files raise a MarkdownParseError."""
        with tempfile.NamedTemporaryFile(delete=False) as f:
            f.write(b'\x00\x01\x02')
            f.close()
            with self.assertRaises(MarkdownParseError):
                self.parser.parse_file(f.name)
            os.remove(f.name)

    def test_file_too_large(self):
        """Test that oversized files raise a MarkdownParseError."""
        parser = MarkdownParser(config={"max_file_size": 10})
        with tempfile.NamedTemporaryFile(delete=False, mode="w") as f:
            f.write("A" * 100)
            f.close()
            with self.assertRaises(MarkdownParseError):
                parser.parse_file(f.name)
            os.remove(f.name)

    def test_convert_markdown_to_html_output_file_created(self):
        """Test that the output is created and contains expected content."""
        with tempfile.TemporaryDirectory() as tmpdir:
            md_path = os.path.join(tmpdir, "test.md")
            html_path = os.path.join(tmpdir, "test.html")
            with open(md_path, "w", encoding="utf-8") as f:
                f.write("# Title")
            result = convert_markdown_to_html(md_path, html_path)
            self.assertTrue(os.path.exists(result))
            with open(result, "r", encoding="utf-8") as f:
                self.assertIn("<h1>Title</h1>", f.read())

    def test_custom_indent_size(self):
        """Test that custom indentation settings are applied."""
        parser = MarkdownParser(config={"indent_size": 2})
        html = parser.parse_string("# Title")
        self.assertIn("<h1>Title</h1>", html)

    def test_header_with_inline_formatting(self):
        """Test that headers support inline bold and italic formatting."""
        html = self.parser.parse_string("## Header with *italic* and **bold**")
        self.assertIn("<em>italic</em>", html)
        self.assertIn("<strong>bold</strong>", html)

    def test_large_number_of_list_items(self):
        """Test parser performance with a large number of list items."""
        md = "\n".join(f"- Item {i}" for i in range(1000))
        html = self.parser.parse_string(md)
        self.assertEqual(html.count("<li>"), 1000)


# Model Breaking Proof

#### Model Breaking Task URL: https://rlhf-v3.turing.com/prompt/e2b65294-67ba-4672-bf2a-8d24242e41be?origin=https%3A%2F%2Flabeling-ms.turing.com&redirect_url=https%3A%2F%2Flabeling-ms.turing.com%2Fconversations%2F74977%2Fview

#### Model code:

```python
import os
import re
from pathlib import Path

class MarkdownParseError(Exception):
    """Custom exception for Markdown parsing errors."""
    pass

class MarkdownParser:
    def __init__(self, config=None):
        """
        Initialize the parser with optional configuration.
        
        Args:
            config (dict, optional): Configuration options for parsing behavior
        """
        self.config = config if config else {}
    
    def parse_file(self, input_path, output_path=None):
        """
        Parse a single Markdown file and convert to HTML.
        
        Args:
            input_path (str): Path to the input Markdown file
            output_path (str, optional): Path for output HTML file
            
        Returns:
            str: Generated HTML content
            
        Raises:
            FileNotFoundError: If input file doesn't exist
            PermissionError: If unable to write output file
            MarkdownParseError: If parsing fails critically
        """
        # Check if file exists
        if not os.path.isfile(input_path):
            raise FileNotFoundError(f"File {input_path} not found.")
        
        # Read the content
        with open(input_path, 'r', encoding='utf-8') as file:
            markdown_content = file.read()
        
        html_content = self.parse_string(markdown_content)
        
        # Write output if output path is specified
        if output_path:
            with open(output_path, 'w', encoding='utf-8') as file:
                file.write(html_content)
        
        return html_content
    
    def parse_batch(self, input_files, output_directory=None):
        """
        Process multiple Markdown files in batch.
        
        Args:
            input_files (list): List of input file paths
            output_directory (str, optional): Directory for output files
            
        Returns:
            dict: Mapping of input files to output results
        """
        results = {}
        for input_file in input_files:
            try:
                if output_directory:
                    output_path = os.path.join(output_directory, f"{Path(input_file).stem}.html")
                else:
                    output_path = f"{Path(input_file).stem}.html"
                
                results[input_file] = self.parse_file(input_file, output_path)
            except Exception as ex:
                results[input_file] = str(ex)
        
        return results
    
    def parse_string(self, markdown_content):
        """
        Parse Markdown content from string.
        
        Args:
            markdown_content (str): Markdown content to parse
            
        Returns:
            str: Generated HTML content
        """
        lines = markdown_content.split('\n')
        html_content = []
        
        # Add standard HTML document structure
        html_content.append('<!DOCTYPE html>')
        html_content.append('<html>')
        html_content.append('<head>')
        html_content.append('<meta charset="UTF-8">')
        html_content.append('<title>Converted Markdown</title>')
        html_content.append('</head>')
        html_content.append('<body>')
        
        # Initialize list tracking
        list_stack = []
        
        for line in lines:
            if not line.strip():
                continue  # Skip empty lines
                 # <Issue>: Empty lines are completely ignored instead of being converted to proper paragraph breaks in HTML. This violates the whitespace management requirement to "preserve paragraph breaks and handle extra whitespace appropriately."
            
            # Headers
            header_match = re.match(r'^(#{1,6})\s+(.*)', line)
            if header_match:
                tag = f"h{len(header_match[1])}"
                content = self._parse_inline_formatting(header_match[2])
                html_content.append(f"<{tag}>{content}</{tag}>")
                continue
            
            # Unordered List
            ul_match = re.match(r'^[\*\-]\s+(.*)', line)
            if ul_match:
                self._handle_list(ul_match[1], list_stack, html_content, list_type='ul')
                continue
            
            # Ordered List
            ol_match = re.match(r'^\d+\.\s+(.*)', line)
            if ol_match:
                self._handle_list(ol_match[1], list_stack, html_content, list_type='ol')
                continue
            
            # Paragraphs (Default)
            html_content.append(f"<p>{self._parse_inline_formatting(line)}</p>")
        
        # Close any open list stacks
        self._close_list_stack(list_stack, html_content)
        
        # Finalize HTML structure
        html_content.append('</body>')
        html_content.append('</html>')
        
        return '\n'.join(html_content)
    
    def _parse_inline_formatting(self, text):
        """Helper method to parse and convert inline Markdown syntax."""
        text = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', text)
        text = re.sub(r'\*(.+?)\*', r'<em>\1</em>', text)
        text = re.sub(r'$(.+?)$$(.+?)$', r'<a href="\2">\1</a>', text) # <Issue>: Uses $ delimiters instead of the required [text](url) markdown syntax. This means no links will be parsed or converted, completely failing one of the core supported elements specified in the problem.
        return text # <Issue>: Special HTML characters (&, <, >, ", ') in content are not escaped, which can break HTML structure and create security vulnerabilities. This directly violates the explicit requirement for "proper escaping of HTML special characters in content."
    
    def _handle_list(self, item, list_stack, html_content, list_type='ul'):
        """Handle creation and management of nested HTML list structures."""
        current_indent = len(item) - len(item.lstrip()) # <Issue>: Calculates indentation on already-processed content after regex has stripped list markers, so current_indent is always 0. This makes detecting nested lists impossible, violating the requirement for nested structure support up to 4+ levels.
        list_tag = list_type
        
        if not list_stack or current_indent > list_stack[-1]:
            html_content.append(f"<{list_tag}>")
            list_stack.append(current_indent)
        
        while list_stack and current_indent < list_stack[-1]:
            html_content.append("</li>")
            html_content.append(f"</{list_tag}>")
            list_stack.pop()
        
        html_content.append(f"<li>{self._parse_inline_formatting(item.strip())}")
        # Prepare for potentially nested list:
        if not list_stack or current_indent < list_stack[-1]:
            html_content.append("</li>")
    
    def _close_list_stack(self, list_stack, html_content):
        """Close any remaining open HTML list elements."""
        while list_stack:
            html_content.append("</li>")
            html_content.append(f"</ul>") # <Issue>: Always closes lists with </ul> tags regardless of whether they were ordered (<ol>) or unordered lists. This generates invalid HTML that fails basic validation, violating the standards compliance requirement. Also the it does not closes all li tags, only closes the last one.
            list_stack.pop()

def convert_markdown_to_html(input_path, output_path=None, config=None):
    """
    Convenience function for simple Markdown to HTML conversion.
    
    Args:
        input_path (str): Path to input Markdown file
        output_path (str, optional): Path for output HTML file
        config (dict, optional): Parser configuration options
        
    Returns:
        str: Path to generated HTML file
    """
    parser = MarkdownParser(config)
    html_content = parser.parse_file(input_path, output_path)
    return html_content

# Example usage
# result = convert_markdown_to_html('example.md', 'example.html')
```