# Math Formula Detection and Rendering System

This notebook demonstrates how to create a robust system for detecting and rendering LaTeX math formulas within Markdown content, addressing the complex challenges of handling inline math, display blocks, and math within tables.

## Problem Statement

Math formulas in streaming content are incredibly complex to handle because:
- **Inline math**: `\( a + b = b + a \)` mixed within text
- **Display math**: `$$\frac{a}{b} = c$$` as separate blocks  
- **Math in tables**: Formulas within markdown table cells
- **Incomplete streaming**: Math expressions split across chunks
- **Safety concerns**: Need to avoid crashes during partial renders

## Approach: Math Block Extraction

Similar to how HTML blocks are extracted and rendered separately, we'll:
1. **Extract math formulas** as separate blocks during parsing
2. **Replace with placeholders** in markdown content  
3. **Render separately** using MathRenderer components
4. **Handle streaming safely** by buffering incomplete expressions

## 1. Setting Up Required Libraries

First, let's install and import the necessary libraries for parsing and handling mixed content with math formulas.

In [None]:
# Install required packages
import subprocess
import sys

def install_package(package):
    """Install a package using pip"""
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"✅ {package} installed successfully")
    except subprocess.CalledProcessError:
        print(f"❌ Failed to install {package}")

# Install required packages
packages = [
    "markdown-it-py",
    "regex", 
    "html2text",
    "beautifulsoup4"
]

for package in packages:
    install_package(package)

In [None]:
# Import required libraries
import re
import json
from typing import List, Dict, Union, Tuple, Optional
from dataclasses import dataclass
from enum import Enum

# For markdown processing
try:
    from markdown_it import MarkdownIt
    print("✅ markdown-it-py imported successfully")
except ImportError:
    print("❌ markdown-it-py not available, using basic regex")

# For HTML parsing
from bs4 import BeautifulSoup
import html

print("📦 All libraries imported successfully!")

# Define data structures for content blocks
class BlockType(Enum):
    TEXT = "text"
    CODE = "code" 
    MERMAID = "mermaid"
    MINDMAP = "mindmap"
    HTML = "html"
    MATH = "math"

@dataclass
class ContentBlock:
    type: BlockType
    content: str
    language: Optional[str] = None
    display: Optional[bool] = None  # For math blocks: True = display, False = inline

@dataclass 
class ParsedContent:
    blocks: List[ContentBlock]
    raw_content: str

print("🏗️ Data structures defined!")

## 2. Creating a Math Pattern Detection System

Now let's build robust regex patterns to detect different types of math expressions. This is the most critical part since we need to handle:

- **Inline LaTeX**: `\( formula \)` 
- **Display LaTeX**: `\[ formula \]`
- **Inline Dollar**: `$formula$`
- **Display Dollar**: `$$formula$$`
- **Escaped delimiters**: `\$` should not trigger math
- **Math in tables**: Special handling needed
- **Nested structures**: Complex expressions with brackets

In [None]:
class MathPatternDetector:
    """Comprehensive math pattern detection for LaTeX expressions in markdown"""
    
    def __init__(self):
        # Define math patterns with proper precedence
        # Order matters! More specific patterns first
        
        # Display math patterns (higher precedence)
        self.display_patterns = [
            {
                'name': 'display_dollars',
                'regex': re.compile(r'\$\$\s*(.*?)\s*\$\$', re.DOTALL),
                'display': True
            },
            {
                'name': 'display_brackets', 
                'regex': re.compile(r'\\\[\s*(.*?)\s*\\\]', re.DOTALL),
                'display': True
            }
        ]
        
        # Inline math patterns (lower precedence)
        self.inline_patterns = [
            {
                'name': 'inline_parentheses',
                'regex': re.compile(r'\\\(\s*(.*?)\s*\\\)'),
                'display': False
            },
            {
                'name': 'inline_dollars',
                'regex': re.compile(r'(?<!\$)\$([^$\n]+?)\$(?!\$)'),  # Negative lookbehind/lookahead
                'display': False
            }
        ]
        
        # Combined patterns list (display first for priority)
        self.all_patterns = self.display_patterns + self.inline_patterns
    
    def find_all_math(self, text: str) -> List[Dict]:
        """Find all math expressions in text with positions"""
        found_math = []
        
        for pattern_info in self.all_patterns:
            pattern = pattern_info['regex']
            pattern.lastIndex = 0  # Reset regex if using global flag
            
            for match in pattern.finditer(text):
                math_content = match.group(1).strip()
                
                # Skip empty matches
                if not math_content:
                    continue
                    
                # Check for overlaps with existing matches
                start, end = match.span()
                has_overlap = any(
                    (start >= existing['start'] and start < existing['end']) or
                    (end > existing['start'] and end <= existing['end'])
                    for existing in found_math
                )
                
                if not has_overlap:
                    found_math.append({
                        'content': math_content,
                        'full_match': match.group(0),
                        'start': start,
                        'end': end,
                        'display': pattern_info['display'],
                        'pattern_name': pattern_info['name']
                    })
        
        # Sort by start position
        found_math.sort(key=lambda x: x['start'])
        return found_math
    
    def is_valid_math(self, content: str) -> bool:
        """Basic validation for math content"""
        # Check for common LaTeX commands
        latex_commands = [
            r'\\frac', r'\\sqrt', r'\\sum', r'\\int', r'\\times', r'\\div',
            r'\\alpha', r'\\beta', r'\\gamma', r'\\pi', r'\\mu', r'\\sigma'
        ]
        
        # Simple heuristics
        has_latex = any(re.search(cmd, content) for cmd in latex_commands)
        has_math_symbols = any(char in content for char in ['^', '_', '{', '}', '\\'])
        
        return has_latex or has_math_symbols or len(content.strip()) > 0

# Test the detector
detector = MathPatternDetector()

# Test cases
test_cases = [
    "Simple inline: \\( a + b = c \\)",
    "Display block: $$\\frac{a}{b} = c$$", 
    "Mixed: \\( x^2 \\) and $$y = \\sqrt{z}$$",
    "Dollar inline: $a + b$ and display $$c = d$$",
    "Escaped: \\$not math\\$ but \\( real math \\)",
    "In table: | \\( Area = l \\times w \\) | Formula |"
]

print("🔍 Testing Math Pattern Detection:")
print("=" * 50)

for i, test in enumerate(test_cases, 1):
    print(f"\n{i}. Test: {test}")
    matches = detector.find_all_math(test)
    print(f"   Found {len(matches)} math expressions:")
    for match in matches:
        print(f"     - {match['pattern_name']}: '{match['content']}' (display: {match['display']})")

## 3. Building a Content Parser with Math Awareness

Now let's implement a comprehensive parser that can:
1. **Extract code blocks first** (```language``` syntax)
2. **Extract math expressions** from remaining text 
3. **Replace math with placeholders** to prevent markdown interference
4. **Create separate blocks** for each content type
5. **Preserve original content** for fallback rendering

In [None]:
class MathAwareContentParser:
    """Advanced content parser that extracts math expressions as separate blocks"""
    
    def __init__(self):
        self.math_detector = MathPatternDetector()
        
    def parse_mixed_content(self, raw_content: str) -> ParsedContent:
        """Parse content into blocks, extracting math as separate blocks"""
        if not raw_content or not raw_content.strip():
            return ParsedContent(blocks=[], raw_content=raw_content)
        
        blocks = []
        
        # Step 1: Extract code blocks first (```language``` syntax)
        code_regex = re.compile(r'```(\w+)?\n([\s\S]*?)```', re.MULTILINE)
        text_segments = []
        last_index = 0
        
        for match in code_regex.finditer(raw_content):
            # Capture text before this code block
            if match.start() > last_index:
                text_content = raw_content[last_index:match.start()]
                if text_content.strip():
                    text_segments.append(text_content)
            
            # Process the code block
            language = match.group(1).lower() if match.group(1) else 'text'
            code_content = match.group(2)
            
            if language in ['mermaid', 'mindmap']:
                blocks.append(ContentBlock(BlockType.MERMAID if language == 'mermaid' else BlockType.MINDMAP, code_content))
            elif language == 'html':
                blocks.append(ContentBlock(BlockType.HTML, code_content))
            else:
                blocks.append(ContentBlock(BlockType.CODE, code_content, language=language))
            
            last_index = match.end()
        
        # Capture remaining text after last code block
        if last_index < len(raw_content):
            remaining_text = raw_content[last_index:]
            if remaining_text.strip():
                text_segments.append(remaining_text)
        
        # If no code blocks found, treat entire content as text
        if not text_segments and not blocks and raw_content.strip():
            text_segments.append(raw_content)
        
        # Step 2: Process text segments for math extraction
        for segment in text_segments:
            self._process_text_segment_with_math(segment, blocks)
        
        # Fallback: if no blocks found, add entire content as text
        if not blocks and raw_content.strip():
            blocks.append(ContentBlock(BlockType.TEXT, raw_content))
        
        return ParsedContent(blocks=blocks, raw_content=raw_content)
    
    def _process_text_segment_with_math(self, text: str, blocks: List[ContentBlock]):
        """Process a text segment, extracting math as separate blocks"""
        # Find all math expressions
        math_expressions = self.math_detector.find_all_math(text)
        
        if not math_expressions:
            # No math found, add as text block
            if text.strip():
                blocks.append(ContentBlock(BlockType.TEXT, text))
            return
        
        # Split text and create blocks
        last_end = 0
        
        for math_expr in math_expressions:
            # Add text before this math expression
            if math_expr['start'] > last_end:
                text_before = text[last_end:math_expr['start']]
                if text_before.strip():
                    blocks.append(ContentBlock(BlockType.TEXT, text_before))
            
            # Add the math expression as a separate block
            blocks.append(ContentBlock(
                BlockType.MATH, 
                math_expr['content'], 
                display=math_expr['display']
            ))
            
            last_end = math_expr['end']
        
        # Add any remaining text
        if last_end < len(text):
            remaining_text = text[last_end:]
            if remaining_text.strip():
                blocks.append(ContentBlock(BlockType.TEXT, remaining_text))

# Test the parser with complex content
parser = MathAwareContentParser()

# Sample complex content with mixed types
sample_content = '''
# Math Examples

Here's some text with inline math \\( a + b = c \\) and more text.

```python
def calculate_area(radius):
    return 3.14159 * radius ** 2
```

Display math block:
$$\\text{Area} = \\pi \\times \\text{Radius}^2$$

More text here, and another inline: \\( \\frac{a}{b} = \\frac{c}{d} \\)

| Formula | Description |
|---------|-------------|
| \\( E = mc^2 \\) | Einstein's equation |
| $$F = ma$$ | Newton's second law |

```mermaid
graph TD
    A[Start] --> B[Process]
```

Final display math:
$$\\sum_{i=1}^{n} i = \\frac{n(n+1)}{2}$$
'''

print("🧮 Testing Math-Aware Content Parser:")
print("=" * 60)

result = parser.parse_mixed_content(sample_content)

print(f"\\n📊 Parsing Results:")
print(f"Raw content length: {len(result.raw_content)}")
print(f"Total blocks: {len(result.blocks)}")
print(f"\\nBlock breakdown:")

block_counts = {}
for block in result.blocks:
    block_type = block.type.value
    block_counts[block_type] = block_counts.get(block_type, 0) + 1

for block_type, count in block_counts.items():
    print(f"  - {block_type}: {count}")

print(f"\\n🔍 Detailed Block Analysis:")
for i, block in enumerate(result.blocks):
    content_preview = block.content[:50].replace('\\n', '\\\\n')
    if len(block.content) > 50:
        content_preview += "..."
    
    extra_info = ""
    if block.language:
        extra_info += f", language: {block.language}"
    if block.display is not None:
        extra_info += f", display: {block.display}"
    
    print(f"  {i+1}. {block.type.value}: '{content_preview}'{extra_info}")

## 4. Implementing Safe Streaming vs Complete Rendering

This is the critical part: handling streaming content safely. We need to:

1. **Buffer incomplete math expressions** during streaming
2. **Use safe markdown rendering** during streaming (no math)
3. **Switch to full rendering** when content is complete
4. **Detect incomplete math patterns** to avoid crashes

The key insight: **Math expressions are like HTML - they need to be complete before rendering safely.**