# PDF to Accessible HTML with Mistral OCR - v2

---

## WCAG-Compliant and Screen Reader Friendly HTML Conversion
This notebook demonstrates how to upload a PDF file from your local computer, use Mistral OCR to extract text and images, and convert the content to a **fully accessible, structured HTML document** that preserves the layout and formatting of the original PDF. This version focuses on WCAG compliance and screen reader accessibility.

---

### Key Features
- **Proper semantic HTML5** structure with appropriate ARIA attributes
- **Enhanced table accessibility** with proper markup and headers
- **Descriptive alt text** for all images generated by Mistral AI
- **Graph and chart interpretation** in alt text
- **Proper document hierarchy** with semantic headings
- **Screen reader optimizations** throughout

---

### Technology Used
- Mistral OCR API
- Pixtral 12B for image content interpretation
- Semantic HTML5/CSS for accessible output
- ARIA attributes for screen reader support

### Setup
First, let's install the required libraries.

In [None]:
!pip install -q mistralai

## API Setup
Set up your Mistral API client. You can create an API key on the [Mistral Platform](https://console.mistral.ai/api-keys/).

In [None]:
from mistralai import Mistral

# Enter your API key here
api_key = "YOUR_API_KEY_HERE"
client = Mistral(api_key=api_key)

## PDF Upload

Upload a PDF document using Google Colab's built-in file upload feature. Run the cell below and follow the instructions.

In [None]:
from google.colab import files
import traceback
from mistralai import DocumentURLChunk, TextChunk
import time

print("Click 'Choose Files' to upload a PDF document (max 50MB):")
uploaded = files.upload()

if not uploaded:
    print("❌ No files were uploaded.")
else:
    try:
        # Get the first file
        filename = next(iter(uploaded.keys()))
        file_content = uploaded[filename]
        file_size_mb = len(file_content) / (1024 * 1024)
        print(f"✅ Uploaded: {filename} ({file_size_mb:.2f} MB)")
        
        print("\nStep 1/3: Uploading to Mistral...")
        # Upload to Mistral
        uploaded_file = client.files.upload(
            file={
                "file_name": filename,
                "content": file_content,
            },
            purpose="ocr"
        )
        print(f"✅ File uploaded to Mistral with ID: {uploaded_file.id}")
        
        print("\nStep 2/3: Getting signed URL...")
        # Get the signed URL
        signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)
        print(f"✅ Signed URL obtained")
        
        print("\nStep 3/3: Processing with OCR...")
        print("This may take a while for large files. Please be patient.")
        start_time = time.time()
        
        # Process with OCR
        ocr_result = client.ocr.process(
            document=DocumentURLChunk(document_url=signed_url.url), 
            model="mistral-ocr-latest", 
            include_image_base64=True
        )
        
        # Calculate processing time
        processing_time = time.time() - start_time
        print(f"✅ OCR processing complete in {processing_time:.2f} seconds!")
        
        # Display basic info about the results
        image_count = sum(len(page.images) for page in ocr_result.pages)
        print(f"- Total pages: {len(ocr_result.pages)}")
        print(f"- Total images extracted: {image_count}")
        
        # Print a sample of the first page text (truncated)
        if ocr_result.pages:
            first_page_text = ocr_result.pages[0].markdown[:200]
            print(f"\nSample text from first page:\n{first_page_text}...")
            
    except Exception as e:
        print(f"\n❌ Error: {str(e)}")
        traceback.print_exc()

## Generate Accessible Image Descriptions

For WCAG compliance, we need to generate appropriate alt text for all images in the document. We'll use Pixtral 12B to interpret the content of each image, especially for charts and graphs.

In [ ]:
from IPython.display import display, HTML
import base64
import time
import json

def generate_alt_text_for_image(img_base64, context=""):
    """Generate WCAG-compliant alt text for an image using Pixtral 12B
    
    Args:
        img_base64: Base64 encoded image data
        context: Text context surrounding the image to provide better descriptions
        
    Returns:
        Descriptive alt text suitable for screen readers
    """
    try:
        # Use Pixtral 12B model to interpret the image
        prompt = """
        Create a detailed, accessible alt text description for this image following WCAG 2.1 guidelines. 
        
        Consider these requirements:
        1. Be concise but thorough (30-150 words)
        2. Describe the main visual content objectively
        3. Include any text visible in the image
        4. If it's a chart or graph, describe the type, axes, trends, and key insights
        5. Mention colors only when relevant to understanding the content
        6. Do not use phrases like "image of" or "picture of"
        7. Focus on what's important for understanding the document's content
        
        Context about where this image appears in the document:
        {context}
        
        Return only the alt text, no other commentary or notes.
        """.format(context=context)
        
        chat_response = client.chat.complete(
            model="pixtral-12b-latest",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": img_base64}}
                    ]
                }
            ],
            temperature=0.1,
            max_tokens=300
        )
        
        # Return the generated alt text
        return chat_response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating alt text: {str(e)}")
        return "Image from document"

if 'ocr_result' in locals():
    # Process the first few images as an example (to save API costs)
    print("Generating accessible alt text for images (this uses Pixtral 12B API)...")
    
    # We'll store alt texts in this dictionary
    all_image_alt_texts = {}
    
    # Process up to 3 images as a demonstration
    images_processed = 0
    max_images_to_process = 3
    
    for page_idx, page in enumerate(ocr_result.pages):
        # Get surrounding text for context
        page_context = page.markdown[:500]  # Use first 500 chars as context
        
        for img in page.images:
            if images_processed >= max_images_to_process:
                break
                
            images_processed += 1
            print(f"Processing image {images_processed}/{min(max_images_to_process, sum(len(p.images) for p in ocr_result.pages))}...")
            
            # Generate alt text
            alt_text = generate_alt_text_for_image(img.image_base64, context=page_context)
            all_image_alt_texts[img.id] = alt_text
            
            # Display the image and its alt text
            display(HTML(f"<p><strong>Generated Alt Text:</strong> {alt_text}</p>"))
            display(HTML(f"<img src='{img.image_base64}' style='max-width:500px; border:1px solid #ddd; padding:5px;' />"))
            print("\n---\n")
            
        if images_processed >= max_images_to_process:
            break
    
    if images_processed == 0:
        print("No images found in the document.")
    else:
        print(f"✅ Generated alt text for {images_processed} images")
        print("For the full document conversion, all images will be processed.")
else:
    print("❌ No OCR results available. Please process a PDF first.")

## Extract and Enhance Tables

Tables need special handling for screen reader accessibility. We'll identify and properly mark up tables with appropriate headers and ARIA attributes.

In [None]:
import re

def process_tables_for_accessibility(markdown_content):
    """Enhance tables with proper accessibility features
    
    Args:
        markdown_content: Original markdown content containing tables
        
    Returns:
        Markdown with enhanced table markup for accessibility
    """
    # Identify potential table sections in the markdown
    table_sections = re.findall(r'(\|.+\|\n)+', markdown_content)
    
    if not table_sections:
        return markdown_content
    
    enhanced_markdown = markdown_content
    
    for table_idx, section in enumerate(table_sections):
        # Analyze table structure
        rows = section.strip().split('\n')
        if len(rows) < 2:  # Not a proper table
            continue
            
        # Identify header row
        has_header = any('---' in row for row in rows)
        header_row_idx = 0  # Default to first row as header
        
        # Add a table caption before the table
        table_id = f"table-{table_idx+1}"
        table_caption = f"\n\n**Table {table_idx+1}**\n\n"
        
        # Replace the original table section with enhanced version
        enhanced_markdown = enhanced_markdown.replace(section, table_caption + section)
    
    return enhanced_markdown

if 'ocr_result' in locals():
    # Process the first page with tables as an example
    for page in ocr_result.pages:
        if '|' in page.markdown:  # Simple check for tables
            print("Found table in page. Processing for accessibility...")
            
            # Sample of original content with table
            table_section = re.search(r'(\|.+\|\n)+', page.markdown)
            if table_section:
                print("\nOriginal table format:\n")
                print(table_section.group(0))
                
                # Enhanced version
                enhanced_content = process_tables_for_accessibility(page.markdown)
                
                # Find the enhanced table
                enhanced_table = re.search(r'\*\*Table \d+\*\*\n\n(\|.+\|\n)+', enhanced_content)
                if enhanced_table:
                    print("\nAccessible table format:\n")
                    print(enhanced_table.group(0))
            break
    else:
        print("No tables found in the document for demonstration.")
else:
    print("❌ No OCR results available. Please process a PDF first.")

## Convert OCR Results to Accessible HTML
Now we'll convert the OCR results to a WCAG-compliant, screen reader friendly HTML document that preserves the layout and formatting of the original PDF.

In [ ]:
import base64
import re
from IPython.display import HTML, display
from mistralai.models import OCRResponse
import time

def convert_ocr_to_accessible_html(ocr_response):
    """Convert OCR results to WCAG-compliant accessible HTML"""
    if ocr_response is None:
        return "<p>No OCR results available. Please process a PDF first.</p>"
    
    # Identify document language (defaulting to English)
    document_language = "en"
    
    # Start with an HTML5 document with ARIA roles and accessibility features
    html_parts = []
    html_parts.append(f"""
    <!DOCTYPE html>
    <html lang="{document_language}">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Accessible Document</title>
        <style>
            /* Base styles with accessibility considerations */
            body {{ 
                font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif; 
                line-height: 1.6; 
                max-width: 1000px; 
                margin: 0 auto; 
                padding: 20px;
                color: #333;
                background-color: #fff;
            }}
            
            /* Ensure sufficient color contrast (WCAG AA 4.5:1) */
            h1, h2, h3, h4, h5, h6 {{ 
                color: #222; 
                margin-top: 1.5em;
                line-height: 1.2;
            }}
            
            /* Image handling for WCAG */
            img {{ 
                max-width: 100%; 
                height: auto; 
                display: block; 
                margin: 1em 0;
            }}
            
            /* Accessible table styles */
            table {{ 
                border-collapse: collapse; 
                width: 100%; 
                margin: 1.5em 0; 
                border: 1px solid #ddd;
            }}
            caption {{
                font-weight: bold;
                text-align: left;
                margin-bottom: 0.5em;
                font-size: 1.1em;
            }}
            th, td {{ 
                border: 1px solid #ddd; 
                padding: 8px; 
                text-align: left; 
            }}
            th {{ 
                background-color: #f2f2f2; 
                font-weight: bold;
            }}
            /* Zebra striping for better readability */
            tr:nth-child(even) {{
                background-color: #f8f8f8;
            }}
            
            /* Ensure page breaks are accessible */
            .page-break {{ 
                height: 40px; 
                margin: 40px 0; 
                border-bottom: 1px dashed #ccc; 
                text-align: center;
                position: relative;
            }}
            .page-break::after {{
                content: "Page Break";
                position: absolute;
                top: 50%;
                left: 50%;
                transform: translate(-50%, -50%);
                background: white;
                padding: 0 10px;
                color: #666;
                font-size: 0.8em;
            }}
            
            /* Document structure */
            .pdf-page {{
                margin-bottom: 2em;
                border: 1px solid #eee;
                padding: 2em;
                border-radius: 5px;
            }}
            
            /* Focus indicators for keyboard navigation */
            a:focus, button:focus, input:focus {{
                outline: 3px solid #4a90e2;
                outline-offset: 2px;
            }}
            
            /* Skip link for keyboard users */
            .skip-link {{
                position: absolute;
                top: -40px;
                left: 0;
                background: #4a90e2;
                color: white;
                padding: 8px 16px;
                z-index: 100;
                transition: top 0.3s;
            }}
            .skip-link:focus {{
                top: 0;
            }}
            
            /* Print-specific styles */
            @media print {{
                body {{
                    width: 100%;
                    max-width: none;
                    margin: 0;
                    padding: 0;
                }}
                .pdf-page {{
                    border: none;
                    padding: 0;
                    margin: 0 0 2em 0;
                }}
                .page-break {{
                    page-break-after: always;
                    border: none;
                    height: 0;
                }}
                .page-break::after {{
                    display: none;
                }}
                .skip-link {{
                    display: none;
                }}
            }}
        </style>
    </head>
    <body>
        <a href="#main-content" class="skip-link">Skip to main content</a>
        <main id="main-content">
    """)
    
    # Add document title if first page starts with heading
    if ocr_response.pages and ocr_response.pages[0].markdown:
        first_line = ocr_response.pages[0].markdown.strip().split('\n')[0]
        if first_line.startswith('# '):
            document_title = first_line.replace('# ', '').strip()
            html_parts.append(f"<h1 id='document-title'>{document_title}</h1>")
    
    # Process each page
    for i, page in enumerate(ocr_response.pages):
        page_number = i + 1
        
        # Process all images to get alt text
        print(f"Processing page {page_number}/{len(ocr_response.pages)}...")
        
        # Create a dict of images by ID for quick lookup
        image_data = {}
        image_alt_texts = {}
        
        # Process images for this page
        for img in page.images:
            image_data[img.id] = img.image_base64
            
            # Generate alt text for each image using surrounding text for context
            # Extract text before and after the image reference to provide context
            image_reference = f"![{img.id}]({img.id})"
            img_pos = page.markdown.find(image_reference)
            
            if img_pos > -1:
                # Get text before and after the image (up to 250 chars each)
                before_text = page.markdown[max(0, img_pos-250):img_pos]
                after_text = page.markdown[img_pos+len(image_reference):min(len(page.markdown), img_pos+len(image_reference)+250)]
                context = before_text + "\n" + after_text
            else:
                context = ""
                
            print(f"  Generating alt text for image {img.id}...")
            start_time = time.time()
            alt_text = generate_alt_text_for_image(img.image_base64, context=context)
            print(f"  ✓ Alt text generated in {time.time() - start_time:.2f} seconds")
            image_alt_texts[img.id] = alt_text
        
        # Add page heading with proper ARIA role
        html_parts.append(f"<section class='pdf-page' id='page-{page.index}' aria-label='Page {page_number}'>")
        
        # Process tables for accessibility
        enhanced_md = process_tables_for_accessibility(page.markdown)
        
        # Replace image markers with accessible HTML img tags
        for img_id, base64_str in image_data.items():
            # Get the generated alt text or use a default
            alt_text = image_alt_texts.get(img_id, f"Image {img_id} from document")
            
            # Extract image dimensions if available
            img_obj = next((img for img in page.images if img.id == img_id), None)
            width = img_obj.bottom_right_x - img_obj.top_left_x
            height = img_obj.bottom_right_y - img_obj.top_left_y
            
            # Create figure with caption for complex images
            enhanced_md = enhanced_md.replace(
                f"![{img_id}]({img_id})", 
                f"<figure id='figure-{img_id}'>"
                f"<img src='{base64_str}' alt='{alt_text}' id='{img_id}'"
                f" width='{width}' height='{height}' />"
                f"<figcaption>Figure: {alt_text[:60]}...</figcaption>"
                f"</figure>"
            )
        
        # Convert markdown to semantic HTML5
        html_content = convert_markdown_to_semantic_html(enhanced_md)
        html_parts.append(html_content)
        
        # Close page section and add page break if not the last page
        html_parts.append("</section>")
        if i < len(ocr_response.pages) - 1:
            html_parts.append(f"<div class='page-break' role='separator' aria-label='Page break between pages {page_number} and {page_number+1}'></div>")
    
    # Close main content and document
    html_parts.append("</main>")
    
    # Add footer with document metadata
    html_parts.append("""
    <footer role="contentinfo">
        <p>Document processed with Mistral OCR and converted to accessible HTML.</p>
    </footer>
    </body>
    </html>""")
    
    return '\n'.join(html_parts)

def convert_markdown_to_semantic_html(markdown_content):
    """Convert markdown to semantic HTML5 with accessibility features"""
    # Headers with proper hierarchy and IDs
    markdown_content = re.sub(r'^# (.+)$', lambda m: f"<h1 id='{slugify(m.group(1))}'>{m.group(1)}</h1>", markdown_content, flags=re.MULTILINE)
    markdown_content = re.sub(r'^## (.+)$', lambda m: f"<h2 id='{slugify(m.group(1))}'>{m.group(1)}</h2>", markdown_content, flags=re.MULTILINE)
    markdown_content = re.sub(r'^### (.+)$', lambda m: f"<h3 id='{slugify(m.group(1))}'>{m.group(1)}</h3>", markdown_content, flags=re.MULTILINE)
    markdown_content = re.sub(r'^#### (.+)$', lambda m: f"<h4 id='{slugify(m.group(1))}'>{m.group(1)}</h4>", markdown_content, flags=re.MULTILINE)
    markdown_content = re.sub(r'^##### (.+)$', lambda m: f"<h5 id='{slugify(m.group(1))}'>{m.group(1)}</h5>", markdown_content, flags=re.MULTILINE)
    markdown_content = re.sub(r'^###### (.+)$', lambda m: f"<h6 id='{slugify(m.group(1))}'>{m.group(1)}</h6>", markdown_content, flags=re.MULTILINE)
    
    # Lists with proper semantics
    # Find ordered lists and convert to semantic HTML
    ordered_list_pattern = r'(^\d+\. .+$(\n^\d+\. .+$)*)'  
    ordered_lists = re.findall(ordered_list_pattern, markdown_content, re.MULTILINE)
    for ol_match in ordered_lists:
        ol_content = ol_match[0]
        items = re.findall(r'^\d+\. (.+)$', ol_content, re.MULTILINE)
        
        # Build semantic ordered list
        new_list = "<ol>\n"
        for item in items:
            new_list += f"  <li>{item}</li>\n"
        new_list += "</ol>"
        
        # Replace in original markdown
        markdown_content = markdown_content.replace(ol_content, new_list)
    
    # Find unordered lists and convert to semantic HTML
    unordered_list_pattern = r'(^- .+$(\n^- .+$)*)'  
    unordered_lists = re.findall(unordered_list_pattern, markdown_content, re.MULTILINE)
    for ul_match in unordered_lists:
        ul_content = ul_match[0]
        items = re.findall(r'^- (.+)$', ul_content, re.MULTILINE)
        
        # Build semantic unordered list
        new_list = "<ul>\n"
        for item in items:
            new_list += f"  <li>{item}</li>\n"
        new_list += "</ul>"
        
        # Replace in original markdown
        markdown_content = markdown_content.replace(ul_content, new_list)
    
    # Tables with proper semantics and ARIA
    table_sections = re.findall(r'\*\*Table (\d+)\*\*\n\n(\|.+\|\n)+', markdown_content)
    for table_match in table_sections:
        table_num = table_match[0]
        table_content = table_match[1]
        
        # Find the complete table section
        complete_table_section = f"**Table {table_num}**\n\n{table_content}"
        table_section = re.search(re.escape(complete_table_section), markdown_content)
        if not table_section:
            continue
            
        table_section = table_section.group(0)
        rows = re.findall(r'\|(.+)\|', table_section)
        
        if len(rows) < 2:  # Need at least header and one data row
            continue
            
        # Skip separator row if present
        separator_index = -1
        for i, row in enumerate(rows):
            if re.match(r'^[\s\-:|]+$', row):  # Row contains only separators
                separator_index = i
                break
                
        # Build accessible table
        table_id = f"table-{table_num}"
        semantic_table = f"<div class='table-container' role='region' aria-labelledby='{table_id}-caption' tabindex='0'>\n"
        semantic_table += f"<table id='{table_id}'>\n"
        semantic_table += f"<caption id='{table_id}-caption'>Table {table_num}</caption>\n"
        
        # Header row processing
        header_row = rows[0]
        headers = [cell.strip() for cell in header_row.split('|')]
        header_ids = [f"{table_id}-col-{i+1}" for i in range(len(headers))]
        
        semantic_table += "<thead>\n<tr>\n"
        for i, header in enumerate(headers):
            semantic_table += f"<th id='{header_ids[i]}' scope='col'>{header}</th>\n"
        semantic_table += "</tr>\n</thead>\n"
        
        # Data rows
        semantic_table += "<tbody>\n"
        data_rows = [r for i, r in enumerate(rows) if i != 0 and i != separator_index]
        
        for row_idx, row in enumerate(data_rows):
            cells = [cell.strip() for cell in row.split('|')]
            semantic_table += "<tr>\n"
            
            # First cell might be a row header
            first_cell = cells[0] if cells else ""
            if first_cell and all(c == first_cell for c in cells):
                semantic_table += f"<th scope='colgroup' colspan='{len(cells)}'>{first_cell}</th>\n"
            else:
                for col_idx, cell in enumerate(cells):
                    if col_idx == 0 and is_likely_header(cell, cells):
                        # This is likely a row header
                        semantic_table += f"<th scope='row'>{cell}</th>\n"
                    else:
                        # Regular cell - reference its header for accessibility
                        header_id = header_ids[col_idx] if col_idx < len(header_ids) else ""
                        headers_attr = f" headers='{header_id}'" if header_id else ""
                        semantic_table += f"<td{headers_attr}>{cell}</td>\n"
                        
            semantic_table += "</tr>\n"
            
        semantic_table += "</tbody>\n</table>\n</div>"
        
        # Replace in original markdown
        markdown_content = markdown_content.replace(table_section, semantic_table)
    
    # Bold and italic
    markdown_content = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', markdown_content)
    markdown_content = re.sub(r'\*(.+?)\*', r'<em>\1</em>', markdown_content)
    
    # Links with aria-label for better screen reader experience
    markdown_content = re.sub(
        r'\[(.+?)\]\((.+?)\)', 
        lambda m: f"<a href=\"{m.group(2)}\" aria-label=\"{m.group(1)}\">{m.group(1)}</a>", 
        markdown_content
    )
    
    # Convert remaining paragraphs with proper HTML5 structure
    paragraphs = re.split(r'\n{2,}', markdown_content)
    html_parts = []
    
    for p in paragraphs:
        p = p.strip()
        if not p:
            continue
            
        # Skip if already HTML
        if p.startswith('<') and p.endswith('>'):
            html_parts.append(p)
        else:
            # Replace single line breaks with <br>
            p = p.replace('\n', '<br>')
            # Wrap in paragraph tags
            html_parts.append(f"<p>{p}</p>")
    
    return '\n'.join(html_parts)

def slugify(text):
    """Convert text to a URL-friendly format for IDs"""
    # Replace non-alphanumeric with hyphens
    text = re.sub(r'[^\w\s]', '', text.lower())
    # Replace spaces with hyphens
    return re.sub(r'\s+', '-', text)

def is_likely_header(cell, row_cells):
    """Determine if a cell is likely a row header"""
    # Check if first cell is formatted differently (e.g., all caps, shorter)
    if cell.isupper() and not all(c.isupper() for c in row_cells[1:]):
        return True
    # Check if first cell is shorter than average of other cells
    if len(row_cells) > 1:
        avg_len = sum(len(c) for c in row_cells[1:]) / (len(row_cells) - 1)
        if len(cell) < avg_len * 0.5:  # Significantly shorter
            return True
    return False

# Check if OCR results are available
if 'ocr_result' in locals():
    print("✅ OCR results are available! Converting to accessible HTML...")
    print("This process will generate alt text for images and may take some time.")
else:
    print("❌ No OCR results available. Please run the PDF processing cell first.")

## Generate and Display HTML Preview
Convert the OCR results to accessible HTML and display a preview.

In [None]:
if 'ocr_result' not in locals():
    print("❌ No OCR results available. Please process a PDF first.")
else:
    print("Generating accessible HTML preview...")
    html_content = convert_ocr_to_accessible_html(ocr_result)
    display(HTML(html_content))
    print("\n✅ Accessible HTML preview displayed. Scroll up to see the rendered content.")

## Save Accessible HTML to File
Download the accessible HTML content as a file. The file will be fully WCAG compliant and screen reader friendly.

In [None]:
def create_download_link(html_content, filename="accessible_document.html"):
    """Create a download link for the HTML content"""
    b64 = base64.b64encode(html_content.encode()).decode()
    href = f'<a download="{filename}" href="data:text/html;base64,{b64}" target="_blank">Download Accessible HTML File</a>'
    return href

if 'ocr_result' not in locals() or 'html_content' not in locals():
    print("❌ No HTML content available. Please generate the HTML preview first.")
else:
    print("Generating download link...")
    download_link = create_download_link(html_content)
    display(HTML(download_link))

## Accessibility Features Implemented

This HTML conversion implements the following WCAG 2.1 (Web Content Accessibility Guidelines) features:

1. **Semantic Structure**
   - Proper HTML5 elements (`<main>`, `<section>`, `<figure>`, etc.)
   - Landmarks and regions with ARIA roles
   - Hierarchical heading structure (`<h1>` through `<h6>`)

2. **Screen Reader Support**
   - Skip links for keyboard navigation
   - ARIA labels and descriptions
   - Proper alt text for all images
   - Table captions and headers association

3. **Images and Media**
   - AI-generated descriptive alt text
   - Special handling for charts and graphs
   - Figure captions for complex images

4. **Tables**
   - Proper `<thead>`, `<tbody>` structure
   - Row and column headers with correct scope
   - Data cells associated with headers
   - Table captions and summaries

5. **Visual Design**
   - High contrast text (WCAG AA 4.5:1 ratio)
   - Responsive layout for various devices
   - Clear focus indicators for keyboard navigation
   - Print-friendly styling

The generated HTML file can be validated using accessibility tools like WAVE, axe, or the NVDA screen reader to verify compliance.

## Troubleshooting

If you encounter issues processing PDFs, try these steps:

1. **API Key Issues**: Verify your API key is correct and has sufficient permissions
2. **Large PDFs**: For PDFs larger than 10MB, try reducing file size or processing fewer pages
3. **Timeout Issues**: The OCR and image description processing may take several minutes for large files
4. **Google Colab Runtime Issues**: If cells hang, try:
   - Restart the runtime (Runtime → Restart runtime)
   - Check if your PDF is too large (>50MB) for the Mistral API
5. **Image Alt Text Generation**: If image descriptions are taking too long:
   - Modify the code to process fewer images
   - Use a simpler description approach
6. **Table Structure Issues**: If tables aren't properly structured:
   - The original PDF may have complex formatting
   - Try adjusting the table detection regex patterns

If issues persist, check the [Mistral OCR documentation](https://docs.mistral.ai/capabilities/document/) for more information or try with a smaller PDF file first.