# ScavToMarkdown: PDF to Markdown Converter

`ScavToMarkdown` is a robust solution for transcribing PDF files into Markdown text, using OpenAI models (`gpt-4.1` and `gpt-4o` only) via LangChain.

- **Input:** file path, bytes, URL, or base64 string of the PDF.
- **Output:** Markdown string with a detailed transcription of the PDF.
- **Highlights:** parallel processing, dynamic and configurable logging, integration with LangChain and OpenAI, and detailed token tracking.

---

## Class Parameters

| Parameter     | Type    | Default    | Description                                                                |
|---------------|---------|------------|----------------------------------------------------------------------------|
| model         | str     | 'gpt-4.1'  | OpenAI model to use ('gpt-4.1' or 'gpt-4o')                                |
| temperature   | float   | 0          | Temperature for text generation                                            |
| max_tokens    | int     | 2048       | Maximum number of tokens per response                                      |
| timeout       | int     | 90         | Timeout (seconds) for requests                                             |
| max_workers   | int     | 10         | Maximum number of threads for parallel processing                          |
| verbose       | bool    | True       | Enables/disables detailed logs                                             |
| log_level     | str     | 'INFO'     | Log level ('DEBUG', 'INFO', 'WARNING', 'ERROR')                           |
| callbacks     | list    | None       | LangChain callbacks for tracing                                            |
| logger        | Logger  | None       | Custom logger (optional, for dynamic integration)                          |

**Exceptions:**
- `ModelIncompatibilityError`: if the model is not 'gpt-4.1' or 'gpt-4o'.
- `RuntimeError`: issues when reading the PDF.

---

In [1]:
# Dependency installation (if needed)
# !pip install langchain_openai langchain_core openai python-dotenv requests pymupdf tiktoken

## Instantiating the class

Below, we instantiate the converter with the default model and logging enabled.
You can pass a custom logger for dynamic integration with other logging systems.


In [8]:
from scav_switch.converters.pdf import ScavToMarkdown
import logging

scav = ScavToMarkdown(model='gpt-4.1', verbose=True)


2025-06-14 05:09:47,307 - ScavToMarkdown - INFO - ScavToMarkdown initialized - Project: None


## Example 1: Convert PDF from a file path

The `dig` method accepts the path to a local PDF file.


In [9]:
import os
base_path = os.getcwd()
pdf_path = os.path.join(base_path, 'sample-tables.pdf')
markdown_text = scav.dig(pdf_path)

with open('markdown_text.md', 'w') as f:
    f.write(markdown_text)

print(markdown_text[:250])


2025-06-14 05:09:48,982 - ScavToMarkdown - INFO - Starting PDF to Markdown conversion process
2025-06-14 05:09:48,983 - ScavToMarkdown - INFO - PDF loaded successfully - 145494 bytes
2025-06-14 05:09:49,438 - ScavToMarkdown - INFO - Extracted 11 pages from PDF
2025-06-14 05:09:49,438 - ScavToMarkdown - INFO - Starting parallel processing with 10 workers
2025-06-14 05:10:09,364 - ScavToMarkdown - INFO - Conversion completed - 16976 characters generated



# Design and build accessible PDF tables
## Sample tables

---

### Table 1

| **Column header (TH)** | **Column header (TH)** | **Column header (TH)** |
|------------------------|------------------------|------------------------|
| **Row header (TH


## Example 2: Convert PDF from a URL

You can also provide a direct URL to the PDF.


In [4]:
pdf_url = 'https://datascav-public-media-bkt.s3.sa-east-1.amazonaws.com/pdf/sample-tables.pdf'
markdown_text = scav.dig(pdf_url)
print(markdown_text[:250])


2025-06-14 03:23:57,139 - ScavToMarkdown - INFO - Starting PDF to Markdown conversion process
2025-06-14 03:23:57,260 - ScavToMarkdown - INFO - PDF loaded successfully - 145494 bytes
2025-06-14 03:23:57,728 - ScavToMarkdown - INFO - Extracted 11 pages from PDF
2025-06-14 03:23:57,728 - ScavToMarkdown - INFO - Starting parallel processing with 10 workers
2025-06-14 03:24:13,395 - ScavToMarkdown - INFO - Conversion completed - 16784 characters generated



Design and build  
accessible PDF tables  
Sample tables

Table 1

| Column header (TH) | Column header (TH) | Column header (TH) |
|--------------------|--------------------|--------------------|
| **Row header (TH)** | Data cell (TD)      | Data c


## Example 3: Convert PDF from a base64 string

If you have the PDF in base64, just pass the string directly.


In [5]:
import base64
pdf_base64 = base64.b64encode(open('sample-tables.pdf', 'rb').read()).decode('utf-8')
markdown_text = scav.dig(pdf_base64)
print(markdown_text[:250])


2025-06-14 03:24:13,402 - ScavToMarkdown - INFO - Starting PDF to Markdown conversion process
2025-06-14 03:24:13,403 - ScavToMarkdown - INFO - PDF loaded successfully - 145494 bytes
2025-06-14 03:24:13,854 - ScavToMarkdown - INFO - Extracted 11 pages from PDF
2025-06-14 03:24:13,855 - ScavToMarkdown - INFO - Starting parallel processing with 10 workers
2025-06-14 03:24:28,520 - ScavToMarkdown - INFO - Conversion completed - 16911 characters generated



Design and build  
accessible PDF tables  
Sample tables

Table 1

| **Column header (TH)** | **Column header (TH)** | **Column header (TH)** |
|------------------------|------------------------|------------------------|
| **Row header (TH)**    | D


---

## Detailed token tracking

After processing, you can access the `tokens_usage` attribute to get a detailed report of tokens consumed:

- `input`: accumulated input tokens
- `output`: accumulated output tokens
- `total`: total tokens
- `details`: detailed list per page (with fallback if applicable)


In [6]:
for key, value in scav.tokens_usage.items():
    if key != 'details':
        print(f"{key.capitalize()} Tokens: {value}")

Input Tokens: 3333
Output Tokens: 14875
Total Tokens: 18208


## Best practices and notes

- **Supported models:** only 'gpt-4.1' and 'gpt-4o'.
- **Flexible input:** path, URL, bytes, or base64.
- **Logs:** use `verbose=True` for detailed debugging or pass a custom logger.
- **Errors:** use try/except to catch `ModelIncompatibilityError` and other input issues.
- **Performance:** for large PDFs, adjust `max_workers` according to your infrastructure.

---

**Example of error handling:**


In [7]:
from scav_switch.converters.pdf.exceptions import ModelIncompatibilityError

try:
    scav = ScavToMarkdown(model='gpt-3.5-turbo')  # Not supported
except ModelIncompatibilityError as e:
    print(f'Error: {e}')


Error: Model 'gpt-3.5-turbo' is not compatible. Only 'gpt-4.1' and 'gpt-4o' are supported for PDF to Markdown conversion.
