# Marker 

## ... also known as `marker-pdf`

> Marker converts documents to markdown, JSON, and HTML quickly and accurately.
>
> * Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
> * Formats tables, forms, equations, inline math, links, references, and code blocks

In [None]:
!ls -la samples

#### Install `marker-pdf` 

> ... If you want to use marker on documents other than PDFs, you will need to install additional dependencies with:

    pip install marker-pdf[full]

In [1]:
from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

In [2]:
import json
import os
import re

import pandas as pd

from bs4 import BeautifulSoup

In [3]:
def marker_pdf(pdf_path, config):
    # create configuration from dict / marker-pdf option key-values
    config_parser = ConfigParser(config)

    # instantiate converter
    converter = PdfConverter(
        config=config_parser.generate_config_dict(),
        artifact_dict=create_model_dict(),
        processor_list=config_parser.get_processors(),
        renderer=config_parser.get_renderer(),
        llm_service=config_parser.get_llm_service()
    )

    # return the rendered object
    return converter(pdf_path)

----

## `saintmarc-hd_20250213.pdf`: plain-vanilla vs LLM option

* `saintmarc-hd_20250213.pdf` is a PDF containing text, and the table format is relatively clean and straight-forward.
* The difference in the rendered results between the out-of-the-box, plain-vanilla API vs the LLM option to use the new Google Gemini 2.0 model is very small.

In [4]:
pdf_path = 'samples/saintmarc-hd_20250213.pdf'

config = {
    "output_format": "markdown",
    "page_range": "0",
}

rendered = marker_pdf(pdf_path, config)
markdown, _, images = text_from_rendered(rendered)

print(markdown)

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


Recognizing layout: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.65it/s]
Running OCR Error Detection: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.38it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.98it/s]

## 月次売上情報

|       |      | 年度   | 4月    | 5月    | 6月    | 7月    | 8月    | 9月    | 上半期   | 10月   | 11月   | 12月   | 1月    | 2月    | 3月    | 通期    |
|-------|------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| 全店売上  | 昨年対比 | 2022 | 118.9 | 144.0 | 126.3 | 110.7 | 124.0 | 127.2 | 124.5 | 115.7 | 107.4 | 106.1 | 122.5 | 140.1 | 120.5 | 120.7 |
|       | (%)  | 2023 | 116.0 | 110.7 | 109.5 | 117.6 | 119.1 | 114.3 | 114.6 | 106.5 | 108.6 | 108.8 | 108.1 | 107.9 | 110.3 | 111.3 |
|       |      | 2024 | 102.6 | 102.4 | 109.9 | 100.7 | 106.6 | 105.6 | 104.6 | 98.8  | 104.5 | 101.8 | 101.2 |       |       |       |
| 既存店売上 | 昨年対比 | 2022 | 115.1 | 126.0 | 122.8 | 111.5 | 124.8 | 127.4 | 120.9 | 115.8 | 107.0 | 106.3 | 123.6 | 143.7 | 123.8 | 119.6 |
|       | (%)  | 2023 | 119.7 | 114.6 | 113.8 | 120.7 | 122.9 | 117.1 | 118.2 | 110.3 | 113.1 | 113.3 | 113.2 | 112.8 | 115.1 | 115.5 |
|       |      | 2024 | 107.1 | 106.3




<hr width=40%/>

In [5]:
pdf_path = 'samples/saintmarc-hd_20250213.pdf'

config = {
    "output_format": "markdown",
    "page_range": "0",
    "use_llm": True,
    "gemini_api_key": "AIzaSyDsYsuF_ObXDTA661hZmuy6RoNXV7ZglvU"
}

rendered = marker_pdf(pdf_path, config)
markdown, _, images = text_from_rendered(rendered)

print(markdown)

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


Recognizing layout: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.54it/s]
LLM layout relabelling: 0it [00:00, ?it/s]
Running OCR Error Detection: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 163.87it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.10it/s]
LLMTableProcessor running: 1it [00:05,  5.84s/it]
LLMTableMergeProcessor running: 0it [00:00, ?it/s]
LLM processors running: 0it [00:00, ?it/s]

## 月次売上情報

|       |      | 年度    | 4月    | 5月    | 6月    | 7月    | 8月    | 9月    | 上半期   | 10月   | 11月   | 12月   | 1月    | 2月    | 3月    | 通期    |
|-------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| 全店売上  | 昨年対比 | 2022  | 118.9 | 144.0 | 126.3 | 110.7 | 124.0 | 127.2 | 124.5 | 115.7 | 107.4 | 106.1 | 122.5 | 140.1 | 120.5 | 120.6 |
|       | (%)  | 2023  | 116.0 | 110.7 | 109.5 | 117.6 | 119.1 | 114.3 | 114.6 | 106.5 | 108.6 | 108.8 | 108.1 | 107.9 | 110.3 | 111.3 |
|       |      | 2024  | 102.6 | 102.4 | 109.9 | 100.7 | 106.6 | 105.6 | 104.6 | 98.8  | 104.5 | 101.8 | 101.2 |       |       |       |
| 既存店売上 | 昨年対比 | 2022  | 115.1 | 126.0 | 122.8 | 111.5 | 124.8 | 127.4 | 120.9 | 115.8 | 107.0 | 106.3 | 123.6 | 143.7 | 123.8 | 119.8 |
|       | (%)  | 2023  | 119.7 | 114.6 | 113.8 | 120.7 | 122.9 | 117.1 | 118.2 | 110.3 | 113.1 | 113.3 | 113.2 | 112.8 | 115.1 | 115.4 |
|       |      | 2024  | 107.1




----

## `saintmarc-hd_20250313.pdf`: plain-vanilla vs LLM option

* In contrast to the 2025-Feb `saintmarc-hd_20250213.pdf`, the <span style="background-color:#AAFFFF;">2025-Mar `saintmarc-hd_20250313.pdf` PDF does not contain text but instead was likely created with an image file.</span>
* Note that the plain-vanilla rendered content has at least three mistakes, indicating that despite using a combination of OCR with layout-understanding models, `marker-pdf` may still face issues with irregular input files.
* The Gemini 2.0 LLM appears to have completely understood the layout and rendered the table perfectly.

In [6]:
pdf_path = 'samples/saintmarc-hd_20250313.pdf'

config = {
    "output_format": "markdown",
    "page_range": "0",
}

rendered = marker_pdf(pdf_path, config)
markdown, _, images = text_from_rendered(rendered)

print(markdown)

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


Recognizing layout: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.43it/s]
Running OCR Error Detection: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 166.75it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.78it/s]
Recognizing Text: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.77it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.60it/s]
Recognizing Text: 100%|███████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.17it/s]
Recognizing tables: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.18it/s]

## 月次売上情報

|       |      | 年度     | 4月 | 5月 | 6月                                              | 7月 | 8月 | 9月    上半期 | 10月   | 11月 | 12月 | 1月                                               | 2月 | 3月 | 通期    |
|-------|------|--------|----|----|-------------------------------------------------|----|----|-----------|-------|-----|-----|--------------------------------------------------|----|----|-------|
| 全店売上  | 昨年対比 | 2022 - |    |    | 118.9  144.0  126.3  110.7  124.0  127.2        |    |    | 124.5     | 115.7 |     |     | 107.4  106.1  122.5  140.1  120.5                |    |    | 120.7 |
|       | (%)  | 2023 - |    |    | 116.0  110.7  109.5  117.6  119.1  114.3        |    |    | 114.6     | 106.5 |     |     | 108.6  108.8  108.1  107.9  110.3                |    |    | 111.3 |
|       |      | 2024   |    |    | 102.6  102.4  109.9  100.7  106.6  105.6        |    |    | 104.6     |       |     |     | 98.8  104.5  101.8  101.2  102.5                 |    |    |       |
| 既存




In [7]:
pdf_path = 'samples/saintmarc-hd_20250313.pdf'

config = {
    "output_format": "markdown",
    "page_range": "0",
    "use_llm": True,
    "gemini_api_key": "AIzaSyDsYsuF_ObXDTA661hZmuy6RoNXV7ZglvU"
}

rendered = marker_pdf(pdf_path, config)
markdown, _, images = text_from_rendered(rendered)

print(markdown)

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


Recognizing layout: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.31it/s]
LLM layout relabelling: 0it [00:00, ?it/s]
Running OCR Error Detection: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 146.60it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.39it/s]
Recognizing Text: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.88it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.76it/s]
Recognizing Text: 100%|███████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.19it/s]
Recognizing tables: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.16it/s]
LLMTableProcessor running: 1it [00:07,  7.45s/it]
LLMTableMergeProcessor running: 0it [00:00, ?it/s]
LLM processors running: 0it [00:00, ?it/s]

## 月次売上情報

|       |      | 年度   | 4月    | 5月    | 6月    | 7月    | 8月    | 9月    | 上半期   | 10月   | 11月   | 12月   | 1月    | 2月    | 3月    | 通期    |
|-------|------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| 全店売上  | 昨年対比 | 2022 | 118.9 | 144.0 | 126.3 | 110.7 | 124.0 | 127.2 | 124.5 | 115.7 | 107.4 | 106.1 | 122.5 | 140.1 | 120.5 | 120.7 |
|       | (%)  | 2023 | 116.0 | 110.7 | 109.5 | 117.6 | 119.1 | 114.3 | 114.6 | 106.5 | 108.6 | 108.8 | 108.1 | 107.9 | 110.3 | 111.3 |
|       |      | 2024 | 102.6 | 102.4 | 109.9 | 100.7 | 106.6 | 105.6 | 104.6 | 98.8  | 104.5 | 101.8 | 101.2 | 102.5 |       |       |
| 既存店売上 | 昨年対比 | 2022 | 115.1 | 126.0 | 122.8 | 111.5 | 124.8 | 127.4 | 120.9 | 115.8 | 107.0 | 106.3 | 123.6 | 143.7 | 123.8 | 119.6 |
|       | (%)  | 2023 | 119.7 | 114.6 | 113.8 | 120.7 | 122.9 | 117.1 | 118.2 | 110.3 | 113.1 | 113.3 | 113.2 | 112.8 | 115.1 | 115.5 |
|       |      | 2024 | 107.1 | 106.3




----

## `Press_release_car_registrations_February_2025.pdf`: plain-vanilla vs LLM option

* ???

In [8]:
pdf_path = 'samples/Press_release_car_registrations_February_2025.pdf'

config = {
    "output_format": "json",
    "page_range": "2"
}

rendered = marker_pdf(pdf_path, config)
children, blocktype, metadata = text_from_rendered(rendered)

## <<< HACK!!!
# parse the str representation of rendered output at JSON object
children = json.loads(children)
obj = children['children'][0]
#print(obj.keys())

# locate the table JSON in the PDF
hits = list(filter(lambda c: c['block_type']=='Table', obj['children']))
table = hits[0]
#print(table)

# now parse the HTML of the table JSON object
soup = BeautifulSoup(table['html'], features='html.parser')

acc = []
trs = soup.find_all('tr')
for tr in trs:
    vals = [
        v.strip() for v in 
        [e for sublist in 
             [child.text.split('|') for child in tr.children if len(child.text.strip()) > 0] for e in sublist] 
        if len(v.strip()) > 0
    ]
    acc.append(vals)
tmp_df = pd.DataFrame(acc)

df = tmp_df.iloc[3:,1:].copy()

df.index = tmp_df.iloc[3:,0].values

col_level_0 = (
    ['BATTERY ELECTRIC']*3 +
    ['PLUG-IN HYBRID']*3 +
    ['HYBRID ELECTRIC']*3 +
    ['OTHERS']*3 +
    ['PETROL']*3 +
    ['DIESEL']*3 +
    ['TOTAL']*3 
)

col_level_1 = (
    tmp_df.iloc[1:3,:].T
        .dropna(how='all', axis=0)
        .apply(lambda r: ' '.join(r.values), axis=1).values
)

df.columns = pd.MultiIndex.from_tuples(zip(col_level_0, col_level_1))
#df
df.loc['Romania']

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


Recognizing layout: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.74it/s]
Running OCR Error Detection: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 128.66it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 100%|█████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.14s/it]


BATTERY ELECTRIC  February 2025        724
                  February 2024      1,109
                  % change 25/24     -34.7
PLUG-IN HYBRID    February 2025      5,510
                  February 2024      3,736
                  % change 25/24     +47.5
HYBRID ELECTRIC   February 2025      1,354
                  February 2024        953
                  % change 25/24     +42.1
OTHERS            February 2025      3,007
                  February 2024      3,729
                  % change 25/24     -19.4
PETROL            February 2025      1,255
                  February 2024      1,812
                  % change 25/24     -30.7
DIESEL            February 2025     11,850
                  February 2024     11,339
                  % change 25/24      +4.5
TOTAL             February 2025       None
                  February 2024       None
                  % change 25/24      None
Name: Romania, dtype: object

<hr width=40%/>

In [9]:
pdf_path = 'samples/Press_release_car_registrations_February_2025.pdf'

config = {
    "output_format": "json",
    "page_range": "2",
    "use_llm": True,
    "gemini_api_key": "AIzaSyDsYsuF_ObXDTA661hZmuy6RoNXV7ZglvU"
}

rendered = marker_pdf(pdf_path, config)
children, blocktype, metadata = text_from_rendered(rendered)

## <<< HACK!!!
# parse the str representation of rendered output at JSON object
children = json.loads(children)
obj = children['children'][0]
#print(obj.keys())

# locate the table JSON in the PDF
hits = list(filter(lambda c: c['block_type']=='Table', obj['children']))
table = hits[0]
#print(table)

# now parse the HTML of the table JSON object
soup = BeautifulSoup(table['html'], features='html.parser')

acc = []
trs = soup.find_all('tr')
for tr in trs:
    vals = [
        v.strip() for v in 
        [e for sublist in 
             [child.text.split('|') for child in tr.children if len(child.text.strip()) > 0] for e in sublist] 
        if len(v.strip()) > 0
    ]
    acc.append(vals)
tmp_df = pd.DataFrame(acc)

df = tmp_df.iloc[3:,1:].copy()

df.index = tmp_df.iloc[3:,0].values

col_level_0 = (
    ['BATTERY ELECTRIC']*3 +
    ['PLUG-IN HYBRID']*3 +
    ['HYBRID ELECTRIC']*3 +
    ['OTHERS']*3 +
    ['PETROL']*3 +
    ['DIESEL']*3 +
    ['TOTAL']*3 
)

col_level_1 = (
    tmp_df.iloc[1:3,:].T
        .dropna(how='all', axis=0)
        .apply(lambda r: ' '.join(r.values), axis=1).values
)

df.columns = pd.MultiIndex.from_tuples(zip(col_level_0, col_level_1))
#df
df.loc['Romania']

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_28 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16


Recognizing layout: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.77it/s]
LLM layout relabelling: 0it [00:00, ?it/s]
Running OCR Error Detection: 100%|███████████████████████████████████████████████| 1/1 [00:00<00:00, 129.45it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 100%|█████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.14s/it]
LLMTableProcessor running: 1it [00:30, 30.23s/it]


The read operation timed out


LLMTableMergeProcessor running: 0it [00:00, ?it/s]
LLM processors running: 100%|█████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.40s/it]


BATTERY ELECTRIC  February 2025        724
                  February 2024      1,109
                  % change 25/24     -34.7
PLUG-IN HYBRID    February 2025      5,510
                  February 2024      3,736
                  % change 25/24     +47.5
HYBRID ELECTRIC   February 2025      1,354
                  February 2024        953
                  % change 25/24     +42.1
OTHERS            February 2025      3,007
                  February 2024      3,729
                  % change 25/24     -19.4
PETROL            February 2025      1,255
                  February 2024      1,812
                  % change 25/24     -30.7
DIESEL            February 2025     11,850
                  February 2024     11,339
                  % change 25/24      +4.5
TOTAL             February 2025       None
                  February 2024       None
                  % change 25/24      None
Name: Romania, dtype: object

----

## Conclusion

* `marker-pdf`s understanding of format is much better than old school and purely OCR-based approaches.
* However, the accuracy is still somewhat lacking despite being NN-derived.