Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenting Markdown-converted PDFs into pages #86

Closed
umarbutler opened this issue Feb 17, 2024 · 7 comments · Fixed by #197
Closed

Segmenting Markdown-converted PDFs into pages #86

umarbutler opened this issue Feb 17, 2024 · 7 comments · Fixed by #197

Comments

@umarbutler
Copy link

Hi @VikParuchuri,
Thank you very much for creating this invaluable package which I have found extremely useful in several projects already. I just wanted to ask if an option could be added to indicate where pages start and end in the outputted Markdown? Even having the ability to add a custom delimiter such as <page> would help.

@umarbutler
Copy link
Author

umarbutler commented Feb 17, 2024

For anyone else interested in preserving page boundaries, I managed to add a page delimiter by:

  1. Replacing the merge_lines() function in markdown.py with the following:
    def merge_lines(blocks, page_blocks: List[Page]):
        text_blocks = []
        prev_type = None
        prev_line = None
        block_text = ""
        block_type = ""
        common_line_heights = [p.get_line_height_stats() for p in page_blocks]
        for page_i, page in enumerate(blocks):
            for block in page:
                block_type = block.most_common_block_type()
                if block_type != prev_type and prev_type:
                    text_blocks.append(
                        FullyMergedBlock(
                            text=block_surround(block_text, prev_type),
                            block_type=prev_type
                        )
                    )
                    block_text = ""
    
                prev_type = block_type
                # Join lines in the block together properly
                for i, line in enumerate(block.lines):
                    line_height = line.bbox[3] - line.bbox[1]
                    prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
                    prev_line_x = prev_line.bbox[0] if prev_line else 0
                    prev_line = line
                    is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
    
                    if block_text:
                        block_text = line_separator(block_text, line.text, block_type, is_continuation)
                    else:
                        block_text = line.text
    
            # This is where the magic happens!
            if page_i != len(blocks) - 1:
                block_text += ''
            # This is where the magic ends!
    
        # Append the final block
        text_blocks.append(
            FullyMergedBlock(
                text=block_surround(block_text, prev_type),
                block_type=block_type
            )
        )
        return text_blocks
  2. Replacing lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå" in the line_seperator() function of markdown.py with lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå". This ensures that delimiters do not cause newlines to be inserted in the middle of lines.

This uses (Unicode's object replacement character) instead of <page> as it is a single character and can therefore be added directly to the lowercase_letters regex character set instead of having to rework regex patterns. You may replace it with any other character of your choosing.

This is a bit of a hacky solution so I'd still like to see page segmentation implemented officially in marker.

davidpomerenke added a commit to danu-insight/marker that referenced this issue Feb 20, 2024
@nunamia
Copy link

nunamia commented Feb 21, 2024

YES, You need edit schema.py
image

and edit markdown.py
`def merge_lines(blocks, page_blocks: List[Page]):
text_blocks = []
prev_type = None
prev_line = None
block_text = ""
block_type = ""
block_pnum = 0
common_line_heights = [p.get_line_height_stats() for p in page_blocks]
for page in blocks:
for block in page:
block_pnum = block.pnum
block_type = block.most_common_block_type()
if block_type != prev_type and prev_type:
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=prev_type,
pnum=block_pnum
)
)
block_text = ""
prev_type = block_type
# Join lines in the block together properly
for i, line in enumerate(block.lines):
line_height = line.bbox[3] - line.bbox[1]
prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
prev_line_x = prev_line.bbox[0] if prev_line else 0
prev_line = line
is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
if block_text:
block_text = line_separator(block_text, line.text, block_type, is_continuation)
else:
block_text = line.text

# Append the final block
text_blocks.append(
    FullyMergedBlock(
        text=block_surround(block_text, prev_type),
        block_type=block_type,
        pnum=block_pnum
    )
)
return text_blocks`
image

@Terranic
Copy link

Terranic commented Apr 28, 2024

@nunamia How about making a merge of this solution?

However, I´m observing issues with the page numbers. I have a document vom EU Parliament where every page has content but the page numbers are too often and jump

image

@umarbutler
Copy link
Author

@Terranic Try out my solution, I haven't found that issue with it.

@VikParuchuri
Copy link
Owner

Thanks for the script @umarbutler . This is on my list of features to include, as a few people have asked for it

@HaileyStorm
Copy link

Here's a script to monkeypatch Marker with @umarbutler 's solution:

import ast
import inspect
import marker.postprocessors.markdown


class MarkdownTransformer(ast.NodeTransformer):
    def __init__(self):
        self.current_function = None

    def visit_FunctionDef(self, node):
        # Store the current function name
        self.current_function = node.name
        # Visit all the child nodes within the function
        self.generic_visit(node)
        # Reset current function name to None after leaving the function
        self.current_function = None
        return node

    def visit_Assign(self, node):
        if self.current_function == 'line_separator':
            if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'lowercase_letters':
                if isinstance(node.value, ast.Constant) and isinstance(node.value.value, str):
                    original_value = node.value.value  # might want node.value.s
                    new_value = original_value + '|'
                    node.value = ast.Constant(value=new_value)
        return node

    def visit_For(self, node):
        if self.current_function == 'merge_lines':
            # Check if the loop iterates over a variable named 'page'
            if isinstance(node.target, ast.Name) and node.target.id == 'page':
                # Change the loop to use enumerate
                node.iter = ast.Call(
                    func=ast.Name(id='enumerate', ctx=ast.Load()),
                    args=[node.iter],
                    keywords=[]
                )
                node.target = ast.Tuple(elts=[
                    ast.Name(id='page_i', ctx=ast.Store()),
                    ast.Name(id='page', ctx=ast.Store())
                ], ctx=ast.Store())

                # Create the additional check and append operation
                page_check = ast.parse("""
if page_i != len(blocks) - 1:
    block_text += ''
""").body[0]
                node.body.append(page_check)
        return node


# Get the source code and make the AST
markdown_source = inspect.getsource(marker.postprocessors.markdown)
markdown_ast = ast.parse(markdown_source)

# Create the AST transformer instance
markdown_transformer = MarkdownTransformer()

# Perform the transformation (explores the tree and applies defined transformation functions, returning the new tree)
markdown_ast = markdown_transformer.visit(markdown_ast)
# Fix missing locations in the modified AST
ast.fix_missing_locations(markdown_ast)

# Replace the functions in the actual module - e.g. internal module calls to
# marker.postprocessors.markdown.line_separator will use the updated version.
exec(compile(markdown_ast, filename='<ast>', mode='exec'), marker.postprocessors.markdown.__dict__)

@knysfh
Copy link

knysfh commented Jun 3, 2024

Less debugging for others,the method of using @umarbutler requires changing the two files marker/schema/merged.py and marker/postprocessors/markdown.py

note:tested on marker-pdf==0.2.5

merged.py

from collections import Counter
from typing import List, Optional

from pydantic import BaseModel

from marker.schema.bbox import BboxElement


class MergedLine(BboxElement):
    text: str
    fonts: List[str]

    def most_common_font(self):
        counter = Counter(self.fonts)
        return counter.most_common(1)[0][0]


class MergedBlock(BboxElement):
    lines: List[MergedLine]
    pnum: int
    block_type: Optional[str]


class FullyMergedBlock(BaseModel):
    text: str
    block_type: str
    pnum: int

markdown.py,replace merge_lines function.

def merge_lines(blocks: List[List[MergedBlock]]):
    text_blocks = []
    prev_type = None
    prev_line = None
    block_text = ""
    block_type = ""
    block_pnum = 0
    # common_line_heights = [p.get_line_height_stats() for p in page_blocks]
    for page_i, page in enumerate(blocks):
        for block in page:
            block_pnum = block.pnum
            block_type = block.block_type
            if block_type != prev_type and prev_type:
                text_blocks.append(
                    FullyMergedBlock(
                        text=block_surround(block_text, prev_type),
                        block_type=prev_type,
                        pnum=block_pnum
                    )
                )
                block_text = ""

            prev_type = block_type
            # Join lines in the block together properly
            for i, line in enumerate(block.lines):
                line_height = line.bbox[3] - line.bbox[1]
                prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
                prev_line_x = prev_line.bbox[0] if prev_line else 0
                prev_line = line
                is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x

                if block_text:
                    block_text = line_separator(block_text, line.text, block_type, is_continuation)
                else:
                    block_text = line.text

        # This is where the magic happens!
        if page_i != len(blocks) - 1:
            block_text += ''
        # This is where the magic ends!

    # Append the final block
    text_blocks.append(
        FullyMergedBlock(
            text=block_surround(block_text, prev_type),
            block_type=block_type,
            pnum=block_pnum
        )
    )
    return text_blocks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants