Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix! figures missing? #73

Merged
merged 5 commits into from
Mar 14, 2024
Merged

bugfix! figures missing? #73

merged 5 commits into from
Mar 14, 2024

Conversation

kyleclo
Copy link
Collaborator

@kyleclo kyleclo commented Mar 13, 2024

Hey y'all sorry looks like some bugs introduced when migrating from our internal repo to the public one. This should resolve a lot of issues with Figures. Basically, certain entities like Figures don't have any spans associated; it's just boxes:

Entity(spans=[], boxes=[something here])

because they come from vision models.

In this case, the ability to intersect cross layer via operations like . (see Entity.__getattr__()) is messed up because previously, it relied on being able to hit intersect_by_span. I've added a deprecation warning to any uses of .getattr since it's ambiguous; recommend all users to use intersect_by_span or intersect_by_boxes in the future, which is more clear.

I've then added Figure detection into the CoreRecipe properly, as derived from doc.blocks

Here's a minimal test to validate:

import json
import os
import pathlib

from papermage.magelib import Document
from papermage.recipes import CoreRecipe
from papermage.visualizers.visualizer import plot_entities_on_page

# load doc
recipe = CoreRecipe()
pdfpath = pathlib.Path(__file__).parent.parent / "tests/fixtures/2305.14772.pdf"
doc = recipe.from_pdf(pdf=pdfpath)

# visualize tokens
page_id = 0
plot_entities_on_page(page_image=doc.images[page_id], entities=doc.pages[page_id].tokens)

# visualize tables
page_id = 5
tables = doc.pages[page_id].intersect_by_box("tables")
plot_entities_on_page(page_image=doc.images[page_id], entities=tables)
for table in tables:
    print(table.text)

# visualize figures
figures = doc.pages[page_id].intersect_by_box("figures")
for figure in figures:
    print(figure.text)
plot_entities_on_page(page_image=doc.images[page_id], entities=figures)

# visualize blocks
blocks = doc.pages[page_id].intersect_by_box("blocks")
for block in blocks:
    print(block.text)
plot_entities_on_page(page_image=doc.images[page_id], entities=blocks)

Here's example of the figure visualization:
image

Thanks @aakanksha19 for catching!

@kyleclo kyleclo changed the title bugfix! figures were not plottin properly because of weird interacti… bugfix! figures missing? Mar 14, 2024
@kyleclo kyleclo merged commit 4cb681e into main Mar 14, 2024
4 checks passed
@kyleclo kyleclo deleted the kyle/bugfix-figures branch March 14, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant