Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aryn trained DETR model for entity detection #212

Merged
merged 1 commit into from
Jan 31, 2024
Merged

Conversation

bohou-aryn
Copy link
Collaborator

No description provided.

Copy link
Contributor

@baitsguy baitsguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not possible to do table extraction if we use this?

sycamore/data/bbox.py Outdated Show resolved Hide resolved
def zero(self) -> bool:
return math.isclose(self.x1, self.x2, rel_tol=1e-6) or math.isclose(self.y1, self.y2, rel_tol=1e-6)

def iou(self, other: "BoundingBox") -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This is "intersection over union", right? Maybe leave a comment so people know what to search for.


@staticmethod
def _supplement_text(inferred: List[Element], text: List[Element], threshold: float = 0.5) -> List[Element]:
# this is a n^2 time complexity, but for hungarian, it's too expensive? n^3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a little more context here on your choices? From Googling I see that the Hungarian algorithm is indeed O(n^3), but it's not very clear to me what algorithm we used that is O(n^2). Is it the same as Unstructured? What are the cases where this doesn't work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, similar but simplified, I only supplement text element to the inferred ones.

return inferred

def partition_pdf(self, file: BinaryIO, threshold: float = 0.4) -> List[List["Element"]]:
with tempfile.TemporaryDirectory() as tmp_dir, tempfile.NamedTemporaryFile() as tmp_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we write to a temporary file because pdf2image/pdfminer/detr expect it? Do they only support names, or can you pass in a BinaryIO object? I know pd2image supports binary (

images = pdf2image.convert_from_bytes(doc.binary_representation)
), but guess I don't know about the others. I guess it's not a big problem to use a temp file, but just curious why it's necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not make it work directly from bytes, would try couple more times.

sycamore/transforms/detr_partitioner.py Show resolved Hide resolved
Comment on lines +104 to +109
results = self.processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=threshold)[
0
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess since this passes linting this must be what black generated, right? What a weird format for array indexing :)

sycamore/transforms/detr_partitioner.py Outdated Show resolved Hide resolved
@bohou-aryn bohou-aryn merged commit 4279307 into main Jan 31, 2024
10 checks passed
@bohou-aryn bohou-aryn deleted the deformable branch January 31, 2024 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants