Skip to content

Conversation

@Coniferish
Copy link
Contributor

@Coniferish Coniferish commented Jun 29, 2023

Summary

Closes #820
Adds include_metadata kwarg to all partition functions except for partition_image.
Adds associated tests for excluding metadata for all partitions.
Adds functionality to add_metadata_with_filetype decorator so include_metadata info is automatically added to a decorated function's docstring

Testing

from unstructured.partition.pdf import partition_pdf
from unstructured.partition import pdf

filename="example-docs/layout-parser-paper-fast.pdf"
elements = pdf.partition_pdf(filename=filename, strategy="auto", include_metadata=False)
elements[0].metadata

add exclude_metadata to docx

add test for doc to exclude metadata

add include_metadata kwarg to email

add include_metadata kwarg to epub

add include_metadata kwarg to json

add exclude_metadata tests to md

add include_metadata kwarg and tests for msg parse

add include_metadata kwarg and tests for odt parse

add include_metadata kwarg and tests for org parse

add include_metadata kwarg and tests for ppt and pptx parse

add include_metadata kwarg and tests for rst parse

add include_metadata kwarg and tests for rtf parse

add include_metadata tests for text parse

add include_metadata tests for tsv parse

add include_metadata tests for xlsx parse

add include_metadata tests for xml parse
@Coniferish Coniferish requested review from MthwRobinson and removed request for MthwRobinson June 29, 2023 15:47
@Coniferish Coniferish requested a review from MthwRobinson June 29, 2023 16:02
Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor request on where the doc strings are getting added, otherwise looks good

Copy link
Contributor

@MthwRobinson MthwRobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once linting is passing!

@MthwRobinson MthwRobinson merged commit e9fdbb0 into Unstructured-IO:main Jun 30, 2023
@Coniferish Coniferish deleted the include_metadata branch July 3, 2023 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Add include_metadata across all partition functions

2 participants