-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore CORE-4775: remove html page number metadata field #2942
Merged
yuming-long
merged 10 commits into
main
from
yuming/remove_html_page_numer_metadata_field
Apr 30, 2024
Merged
chore CORE-4775: remove html page number metadata field #2942
yuming-long
merged 10 commits into
main
from
yuming/remove_html_page_numer_metadata_field
Apr 30, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cragwolfe
reviewed
Apr 29, 2024
"884be260a86bbdf265c248d5fff5ea00", | ||
"0a23b3ae6bd812b3d90e47fec1df9fe0", | ||
"1e9e5be33c99f7bbf2e569b2430e16cf", | ||
"333e32df62a0ec81a8df07d52dd73c99", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit of a chore to have to update these.
i realized after we merged this pattern we could just call partition_html
twice and make sure element_id's are the same. that should be a separate PR wherever that pattern is followed, though.
…t fixtures update (#2949) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
cragwolfe
approved these changes
Apr 30, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple
<article>
tag)Test
Unit tests
test_add_chunking_strategy_on_partition_html_respects_multipage
andtest_add_chunking_strategy_title_on_partition_auto_respects_multipage
removed since they relay on thepage_number
fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for thisAlso changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment