-
Notifications
You must be signed in to change notification settings - Fork 2
created_date 추출 정확도를 높이기 위해서 파일명, 문서4페이지 텍스트를 사용하도록 수정 #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The head ref may contain hidden characters: "bug/115-created_date-\uCD94\uCD9C-\uC2E4\uD328"
Conversation
Summary of ChangesHello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 이 Pull Request는 문서에서 Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
이번 PR은 created_date 추출 정확도를 높이기 위해 파일명과 문서의 처음 4페이지 텍스트를 활용하도록 수정한 점이 인상적입니다. 전체적으로 **kwargs를 통해 필요한 정보를 하위 함수로 잘 전달하고 있습니다. 코드 가독성 및 유지보수성 향상을 위해 타입 힌트 수정 및 코드 간소화에 대한 몇 가지 제안 사항을 남겼으니 확인 부탁드립니다.
| return 0 | ||
|
|
||
| def apply_metadata_enrichment(self, document: DoclingDocument) -> bool: | ||
| def apply_metadata_enrichment(self, document: DoclingDocument, **kwargs: dict) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwargs: dict 타입 힌트는 kwargs의 각 값(value)이 dict 타입이어야 함을 의미합니다. 하지만 이 함수에서는 org_filename과 같은 문자열 값을 전달받을 수 있습니다. **kwargs의 타입 힌트는 일반적으로 Any를 사용하여 다양한 타입의 키워드 인수를 받을 수 있도록 하는 것이 좋습니다.
| def apply_metadata_enrichment(self, document: DoclingDocument, **kwargs: dict) -> bool: | |
| def apply_metadata_enrichment(self, document: DoclingDocument, **kwargs: Any) -> bool: |
docling/utils/document_enrichment.py
Outdated
|
|
||
|
|
||
| def _extract_document_metadata_date(self, document_content): | ||
| def _extract_document_metadata_date(self, document_content, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docling/utils/document_enrichment.py
Outdated
| custom_user = self.enrichment_options.metadata_user_prompt | ||
|
|
||
| # 불필요한 태그 제거 (이미지 정보를 나타내는 태그 제거) | ||
| document_content = document_content.replace("<!-- image -->", "").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| # 간단한 함수형 API | ||
| def enrich_document(document: DoclingDocument, enrichment_options: DataEnrichmentOptions) -> DoclingDocument: | ||
| def enrich_document(document: DoclingDocument, enrichment_options: DataEnrichmentOptions, **kwargs: dict) -> DoclingDocument: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwargs: dict 타입 힌트는 kwargs의 각 값(value)이 dict 타입이어야 함을 의미합니다. 하지만 이 함수에서는 org_filename과 같은 문자열 값을 전달받을 수 있습니다. **kwargs의 타입 힌트는 일반적으로 Any를 사용하여 다양한 타입의 키워드 인수를 받을 수 있도록 하는 것이 좋습니다.
| def enrich_document(document: DoclingDocument, enrichment_options: DataEnrichmentOptions, **kwargs: dict) -> DoclingDocument: | |
| def enrich_document(document: DoclingDocument, enrichment_options: DataEnrichmentOptions, **kwargs: Any) -> DoclingDocument: |
Checklist: