Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for doc and docx #13

Merged
merged 5 commits into from
Sep 30, 2024
Merged

Conversation

gowthamshankar99
Copy link
Contributor

support for doc and docx

@anjanvb
Copy link
Contributor

anjanvb commented Sep 10, 2024

@gowthamshankar99 can we make python-docx library optional and check the library import at run-time?

@gowthamshankar99
Copy link
Contributor Author

@anjanvb all set with the requested changes.

Copy link
Contributor

@anjanvb anjanvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We can look at using the raw text from a DOCX file to be sent to the model instead of having to convert each page as a next iteration improvement. However, converting to image ensures that any inline figures are preserved.

@anjanvb anjanvb merged commit 26d96d5 into awslabs:main Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants