Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition pdf #301

Merged
merged 15 commits into from
May 18, 2024
Merged

Partition pdf #301

merged 15 commits into from
May 18, 2024

Conversation

fsatsuki
Copy link
Contributor

Issue #, if available:
#297

Description of changes:
When analyzing PDFs, unstructured.partition.auto was used, however unstructured.partition.pdf capable of detailed structural analysis can be selected.
When enabling unstructured.partition.pdf , set “enable_partition_pdf”: true , in cdk.json

unstructured.partition.pdf spends a lot of time.
Implement parallel processing with multiple processes and shorten processing time by making it possible to change the container size for embedding with cdk.json

When embedding 30 PDFs of 15 to 150 pages,
It takes 7 minutes with unstructured.partition.auto and 61 minutes with unstructured.partition.pdf

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

backend/embedding/main.py Outdated Show resolved Hide resolved
backend/embedding/loaders/s3.py Show resolved Hide resolved
backend/embedding/main.py Outdated Show resolved Hide resolved
@statefb statefb self-requested a review May 14, 2024 01:36
@fsatsuki fsatsuki self-assigned this May 14, 2024
frontend/src/i18n/en/index.ts Outdated Show resolved Hide resolved
frontend/src/i18n/ja/index.ts Outdated Show resolved Hide resolved
backend/app/repositories/custom_bot.py Show resolved Hide resolved
backend/embedding/loaders/s3.py Show resolved Hide resolved
backend/embedding/loaders/s3.py Outdated Show resolved Hide resolved
backend/embedding/main.py Outdated Show resolved Hide resolved
backend/embedding/main.py Outdated Show resolved Hide resolved
cdk/lib/constructs/embedding.ts Outdated Show resolved Hide resolved
cdk/cdk.json Outdated Show resolved Hide resolved
cdk/lib/constructs/embedding.ts Outdated Show resolved Hide resolved
@statefb
Copy link
Contributor

statefb commented May 15, 2024

Memo: comparison when enables hi-res mode (partition.pdf) with multi processing

When embedding 30 PDFs of 15 to 150 pages,
It takes 7 minutes with unstructured.partition.auto and 61 minutes with unstructured.partition.pdf

image

backend/embedding/main.py Outdated Show resolved Hide resolved
cdk/bin/bedrock-chat.ts Outdated Show resolved Hide resolved
EMBEDDING_CONTAINER_VCPU, EMBEDDING_CONTAINER_MEMORYからデフォルト値の設定をはずす

retryのパラメータをpostgres用とupdate_sync_status用で別々にする
@statefb statefb merged commit 4fae0bb into aws-samples:main May 18, 2024
6 checks passed
@statefb
Copy link
Contributor

statefb commented May 18, 2024

LGTM!

@fsatsuki fsatsuki deleted the partition_pdf branch May 20, 2024 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants