Skip to content

Commit

Permalink
Docs: add Unstructured.io blurb to S3 and Google Drive source connect…
Browse files Browse the repository at this point in the history
…ors (#32413)
  • Loading branch information
aaronsteers committed Apr 23, 2024
1 parent 6269b7f commit a05d84f
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 2 deletions.
5 changes: 4 additions & 1 deletion docs/integrations/sources/azure-blob-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,10 @@ The Document File Type Format is a special format that allows you to extract tex

One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.

To perform the text extraction from PDF and Docx files, the connector uses the [Unstructured](https://pypi.org/project/unstructured/) Python library.
#### Parsing via Unstructured.io Python Library

This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).

</FieldAnchor>

## Changelog
Expand Down
4 changes: 4 additions & 0 deletions docs/integrations/sources/google-drive.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,10 @@ One record will be emitted for each document. Keep in mind that large files can
Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs.
#### Parsing via Unstructured.io Python Library
This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
## Changelog
| Version | Date | Pull Request | Subject |
Expand Down
5 changes: 4 additions & 1 deletion docs/integrations/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,10 @@ The Document File Type Format is a special format that allows you to extract tex

One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.

To perform the text extraction from PDF and Docx files, the connector uses the [Unstructured](https://pypi.org/project/unstructured/) Python library.
#### Parsing via Unstructured.io Python Library

This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).

</FieldAnchor>

## Changelog
Expand Down

0 comments on commit a05d84f

Please sign in to comment.