Docs: add Unstructured.io blurb to S3 and Google Drive source connect…

…ors (#32413)
airbytehq · Apr 23, 2024 · a05d84f · a05d84f
1 parent 6269b7f
commit a05d84f
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 2 deletions.
diff --git a/docs/integrations/sources/azure-blob-storage.md b/docs/integrations/sources/azure-blob-storage.md
@@ -207,7 +207,10 @@ The Document File Type Format is a special format that allows you to extract tex
 
 One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.
 
-To perform the text extraction from PDF and Docx files, the connector uses the [Unstructured](https://pypi.org/project/unstructured/) Python library.
+#### Parsing via Unstructured.io Python Library
+
+This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
+
 </FieldAnchor>
 
 ## Changelog

diff --git a/docs/integrations/sources/google-drive.md b/docs/integrations/sources/google-drive.md
@@ -243,6 +243,10 @@ One record will be emitted for each document. Keep in mind that large files can
 
 Before parsing each document, the connector exports Google Document files to Docx format internally. Google Sheets, Google Slides, and drawings are internally exported and parsed by the connector as PDFs.
 
+#### Parsing via Unstructured.io Python Library
+
+This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
+
 ## Changelog
 
 | Version | Date       | Pull Request                                             | Subject                                                                                      |

diff --git a/docs/integrations/sources/s3.md b/docs/integrations/sources/s3.md
@@ -318,7 +318,10 @@ The Document File Type Format is a special format that allows you to extract tex
 
 One record will be emitted for each document. Keep in mind that large files can emit large records that might not fit into every destination as each destination has different limitations for string fields.
 
-To perform the text extraction from PDF and Docx files, the connector uses the [Unstructured](https://pypi.org/project/unstructured/) Python library.
+#### Parsing via Unstructured.io Python Library
+
+This connector utilizes the open source [Unstructured](https://unstructured-io.github.io/unstructured/introduction.html#product-offerings) library to perform OCR and text extraction from PDFs and MS Word files, as well as from embedded tables and images. You can read more about the parsing logic in the [Unstructured docs](https://unstructured-io.github.io/unstructured/core/partition.html) and you can learn about other Unstructured tools and services at [www.unstructured.io](https://www.unstructured.io).
+
 </FieldAnchor>
 
 ## Changelog