diff --git a/api-reference/ingest/destination-connector/delta-table.mdx b/api-reference/ingest/destination-connector/delta-table.mdx index 71c39a62..896423cb 100644 --- a/api-reference/ingest/destination-connector/delta-table.mdx +++ b/api-reference/ingest/destination-connector/delta-table.mdx @@ -2,6 +2,24 @@ title: Delta Table --- -import SharedDeltaTable from '/snippets/dc-shared-text/delta-table.mdx'; +import NewDocument from '/snippets/general-shared-text/new-document.mdx'; - + + +import SharedContentDeltaTable from '/snippets/dc-shared-text/delta-table-cli-api.mdx'; +import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx'; + + + + +Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector: + +import DeltaTableAPISh from '/snippets/destination_connectors/delta_table.sh.mdx'; +import DeltaTableAPIPyV2 from '/snippets/destination_connectors/delta_table.v2.py.mdx'; +import DeltaTableAPIPyV1 from '/snippets/destination_connectors/delta_table.v1.py.mdx'; + + + + + + diff --git a/open-source/ingest/destination-connectors/delta-table.mdx b/open-source/ingest/destination-connectors/delta-table.mdx index 71c39a62..99242e17 100644 --- a/open-source/ingest/destination-connectors/delta-table.mdx +++ b/open-source/ingest/destination-connectors/delta-table.mdx @@ -2,6 +2,28 @@ title: Delta Table --- -import SharedDeltaTable from '/snippets/dc-shared-text/delta-table.mdx'; +import NewDocument from '/snippets/general-shared-text/new-document.mdx'; + + + +import SharedDeltaTable from '/snippets/dc-shared-text/delta-table-cli-api.mdx'; + +Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. + +This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page. + +import DeltaTableAPISh from '/snippets/destination_connectors/delta_table.sh.mdx'; +import DeltaTableAPIPyV2 from '/snippets/destination_connectors/delta_table.v2.py.mdx'; +import DeltaTableAPIPyV1 from '/snippets/destination_connectors/delta_table.v1.py.mdx'; + + + + + + + +import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx'; + + diff --git a/snippets/dc-shared-text/delta-table-cli-api.mdx b/snippets/dc-shared-text/delta-table-cli-api.mdx new file mode 100644 index 00000000..af9a42a1 --- /dev/null +++ b/snippets/dc-shared-text/delta-table-cli-api.mdx @@ -0,0 +1,9 @@ +Batch process all your records to store structured outputs in a Delta Table in an Amazon S3 bucket. + +You will need: + +import SharedDeltaTable from '/snippets/general-shared-text/delta-table.mdx'; +import SharedDeltaTableCLIAPI from '/snippets/general-shared-text/delta-table-cli-api.mdx'; + + + \ No newline at end of file diff --git a/snippets/dc-shared-text/delta-table.mdx b/snippets/dc-shared-text/delta-table.mdx deleted file mode 100644 index 042d160f..00000000 --- a/snippets/dc-shared-text/delta-table.mdx +++ /dev/null @@ -1,25 +0,0 @@ -Batch process all your records using `unstructured-ingest` to store structured outputs locally on your filesystem and upload those local files to a Delta Table. - -First you’ll need to install the delta table dependencies as shown here. - -```bash -pip install "unstructured-ingest[delta-table]" -``` - -The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the upstream local connector. - -import DeltaTableSh from '/snippets/destination_connectors/delta_table.sh.mdx'; -import DeltaTablePy from '/snippets/destination_connectors/delta_table.py.mdx'; - - - - - - - - - - -For a full list of the options the Unstructured Ingest CLI accepts check `unstructured-ingest delta-table --help`. - -NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you’re running this locally. You can find more information about this in the [installation guide](/open-source/installation/overview). \ No newline at end of file diff --git a/snippets/destination_connectors/delta_table.sh.mdx b/snippets/destination_connectors/delta_table.sh.mdx index e1219f78..ca28c6d3 100644 --- a/snippets/destination_connectors/delta_table.sh.mdx +++ b/snippets/destination_connectors/delta_table.sh.mdx @@ -1,4 +1,4 @@ -```bash Shell +```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. @@ -6,12 +6,15 @@ unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ - --output-dir $LOCAL_FILE_OUTPUT_DIR \ + --partition-by-api \ + --api-key $UNSTRUCTURED_API_KEY \ + --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ - --chunk-elements \ + --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ + --chunking-strategy by_title \ --embedding-provider huggingface \ - --num-processes 2 \ - --verbose \ delta-table \ - --table-uri delta-table-dest + --aws-access-key-id $AWS_ACCESS_KEY_ID \ + --aws-secret-access-key $AWS_SECRET_ACCESS_KEY \ + --table-uri $AWS_S3_URL ``` diff --git a/snippets/destination_connectors/delta_table.py.mdx b/snippets/destination_connectors/delta_table.v1.py.mdx similarity index 94% rename from snippets/destination_connectors/delta_table.py.mdx rename to snippets/destination_connectors/delta_table.v1.py.mdx index 25295f01..342c6408 100644 --- a/snippets/destination_connectors/delta_table.py.mdx +++ b/snippets/destination_connectors/delta_table.v1.py.mdx @@ -1,4 +1,4 @@ -```python Python +```python Python Ingest v1 import os from unstructured_ingest.connector.delta_table import DeltaTableWriteConfig, SimpleDeltaTableConfig @@ -20,9 +20,8 @@ from unstructured_ingest.runner.writers.delta_table import ( def get_writer() -> Writer: return DeltaTableWriter( connector_config=SimpleDeltaTableConfig( - table_uri="delta-table-dest", + table_uri=os.getenv("AWS_S3_URL"), storage_options={ - "AWS_REGION": "us-east-2", "AWS_ACCESS_KEY_ID": os.getenv("AWS_ACCESS_KEY_ID"), "AWS_SECRET_ACCESS_KEY": os.getenv("AWS_SECRET_ACCESS_KEY"), }, diff --git a/snippets/destination_connectors/delta_table.v2.py.mdx b/snippets/destination_connectors/delta_table.v2.py.mdx new file mode 100644 index 00000000..1472af4e --- /dev/null +++ b/snippets/destination_connectors/delta_table.v2.py.mdx @@ -0,0 +1,55 @@ +```python Python Ingest v2 +import os + +from unstructured_ingest.v2.pipeline.pipeline import Pipeline +from unstructured_ingest.v2.interfaces import ProcessorConfig + +from unstructured_ingest.v2.processes.connectors.delta_table import ( + DeltaTableConnectionConfig, + DeltaTableAccessConfig, + DeltaTableUploadStagerConfig, + DeltaTableUploaderConfig +) + +from unstructured_ingest.v2.processes.connectors.local import ( + LocalIndexerConfig, + LocalConnectionConfig, + LocalDownloaderConfig +) + +from unstructured_ingest.v2.processes.partitioner import PartitionerConfig +from unstructured_ingest.v2.processes.chunker import ChunkerConfig +from unstructured_ingest.v2.processes.embedder import EmbedderConfig + +# Chunking and embedding are optional. + +if __name__ == "__main__": + + Pipeline.from_configs( + context=ProcessorConfig(), + indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), + downloader_config=LocalDownloaderConfig(), + source_connection_config=LocalConnectionConfig(), + partitioner_config=PartitionerConfig( + partition_by_api=True, + api_key=os.getenv("UNSTRUCTURED_API_KEY"), + partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), + additional_partition_args={ + "split_pdf_page": True, + "split_pdf_allow_failed": True, + "split_pdf_concurrency_level": 15 + } + ), + chunker_config=ChunkerConfig(chunking_strategy="by_title"), + embedder_config=EmbedderConfig(embedding_provider="huggingface"), + destination_connection_config=DeltaTableConnectionConfig( + access_config=DeltaTableAccessConfig( + aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"), + aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY") + ), + table_uri=os.getenv("AWS_S3_URL") + ), + stager_config=DeltaTableUploadStagerConfig(), + uploader_config=DeltaTableUploaderConfig() + ).run() +``` \ No newline at end of file diff --git a/snippets/general-shared-text/delta-table-cli-api.mdx b/snippets/general-shared-text/delta-table-cli-api.mdx new file mode 100644 index 00000000..96f25ab2 --- /dev/null +++ b/snippets/general-shared-text/delta-table-cli-api.mdx @@ -0,0 +1,15 @@ +The Delta Table connector dependencies for Amazon S3: + +```bash CLI, Python +pip install "unstructured-ingest[delta-table]" +``` + +import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-dependencies.mdx'; + + + +The following environment variables: + +- `AWS_S3_URL` - The path to the S3 bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`, represented by `--table-uri` (CLI) or `table_uri` (Python). +- `AWS_ACCESS_KEY_ID` - The AWS access key ID for the authenticated AWS IAM user, represented by `--aws-access-key-id` (CLI) or `aws_access_key` (Python). +- `AWS_SECRET_ACCESS_KEY` - The corresponding AWS secret access key, represented by `--aws-secret-access-key` (CLI) or `aws_secret_access_key` (Python). \ No newline at end of file diff --git a/snippets/general-shared-text/delta-table.mdx b/snippets/general-shared-text/delta-table.mdx new file mode 100644 index 00000000..8872536d --- /dev/null +++ b/snippets/general-shared-text/delta-table.mdx @@ -0,0 +1,75 @@ +The Delta Table prerequisites for Amazon S3: + +The following video shows how to fulfill the minimum set of S3 prerequisites: + + + +The preceding video does not show how to create an AWS account or an S3 bucket. + +For more information about prerequisites, see the following: + +- An AWS account. [Create an AWS account](https://aws.amazon.com/free). + + + +- An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). + Additional approaches are in the following video and in the how-to sections at the end of this page. + + + +- For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). + + + +- For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at + minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). + +- For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. + [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). + + + +- If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). + If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). +- If the target files are in a folder, make sure the authenticated AWS IAM user has + authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). \ No newline at end of file