Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions api-reference/ingest/destination-connector/delta-table.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,24 @@
title: Delta Table
---

import SharedDeltaTable from '/snippets/dc-shared-text/delta-table.mdx';
import NewDocument from '/snippets/general-shared-text/new-document.mdx';

<SharedDeltaTable />
<NewDocument />

import SharedContentDeltaTable from '/snippets/dc-shared-text/delta-table-cli-api.mdx';
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';

<SharedContentDeltaTable/>
<SharedAPIKeyURL/>

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector:

import DeltaTableAPISh from '/snippets/destination_connectors/delta_table.sh.mdx';
import DeltaTableAPIPyV2 from '/snippets/destination_connectors/delta_table.v2.py.mdx';
import DeltaTableAPIPyV1 from '/snippets/destination_connectors/delta_table.v1.py.mdx';

<CodeGroup>
<DeltaTableAPISh />
<DeltaTableAPIPyV2 />
<DeltaTableAPIPyV1 />
</CodeGroup>
24 changes: 23 additions & 1 deletion open-source/ingest/destination-connectors/delta-table.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,28 @@
title: Delta Table
---

import SharedDeltaTable from '/snippets/dc-shared-text/delta-table.mdx';
import NewDocument from '/snippets/general-shared-text/new-document.mdx';

<NewDocument />

import SharedDeltaTable from '/snippets/dc-shared-text/delta-table-cli-api.mdx';

<SharedDeltaTable />

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector.

This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.

import DeltaTableAPISh from '/snippets/destination_connectors/delta_table.sh.mdx';
import DeltaTableAPIPyV2 from '/snippets/destination_connectors/delta_table.v2.py.mdx';
import DeltaTableAPIPyV1 from '/snippets/destination_connectors/delta_table.v1.py.mdx';

<CodeGroup>
<DeltaTableAPISh />
<DeltaTableAPIPyV2 />
<DeltaTableAPIPyV1 />
</CodeGroup>

import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx';

<SharedPartitionByAPIOSS/>
9 changes: 9 additions & 0 deletions snippets/dc-shared-text/delta-table-cli-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Batch process all your records to store structured outputs in a Delta Table in an Amazon S3 bucket.

You will need:

import SharedDeltaTable from '/snippets/general-shared-text/delta-table.mdx';
import SharedDeltaTableCLIAPI from '/snippets/general-shared-text/delta-table-cli-api.mdx';

<SharedDeltaTable />
<SharedDeltaTableCLIAPI />
25 changes: 0 additions & 25 deletions snippets/dc-shared-text/delta-table.mdx

This file was deleted.

15 changes: 9 additions & 6 deletions snippets/destination_connectors/delta_table.sh.mdx
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
```bash Shell
```bash CLI
#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--strategy hi_res \
--chunk-elements \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
--chunking-strategy by_title \
--embedding-provider huggingface \
--num-processes 2 \
--verbose \
delta-table \
--table-uri delta-table-dest
--aws-access-key-id $AWS_ACCESS_KEY_ID \
--aws-secret-access-key $AWS_SECRET_ACCESS_KEY \
--table-uri $AWS_S3_URL
```
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
```python Python
```python Python Ingest v1
import os

from unstructured_ingest.connector.delta_table import DeltaTableWriteConfig, SimpleDeltaTableConfig
Expand All @@ -20,9 +20,8 @@ from unstructured_ingest.runner.writers.delta_table import (
def get_writer() -> Writer:
return DeltaTableWriter(
connector_config=SimpleDeltaTableConfig(
table_uri="delta-table-dest",
table_uri=os.getenv("AWS_S3_URL"),
storage_options={
"AWS_REGION": "us-east-2",
"AWS_ACCESS_KEY_ID": os.getenv("AWS_ACCESS_KEY_ID"),
"AWS_SECRET_ACCESS_KEY": os.getenv("AWS_SECRET_ACCESS_KEY"),
},
Expand Down
55 changes: 55 additions & 0 deletions snippets/destination_connectors/delta_table.v2.py.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
```python Python Ingest v2
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig

from unstructured_ingest.v2.processes.connectors.delta_table import (
DeltaTableConnectionConfig,
DeltaTableAccessConfig,
DeltaTableUploadStagerConfig,
DeltaTableUploaderConfig
)

from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalConnectionConfig,
LocalDownloaderConfig
)

from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

# Chunking and embedding are optional.

if __name__ == "__main__":

Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
destination_connection_config=DeltaTableConnectionConfig(
access_config=DeltaTableAccessConfig(
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY")
),
table_uri=os.getenv("AWS_S3_URL")
),
stager_config=DeltaTableUploadStagerConfig(),
uploader_config=DeltaTableUploaderConfig()
).run()
```
15 changes: 15 additions & 0 deletions snippets/general-shared-text/delta-table-cli-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
The Delta Table connector dependencies for Amazon S3:

```bash CLI, Python
pip install "unstructured-ingest[delta-table]"
```

import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-dependencies.mdx';

<AdditionalIngestDependencies />

The following environment variables:

- `AWS_S3_URL` - The path to the S3 bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`, represented by `--table-uri` (CLI) or `table_uri` (Python).
- `AWS_ACCESS_KEY_ID` - The AWS access key ID for the authenticated AWS IAM user, represented by `--aws-access-key-id` (CLI) or `aws_access_key` (Python).
- `AWS_SECRET_ACCESS_KEY` - The corresponding AWS secret access key, represented by `--aws-secret-access-key` (CLI) or `aws_secret_access_key` (Python).
75 changes: 75 additions & 0 deletions snippets/general-shared-text/delta-table.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
The Delta Table prerequisites for Amazon S3:

The following video shows how to fulfill the minimum set of S3 prerequisites:

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/_W4565dcUGI"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

The preceding video does not show how to create an AWS account or an S3 bucket.

For more information about prerequisites, see the following:

- An AWS account. [Create an AWS account](https://aws.amazon.com/free).

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/lIdh92JmWtg"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
Additional approaches are in the following video and in the how-to sections at the end of this page.

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/e6w9LwZJFIA"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html).

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/y4SfQoJpipo"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at
minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html).

- For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account.
[Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey).

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/MoFTaGJE65Q"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`).
If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`).
- If the target files are in a folder, make sure the authenticated AWS IAM user has
authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders).