Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions api-reference/ingest/destination-connector/duckdb.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: DuckDB
---

import NewDocument from '/snippets/general-shared-text/new-document.mdx';

<NewDocument />

import SharedContentDuckDB from '/snippets/dc-shared-text/duckdb-cli-api.mdx';
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';

<SharedContentDuckDB/>
<SharedAPIKeyURL/>

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:

import DuckDBAPISh from '/snippets/destination_connectors/duckdb.sh.mdx';
import DuckDBAPIPyV2 from '/snippets/destination_connectors/duckdb.v2.py.mdx';

<CodeGroup>
<DuckDBAPISh />
<DuckDBAPIPyV2 />
</CodeGroup>

24 changes: 24 additions & 0 deletions api-reference/ingest/destination-connector/motherduck.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: MotherDuck
---

import NewDocument from '/snippets/general-shared-text/new-document.mdx';

<NewDocument />

import SharedContentMotherDuck from '/snippets/dc-shared-text/motherduck-cli-api.mdx';
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';

<SharedContentMotherDuck/>
<SharedAPIKeyURL/>

Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector:

import MotherDuckAPISh from '/snippets/destination_connectors/motherduck.sh.mdx';
import MotherDuckAPIPyV2 from '/snippets/destination_connectors/motherduck.v2.py.mdx';

<CodeGroup>
<MotherDuckAPISh />
<MotherDuckAPIPyV2 />
</CodeGroup>

1 change: 1 addition & 0 deletions api-reference/ingest/ingest-dependencies.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ To add support for additional connectors, run the following:
| `pip install "unstructured-ingest[delta-table]"` | Delta Tables |
| `pip install "unstructured-ingest[discord]"` | Discord |
| `pip install "unstructured-ingest[dropbox]"` | Dropbox |
| `pip install "unstructured-ingest[dropbox]"` | DuckDB, MotherDuck |
| `pip install "unstructured-ingest[elasticsearch]"` | Elasticsearch |
| `pip install "unstructured-ingest[gcs]"` | Google Cloud Storage |
| `pip install "unstructured-ingest[github]"` | GitHub |
Expand Down
4 changes: 4 additions & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,7 @@
"open-source/ingest/destination-connectors/databricks-volumes",
"open-source/ingest/destination-connectors/delta-table",
"open-source/ingest/destination-connectors/dropbox",
"open-source/ingest/destination-connectors/duckdb",
"open-source/ingest/destination-connectors/elasticsearch",
"open-source/ingest/destination-connectors/google-cloud-service",
"open-source/ingest/destination-connectors/kafka",
Expand All @@ -221,6 +222,7 @@
"open-source/ingest/destination-connectors/local",
"open-source/ingest/destination-connectors/milvus",
"open-source/ingest/destination-connectors/mongodb",
"open-source/ingest/destination-connectors/motherduck",
"open-source/ingest/destination-connectors/onedrive",
"open-source/ingest/destination-connectors/opensearch",
"open-source/ingest/destination-connectors/pinecone",
Expand Down Expand Up @@ -372,6 +374,7 @@
"api-reference/ingest/destination-connector/databricks-volumes",
"api-reference/ingest/destination-connector/delta-table",
"api-reference/ingest/destination-connector/dropbox",
"api-reference/ingest/destination-connector/duckdb",
"api-reference/ingest/destination-connector/elasticsearch",
"api-reference/ingest/destination-connector/google-cloud-service",
"api-reference/ingest/destination-connector/kafka",
Expand All @@ -380,6 +383,7 @@
"api-reference/ingest/destination-connector/local",
"api-reference/ingest/destination-connector/milvus",
"api-reference/ingest/destination-connector/mongodb",
"api-reference/ingest/destination-connector/motherduck",
"api-reference/ingest/destination-connector/onedrive",
"api-reference/ingest/destination-connector/opensearch",
"api-reference/ingest/destination-connector/pinecone",
Expand Down
27 changes: 27 additions & 0 deletions open-source/ingest/destination-connectors/duckdb.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: DuckDB
---

import NewDocument from '/snippets/general-shared-text/new-document.mdx';

<NewDocument />

import SharedDuckDB from '/snippets/dc-shared-text/duckdb-cli-api.mdx';

<SharedDuckDB />

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector.

This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.

import DuckDBAPISh from '/snippets/destination_connectors/duckdb.sh.mdx';
import DuckDBAPIPyV2 from '/snippets/destination_connectors/duckdb.v2.py.mdx';

<CodeGroup>
<DuckDBAPISh />
<DuckDBAPIPyV2 />
</CodeGroup>

import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx';

<SharedPartitionByAPIOSS/>
27 changes: 27 additions & 0 deletions open-source/ingest/destination-connectors/motherduck.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: MotherDuck
---

import NewDocument from '/snippets/general-shared-text/new-document.mdx';

<NewDocument />

import SharedMotherDuck from '/snippets/dc-shared-text/motherduck-cli-api.mdx';

<SharedMotherDuck />

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector.

This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.

import MotherDuckAPISh from '/snippets/destination_connectors/motherduck.sh.mdx';
import MotherDuckAPIPyV2 from '/snippets/destination_connectors/motherduck.v2.py.mdx';

<CodeGroup>
<MotherDuckAPISh />
<MotherDuckAPIPyV2 />
</CodeGroup>

import SharedPartitionByAPIOSS from '/snippets/ingest-configuration-shared/partition-by-api-oss.mdx';

<SharedPartitionByAPIOSS/>
9 changes: 9 additions & 0 deletions snippets/dc-shared-text/duckdb-cli-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Batch process all your records to store structured outputs in a DuckDB installation.

The requirements are as follows.

import SharedDuckDB from '/snippets/general-shared-text/duckdb.mdx';
import SharedDuckDBCLIAPI from '/snippets/general-shared-text/duckdb-cli-api.mdx';

<SharedDuckDB />
<SharedDuckDBCLIAPI />
9 changes: 9 additions & 0 deletions snippets/dc-shared-text/motherduck-cli-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Batch process all your records to store structured outputs in a MotherDuck account.

The requirements are as follows.

import SharedMotherDuck from '/snippets/general-shared-text/motherduck.mdx';
import SharedMotherDuckCLIAPI from '/snippets/general-shared-text/motherduck-cli-api.mdx';

<SharedMotherDuck />
<SharedMotherDuckCLIAPI />
19 changes: 19 additions & 0 deletions snippets/destination_connectors/duckdb.sh.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
```bash CLI
#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--chunking-strategy by_title \
--embedding-provider huggingface \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
duckdb \
--database $DUCKDB_DATABASE \
--db-schema $DUCKDB_DB_SCHEMA \
--table $DUCKDB_TABLE
```
51 changes: 51 additions & 0 deletions snippets/destination_connectors/duckdb.v2.py.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
```python Python Ingest v2
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig

from unstructured_ingest.v2.processes.connectors.duckdb.duckdb import (
DuckDBAccessConfig,
DuckDBConnectionConfig,
DuckDBUploadStagerConfig,
DuckDBUploaderConfig
)
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalConnectionConfig,
LocalDownloaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

# Chunking and embedding are optional.

if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
destination_connection_config=DuckDBConnectionConfig(
access_config=DuckDBAccessConfig(),
database=os.getenv("DUCKDB_DATABASE"),
db_schema=os.getenv("DUCKDB_DB_SCHEMA"),
table=os.getenv("DUCKDB_TABLE")
),
stager_config=DuckDBUploadStagerConfig(),
uploader_config=DuckDBUploaderConfig(batch_size=50)
).run()
```
20 changes: 20 additions & 0 deletions snippets/destination_connectors/motherduck.sh.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
```bash CLI
#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--chunking-strategy by_title \
--embedding-provider huggingface \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
motherduck \
--md-token $MOTHERDUCK_MD_TOKEN \
--database $MOTHERDUCK_DATABASE \
--db-schema $MOTHERDUCK_DB_SCHEMA \
--table $MOTHERDUCK_TABLE
```
51 changes: 51 additions & 0 deletions snippets/destination_connectors/motherduck.v2.py.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
```python Python Ingest v2
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig

from unstructured_ingest.v2.processes.connectors.duckdb.motherduck import (
MotherDuckAccessConfig,
MotherDuckConnectionConfig,
MotherDuckUploadStagerConfig,
MotherDuckUploaderConfig
)
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalConnectionConfig,
LocalDownloaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

# Chunking and embedding are optional.

if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
embedder_config=EmbedderConfig(embedding_provider="huggingface"),
destination_connection_config=MotherDuckConnectionConfig(
access_config=MotherDuckAccessConfig(md_token=os.getenv("MOTHERDUCK_MD_TOKEN")),
database=os.getenv("MOTHERDUCK_DATABASE"),
db_schema=os.getenv("MOTHERDUCK_DB_SCHEMA"),
table=os.getenv("MOTHERDUCK_TABLE")
),
stager_config=MotherDuckUploadStagerConfig(),
uploader_config=MotherDuckUploaderConfig(batch_size=50)
).run()
```
15 changes: 15 additions & 0 deletions snippets/general-shared-text/duckdb-cli-api.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
The DuckDB connector dependencies:

```bash CLI, Python
pip install "unstructured-ingest[duckdb]"
```

import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-dependencies.mdx';

<AdditionalIngestDependencies />

The following environment variables:

- `DUCKDB_DATABASE` - The path to the target DuckDB persistent database file with the extension `.db` or `.duckdb`, represented by `--database` (CLI) or `database` (Python).
- `DUCKDB_DB_SCHEMA` - The name of the target schema in the database, represented by `--db-schema` (CLI) or `db_schema` (Python).
- `DUCKDB_TABLE` - The name of the target table in the schema, represented by `--table` (CLI) or `table` (Python).
Loading