Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api-reference/workflow/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ The following Unstructured SDKs, tools, and libraries do _not_ work with the Uns
- The [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts)
- [Local single-file POST requests](/api-reference/partition/sdk-jsts) to the Unstructured Partition Endpoint
- The [Unstructured open source Python library](/open-source/introduction/overview)
- The [Unstructued Ingest CLI](/ingestion/ingest-cli)
- The [Unstructured Ingest CLI](/ingestion/ingest-cli)
- The [Unstructured Ingest Python library](/ingestion/python-ingest)

The following Unstructured API URL is also _not_ supported: `https://api.unstructuredapp.io/general/v0/general` (the Unstructured Partition Endpoint URL).
Expand Down
2 changes: 1 addition & 1 deletion ingestion/ingest-cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ You can use the Unstructured Ingest CLI to process files locally, or you can use

Local processing does not use an Unstructured API key or API URL.

Using the Ingest CLI to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
Using the Ingest CLI to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows:

<GetStartedSimpleAPIOnly />

Expand Down
4 changes: 2 additions & 2 deletions ingestion/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ title: Overview
---

<Note>
Unstructured recommends that you use the [Unstructured API](/api-reference/overview) instead of the
Unstructured recommends that you use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead of the
Unstructured Ingest CLI or the Unstructured Ingest Python library.

The Unstructured API provides a full range of partitioning, chunking, embedding, and enrichment options for your files and data.
The Unstructured UI and API provide a full range of partitioning, chunking, embedding, and enrichment options for your files and data.
It also uses the latest and highest-performing models on the market today, and it has built-in logic to deliver the highest quality results
at the lowest cost.

Expand Down
2 changes: 1 addition & 1 deletion ingestion/python-ingest.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ You can use the Unstructured Ingest Python library to process files locally, or

Local processing does not use an Unstructured API key or API URL.

Using the Ingest Python library to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
Using the Ingest Python library to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows:

<GetStartedSimpleAPIOnly />

Expand Down
5 changes: 1 addition & 4 deletions open-source/core-functionality/staging.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,7 @@ title: Staging
---

<Warning>

The `Staging` brick is being deprecated in favor of the new and more comprehensive `Destination Connectors`. To explore the complete list and usage, please refer to [Destination Connectors documentation](/ingestion/destination-connectors/overview).

Note: We are constantly expanding our collection of destination connectors. If you wish to request a specific Destination Connector, you’re encouraged to submit a Feature Request on the [Unstructured GitHub repository](https://github.com/Unstructured-IO/unstructured/issues/new/choose).
Staging functions in the Unstructured open source library are being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview).
</Warning>

Staging functions in the `unstructured` package help prepare your data for ingestion into downstream systems. A staging function accepts a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using `the stage_for_label_studio` function. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task.
Expand Down
25 changes: 12 additions & 13 deletions open-source/introduction/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,48 +3,47 @@ title: Unstructured Open Source
sidebarTitle: Overview
---

<Note>The `unstructured` open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, see the [Unstructured API](/api-reference/overview) instead.</Note>
<Note>The Unstructured open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead.</Note>

The `unstructured` [library](https://github.com/Unstructured-IO/unstructured) offers an open-source toolkit
<Tip>To start using the Unstructured open source library right away, skip ahead to the [quickstart](/open-source/introduction/quick-start).</Tip>

The Unstructured open source library ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) offers an open-source toolkit
designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents
such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs),
`unstructured` provides modular functions and connectors that work seamlessly together. This cohesive system ensures
the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures
efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms
and use cases.

## Key functionality

* **Precise Document Extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements).
* **Precise document extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements).

* **Extensive File Support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types).
* **Robust file support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types).

* **Robust Core Functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
* **Robust core functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:

* [Partitioning](/open-source/core-functionality/partitioning): The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.

* [Cleaning](/open-source/core-functionality/cleaning): Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.

* [Extracting](/open-source/core-functionality/extracting): This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.

* [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of `Destination Connectors`.
* [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview).

* [Chunking](/open-source/core-functionality/chunking): The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).

* **High-performant Connectors**: The platform includes optimized connectors for efficient data ingestion and output. These comprise [Source Connectors](/ingestion/source-connectors/overview) for data input and [Destination Connectors](/ingestion/destination-connectors/overview) for data export.


## Common use cases

* Pretraining models
* Fine-tuning models
* Retrieval Augmented Generation (RAG)
* Traditional ETL

<Note>We do not support GPU usage with the open source library.</Note>
<Note>GPU usage is not supported for the Unstructured open source library.</Note>

## Limits

The open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview):
The Unstructured open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview):

* Not designed for production scenarios.
* Significantly decreased performance on document and table extraction.
Expand Down
Loading