From b7b9c6853ec05aa5655adc5f0f500421c7a4cd7e Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Tue, 20 May 2025 17:11:04 -0700 Subject: [PATCH 1/3] Open source library: update quickstart --- open-source/introduction/quick-start.mdx | 274 ++++++++++++++++------- 1 file changed, 199 insertions(+), 75 deletions(-) diff --git a/open-source/introduction/quick-start.mdx b/open-source/introduction/quick-start.mdx index c3a02b22..bcf528a6 100644 --- a/open-source/introduction/quick-start.mdx +++ b/open-source/introduction/quick-start.mdx @@ -1,93 +1,217 @@ --- title: Quickstart -description: Using Unstructured Open Source. --- -## Installation -1. **Installing the open source library**: You can install the core SDK using pip: - - `pip install unstructured` - - Plain text files, HTML, XML, JSON, and Emails are immediately supported without any additional dependencies. - - If you need to process other document types, you can install the extras required by following the [Full Installation](/open-source/installation/full-installation) - -2. **System Dependencies**: Ensure the subsequent system dependencies are installed. Your requirements might vary based on the document types you’re handling: - - * _libmagic-dev_ : Essential for filetype detection. +In this quickstart, you use the [Unstructured open source library](/open-source/introduction/overview) +([GitHub](https://github.com/Unstructured-IO/unstructured), +[PyPI](https://pypi.org/project/unstructured/)) along with Python on your local development machine to partition a PDF file into a standard set of +[Unstructured document elements and metadata](/open-source/concepts/document-elements). You can use these elements and +metadata as input into your RAG applications, AI agents, model fine-tuning tasks, and more. + + + + To complete this quickstart, you need: + + - A Python virtual environment manager is recommended to manage your Python code dependencies. + This quickstart uses [uv](https://docs.astral.sh/uv/) for managing virtual environments and + [venv](https://docs.python.org/3/library/venv.html) as the virtual environment type. Installation and + use of `uv` and `venv` are described in the following steps. + However, `uv` and `venv` are not required to use the Unstructured open source library. + - Python 3.9 or higher. You can use `uv` to install Python if needed, as described in the following steps. + - A PDF file on your local machine. If you do not have a PDF file available, this quickstart provides a sample PDF file named + `layout-parser-paper.pdf` that you can download in a later step. (The Unstructured open source library provides + [support for additional file types](/open-source/introduction/supported-file-types) as well.) + + + + + To use `curl` with `sh`: + + ```bash + curl -fsSL https://get.uv.dev | bash + ``` + + To use `wget` with `sh` instead: + + ```bash + wget -qO- https://astral.sh/uv/install.sh | sh + ``` + + + To use PowerShell with `irm` to download the script and run it with `iex`: + + ```powershell + powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" + ``` + + + + To install `uv` by using other approaches such as PyPI, Homebrew, or WinGet, + see [Installing uv](https://docs.astral.sh/uv/getting-started/installation/). + + + `uv` will detect and use Python if you already have it installed. + To view a list of installed Python versions, run the following command: + + ```bash + uv python list + ``` - * _poppler-utils_ : Needed for images and PDFs. - - * _tesseract-ocr_ : Essential for images and PDFs. + If, however, you do not already have Python installed, you can install a version of Python for use with `uv` + by running the following command. For example, this command installs Python 3.12 for use with `uv`: + + ```bash + uv python install 3.12 + ``` + + + To use `uv` to create a project, switch to the directory on your development machine where you want to + create the project, and then run the following command: + + ```bash + uv init + ``` + + + From your project directory, use `uv` to create a virtual environment with `venv` by running the following command: + + ```bash + # Create the virtual environment by using the current Python version: + uv venv + + # Or, if you want to use a specific Python version: + uv venv --python 3.12 + ``` + + + To activate the virtual environment, run one of the following commands: - * _libreoffice_ : For MS Office documents. + + + - For `bash` or `zsh`, run `source .venv/bin/activate` + - For `fish`, run `source .venv/bin/activate.fish` + - For `csh` or `tcsh`, run `source .venv/bin/activate.csh` + - For `pwsh`, run `.venv/bin/Activate.ps1` + + + - For `cmd.exe`, run `.venv\Scripts\activate.bat` + - For `PowerShell`, run `.venv\Scripts\Activate.ps1` + + - * _pandoc_ : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version 2.14.2 or newer. Running [this script](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh) will install the correct version for you. + To deactivate the virtual environment at any time, run `deactivate`. + + + With the virtual environment activated, install the Unstructured open source library by running the following command: + + ```bash + uv add unstructured + ``` + + The preceding command supports plain text files (`.txt`), HTML files (`.html`), XML files (`.xml`), and emails (`.eml`, `.msg`, and `.p7s`) without any additional dependencies. + + To work with other file types, you must also install these dependencies, as follows, replacing `` with the appropriate extra for the target file type: + + ```bash + uv add "unstructured[]" + ``` + + The following file type extras are available: + + - `all-docs` (for all supported file types in this list) + - `csv` (for `.csv` files only) + - `docx` (for `.doc` and `.docx` files only) + - `epub` (for `.epub` files only) + - `image` (for all supported image file types: `.bmp`, `.heic`, `.jpeg`, `.png`, and `.tiff`) + - `md` (for `.md` files only) + - `odt` (for `.odt` files only) + - `org` (for `.org` files only) + - `pdf` (for `.pdf` files only) + - `pptx` (for `.ppt` and `.pptx` files only) + - `rst` (for `.rst` files only) + - `rtf` (for `.rtf` files only) + - `tsv` (for `.tsv` files only) + - `xlsx` (for `.xls` and `.xlsx` files only) + + As this quickstart uses a sample PDF file, run the following command: - -## Validating Installation - -After installation, confirm the setup by executing the below Python code: - -```python -from unstructured.partition.auto import partition -elements = partition(filename="example-docs/eml/fake-email.eml") -``` - - -If you’ve opted for the “local-inference” installation, you should also be able to execute: - -```python -from unstructured.partition.auto import partition -elements = partition("example-docs/pdf/layout-parser-paper.pdf") - -``` - - -If these code snippets run without errors, congratulations! Your `unstructured` installation is successful and ready for use. - -The following section will cover basic concepts and usage patterns in `unstructured`. After reading this section, you should be able to: - -* Partitioning a document with the `partition` function. - -* Understand how documents are structured in `unstructured`. - -* Convert a document to a dictionary and/or save it as a JSON. + ```bash + uv add "unstructured[pdf]" + ``` + + + You maximum compatibility, you should also install the following system dependencies: + + - [libmagic-dev](https://man7.org/linux/man-pages/man3/libmagic.3.html) (for filetype detection) + - [poppler-utils](https://poppler.freedesktop.org/) and [tesseract-ocr](https://github.com/tesseract-ocr/tesseract) (for images and PDFs), and `tesseract-lang` (for additional language support) + - [libreoffice](https://www.libreoffice.org/discover/libreoffice/) (for Microsoft Office documents) + - [pandoc](https://pandoc.org/) (for `.epub`, `.odt`, and `.rtf` files. For `.rtf` files, you must have version 2.14.2 or newer. Running [this script](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh) will install the correct version for you.) + + Installation instructured for these system dependencies vary by operating system type. For details, follow the preceding links or see your + operating system's documentation. + + + Download the sample PDF file named `layout-parser-paper.pdf` from the following location to your local development machine: + + [https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf) + (You can also use any other PDF file that you want to work with instead of this sample file, if you prefer.) + + + In the project's `main.py` file, add the following Python code, replacing `` with the + path to the `layout-parser-paper.pdf` file that you downloaded to your local development machine. -The example documents in this section come from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) directory in the `unstructured` repo. - -Before running the code in this make sure you’ve installed the `unstructured` library and all dependencies using the instructions in the [quickstart](../installation/overview#quick-start) section. - -## Partitioning a document - -In this section, we’ll cut right to the chase and get to the most important part of the library: partitioning a document. The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections, and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for partitioning a document. We’ll cover those in a later section. For now, we’ll use the simplest API in the library, the `partition` function. The `partition` function will detect the filetype of the source document and route it to the appropriate partitioning function. You can try out the partition function by running the cell below. - -```python -from unstructured.partition.auto import partition - -elements = partition(filename="example-10k.html") - -``` - - -You can also pass in a file as a file-like object using the following workflow: - -```python -with open("example-10k.html", "rb") as f: - elements = partition(file=f) + (If you want to use a different PDF file, replace `layout-parser-paper` with the name of that PDF file instead.) + + ```python + from unstructured.partition.pdf import partition_pdf + from unstructured.staging.base import elements_to_json + + file_path = "" + base_file_name = "layout-parser-paper" + + def main(): + elements = partition_pdf(filename=f"{file_path}/{base_file_name}.pdf") + elements_to_json(elements=elements, filename=f"{file_path}/{base_file_name}-output.json") + + if __name__ == "__main__": + main() + ``` + + + Run the preceding Python code by running the following command: + + ```bash + uv run main.py + ``` + + + View the Unstructured elements and metadata that were generated by opening the `layout-parser-paper-output.json` file in your editor. This file will be in + the location as the original `layout-parser-paper.pdf` file. + + (If you used a different PDF file, the output file will be named `-output.json` instead.) + + + +## Next steps -``` +import SharedOSSSingleFile from '/snippets/general-shared-text/multi-file-oss-use-connectors.mdx'; +- Learn more about the [available partition functions](/open-source/core-functionality/partitioning) in addition to `partition_pdf` for converting other types of files into standard [Unstructured document elements and metadata](/open-source/concepts/document-elements). +- Learn about [available partitioning strategies](/open-source/concepts/partitioning-strategies) for optimal approaches for converting different types of files into Unstructured document elements. +- Learn about [available chunking functions](/open-source/core-functionality/chunking) for splitting up the text in your document elements into manageable chunks as needed to fit into your models' limited context windows. +- Learn about [available cleaning functions](/open-source/core-functionality/cleaning) for cleaning up your document elements' data as needed. +- Learn about [available extraction functions](/open-source/core-functionality/extracting) for getting precise information out of your document elements as needed. +- Learn about how to [generate vector embeddings](/open-source/core-functionality/embedding) for the text in your document elements for use in RAG applications, AI agents, model fine-tuning tasks, and more. +- For an additional code example, see the [Unstructured Quick Tour](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) Google Colab notebook. +- The Unstructured open source library is also available as a [Docker container](/open-source/installation/docker-installation). -The `partition` function uses [libmagic](https://formulae.brew.sh/formula/libmagic) for filetype detection. If `libmagic` is not present and the user passes a filename, `partition` falls back to detecting the filetype using the file extension. `libmagic` is required if you’d like to pass a file-like object to `partition`. We highly recommend installing `libmagic` and you may observe different file detection behaviors if `libmagic` is not installed\`. + -import SharedOSSSingleFile from '/snippets/general-shared-text/multi-file-oss-use-connectors.mdx'; +## Need help? - +To get help, join the [Unstructured Slack community](https://short.unstructured.io/pzw05l7) and post your +questions in the **# ask-for-help-open-source-library** channel. -## Quickstart Tutorial -If you’re eager to dive in, head over [Getting Started](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=jZp37lfueaeZ) on Google Colab to get a hands-on introduction to the `unstructured` library. In a few minutes, you’ll have a basic workflow set up and running! -For more detailed information about specific components or advanced features, explore the rest of the documentation. From cdcf48b590ed36e9de341330089b37a144883ba1 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Wed, 21 May 2025 08:41:58 -0700 Subject: [PATCH 2/3] Next steps updates --- api-reference/workflow/overview.mdx | 2 +- ingestion/ingest-cli.mdx | 2 +- ingestion/overview.mdx | 4 ++-- ingestion/python-ingest.mdx | 2 +- open-source/core-functionality/staging.mdx | 5 +---- open-source/introduction/overview.mdx | 23 ++++++++++------------ open-source/introduction/quick-start.mdx | 21 ++++++++++++-------- 7 files changed, 29 insertions(+), 30 deletions(-) diff --git a/api-reference/workflow/overview.mdx b/api-reference/workflow/overview.mdx index e10f3081..d3697b96 100644 --- a/api-reference/workflow/overview.mdx +++ b/api-reference/workflow/overview.mdx @@ -222,7 +222,7 @@ The following Unstructured SDKs, tools, and libraries do _not_ work with the Uns - The [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts) - [Local single-file POST requests](/api-reference/partition/sdk-jsts) to the Unstructured Partition Endpoint - The [Unstructured open source Python library](/open-source/introduction/overview) -- The [Unstructued Ingest CLI](/ingestion/ingest-cli) +- The [Unstructured Ingest CLI](/ingestion/ingest-cli) - The [Unstructured Ingest Python library](/ingestion/python-ingest) The following Unstructured API URL is also _not_ supported: `https://api.unstructuredapp.io/general/v0/general` (the Unstructured Partition Endpoint URL). diff --git a/ingestion/ingest-cli.mdx b/ingestion/ingest-cli.mdx index 932f2e7c..def9a1ba 100644 --- a/ingestion/ingest-cli.mdx +++ b/ingestion/ingest-cli.mdx @@ -19,7 +19,7 @@ You can use the Unstructured Ingest CLI to process files locally, or you can use Local processing does not use an Unstructured API key or API URL. -Using the Ingest CLI to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows: +Using the Ingest CLI to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows: diff --git a/ingestion/overview.mdx b/ingestion/overview.mdx index ee5744d3..4813e0ac 100644 --- a/ingestion/overview.mdx +++ b/ingestion/overview.mdx @@ -3,10 +3,10 @@ title: Overview --- - Unstructured recommends that you use the [Unstructured API](/api-reference/overview) instead of the + Unstructured recommends that you use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead of the Unstructured Ingest CLI or the Unstructured Ingest Python library. - The Unstructured API provides a full range of partitioning, chunking, embedding, and enrichment options for your files and data. + The Unstructured UI and API provide a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It also uses the latest and highest-performing models on the market today, and it has built-in logic to deliver the highest quality results at the lowest cost. diff --git a/ingestion/python-ingest.mdx b/ingestion/python-ingest.mdx index 0bff8008..91c347c9 100644 --- a/ingestion/python-ingest.mdx +++ b/ingestion/python-ingest.mdx @@ -31,7 +31,7 @@ You can use the Unstructured Ingest Python library to process files locally, or Local processing does not use an Unstructured API key or API URL. -Using the Ingest Python library to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows: +Using the Ingest Python library to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows: diff --git a/open-source/core-functionality/staging.mdx b/open-source/core-functionality/staging.mdx index 431de600..131548e0 100644 --- a/open-source/core-functionality/staging.mdx +++ b/open-source/core-functionality/staging.mdx @@ -3,10 +3,7 @@ title: Staging --- - -The `Staging` brick is being deprecated in favor of the new and more comprehensive `Destination Connectors`. To explore the complete list and usage, please refer to [Destination Connectors documentation](/ingestion/destination-connectors/overview). - -Note: We are constantly expanding our collection of destination connectors. If you wish to request a specific Destination Connector, you’re encouraged to submit a Feature Request on the [Unstructured GitHub repository](https://github.com/Unstructured-IO/unstructured/issues/new/choose). + Staging functions in the Unstructured open source library are being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview). Staging functions in the `unstructured` package help prepare your data for ingestion into downstream systems. A staging function accepts a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using `the stage_for_label_studio` function. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task. diff --git a/open-source/introduction/overview.mdx b/open-source/introduction/overview.mdx index d1efbfe8..cbbf074c 100644 --- a/open-source/introduction/overview.mdx +++ b/open-source/introduction/overview.mdx @@ -3,22 +3,22 @@ title: Unstructured Open Source sidebarTitle: Overview --- -The `unstructured` open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, see the [Unstructured API](/api-reference/overview) instead. +The Unstructured open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead. -The `unstructured` [library](https://github.com/Unstructured-IO/unstructured) offers an open-source toolkit +The Unstructured open source library ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) offers an open-source toolkit designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs), -`unstructured` provides modular functions and connectors that work seamlessly together. This cohesive system ensures +the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms and use cases. ## Key functionality -* **Precise Document Extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements). +* **Precise document extraction**: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements). -* **Extensive File Support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types). +* **Robust file support**: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types). -* **Robust Core Functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes: +* **Robust core functionality**: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes: * [Partitioning](/open-source/core-functionality/partitioning): The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis. @@ -26,13 +26,10 @@ and use cases. * [Extracting](/open-source/core-functionality/extracting): This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents. - * [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of `Destination Connectors`. - + * [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of [destination connectors](/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview). + * [Chunking](/open-source/core-functionality/chunking): The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements). - -* **High-performant Connectors**: The platform includes optimized connectors for efficient data ingestion and output. These comprise [Source Connectors](/ingestion/source-connectors/overview) for data input and [Destination Connectors](/ingestion/destination-connectors/overview) for data export. - ## Common use cases * Pretraining models @@ -40,11 +37,11 @@ and use cases. * Retrieval Augmented Generation (RAG) * Traditional ETL -We do not support GPU usage with the open source library. +GPU usage is not supported for the Unstructured open source library. ## Limits -The open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview): +The Unstructured open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview): * Not designed for production scenarios. * Significantly decreased performance on document and table extraction. diff --git a/open-source/introduction/quick-start.mdx b/open-source/introduction/quick-start.mdx index bcf528a6..cc9156e3 100644 --- a/open-source/introduction/quick-start.mdx +++ b/open-source/introduction/quick-start.mdx @@ -84,7 +84,7 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks ``` - To activate the virtual environment, run one of the following commands: + To activate the `venv` virtual environment, run one of the following commands: @@ -102,7 +102,7 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks To deactivate the virtual environment at any time, run `deactivate`. - With the virtual environment activated, install the Unstructured open source library by running the following command: + With the virtual environment activated, use `uv` toinstall the Unstructured open source library by running the following command: ```bash uv add unstructured @@ -179,7 +179,7 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks ``` - Run the preceding Python code by running the following command: + Use `uv` to run the preceding Python code by running the following command: ```bash uv run main.py @@ -198,20 +198,25 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks import SharedOSSSingleFile from '/snippets/general-shared-text/multi-file-oss-use-connectors.mdx'; - Learn more about the [available partition functions](/open-source/core-functionality/partitioning) in addition to `partition_pdf` for converting other types of files into standard [Unstructured document elements and metadata](/open-source/concepts/document-elements). -- Learn about [available partitioning strategies](/open-source/concepts/partitioning-strategies) for optimal approaches for converting different types of files into Unstructured document elements. +- By default, the preceding example uses the `auto` partitioning strategy. Learn about other [available partitioning strategies](/open-source/concepts/partitioning-strategies) for fine-tuned approaches to converting different types of files into Unstructured document elements. - Learn about [available chunking functions](/open-source/core-functionality/chunking) for splitting up the text in your document elements into manageable chunks as needed to fit into your models' limited context windows. - Learn about [available cleaning functions](/open-source/core-functionality/cleaning) for cleaning up your document elements' data as needed. - Learn about [available extraction functions](/open-source/core-functionality/extracting) for getting precise information out of your document elements as needed. - Learn about how to [generate vector embeddings](/open-source/core-functionality/embedding) for the text in your document elements for use in RAG applications, AI agents, model fine-tuning tasks, and more. - For an additional code example, see the [Unstructured Quick Tour](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) Google Colab notebook. - The Unstructured open source library is also available as a [Docker container](/open-source/installation/docker-installation). - - +- The [Unstructured Ingest CLI and Unstructured Ingest Python library](/ingestion/overview) build upon the Unstructured open source library by providing additional functionality such as batch file processing, + ingesting files from remote source locations and sending the processed files' data to remote destination locations, creating programmatic ETL pipelines, optionally processing files on Unstructured-hosted compute resource instead of locally for improved performance and quality on a pay-as-you-go basis, and more. +- The [Unstructured user interface (UI)](/ui/overview) and [Unstructured API](/api-reference/overview) are superior to the Unstructured open source library, the + Unstructured Ingest CLI, and the Unstructured Ingest Python library. The Unstructured UI and API are designed for production scenarios, with significantly increased performance and quality, + the latest OCR and vision language models, advanced chunking strategies, security compliance, multi-user account management, job scheduling and monitoring, self-hosted deployment options, and more on a pay-as-you-go or subscription basis. ## Need help? -To get help, join the [Unstructured Slack community](https://short.unstructured.io/pzw05l7) and post your -questions in the **# ask-for-help-open-source-library** channel. +- Join the [Unstructured Slack community](https://short.unstructured.io/pzw05l7) and post your + questions in the **# ask-for-help-open-source-library** channel. +- Post your bug reports and feature requests in the [Unstructured open source library GitHub repository](https://github.com/Unstructured-IO/unstructured/issues). These bug reports and feature requests are evaluated and addressed + based on the interest and availability of the open source community. From 1f168dbab432f509fbaf1b4c2f5e3aebf4922e0a Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Wed, 21 May 2025 08:54:28 -0700 Subject: [PATCH 3/3] Add a skip-ahead link to the quickstart for open source --- open-source/introduction/overview.mdx | 2 ++ open-source/introduction/quick-start.mdx | 18 +++++++++++++----- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/open-source/introduction/overview.mdx b/open-source/introduction/overview.mdx index cbbf074c..6588651b 100644 --- a/open-source/introduction/overview.mdx +++ b/open-source/introduction/overview.mdx @@ -5,6 +5,8 @@ sidebarTitle: Overview The Unstructured open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead. +To start using the Unstructured open source library right away, skip ahead to the [quickstart](/open-source/introduction/quick-start). + The Unstructured open source library ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) offers an open-source toolkit designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs), diff --git a/open-source/introduction/quick-start.mdx b/open-source/introduction/quick-start.mdx index cc9156e3..f3ae5c6b 100644 --- a/open-source/introduction/quick-start.mdx +++ b/open-source/introduction/quick-start.mdx @@ -65,15 +65,15 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks ``` - To use `uv` to create a project, switch to the directory on your development machine where you want to - create the project, and then run the following command: + Use `uv` to create a project by switching to the directory on your development machine where you want to + create the project and then running the following command: ```bash uv init ``` - From your project directory, use `uv` to create a virtual environment with `venv` by running the following command: + To isolate and manage your project's code dependencies, from your project directory, use `uv` to create a virtual environment with `venv` by running the following command: ```bash # Create the virtual environment by using the current Python version: @@ -102,7 +102,7 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks To deactivate the virtual environment at any time, run `deactivate`. - With the virtual environment activated, use `uv` toinstall the Unstructured open source library by running the following command: + With the virtual environment activated to enable code dependency isolation and management, use `uv` to install the Unstructured open source library by running the following command: ```bash uv add unstructured @@ -138,6 +138,12 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks ```bash uv add "unstructured[pdf]" ``` + + Note that you can install multiple extras at the same time by separating them with commas, for example: + + ```bash + uv add "unstructured[pdf,docx]" + ``` You maximum compatibility, you should also install the following system dependencies: @@ -184,9 +190,11 @@ metadata as input into your RAG applications, AI agents, model fine-tuning tasks ```bash uv run main.py ``` + + It might take a few minutes for the command to finish. - View the Unstructured elements and metadata that were generated by opening the `layout-parser-paper-output.json` file in your editor. This file will be in + After the command finishes running successfully, view the Unstructured elements and metadata that were generated by opening the `layout-parser-paper-output.json` file in your editor. This file will be in the location as the original `layout-parser-paper.pdf` file. (If you used a different PDF file, the output file will be named `-output.json` instead.)