# Unstructured Workflow Endpoint Quickstart

## What this notebook does

This notebook uses the [Unstructured Workflow Endpoint](https://docs.unstructured.io/api-reference/workflow/overview) to have Unstructured process your source files stored in an Amazon S3 bucket, writing data about the processed files as a series of corresponding JSON files to a separate location in the same S3 bucket.

You can modify this notebook to use different sources and destinations. Unstructured supports various storage, database, and vector store providers. [Learn more](https://docs.unstructured.io/api-reference/overview).

## What you'll need to get started

1. An Unstructured API key. [Learn how to get one](https://docs.unstructured.io/platform-api/api/overview).
2. An Amazon Web Services (AWS) account, along with your AWS secret key and AWS secret access key for authentication. You'll also need Amazon S3 bucket with the correct access settings applied, and two child folders inside of this S3 bucket with the folder names of `input` and `output`. [Learn how to set up all of this](https://docs.unstructured.io/platform-api/api/sources/s3).
3. One or more files uploaded to the `input` folder in the S3 bucket. These are the files that Unstructured will process, sending the processed data into the `output` folder. If you don't have any files readily available to upload into the `input` folder, you can grab some from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the `Unstructured-IO/unstructured-ingest` repository in GitHub.

## Step 1: Install the Unstructured API package

Run the following cell to give the notebook access to the Unstructured Workflow Endpoint and to bring in the `datetime` library for assigning unique asset names in Unstructured.

In [None]:
!pip install -U "unstructured-client>=0.31.1" datetime

## Step 2: Set variables

In the following cell, replace the following placeholders, and then run the cell:

1. Replace `<unstrucured-api-key>` with the value of your Unstructured API key.
2. Replace `<bucket-name>` with the name of your S3 bucket.
3. Replace `<aws-access-key>` with the value of your AWS secret key.
4. Replace `<aws-secret-access-key>` with the value of your AWS secret access key.

In [None]:
unstructured_api_key = "<unstructured-api-key>"
bucket_uri = "s3://<bucket-name>"
aws_access_key = "<aws-access-key>"
aws_secret_access_key = "<aws-secret-access-key>"

## Step 3: Initialize the endpoint and helper function

Run the following cell to initialize the Unstructured Workflow Endpoint and to create a helper function that Unstructured can use to print information in a more readable format.

In [None]:
from unstructured_client import UnstructuredClient

unstructured_client = UnstructuredClient(api_key_auth=unstructured_api_key)

def pretty_print_model(response_model):
    print(response_model.model_dump_json(indent=4))

## Step 4: Create the source connector

Run the following cell to create the source connector. If successful, Unstructured prints information about the newly created source connector.

In [None]:
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import S3SourceConnectorConfigInput, CreateSourceConnector, SourceConnectorType
from datetime import datetime

source_connector_config = S3SourceConnectorConfigInput(
    remote_url=f"{bucket_uri}/input",
    key=aws_access_key,
    secret=aws_secret_access_key
)

unique_source_connector_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

response = unstructured_client.sources.create_source(
    request=CreateSourceRequest(
        create_source_connector=CreateSourceConnector(
            name=f"s3-source-{unique_source_connector_suffix}",
            type=SourceConnectorType.S3,
            config=source_connector_config,
        )
    )
)

source_connector_id = response.source_connector_information.id
pretty_print_model(response.source_connector_information)

## Step 5: Create the destination connector

Run the following cell to create the destination connector. If successful, Unstructured prints information about the newly created destination connector.

In [None]:
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import S3DestinationConnectorConfigInput, CreateDestinationConnector, DestinationConnectorType

destination_connector_config = S3DestinationConnectorConfigInput(
    remote_url=f"{bucket_uri}/output",
    key=aws_access_key,
    secret=aws_secret_access_key
)

unique_destination_connector_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

response = unstructured_client.destinations.create_destination(
    request=CreateDestinationRequest(
        create_destination_connector=CreateDestinationConnector(
            name=f"s3-destination-{unique_source_connector_suffix}",
            type=DestinationConnectorType.S3,
            config=destination_connector_config,
        )
    )
)

destination_connector_id = response.destination_connector_information.id
pretty_print_model(response.destination_connector_information)

## Step 6: Create the workflow

Run the following cell to create the workflow. If successful, Unstructured prints information about the newly created workflow.

In [None]:
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowType,
    Schedule
)

# Partition the content by using a vision language model (VLM).
parition_node = WorkflowNode(
    name="Partitioner",
    subtype="vlm",
    type="partition",
    settings={
        "provider": "anthropic",
        "provider_api_key": None,
        "model": "claude-3-5-sonnet-20241022",
        "output_format": "text/html",
        "user_prompt": None,
        "format_html": True,
        "unique_element_ids": True,
        "is_dynamic": True,
        "allow_fast": True
    }
)

# Summarize each detected image.
image_summarizer_node = WorkflowNode(
    name="Image summarizer",
    subtype="openai_image_description",
    type="prompter",
    settings={}
)

# Summarize each detected table.
table_summarizer_node = WorkflowNode(
    name="Table summarizer",
    subtype="anthropic_table_description",
    type="prompter",
    settings={}
)

# Label each recognized named entity.
named_entity_recognizer_node = WorkflowNode(
    name="Named entity recognizer",
    subtype="openai_ner",
    type="prompter",
    settings={
        "prompt_interface_overrides": None
    }
)

# Chunk the partitioned content.
chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type="chunk",
    settings={
        "unstructured_api_url": None,
        "unstructured_api_key": None,
        "multipage_sections": False,
        "combine_text_under_n_chars": 0,
        "include_orig_elements": True,
        "new_after_n_chars": 1500,
        "max_characters": 2048,
        "overlap": 160,
        "overlap_all": False,
        "contextual_chunking_strategy": None
    }
)

# Generate vector embeddings.
embed_node = WorkflowNode(
    name="Embedder",
    subtype="azure_openai",
    type="embed",
    settings={
        "model_name": "text-embedding-3-large"
    }
)

unique_workflow_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

response = unstructured_client.workflows.create_workflow(
    request={
        "create_workflow": {
            "name": f"s3-to-s3-custom-workflow-{unique_workflow_suffix}",
            "source_id": source_connector_id,
            "destination_id": destination_connector_id,
            "workflow_type": WorkflowType.CUSTOM,
            "workflow_nodes": [
                parition_node,
                image_summarizer_node,
                table_summarizer_node,
                named_entity_recognizer_node,
                chunk_node,
                embed_node
            ],
            "schedule": Schedule("monthly")
        }
    }
)

workflow_id = response.workflow_information.id
pretty_print_model(response.workflow_information)

## Step 7: Run the workflow

Run the following cell to start running the workflow. Unstructured will begin processing the files in the `input` folder, sending the processed data into the `output` folder. If successful, Unstructured prints information about the workflow that is running.

In [None]:
res = unstructured_client.workflows.run_workflow(
    request={
        "workflow_id": workflow_id,
    }
)

pretty_print_model(res.job_information)

## Step 8: Get the workflow run's job ID

Run the following cell to get the workflow run's job ID, which is needed to poll for job completion later. If successful, Unstructured prints the job's ID.

In [None]:
response = unstructured_client.jobs.list_jobs(
    request={
        "workflow_id": workflow_id
    }
)

last_job = response.response_list_jobs[0]
job_id = last_job.id
print(f"job_id: {job_id}")

## Step 9: Poll for job completion

Run the following cell to confirm the job has finished running. If successful, Unstructured prints `"status": "COMPLETED"` within the information about the job.

In [None]:
import time

def poll_job_status(job_id):
    while True:
        response = unstructured_client.jobs.get_job(
            request={
                "job_id": job_id
            }
        )

        job = response.job_information

        if job.status == "SCHEDULED":
            print("Job is scheduled, polling again in 10 seconds...")
            time.sleep(10)
        elif job.status == "IN_PROGRESS":
            print("Job is in progress, polling again in 10 seconds...")
            time.sleep(10)
        else:
            print("Job is completed")
            break

    return job

job = poll_job_status(job_id)
pretty_print_model(job)

## Step 10: View the processed data

After the job has successfully completed, look in your S3 bucket's `output` folder for the processed data.

## Learn more

Learn more about the:

- [Unstructured Workflow Endpoint](https://docs.unstructured.io/api-reference/workflow/overview)
- [Unstructured user interface (UI)](https://docs.unstructured.io/ui/overview)