# Building Graph-Based RAG Applications Is Finally Easy

## Unstructured API tutorial for writing data with NER enrichment to your Astra DB

See [blog](https://thenewstack.io/building-graph-based-rag-applications-just-got-easier/) for the full Graph RAG workflow and to see how to effortlessly build a knowledge Graph in Astra DB leveraging Unstructured


Unstructured’s ETL+ for GenAI continuously harvesting newly generated unstructured data from systems of record, transforming it into LLM-ready formats using optimized, pre-built pipelines, and writing it to Astra DB. You can deploy complete  ingestion and preprocessing pipelines in seconds, with configuration options and third party integrations for the partitioning, enrichment, chunking, and embedding steps. This enables knowledge graph building without needing to write any code or create any custom steps. The critical NER enrichment step can be easily configured within the full ETL+ pipeline that is available in Unstructured [UI](https://platform.unstructured.io/) or API (below)

## 1. Installs + credentials

In [None]:
!pip install -qU langchain-openai openai langchain-graph-retriever

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/60.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # Your OpenAI API key

os.environ["S3_AWS_KEY"] = userdata.get('AWS_ACCESS_KEY_ID')
os.environ["S3_AWS_SECRET"] = userdata.get('AWS_SECRET_ACCESS_KEY')
os.environ["S3_REMOTE_URL"] = userdata.get('AWS_S3_URL')



os.environ["ASTRA_DB_APPLICATION_TOKEN"] = userdata.get('ASTRA_DB_APPLICATION_TOKEN')
os.environ["ASTRA_DB_API_ENDPOINT"] = userdata.get('ASTRA_DB_API_ENDPOINT')
os.environ["ASTRA_DB_COLLECTION_NAME"] = userdata.get('ASTRA_DB_COLLECTION_NAME')
os.environ["ASTRA_DB_KEYSPACE"] = userdata.get('ASTRA_DB_KEYSPACE')
os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')

In [None]:
!pip install --upgrade -q "unstructured-client"  astrapy

In [None]:
from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import CreateSourceConnector
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import CreateDestinationConnector



## 2. Create S3 Source Connector

In [None]:
import os

from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import (
    CreateSourceConnector,
    SourceConnectorType,
    S3SourceConnectorConfigInput
)

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    source_response = client.sources.create_source(
        request=CreateSourceRequest(
            create_source_connector=CreateSourceConnector(
                name="graphrag_s3_source",
                type=SourceConnectorType.S3,
                config=S3SourceConnectorConfigInput(


                    key=os.environ.get('S3_AWS_KEY'),
                    secret=os.environ.get('S3_AWS_SECRET'),



                    remote_url=os.environ.get('S3_REMOTE_URL'),
                    recursive=True
                )
            )
        )
    )

In [None]:
print(source_response.source_connector_information)

config=S3SourceConnectorConfig(anonymous=False, recursive=True, remote_url='s3://nina-test-platform/test-mini/', endpoint_url=Unset(), key='**********', secret='**********', token=Unset()) created_at=datetime.datetime(2025, 3, 26, 20, 39, 13, 646221, tzinfo=TzInfo(UTC)) id='05f18931-c3fd-4f43-896a-570093c1edb8' name='graphrag_s3_source' type=<SourceConnectorType.S3: 's3'> updated_at=datetime.datetime(2025, 3, 26, 20, 39, 14, 112052, tzinfo=TzInfo(UTC))


## 3. Create Astra DB Destination Connector

In [None]:
import os

from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import (
    CreateDestinationConnector,
    DestinationConnectorType,
    AstraDBConnectorConfigInput
)

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    destination_response = client.destinations.create_destination(
        request=CreateDestinationRequest(
            create_destination_connector=CreateDestinationConnector(
                name="graphrag_astra_destination",
                type=DestinationConnectorType.ASTRADB,
                config=AstraDBConnectorConfigInput(
                    token=os.environ.get('ASTRA_DB_APPLICATION_TOKEN'),
                    api_endpoint=os.environ.get('ASTRA_DB_API_ENDPOINT'),
                    collection_name=os.environ.get('ASTRA_DB_COLLECTION_NAME'),
                    keyspace=os.environ.get('ASTRA_DB_KEYSPACE'),
                    batch_size=20,
                    flatten_metadata=True
                )
            )
        )
    )

In [None]:
print(destination_response.destination_connector_information)

config=AstraDBConnectorConfig(api_endpoint='**********', batch_size=20, collection_name='graph_vectorstore', token='**********', keyspace='default_keyspace') created_at=datetime.datetime(2025, 3, 26, 20, 39, 14, 412404, tzinfo=TzInfo(UTC)) id='2905af93-d681-4a42-a684-981cf250b7a0' name='graphrag_astra_destination' type=<DestinationConnectorType.ASTRADB: 'astradb'> updated_at=datetime.datetime(2025, 3, 26, 20, 39, 14, 840293, tzinfo=TzInfo(UTC))


## 4. Creating Nodes for Workflow

Note that the named_entity_recognizer_node will be creating the metadata for nodes and edges, and that it is critical to place it after the chunking node, whether you are creating your workflow via UI or API.

In [None]:
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowNodeType,
    WorkflowType,
    Schedule
)

# Partition the content by using a vision language model (VLM).
partition_node = WorkflowNode(
    name="Partitioner",
    subtype="vlm",
    type=WorkflowNodeType.PARTITION,
    settings={
        "provider": "anthropic",
        "provider_api_key": None,
        "model": "claude-sonnet-4-5-20250929",
        "output_format": "text/html",
        "user_prompt": None,
        "format_html": True,
        "unique_element_ids": True,
        "is_dynamic": True,
        "allow_fast": True
    }
)

# Summarize each detected image.
image_summarizer_node = WorkflowNode(
    name="Image summarizer",
    subtype="openai_image_description",
    type=WorkflowNodeType.PROMPTER,
    settings={}
)

# Summarize each detected table.
table_summarizer_node = WorkflowNode(
    name="Table summarizer",
    subtype="anthropic_table_description",
    type=WorkflowNodeType.PROMPTER,
    settings={}
)

# Chunk the partitioned content.
chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type=WorkflowNodeType.CHUNK,
    settings={
        "unstructured_api_url": None,
        "unstructured_api_key": None,
        "multipage_sections": False,
        "combine_text_under_n_chars": 0,
        "include_orig_elements": True,
        "new_after_n_chars": 1500,
        "max_characters": 2048,
        "overlap": 160,
        "overlap_all": False,
        "contextual_chunking_strategy": None
    }
)

# Label each recognized named entity.
named_entity_recognizer_node = WorkflowNode(
    name="Named entity recognizer",
    subtype="openai_ner",
    type=WorkflowNodeType.PROMPTER,
    settings={
        "prompt_interface_overrides": {
            "prompt": {
                "user": (
                    "Extract all named entities, including people and locations, from the given text segments "
                    "and provide structured metadata for each entity identified.\n\n"
                    'Response format: {"PLACES": ["England", "Middlesex"]}'
                )
            }
        }
    }
)


# Generate vector embeddings.
embed_node = WorkflowNode(
    name="Embedder",
    subtype="azure_openai",
    type=WorkflowNodeType.EMBED,
    settings={
        "model_name": "text-embedding-3-large"
    }
)

## 5. Set up the workflow

In [None]:
import os
from datetime import datetime
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowNodeType,
    CreateWorkflow,
    WorkflowType,
    Schedule
)
from unstructured_client.models.operations import CreateWorkflowRequest

unique_workflow_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
workflow = CreateWorkflow(
    name=f"s3-to-astra-Graph-RAG-workflow-{unique_workflow_suffix}",
    source_id=source_response.source_connector_information.id,
    destination_id=destination_response.destination_connector_information.id,
    workflow_type=WorkflowType.CUSTOM,
    workflow_nodes=[
        partition_node,
        image_summarizer_node,
        table_summarizer_node,
        chunk_node,
        named_entity_recognizer_node,
        embed_node
    ],
)

client = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)
response = client.workflows.create_workflow(
    request=CreateWorkflowRequest(
        create_workflow=workflow
    )
)

info = response.workflow_information

print(f"name:           {info.name}")
print(f"id:             {info.id}")
print(f"status:         {info.status}")
print(f"type:           {info.workflow_type}")
print("source(s):")

for source in info.sources:
    print(f"            {source}")

print("destination(s):")

for destination in info.destinations:
    print(f"            {destination}")

print("schedule(s):")

for crontab_entry in info.schedule.crontab_entries:
    print(f"            {crontab_entry.cron_expression}")

name:           s3-to-astra-Graph-RAG-workflow-2025-03-26-20-43-14
id:             5c8117eb-c507-4ef2-9f82-7590147344d4
status:         WorkflowState.ACTIVE
type:           WorkflowType.CUSTOM
source(s):
            05f18931-c3fd-4f43-896a-570093c1edb8
destination(s):
            2905af93-d681-4a42-a684-981cf250b7a0
schedule(s):


## 6. Run the workflow

In [None]:
from unstructured_client.models.operations import RunWorkflowRequest

response = client.workflows.run_workflow(
    request=RunWorkflowRequest(
        workflow_id=info.id
    )
)

print(response.raw_response)

<Response [202 Accepted]>


## Now your data is in your Astra DB, you can check your Collection there, or go to your Unstructured UI to see the Workflow + results!