# Use Unstructured with Snowflake Cortex Search for RAG

This notebook demonstrates how to use Unstructured to process one or more source documents in an Amazon S3 bucket, adding the documents' processed data to a Snowflake table. From there, this notebook demonstrates how to use [Snowflake Cortex](https://www.snowflake.com/product/features/cortex/) to search the table's contents with natural-language queries using retrieval-augmented generation (RAG).

To perform RAG search, three Snowflake functions are highlighted in this notebook:

- [SNOWFLAKE.CORTEX.EMBED_TEXT_1024](https://docs.snowflake.com/sql-reference/functions/embed_text_1024-snowflake-cortex), which creates a vector embedding of 1024 dimensions from the text in the table and the text for the natural-language query's text.
- [VECTOR_COSINE_SIMILARITY](https://docs.snowflake.com/sql-reference/functions/vector_cosine_similarity), which computes the cosine similarity between two vectors. In this notebook, the vector for the text in each of the table's records and the vector for the query text are compared. The table record with the highest positive similarity (closest to 1) is then used. When determining the highest similarity, identical vectors have a cosine similarity of 1 (which means they likely represent the same meaning or concept); two orthogonal vectors have a similarity of 0 (which means they are not semantically related--such as "dog" and "car"); and two opposite vectors have a similarity of -1 (which means they have opposing semantic meanings--such as "hot" and "cold").
- [SNOWFLAKE.CORTEX.COMPLETE](https://docs.snowflake.com/sql-reference/functions/complete-snowflake-cortex), which uses the natural-language query, the table record with the highest positive cosine similarity, and a large language model (LLM) to generate a natural-language response to the query.

At the end of this notebook, you can optionally experiment with additional Snowflake Cortex functions that perform tasks such as sentiment analysis, simple Q&A, language translation, and text classification.

## Requirements

To complete this notebook, you will need:

- An Unstructured account. Go to [https://platform.unstructured.io](https://platform.unstructured.io) and use your email address, Google account, or GitHub account to sign up for an Unstructured account (if you do not already have one) and sign into the account at the same time.
- An Unstructured API key for your account. To get this API key, after you sign in to your Unstructured account:

  1. In the Unstructured user interface (UI), click **API Keys** on the sidebar.
  2. Click **Generate API Key**.
  3. Follow the on-screen instructions to finish generating the key.
  4. Click the **Copy** icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the **Copy** icon again.

You will also need, for Amazon S3:

1. An Amazon Web Services (AWS) account, along with your AWS secret key and AWS secret access key for authentication. You'll also need Amazon S3 bucket with the correct access settings applied.
2. One or more documents uploaded to the S3 bucket. These are the documents that Unstructured will process, sending the documents' processed data into your Snowflake table. If you don't have any documents readily available to upload into the S3 bucket, you can grab some from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the [Unstructured-IO/unstructured-ingest](https://github.com/Unstructured-IO/unstructured-ingest) repository in GitHub.

[Learn how to verify these requirements for S3](https://docs.unstructured.io/api-reference/workflow/sources/s3). At the end of that webpage, ignore the section titled "To create the source connector," as you will do that later in this notebook instead.

You will also need, for Snowflake:

1. A Snowflake account and your account's identifier.
2. Your user's login name (not username) and password in your Snowflake account.
3. The names of the target catalog, schema, table, and warehouse in your Snowflake account.

   **IMPORTANT!** Your table must have a column named `embeddings` with a datatype of `VECTOR(FLOAT, 1024)` to work with this notebook.
   
4. The name of the Snowflake role that your user belongs to and that also has sufficient access to the target catalog, schema, table, and warehouse.
5. The hostname and port number for the host, and the name of the warehouse.

[Learn how to verify these requirements for Snowflake](https://docs.unstructured.io/api-reference/workflow/destinations/snowflake). At the end of that webpage, ignore the section titled "To create the destination connector," as you will do that later in this notebook instead.

# Step 1: Load dependencies, set initial imports and environment variables, and define a helper function

Run the following cell to load code dependencies for the Unstructured API and for using date/time information to generate unique names for your Unstructured connectors and workflow.

In [None]:
!pip install -U unstructured-client datetime

Collecting unstructured-client
  Downloading unstructured_client-0.34.0-py3-none-any.whl.metadata (21 kB)
Collecting datetime
  Downloading DateTime-5.5-py3-none-any.whl.metadata (33 kB)
Collecting aiofiles>=24.1.0 (from unstructured-client)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting eval-type-backport>=0.2.0 (from unstructured-client)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting pypdf>=4.0 (from unstructured-client)
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting zope.interface (from datetime)
  Downloading zope.interface-7.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m270.6 kB/s[0m eta [36m0:00:00[0m
Downloading unstructured_client-0.34.0-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.4/18

Run the following cell to set imports for setting and getting environment variables, calling the Unstructured API, and using date/time functions. Also declare a helper function to make it easier to read JSON-formatted payloads that are returned by calls to the Unstructured API.

In [None]:
import os
from unstructured_client import UnstructuredClient
from datetime import datetime

def pretty_print_model(response_model):
    print(response_model.model_dump_json(indent=4))

Set the following environment variables' values, as described in the following code comments, and then run the cell. In this notebook, you set these environment variables directly, for convenience. In production, you would typically set these environment variables separately from your code, to lessen the risk of exposing sensitive information such as access keys and passwords to other users.

In [None]:
os.environ["AWS_S3_URL"] = "..." # The path to the S3 bucket or folder, formatted as s3://my-bucket/ (if the files are in the bucket’s root) or s3://my-bucket/my-folder/.
os.environ["AWS_ACCESS_KEY_ID"] = "..." # The AWS access key ID for the authenticated AWS IAM user.
os.environ["AWS_SECRET_ACCESS_KEY"] = "..." # The corresponding AWS secret access key.
os.environ["SNOWFLAKE_ACCOUNT"] ="..." # The ID of the target Snowflake account. *
os.environ["SNOWFLAKE_USER"] = "..." # The login name (not username) of the target Snowflake user in the account. *
os.environ["SNOWFLAKE_PASSWORD"] = "..." # The user's password. *
os.environ["SNOWFLAKE_ROLE"] = "..." # The name of the target role for the user. This role must have appropriate permissions to the target database, schema, table, and warehouse.
os.environ["SNOWFLAKE_HOST"] = "..." # The hostname for the target Snowflake warehouse.
os.environ["SNOWFLAKE_PORT"] = "443" # The warehouse's port number.
os.environ["SNOWFLAKE_WAREHOUSE"] = "COMPUTE_WH" # The name of the target Snowflake warehouse. *
os.environ["SNOWFLAKE_DATABASE"] = "..." # The name of the target Snowflake database. *
os.environ["SNOWFLAKE_SCHEMA"] = "..." # The name of the target namespace (also known as a schema) in the database. *
os.environ["SNOWFLAKE_TABLE"] = "ELEMENTS" # The name of the target table in the schema. Typically, this table is named ELEMENTS.
os.environ["SNOWFLAKE_RECORD_ID_KEY"] = "RECORD_ID" # The name of the column in the table that uniquely identifies each record. Typically, this column is named RECORD_ID.
os.environ["UNSTRUCTURED_API_KEY"] = "..." # Your Unstructured API key.

# Step 2: Create the source connector

Run the following cell to create the source connection to the Amazon S3 bucket. If successful, Unstructured prints information about the newly created source connector.

In [None]:
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.shared import (
    CreateSourceConnector,
    SourceConnectorType,
    S3SourceConnectorConfigInput
)

unique_source_connector_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    response = client.sources.create_source(
        request=CreateSourceRequest(
            create_source_connector=CreateSourceConnector(
                name=f"s3-source-{unique_source_connector_suffix}",
                type=SourceConnectorType.S3,
                config=S3SourceConnectorConfigInput(
                    remote_url=os.getenv("AWS_S3_URL"),
                    key=os.getenv("AWS_ACCESS_KEY_ID"),
                    secret=os.getenv("AWS_SECRET_ACCESS_KEY")
                )
            )
        )
    )

    source_connector_id = response.source_connector_information.id
    pretty_print_model(response.source_connector_information)

## Step 3: Create the destination connector

Run the following cell to create the destination connection to the Snowflake table. If successful, Unstructured prints information about the newly created destination connector.

In [None]:
from unstructured_client.models.operations import CreateDestinationRequest
from unstructured_client.models.shared import (
    CreateDestinationConnector,
    DestinationConnectorType,
    SnowflakeDestinationConnectorConfigInput
)

unique_destination_connector_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    response = client.destinations.create_destination(
        request=CreateDestinationRequest(
            create_destination_connector=CreateDestinationConnector(
                name=f"snowflake-destination-{unique_destination_connector_suffix}",
                type=DestinationConnectorType.SNOWFLAKE,
                config=SnowflakeDestinationConnectorConfigInput(
                    account=os.getenv("SNOWFLAKE_ACCOUNT"),
                    user=os.getenv("SNOWFLAKE_USER"),
                    password=os.getenv("SNOWFLAKE_PASSWORD"),
                    role=os.getenv("SNOWFLAKE_ROLE"),
                    host=os.getenv("SNOWFLAKE_HOST"),
                    port=os.getenv("SNOWFLAKE_PORT"),
                    database=os.getenv("SNOWFLAKE_DATABASE"),
                    schema=os.getenv("SNOWFLAKE_SCHEMA"),
                    table_name=os.getenv("SNOWFLAKE_TABLE"),
                    record_id_key=os.getenv("SNOWFLAKE_RECORD_ID_KEY")
                )
            )
        )
    )

    destination_connector_id = response.destination_connector_information.id
    pretty_print_model(response.destination_connector_information)

## Step 4: Create the workflow

Run the following cell to create the workflow. This workflow uses Claude Sonnet 4.5, a vision language model (VLM), to translate the documents' contents into Unstructured document elements. The elements' text is then chunked into smaller parts for easier use by RAG search.

If the cell runs successfully, Unstructured prints information about the newly created workflow.

In [None]:
from unstructured_client.models.operations import CreateWorkflowRequest
from unstructured_client.models.shared import (
    WorkflowNode,
    CreateWorkflow,
    WorkflowType
)

# Partition the content by using a vision language model (VLM).
partition_node = WorkflowNode(
    name="Partitioner",
    subtype="vlm",
    type="partition",
    settings={
        "provider": "anthropic",
        "provider_api_key": None,
        "model": "claude-sonnet-4-5-20250929",
        "output_format": "text/html",
        "user_prompt": None,
        "format_html": True,
        "unique_element_ids": True,
        "is_dynamic": True,
        "allow_fast": True
    }
)

# Chunk the partitioned content.
chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type="chunk",
    settings={
        "unstructured_api_url": None,
        "unstructured_api_key": None,
        "multipage_sections": False,
        "combine_text_under_n_chars": 0,
        "include_orig_elements": True,
        "new_after_n_chars": 1500,
        "max_characters": 2048,
        "overlap": 160,
        "overlap_all": False,
        "contextual_chunking_strategy": None
    }
)

unique_workflow_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    response = client.workflows.create_workflow(
        request=CreateWorkflowRequest(
            create_workflow=CreateWorkflow(
                name=f"s3-to-snowflake-custom-workflow-{unique_workflow_suffix}",
                source_id=source_connector_id,
                destination_id=destination_connector_id,
                workflow_type=WorkflowType.CUSTOM,
                workflow_nodes=[
                    partition_node,
                    chunk_node
                ],
            )
        )
    )

    workflow_id = response.workflow_information.id
    pretty_print_model(response.workflow_information)

## Step 5: Run the workflow

Run the following cell to start running the workflow. Unstructured will begin processing the files in the S3 bucket, sending the processed data into the Snowflake table. If successful, Unstructured prints information about the workflow that is running.

In [None]:
from unstructured_client.models.operations import RunWorkflowRequest

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    response = client.workflows.run_workflow(
        request=RunWorkflowRequest(
            workflow_id=workflow_id,
        )
    )

pretty_print_model(response.job_information)

# Step 6: Get the workflow run's job ID

Run the following cell to get the workflow run's job ID, which is needed to poll for job completion later. If successful, Unstructured prints the job's ID.

In [None]:
from unstructured_client.models.operations import ListJobsRequest

with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
    response = client.jobs.list_jobs(
        request=ListJobsRequest(
            workflow_id=workflow_id
        )
)

last_job = response.response_list_jobs[0]
job_id = last_job.id
print(f"job_id: {job_id}")

# Step 7: Poll for job completion

Run the following cell to confirm the job has finished running. If successful, Unstructured prints "status": "COMPLETED" within the information about the job.

In [None]:
from unstructured_client.models.operations import GetJobRequest
import time

def poll_job_status(job_id):
    while True:
        with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client:
          response = client.jobs.get_job(
            request=GetJobRequest(
                job_id=job_id
            )
        )

        job = response.job_information

        if job.status == "SCHEDULED":
            print("Job is scheduled, polling again in 10 seconds...")
            time.sleep(10)
        elif job.status == "IN_PROGRESS":
            print("Job is in progress, polling again in 10 seconds...")
            time.sleep(10)
        else:
            print("Job is completed")
            break

    return job

job = poll_job_status(job_id)
pretty_print_model(job)

## Step 8: Connect to Snowflake

Run the following cell to install code dependencies for connecting to Snowflake.


In [None]:
!pip install snowflake-connector-python

Run the following cell to connect to Snowflake.

In [None]:
import snowflake.connector

config = {
    "user": os.getenv("SNOWFLAKE_USER"),
    "password": os.getenv("SNOWFLAKE_PASSWORD"),
    "account": os.getenv("SNOWFLAKE_ACCOUNT"),
    "warehouse": os.getenv("SNOWFLAKE_WAREHOUSE"),
    "database": os.getenv("SNOWFLAKE_DATABASE"),
    "schema": os.getenv("SNOWFLAKE_SCHEMA")
}

try:
    conn = snowflake.connector.connect(**config)
    print("Connection established successfully.")
except snowflake.connector.Error as e:
    print(f"Connection failed: {e}")

# Step 9: Generate embedding vectors for the table's text contents

Run the following cell to generate embedding vectors for the text contents of each of the table's rows. If successful, the number of table rows that had embedding vectors generated for them is returned.

In [None]:
cursor = conn.cursor()
table = os.getenv("SNOWFLAKE_TABLE")
query = f"UPDATE {table} SET embeddings = SNOWFLAKE.CORTEX.EMBED_TEXT_1024('voyage-multilingual-2', text)"

cursor.execute(query)

print(f"Rows updated: {cursor.rowcount}")

# Step 10: Generate embedding vectors for the natural-language query

In the following cell, specify the query you want to use, and then run the cell.

In [None]:
natural_language_query = "In what year were women given the right to vote in the United States?"

Run the following cell to create a table named `query_table` in the same schema as your target table, if `query_table` does not already exist there. Embedding vectors will then be generated for your natural-language query and put into the new or existing `query_table` table. If successful, "Rows inserted: 1" is returned.

If you change the natural-language query in the preceding cell, you must run the following cell again, to make sure that matching embedding vectors are generated and overwrite the previous ones.

In [None]:
cursor.execute("""
    CREATE OR REPLACE TABLE query_table (
        query_vec VECTOR(FLOAT, 1024)
    )
""")

cursor.execute("""
    INSERT INTO query_table
    SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_1024(
        'voyage-multilingual-2',
        %s
    )
""", (natural_language_query,))

print(f"Rows inserted: {cursor.rowcount}")

conn.commit()

# Step 11: Perform RAG search

Run the following cell to perform a vector cosine similarity operation, looking for the row in the target table that has the closest similarity to your natural-language query. Based on that row's text, use the Mistral 7B model to return and print a natural-language reply to your query.

To search again with a different natural-language query, go back to Step 10.

In [None]:
from textwrap import fill

query = f"""
WITH result AS (
    SELECT
        t.text,
        %s AS query_text,
        VECTOR_COSINE_SIMILARITY(t.embeddings, q.query_vec) AS similarity
    FROM {table} t, query_table q
    ORDER BY similarity DESC
    LIMIT 1
)
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-7b',
    CONCAT('Answer this question: ', query_text, ' using this text: ', text)
) AS answer
FROM result;
"""

cursor.execute(query, (natural_language_query,))
result = cursor.fetchall()[0][0]
print(fill(result, width=80))

## Optional: Use additional Snowflake Cortex functions

You can run the following cells to experiment with the following functions against your Snowflake table:

- [ENTITY_SENTIMENT](https://docs.snowflake.com/user-guide/snowflake-cortex/sentiment#label-sentiment-entity-sentiment), which returns sentiment scores for English-language text, including overall sentiment and specific sentiment for specified entities.
- [SENTIMENT](https://docs.snowflake.com/user-guide/snowflake-cortex/sentiment#sentiment), which returns an overall sentiment score for the given English-language input text.
- [EXTRACT_ANSWER](https://docs.snowflake.com/sql-reference/functions/extract_answer-snowflake-cortex), which extracts an answer to a given question from text.
- [SUMMARIZE](https://docs.snowflake.com/sql-reference/functions/summarize-snowflake-cortex), which summarizes the given English-language input text.
- [TRANSLATE](https://docs.snowflake.com/sql-reference/functions/translate-snowflake-cortex), which translates the given input text from one supported language to another.
- [CLASSIFY_TEXT](https://docs.snowflake.com/sql-reference/functions/classify_text-snowflake-cortex), which classifies free-form text into categories that you provide.

### ENTITY_SENTIMENT

Run the following cell to perform sentiment analysis on the contents of the `text` column for the first 10 rows of your table.

To change the sentiment labels to analyze, replace `['Prescriptive', 'Declarative']` with up to 10 of your own labels. Each label may not exceed 30 characters.

In [None]:
import json

cursor = conn.cursor()
table = os.getenv("SNOWFLAKE_TABLE")

query = f"""
SELECT SNOWFLAKE.CORTEX.ENTITY_SENTIMENT(
    text,
    ['Prescriptive', 'Declarative']
), text FROM {table} LIMIT 10;
"""

cursor.execute(query)
result = cursor.fetchall()
for entry in result:
    # Print the text part.
    print("Text:\n")
    print(entry[1])
    # Pretty-print the JSON-formatted sentiment analysis part.
    print('-' * 80)  # Add a separator for clarity,
    print("\nSentiment analysis:\n")
    parsed_json = json.loads(entry[0])
    print(json.dumps(parsed_json, indent=4))
    print('=' * 80)  # Add a separator for clarity, before printing the next item.

### SENTIMENT

Run the following cell to return an overall sentiment score for the `text` column for the first 10 rows of your table.

Sentiment scores are expressed as a floating-point number from -1 to 1 (inclusive) indicating the level of negative or positive sentiment in the text. Values around 0 indicate neutral sentiment.

In [None]:
query = f"""
SELECT SNOWFLAKE.CORTEX.SENTIMENT(
    text
), text FROM {table} LIMIT 10;
"""

cursor.execute(query)
result = cursor.fetchall()
print(json.dumps(result, indent=4))

### EXTRACT_ANSWER

Run the following cell to return the answer to a question about the `text` column for the first 10 rows of your table.

To ask a different question, change the value of the following `question` variable, and run the cell again.

In [None]:
question = "Who is primarily mentioned in this text?"

query = f"""
SELECT SNOWFLAKE.CORTEX.EXTRACT_ANSWER(text, '{question}'), text FROM {table}  LIMIT 10;
"""

cursor.execute(query)
result = cursor.fetchall()
for entry in result:
    # Print the text part.
    print("Text:\n")
    print(entry[1])
    # Pretty-print the JSON-formatted answer part.
    print('-' * 80)  # Add a separator for clarity.
    print(f"Question: {question}")
    print(f"Answer:")
    parsed_json = json.loads(entry[0])
    print(json.dumps(parsed_json, indent=4))
    print('=' * 80)  # Add a separator for clarity, before printing the next item.

### SUMMARIZE

Run the following cell to return a summary of the `text` column for the first 10 rows of your table.

The `text` column must contain English text.

In [None]:
query = f"""
SELECT SNOWFLAKE.CORTEX.SUMMARIZE(text), text FROM {table}  LIMIT 10;
"""

cursor.execute(query)
result = cursor.fetchall()
for entry in result:
    print("Original text:\n")
    print(entry[1])
    print('-' * 80) # Add a separator for clarity.
    print("Summary text:\n")
    print(entry[0])
    print('=' * 80) # Add a separator for clarity, before printing the next item.

### TRANSLATE

Run the following cell to return a language translation of the `text` column for the first 10 rows of your table.

Over a dozen languages are supported. For the full list of supported languages, see the [TRANSLATE](https://docs.snowflake.com/sql-reference/functions/translate-snowflake-cortex) documentation.

In [None]:
import json

cursor = conn.cursor()
table = os.getenv("SNOWFLAKE_TABLE")

query = f"""
SELECT SNOWFLAKE.CORTEX.TRANSLATE(text, 'en', 'es'), text FROM {table}  LIMIT 10;
"""

cursor.execute(query)
result = cursor.fetchall()
for entry in result:
    print("Original English text:\n")
    print(entry[1])
    print('-' * 80) # Add a separator for clarity.
    print("Text translated into Spanish:\n")
    print(entry[0])
    print('=' * 80) # Add a separator for clarity, before printing the next item.

### CLASSIFY_TEXT

Run the following cell to perform text classification on the contents of the `text` column for the first 10 rows of your table.

To change the text classification labels to use, replace `['Section', 'Article', 'Amendment', 'Other']` with between 2 and 100 of your own labels, inclusive.

In [None]:
query = f"""
SELECT SNOWFLAKE.CORTEX.CLASSIFY_TEXT(
  text,
  ['Section', 'Article', 'Amendment', 'Other']
), text FROM {table} LIMIT 10;
"""

cursor.execute(query)
result = cursor.fetchall()
for entry in result:
    print("Text:\n")
    print(entry[1])
    print('-' * 80) # Add a separator for clarity.
    print("Classification:\n")
    parsed_json = json.loads(entry[0])
    print(json.dumps(parsed_json, indent=4))
    print('=' * 80) # Add a separator for clarity, before printing the next item.

## Additional Resources

- [Unstructured user interface (UI) documentation](https://docs.unstructured.io/ui/overview)
- [Unstructured Workflow Endpoint API documentation](https://docs.unstructured.io/api-reference/workflow/workflows)