diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 748f79b..0000000 Binary files a/.DS_Store and /dev/null differ diff --git a/notebooks/Getting_Started_with_Unstructured_API_and_PostgreSQL.ipynb b/notebooks/Getting_Started_with_Unstructured_API_and_PostgreSQL.ipynb new file mode 100644 index 0000000..e0e0722 --- /dev/null +++ b/notebooks/Getting_Started_with_Unstructured_API_and_PostgreSQL.ipynb @@ -0,0 +1,811 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Getting Started with Unstructured API and PostgreSQL\n" + ], + "metadata": { + "id": "Z8AvLcpmv8rE" + } + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "[Unstructured](https://unstructured.io) is an ETL+ platform for transforming unstructured content into structured formats ready for downstream use. It lets you:\n", + "\n", + "* Connect to enterprise data sources — cloud storage (S3, Azure Blob), collaboration tools (Confluence, Dropbox), business apps (Salesforce, Jira, Zendesk), and databases (Databricks, Redis, PostgreSQL)\n", + "* Continuously ingest documents from those systems\n", + "* Standardize, enrich, chunk, and transform the content into clean structured output\n", + "* Store the results in a destination database — in this case, **PostgreSQL**\n", + "\n", + "You can configure all of this through the Unstructured UI, the API, or directly from Python using their SDK.\n", + "\n", + "This notebook walks through setting up a full data pipeline using the Unstructured API. We’ll source files from an S3 bucket, process them using a series of transformation steps, and insert the structured results into PostgreSQL — ready for querying, analytics, or integration into business applications.\n", + "\n", + "For source configuration, we’ll use S3 in this example, though you can plug in any supported [data source](https://docs.unstructured.io/api-reference/workflow/sources/overview). The destination will be a PostgreSQL instance configured to receive structured document outputs via Unstructured’s [PostgreSQL connector](https://docs.unstructured.io/api-reference/workflow/destinations/postgresql).\n", + "\n", + "\n", + "This is what the complete data processing pipeline will look like:\n", + "![image.png]()\n" + ], + "metadata": { + "id": "5wNPAx2QwCGG" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Prerequisites\n", + "\n", + "Before running this notebook, you’ll need a few things set up so that Unstructured can connect to your source (e.g. S3) and send processed data to PostgreSQL.\n", + "\n", + "\n", + "### Unstructured API Key\n", + "\n", + "Create an API key from the Unstructured Platform\n", + "\n", + "1. [Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.\n", + "2. Once logged in, go to the **API Keys** section in the sidebar.\n", + "3. Click **New Key**, name it (e.g. `postgresql-workflow-key`), and copy it securely.\n", + "\n", + "---\n", + "\n", + "### AWS S3 (Source)\n", + "\n", + "We’ll use S3 to source raw documents.\n", + "\n", + "Make sure you have:\n", + "\n", + "- **AWS Access Key** and **Secret**\n", + "- An **S3 bucket URI**, like: `s3://your-bucket/`\n", + "- A few files uploaded — PDF, DOCX, HTML, etc. See [supported file types](https://docs.unstructured.io/getting-started/ingest/supported-file-types) for reference.\n", + "\n", + "Store these credentials securely in your notebook or environment variables.\n", + "\n", + "---\n", + "\n", + "### PostgreSQL (Destination)\n", + "\n", + "You’ll need access to a PostgreSQL instance. For this tutorial, we’re using **Amazon RDS for PostgreSQL**, but other cloud-hosted options work as well.\n", + "\n", + "When setting up your database, you’ll need to manually create the target table in advance. This table must match the structure of the document elements that Unstructured outputs.\n", + "\n", + "Required connection details:\n", + "\n", + "- **Host**\n", + "- **Port** (typically `5432`)\n", + "- **Database name**\n", + "- **Username**\n", + "- **Password**\n", + "- **Table name** — the destination table must already exist in your database and match the Unstructured output schema. For this tutorial, we'll use the table name: `elements`.\n", + "- **batch_size**: how many rows to insert per write operation\n", + "\n", + "\n", + "Ensure your PostgreSQL instance allows incoming connections from Unstructured’s IP addresses. If you're using Amazon RDS, check that your instance's **Public access** setting and **security group rules** permit access. \n", + "→ [Unstructured IP list](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json)\n", + "\n", + "> 💡 You must create the destination table ahead of time. The schema should match the structure of Unstructured's document elements. Here's a sample schema you can adapt:\n", + ">\n", + "> ```sql\n", + "> CREATE TABLE elements (\n", + "> id UUID PRIMARY KEY,\n", + "> record_id VARCHAR,\n", + "> element_id VARCHAR,\n", + "> text TEXT,\n", + "> embeddings DECIMAL [],\n", + "> parent_id VARCHAR,\n", + "> page_number INTEGER,\n", + "> is_continuation BOOLEAN,\n", + "> orig_elements TEXT,\n", + "> partitioner_type VARCHAR\n", + "> );\n", + "> ```\n", + "\n", + "\n", + "For a full list of options and setup details, see the [PostgreSQL destination documentation](https://docs.unstructured.io/api-reference/workflow/destinations/postgresql). \n", + "Prefer video? [Watch the setup walkthrough on YouTube](https://www.youtube.com/watch?v=QuIlEimejDs&t).\n" + ], + "metadata": { + "id": "hgI9-3XowHgu" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1: Install the Unstructured API Python SDK\n", + "\n", + "All functionality available in the the UI of the [Unstructured](https://unstructured.io/) product is also available programmatically via Unstructured API. You can interact with Unstructured API either by sending direct requests via curl or postman, or using Unstructured API [Python SDK](https://docs.unstructured.io/api-reference/workflow/overview#unstructured-python-sdk). Here, we'll be using the latter.\n", + "\n", + "\n", + "> **Note:**\n", + "The Unstructured API has two endpoints:\n", + "* The Unstructured Partition Endpoint: intended for rapid prototyping of Unstructured's various partitioning strategies. It works only with processing of local files, one file at a time.\n", + "* The Unstructured Workflow Endpoint: enables a full range of partitioning, chunking, embedding, and enrichment options for your data. It is designed to batch-process data from any data source to any destination. This is what we're using in this notebook.\n", + "\n", + "\n", + "Run the following cell to install the Unstructured API Python SDK." + ], + "metadata": { + "id": "E5CL5iVB0oIF" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install -U \"unstructured-client\"" + ], + "metadata": { + "id": "zp8T30s300Ky", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "outputId": "3e1ba014-4dde-483b-bfc4-9cec420c6696" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting unstructured-client\n", + " Downloading unstructured_client-0.42.3-py3-none-any.whl.metadata (23 kB)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (0.28.1)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (2.11.7)\n", + "Collecting pypdf>=4.0 (from unstructured-client)\n", + " Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.0)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography>=3.1->unstructured-client) (1.17.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (2025.8.3)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (4.10.0)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (3.10)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.33.2)\n", + "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (4.14.1)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.1)\n", + "Requirement already satisfied: requests<3.0.0,>=2.0.1 in /usr/local/lib/python3.11/dist-packages (from requests-toolbelt>=1.0.0->unstructured-client) (2.32.3)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.22)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (3.4.3)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (2.5.0)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->unstructured-client) (1.3.1)\n", + "Downloading unstructured_client-0.42.3-py3-none-any.whl (207 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.8/207.8 kB\u001b[0m \u001b[31m1.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pypdf-6.0.0-py3-none-any.whl (310 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m310.5/310.5 kB\u001b[0m \u001b[31m3.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: pypdf, unstructured-client\n", + "Successfully installed pypdf-6.0.0 unstructured-client-0.42.3\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 2: Set environment variables\n", + "\n", + "Fetching the values from Colab Secrets!" + ], + "metadata": { + "id": "PSLM5_I6wN2g" + } + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "from google.colab import userdata\n", + "\n", + "# Unstructured\n", + "os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')\n", + "\n", + "# AWS S3\n", + "os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')\n", + "os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')\n", + "os.environ['S3_REMOTE_URL'] = userdata.get('S3_REMOTE_URL')\n", + "\n", + "\n", + "# AWS PostgreSQL\n", + "os.environ['PostgreSQL_host'] = userdata.get('PostgreSQL_host')\n", + "os.environ['PostgreSQL_database'] = userdata.get('PostgreSQL_database')\n", + "os.environ['PostgreSQL_port'] = userdata.get('PostgreSQL_port')\n", + "os.environ['PostgreSQL_username'] = userdata.get('PostgreSQL_username')\n", + "os.environ['PostgreSQL_password'] = userdata.get('PostgreSQL_password')\n", + "\n", + "\n" + ], + "metadata": { + "id": "iKEkEUBW0G2K" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# instantiate Unstructured Client\n", + "from unstructured_client import UnstructuredClient\n", + "\n", + "unstructured_client = UnstructuredClient(api_key_auth=os.environ[\"UNSTRUCTURED_API_KEY\"])\n", + "\n", + "# helper function\n", + "def pretty_print_model(response_model):\n", + " print(response_model.model_dump_json(indent=4))" + ], + "metadata": { + "id": "lNbpG7N01NW_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 3: Create the source connector\n", + "\n", + "Run the following cell to create an [AWS S3 source connector](https://docs.unstructured.io/api-reference/workflow/sources/s3).\n", + "\n", + "When creating a source connector, you'll need to:\n", + "- Assign it a unique name\n", + "- Set the connector type to `s3`\n", + "- Provide your AWS credentials and S3 bucket location\n", + "\n", + "Your config must include:\n", + "- `remote_url`: the full URI to your S3 bucket or folder (e.g. `s3://your-bucket/path/`)\n", + "- `key`: your AWS access key\n", + "- `secret`: your AWS secret key\n", + "- `recursive`: (optional) whether to include files in subfolders\n" + ], + "metadata": { + "id": "78gSZnm3wVfw" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import CreateSourceConnector\n", + "\n", + "source_response = unstructured_client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=f\"UnstructuredxPostgreSQL Source_\",\n", + " type=\"s3\",\n", + " config={\n", + " \"key\": os.environ.get('AWS_ACCESS'),\n", + " \"secret\": os.environ.get('AWS_SECRET'),\n", + " \"remote_url\": os.environ.get('S3_REMOTE_URL'),\n", + " \"recursive\": True\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(source_response.source_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fzp8gaSIwm0i", + "outputId": "96b0a09f-d695-49e1-986e-703037aa9e18" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.11/dist-packages/pydantic/functional_validators.py:218: UserWarning: Pydantic serializer warnings:\n", + " PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])\n", + " function=lambda v, h: h(v),\n", + "/usr/local/lib/python3.11/dist-packages/pydantic/main.py:463: UserWarning: Pydantic serializer warnings:\n", + " PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])\n", + " return self.__pydantic_serializer__.to_python(\n" + ] + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"anonymous\": false,\n", + " \"recursive\": true,\n", + " \"remote_url\": \"s3://ajay-uns-devrel-content/mm-agentic-rag/\",\n", + " \"key\": \"**********\",\n", + " \"secret\": \"**********\"\n", + " },\n", + " \"created_at\": \"2025-08-11T14:30:57.981194Z\",\n", + " \"id\": \"b517c8b2-d309-4c5b-9eb1-3578d1f848c8\",\n", + " \"name\": \"UnstructuredxPostgreSQL Source_\",\n", + " \"type\": \"s3\",\n", + " \"updated_at\": \"2025-08-11T14:30:58.256737Z\"\n", + "}\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.11/dist-packages/unstructured_client/models/shared/sourceconnectorinformation.py:182: UserWarning: Pydantic serializer warnings:\n", + " PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='s3', input_type=str])\n", + " serialized = handler(self)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 4: Create the destination connector\n", + "\n", + "Run the following cell to create a [PostgreSQL destination connector](https://docs.unstructured.io/api-reference/workflow/destinations/postgresql).\n", + "\n", + "When setting up a destination connector, you'll need to:\n", + "- Give it a unique name.\n", + "- Specify its type (`postgres` in this case).\n", + "- Provide configuration details including your PostgreSQL instance and table settings.\n", + "\n", + "Your configuration must include:\n", + "- `host`: the hostname of your PostgreSQL instance\n", + "- `port`: the port number (usually `5432`)\n", + "- `database`: the name of the database\n", + "- `username`: your PostgreSQL username\n", + "- `password`: your PostgreSQL password\n", + "- `table_name`: the name of the table where data will be inserted\n", + "- `batch_size`: number of rows to insert per operation\n" + ], + "metadata": { + "id": "LFaaWfdawZKQ" + } + }, + { + "cell_type": "code", + "source": [ + "\n", + "from unstructured_client.models.operations import CreateDestinationRequest\n", + "from unstructured_client.models.shared import CreateDestinationConnector\n", + "\n", + "destination_response = unstructured_client.destinations.create_destination(\n", + " request=CreateDestinationRequest(\n", + " create_destination_connector=CreateDestinationConnector(\n", + " name=\"UnstructuredxPostgreSQL Destination_\",\n", + " type=\"postgres\",\n", + " config={\n", + " \"host\": os.environ['PostgreSQL_host'],\n", + " \"database\": os.environ['PostgreSQL_database'],\n", + " \"port\": os.environ['PostgreSQL_port'],\n", + " \"username\": os.environ['PostgreSQL_username'],\n", + " \"password\": os.environ['PostgreSQL_password'],\n", + " \"table_name\": \"elements\",\n", + " \"batch_size\": 100\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(destination_response.destination_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cjHe50Y0xBHh", + "outputId": "aa277c5f-b31e-433d-8a87-9a1ab008190f" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"batch_size\": 100,\n", + " \"database\": \"uns_devrel_demo\",\n", + " \"host\": \"database-1.cmhg4iuswe98.us-east-1.rds.amazonaws.com\",\n", + " \"password\": \"**********\",\n", + " \"port\": 5432,\n", + " \"table_name\": \"elements\",\n", + " \"username\": \"postgres\"\n", + " },\n", + " \"created_at\": \"2025-08-18T13:32:24.631827Z\",\n", + " \"id\": \"c4b79641-20f5-45b2-afd6-fe3b9a4efc9d\",\n", + " \"name\": \"UnstructuredxPostgreSQL Destination_\",\n", + " \"type\": \"postgres\",\n", + " \"updated_at\": \"2025-08-18T13:32:24.750518Z\"\n", + "}\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.11/dist-packages/unstructured_client/models/shared/destinationconnectorinformation.py:190: UserWarning: Pydantic serializer warnings:\n", + " PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [input_value='postgres', input_type=str])\n", + " serialized = handler(self)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 5: Create the data processing workflow\n", + "\n", + "This step connects everything. We define a workflow that pulls files from S3, processes them through a sequence of transformations, and pushes the structured output to PostgreSQL.\n", + "\n", + "The pipeline includes three core operations:\n", + "\n", + "- **Partitioning**: Converts raw files like PDFs, DOCX, and HTML into structured elements — paragraphs, tables, titles, and more. We’re using the [`vlm` strategy](https://docs.unstructured.io/api-reference/partition/document-elements), which is well-suited for complex or layout-heavy documents.\n", + "\n", + "- **Chunking**: Splits longer content into smaller pieces to make records easier to query and work with downstream. We use [`chunk_by_title`](https://docs.unstructured.io/api-reference/partition/chunking#%E2%80%9Dby-title%E2%80%9D-chunking-strategy), which breaks text at logical section boundaries based on headers and character count.\n", + "\n", + "- **Embedding**: Adds a semantic vector to each chunk using Azure OpenAI’s `text-embedding-3-large`. Learn more about [embedding models and options](https://docs.unstructured.io/ui/embedding).\n", + "\n", + "Once defined, these steps are passed into `create_workflow`, which registers the entire pipeline. From there, documents can be processed end-to-end — source to destination — with structured, enriched records landing in Postgres.\n" + ], + "metadata": { + "id": "pCEO1jxjwb30" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.shared import (\n", + " WorkflowNode,\n", + " WorkflowType,\n", + " Schedule\n", + ")\n", + "\n", + "parition_node = WorkflowNode(\n", + " name=\"Partitioner\",\n", + " subtype=\"vlm\",\n", + " type=\"partition\",\n", + " settings={\n", + " \"provider\": \"anthropic\",\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " }\n", + " )\n", + "\n", + "chunk_node = WorkflowNode(\n", + " name=\"Chunker\",\n", + " subtype=\"chunk_by_title\",\n", + " type=\"chunk\",\n", + " settings={\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150\n", + " }\n", + ")\n", + "\n", + "embedder_node = WorkflowNode(\n", + " name='Embedder',\n", + " subtype='azure_openai',\n", + " type=\"embed\",\n", + " settings={\n", + " 'model_name': 'text-embedding-3-large'\n", + " }\n", + " )\n", + "\n", + "\n", + "response = unstructured_client.workflows.create_workflow(\n", + " request={\n", + " \"create_workflow\": {\n", + " \"name\": f\"PostgreSQL Destination Tutorial Workflow_\",\n", + " \"source_id\": source_response.source_connector_information.id,\n", + " \"destination_id\": destination_response.destination_connector_information.id,\n", + " \"workflow_type\": WorkflowType.CUSTOM,\n", + " \"workflow_nodes\": [\n", + " parition_node,\n", + " chunk_node,\n", + " embedder_node\n", + " ]\n", + " }\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(response.workflow_information)\n", + "workflow_id = response.workflow_information.id" + ], + "metadata": { + "id": "IGnCgfH2I2L8", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "c00ead35-8a39-46fb-90f2-ad1574e3b855" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-08-11T14:31:01.392505Z\",\n", + " \"destinations\": [\n", + " \"176b643b-f9f6-4b55-ac71-2a40f229f4e5\"\n", + " ],\n", + " \"id\": \"a84445dc-ddd1-4c68-acf6-482c2f211b40\",\n", + " \"name\": \"PostgreSQL Destination Tutorial Workflow_\",\n", + " \"sources\": [\n", + " \"b517c8b2-d309-4c5b-9eb1-3578d1f848c8\"\n", + " ],\n", + " \"status\": \"active\",\n", + " \"workflow_nodes\": [\n", + " {\n", + " \"name\": \"Partitioner\",\n", + " \"subtype\": \"vlm\",\n", + " \"type\": \"partition\",\n", + " \"id\": \"b84be967-2981-4d27-9e5c-1e59ebf1f751\",\n", + " \"settings\": {\n", + " \"provider\": \"anthropic\",\n", + " \"provider_api_key\": null,\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " \"output_format\": \"text/html\",\n", + " \"prompt\": null,\n", + " \"format_html\": true,\n", + " \"unique_element_ids\": true,\n", + " \"is_dynamic\": false,\n", + " \"allow_fast\": true,\n", + " \"custom_host_config\": null\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_title\",\n", + " \"type\": \"chunk\",\n", + " \"id\": \"141a257d-3605-42fc-b964-699312ce07a7\",\n", + " \"settings\": {\n", + " \"unstructured_api_url\": null,\n", + " \"unstructured_api_key\": null,\n", + " \"multipage_sections\": false,\n", + " \"combine_text_under_n_chars\": null,\n", + " \"include_orig_elements\": false,\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150,\n", + " \"overlap_all\": false,\n", + " \"contextual_chunking_strategy\": null\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Embedder\",\n", + " \"subtype\": \"azure_openai\",\n", + " \"type\": \"embed\",\n", + " \"id\": \"c19ca380-ee2b-43c2-a07d-17ace2d58e39\",\n", + " \"settings\": {\n", + " \"model_name\": \"text-embedding-3-large\"\n", + " }\n", + " }\n", + " ],\n", + " \"reprocess_all\": false,\n", + " \"schedule\": {\n", + " \"crontab_entries\": []\n", + " },\n", + " \"updated_at\": \"2025-08-11T14:31:01.409100Z\",\n", + " \"workflow_type\": \"custom\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 6: Run the workflow\n", + "\n", + "Run the following cell to start running the workflow." + ], + "metadata": { + "id": "tOIkt9GOwf6i" + } + }, + { + "cell_type": "code", + "source": [ + "res = unstructured_client.workflows.run_workflow(\n", + " request={\n", + " \"workflow_id\": workflow_id,\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(res.job_information)" + ], + "metadata": { + "id": "7QLgPRt-JNYD", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "5f510ba6-8714-44de-eeb9-48b7f952c3e0" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-08-11T15:13:16.416810Z\",\n", + " \"id\": \"036eff90-49a3-46d0-ab97-be5541bfc32e\",\n", + " \"status\": \"SCHEDULED\",\n", + " \"workflow_id\": \"a84445dc-ddd1-4c68-acf6-482c2f211b40\",\n", + " \"workflow_name\": \"PostgreSQL Destination Tutorial Workflow_\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 7: Get the workflow run's job ID\n", + "\n", + "Run the following cell to get the workflow run's job ID, which is needed to poll for job completion later. If successful, Unstructured prints the job's ID." + ], + "metadata": { + "id": "ObIv1fHfwigb" + } + }, + { + "cell_type": "code", + "source": [ + "response = unstructured_client.jobs.list_jobs(\n", + " request={\n", + " \"workflow_id\": workflow_id\n", + " }\n", + ")\n", + "\n", + "last_job = response.response_list_jobs[0]\n", + "job_id = last_job.id\n", + "print(f\"job_id: {job_id}\")" + ], + "metadata": { + "id": "LP5ZPuQJJgQp", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "2113cd82-e8e6-4646-a5ea-af2d5ec2d358" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "job_id: 036eff90-49a3-46d0-ab97-be5541bfc32e\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 8: Poll for job completion\n", + "\n", + "Run the following cell to confirm the job has finished running. If successful, Unstructured prints `\"status\": \"COMPLETED\"` within the information about the job." + ], + "metadata": { + "id": "hJoLbPwLJupD" + } + }, + { + "cell_type": "code", + "source": [ + "import time\n", + "\n", + "def poll_job_status(job_id, wait_time=30):\n", + " while True:\n", + " response = unstructured_client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(f\"Job is scheduled, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(f\"Job is in progress, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " else:\n", + " print(\"Job is completed\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "job = poll_job_status(job_id)\n", + "pretty_print_model(job)" + ], + "metadata": { + "id": "nEfi8Q_SJzuh", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "6f647b43-f5f2-4142-f6c2-eaa498efba8e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is scheduled, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is completed\n", + "{\n", + " \"created_at\": \"2025-08-11T15:13:16.416810\",\n", + " \"id\": \"036eff90-49a3-46d0-ab97-be5541bfc32e\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"a84445dc-ddd1-4c68-acf6-482c2f211b40\",\n", + " \"workflow_name\": \"PostgreSQL Destination Tutorial Workflow_\",\n", + " \"job_type\": \"ephemeral\",\n", + " \"runtime\": \"PT0S\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 9: View the processed data\n", + "\n", + "Once the job finishes, your documents will be processed and stored in PostgreSQL. Each row in the target table represents a structured element from the original files — including metadata, chunked text, and embeddings.\n", + "\n", + "You can query the data directly using SQL. For example:\n", + "\n", + "```sql\n", + "SELECT id, LEFT(text, 100) AS text_preview\n", + "FROM elements\n", + "LIMIT 5;\n", + "```\n", + "\n", + "\n", + "\n", + "![image.png]()" + ], + "metadata": { + "id": "605bniKrNfIY" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "DLptIOnHha-9" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/notebooks/Getting_Started_with_Unstructured_API_and_Redis.ipynb b/notebooks/Getting_Started_with_Unstructured_API_and_Redis.ipynb new file mode 100644 index 0000000..119c7ef --- /dev/null +++ b/notebooks/Getting_Started_with_Unstructured_API_and_Redis.ipynb @@ -0,0 +1,777 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Getting Started with Unstructured API and Redis\n" + ], + "metadata": { + "id": "Z8AvLcpmv8rE" + } + }, + { + "cell_type": "markdown", + "source": [ + "[Unstructured](https://unstructured.io) is an ETL+ platform for prepping unstructured data for GenAI pipelines. It lets you:\n", + "\n", + "* Connect to common enterprise data systems, including cloud storage (S3, Azure Blob), collaboration tools (Confluence, Dropbox), business apps (Salesforce, Jira, Zendesk), and databases (Databricks, Redis)\n", + "* Continuously ingest documents from those sources\n", + "* Standardize, enrich, chunk, and embed the content for downstream use\n", + "* Push the results into a vector store or database, in this case, **Redis**\n", + "\n", + "You can configure all of this through the Unstructured UI, the API, or directly from Python using their SDK.\n", + "\n", + "This notebook walks through setting up a complete data workflow using the Unstructured API. We’ll pull files from an S3 bucket, process them with a few key transformations, and send the output to Redis, making the data ready to be queried, searched, or integrated into a RAG pipeline.\n", + "\n", + "For source configuration, we’ll use S3 in this example, though you can swap in any supported [data source](https://docs.unstructured.io/api-reference/workflow/sources/overview). The destination will be a Redis instance configured to accept chunks via Unstructured’s [Redis connector](https://docs.unstructured.io/api-reference/workflow/destinations/redis).\n", + "\n", + "This is what the complete data processing pipeline will look like:\n", + "![image.png]()\n" + ], + "metadata": { + "id": "5wNPAx2QwCGG" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Prerequisites\n", + "\n", + "Before running this notebook, you’ll need a few things set up so that Unstructured can connect to your source (e.g. S3) and send processed data to Redis.\n", + "\n", + "\n", + "### Unstructured API Key\n", + "\n", + "Create an API key from the [Unstructured Platform](https://platform.unstructured.io)\n", + "\n", + "1. [Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.\n", + "2. Once logged in, go to the **API Keys** section in the sidebar.\n", + "3. Click **New Key**, name it (e.g. `redis-workflow-key`), and copy it securely.\n", + "\n", + "---\n", + "\n", + "### AWS S3 (Source)\n", + "\n", + "We’ll use S3 to source raw documents.\n", + "\n", + "Make sure you have:\n", + "\n", + "- **AWS Access Key** and **Secret**\n", + "- An **S3 bucket URI**, like: `s3://your-bucket/`\n", + "- A few files uploaded — PDF, DOCX, HTML, etc. See [supported file types](https://docs.unstructured.io/getting-started/ingest/supported-file-types) for reference.\n", + "\n", + "Store these credentials securely in your notebook or environment variables.\n", + "\n", + "---\n", + "\n", + "### Redis (Destination)\n", + "\n", + "You’ll need access to a Redis instance. This can be hosted on Redis Cloud or self-managed.\n", + "\n", + "Required connection details:\n", + "\n", + "- **Hostname**\n", + "- **Port**\n", + "- **Username** (optional — depending on your setup)\n", + "- **Password**\n", + "- **Database number** (commonly `0`)\n", + "\n", + "If you’re using Redis Cloud, you can find these under your instance’s connection settings. Ensure your client can reach the Redis endpoint (some Redis setups restrict IPs or require SSL).\n", + "\n", + "You can also customize:\n", + "\n", + "- `batch_size`: controls how many documents are sent per write operation\n", + "- `ssl`: enable if using a secure connection\n", + "\n", + "For a full list of available options, see the [Redis destination documentation](https://docs.unstructured.io/api-reference/workflow/destinations/redis).\n", + "\n" + ], + "metadata": { + "id": "hgI9-3XowHgu" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1: Install the Unstructured API Python SDK\n", + "\n", + "All functionality available in the the UI of the [Unstructured](https://unstructured.io/) product is also available programmatically via Unstructured API. You can interact with Unstructured API either by sending direct requests via curl or postman, or using Unstructured API [Python SDK](https://docs.unstructured.io/api-reference/workflow/overview#unstructured-python-sdk). Here, we'll be using the latter.\n", + "\n", + "\n", + "> **Note:**\n", + "The Unstructured API has two endpoints:\n", + "* The Unstructured Partition Endpoint: intended for rapid prototyping of Unstructured's various partitioning strategies. It works only with processing of local files, one file at a time.\n", + "* The Unstructured Workflow Endpoint: enables a full range of partitioning, chunking, embedding, and enrichment options for your data. It is designed to batch-process data from any data source to any destination. This is what we're using in this notebook.\n", + "\n", + "\n", + "Run the following cell to install the Unstructured API Python SDK." + ], + "metadata": { + "id": "E5CL5iVB0oIF" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install -U \"unstructured-client\"" + ], + "metadata": { + "id": "zp8T30s300Ky", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "outputId": "cc26e0b0-c65d-45b5-aeaf-d80ebd09eb4f" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting unstructured-client\n", + " Downloading unstructured_client-0.42.1-py3-none-any.whl.metadata (23 kB)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (0.28.1)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (2.11.7)\n", + "Collecting pypdf>=4.0 (from unstructured-client)\n", + " Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.0)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography>=3.1->unstructured-client) (1.17.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (2025.7.14)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (4.9.0)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (3.10)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.33.2)\n", + "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (4.14.1)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.1)\n", + "Requirement already satisfied: requests<3.0.0,>=2.0.1 in /usr/local/lib/python3.11/dist-packages (from requests-toolbelt>=1.0.0->unstructured-client) (2.32.3)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.22)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (3.4.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (2.5.0)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->unstructured-client) (1.3.1)\n", + "Downloading unstructured_client-0.42.1-py3-none-any.whl (207 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.2/207.2 kB\u001b[0m \u001b[31m13.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pypdf-5.9.0-py3-none-any.whl (313 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m313.2/313.2 kB\u001b[0m \u001b[31m15.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: pypdf, unstructured-client\n", + "Successfully installed pypdf-5.9.0 unstructured-client-0.42.1\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 2: Set environment variables\n", + "\n", + "Fetching the values from Colab Secrets!" + ], + "metadata": { + "id": "PSLM5_I6wN2g" + } + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "from google.colab import userdata\n", + "\n", + "# Unstructured\n", + "os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')\n", + "\n", + "# AWS S3\n", + "os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')\n", + "os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')\n", + "os.environ['S3_REMOTE_URL'] = userdata.get('S3_REMOTE_URL')\n", + "\n", + "\n", + "# Redis Cloud\n", + "os.environ['REDIS_DATABASE_NAME'] = userdata.get('REDIS_DATABASE_NAME')\n", + "os.environ['REDIS_HOSTNAME'] = userdata.get('REDIS_HOSTNAME')\n", + "os.environ['REDIS_PORTNUMBER'] = userdata.get('REDIS_PORTNUMBER')\n", + "os.environ['REDIS_USERNAME'] = userdata.get('REDIS_USERNAME')\n", + "os.environ['REDIS_PASSWORD'] = userdata.get('REDIS_PASSWORD')\n", + "os.environ['REDIS_DATABASE_NUMBER'] = userdata.get('REDIS_DATABASE_NUMBER')\n", + "\n", + "\n" + ], + "metadata": { + "id": "iKEkEUBW0G2K" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# instantiate Unstructured Client\n", + "from unstructured_client import UnstructuredClient\n", + "\n", + "unstructured_client = UnstructuredClient(api_key_auth=os.environ[\"UNSTRUCTURED_API_KEY\"])\n", + "\n", + "# helper function\n", + "def pretty_print_model(response_model):\n", + " print(response_model.model_dump_json(indent=4))" + ], + "metadata": { + "id": "lNbpG7N01NW_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 3: Create the source connector\n", + "\n", + "Run the following cell to create an [AWS S3 source connector](https://docs.unstructured.io/api-reference/workflow/sources/s3).\n", + "\n", + "When setting up a source connector, you'll need to:\n", + "- Give it a unique name.\n", + "- Specify its type (`s3` in this case).\n", + "- Provide configuration details including your S3 location and credentials.\n", + "\n", + "Your configuration must include:\n", + "- `remote_url`: the URI to your S3 bucket or folder (e.g. `s3://my-bucket/data/`)\n", + "- `key`: your AWS access key\n", + "- `secret`: your AWS secret key\n", + "- `recursive`: (optional) whether to include files in subfolders" + ], + "metadata": { + "id": "78gSZnm3wVfw" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import CreateSourceConnector\n", + "\n", + "source_response = unstructured_client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=f\"Redis Tutorial Source_\",\n", + " type=\"s3\",\n", + " config={\n", + " \"key\": os.environ.get('AWS_ACCESS'),\n", + " \"secret\": os.environ.get('AWS_SECRET'),\n", + " \"remote_url\": os.environ.get('S3_REMOTE_URL'),\n", + " \"recursive\": True\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(source_response.source_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fzp8gaSIwm0i", + "outputId": "01e08bd8-ed82-4184-b25b-83fee6c30563" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"anonymous\": false,\n", + " \"recursive\": true,\n", + " \"remote_url\": \"s3://ajay-uns-devrel-content/agentic-analysis/\",\n", + " \"key\": \"**********\",\n", + " \"secret\": \"**********\"\n", + " },\n", + " \"created_at\": \"2025-08-06T14:30:48.002626Z\",\n", + " \"id\": \"c7c6e187-35fe-4a0f-be8d-2f37ed3744ba\",\n", + " \"name\": \"Redis Tutorial Source_\",\n", + " \"type\": \"s3\",\n", + " \"updated_at\": \"2025-08-06T14:30:48.181506Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 4: Create the destination connector\n", + "\n", + "Now that your Redis credentials are available as environment variables, you're ready to create the destination connector.\n", + "\n", + "This step defines the **Redis destination**, specifying:\n", + "- A unique connector name\n", + "- The connector type (`redis`)\n", + "- Required connection parameters:\n", + " - `host`: Redis hostname (e.g. from Redis Cloud)\n", + " - `port`: connection port (default is usually 6379 or your Redis Cloud port)\n", + " - `username` and `password`: for authentication (Redis Cloud requires both)\n", + " - `database`: Redis logical database index (default is `0`)\n", + " - `ssl`: whether to use SSL for the connection (set to `True` for secure endpoints)\n", + " - `batch_size`: how many documents to push in each write operation\n", + "\n", + "These values are pulled from environment variables\n", + "\n", + "\n", + "\n", + "\n" + ], + "metadata": { + "id": "LFaaWfdawZKQ" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateDestinationRequest\n", + "from unstructured_client.models.shared import CreateDestinationConnector\n", + "\n", + "destination_response = unstructured_client.destinations.create_destination(\n", + " request=CreateDestinationRequest(\n", + " create_destination_connector=CreateDestinationConnector(\n", + " name=\"Redis Tutorial Destination_\",\n", + " type=\"redis\",\n", + " config={\n", + " \"database\": os.environ.get(\"REDIS_DATABASE_NUMBER\"),\n", + " \"ssl\": False,\n", + " \"batch_size\": 100,\n", + "\n", + " # For password authentication:\n", + " \"host\": os.environ.get(\"REDIS_HOSTNAME\"),\n", + " \"port\": os.environ.get(\"REDIS_PORTNUMBER\"),\n", + " \"username\": os.environ.get(\"REDIS_USERNAME\"),\n", + " \"password\": os.environ.get(\"REDIS_PASSWORD\")\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(destination_response.destination_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cjHe50Y0xBHh", + "outputId": "e06801a6-14ab-451c-dc49-ad200dad6e13" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"batch_size\": 100,\n", + " \"database\": 0,\n", + " \"host\": \"redis-15786.c93.us-east-1-3.ec2.redns.redis-cloud.com\",\n", + " \"port\": 15786,\n", + " \"ssl\": false,\n", + " \"password\": \"**********\",\n", + " \"username\": \"default\"\n", + " },\n", + " \"created_at\": \"2025-08-06T14:30:48.527175Z\",\n", + " \"id\": \"697f180a-53af-454d-9cd9-4cfe6a03a623\",\n", + " \"name\": \"Redis Tutorial Destination_\",\n", + " \"type\": \"redis\",\n", + " \"updated_at\": \"2025-08-06T14:30:48.609871Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 5: Create the data processing workflow\n", + "\n", + "![image.png]()\n", + "\n", + "\n", + "Here’s where we define how documents move from source to destination. A workflow is just a chain of processing steps — each one represented by a `WorkflowNode`.\n", + "\n", + "In this setup, we’re using three nodes: `partition`, `chunk`, and `embed`.\n", + "\n", + "\n", + "### 🧩 Partition\n", + "\n", + "This always comes first. It takes in raw files — PDFs, Word docs, HTML — and turns them into structured JSON with metadata. We’re using the `vlm` strategy here, backed by Anthropic’s Claude 3.5, which works well for layout-heavy or noisy documents.\n", + "\n", + "→ [How partitioning works](https://docs.unstructured.io/api-reference/partition/document-elements)\n", + "\n", + "\n", + "\n", + "### ✂️ Chunk\n", + "\n", + "Next, we split that structured content into smaller chunks. This helps with retrieval later and keeps us under model token limits. The `chunk_by_title` strategy breaks things up based on document headers, with character limits and overlap settings for better continuity.\n", + "\n", + "→ [Chunking strategies explained](https://docs.unstructured.io/ui/chunking)\n", + "\n", + "\n", + "\n", + "### 📐 Embed\n", + "\n", + "Last step: embedding. Each chunk gets turned into a dense vector using Azure OpenAI’s `text-embedding-3-large`. These vectors are what we’ll send to Redis for downstream search or RAG.\n", + "\n", + "→ [Embedding models and config](https://docs.unstructured.io/ui/embedding)\n", + "\n", + "\n", + "All three nodes are passed into `create_workflow`, which registers the DAG and ties everything together: from S3 input all the way to Redis.\n" + ], + "metadata": { + "id": "pCEO1jxjwb30" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.shared import (\n", + " WorkflowNode,\n", + " WorkflowType,\n", + " Schedule\n", + ")\n", + "\n", + "parition_node = WorkflowNode(\n", + " name=\"Partitioner\",\n", + " subtype=\"vlm\",\n", + " type=\"partition\",\n", + " settings={\n", + " \"provider\": \"anthropic\",\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " }\n", + " )\n", + "\n", + "chunk_node = WorkflowNode(\n", + " name=\"Chunker\",\n", + " subtype=\"chunk_by_title\",\n", + " type=\"chunk\",\n", + " settings={\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150\n", + " }\n", + ")\n", + "\n", + "embedder_node = WorkflowNode(\n", + " name='Embedder',\n", + " subtype='azure_openai',\n", + " type=\"embed\",\n", + " settings={\n", + " 'model_name': 'text-embedding-3-large'\n", + " }\n", + " )\n", + "\n", + "\n", + "response = unstructured_client.workflows.create_workflow(\n", + " request={\n", + " \"create_workflow\": {\n", + " \"name\": f\"Redis Tutorial Workflow_\",\n", + " \"source_id\": source_response.source_connector_information.id,\n", + " \"destination_id\": destination_response.destination_connector_information.id,\n", + " \"workflow_type\": WorkflowType.CUSTOM,\n", + " \"workflow_nodes\": [\n", + " parition_node,\n", + " chunk_node,\n", + " embedder_node\n", + " ]\n", + " }\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(response.workflow_information)\n", + "workflow_id = response.workflow_information.id" + ], + "metadata": { + "id": "IGnCgfH2I2L8", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "44dbe387-ca47-4edc-92cd-8ace37834f83" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-07-12T15:09:08.873461Z\",\n", + " \"destinations\": [\n", + " \"993093e2-b39a-4c8d-9269-a10897601b16\"\n", + " ],\n", + " \"id\": \"0d9bc496-44cb-4bac-8302-bd0438fd58da\",\n", + " \"name\": \"Redis Tutorial Workflow_\",\n", + " \"sources\": [\n", + " \"50477cd7-32df-4ee7-b0ed-09bf64299128\"\n", + " ],\n", + " \"status\": \"active\",\n", + " \"workflow_nodes\": [\n", + " {\n", + " \"name\": \"Partitioner\",\n", + " \"subtype\": \"vlm\",\n", + " \"type\": \"partition\",\n", + " \"id\": \"3a405db3-638e-47a7-9ef7-5e728e3e40e0\",\n", + " \"settings\": {\n", + " \"provider\": \"anthropic\",\n", + " \"provider_api_key\": null,\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " \"output_format\": \"text/html\",\n", + " \"prompt\": null,\n", + " \"format_html\": true,\n", + " \"unique_element_ids\": true,\n", + " \"is_dynamic\": false,\n", + " \"allow_fast\": true\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_title\",\n", + " \"type\": \"chunk\",\n", + " \"id\": \"6f47169e-799a-409b-a353-dd2df4e13997\",\n", + " \"settings\": {\n", + " \"unstructured_api_url\": null,\n", + " \"unstructured_api_key\": null,\n", + " \"multipage_sections\": false,\n", + " \"combine_text_under_n_chars\": null,\n", + " \"include_orig_elements\": false,\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150,\n", + " \"overlap_all\": false,\n", + " \"contextual_chunking_strategy\": null\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Embedder\",\n", + " \"subtype\": \"azure_openai\",\n", + " \"type\": \"embed\",\n", + " \"id\": \"975929a5-4e82-4549-abcd-65f6640e49ec\",\n", + " \"settings\": {\n", + " \"model_name\": \"text-embedding-3-large\"\n", + " }\n", + " }\n", + " ],\n", + " \"reprocess_all\": false,\n", + " \"schedule\": {\n", + " \"crontab_entries\": []\n", + " },\n", + " \"updated_at\": \"2025-07-12T15:09:08.887600Z\",\n", + " \"workflow_type\": \"custom\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 6: Run the workflow\n", + "\n", + "Run the following cell to start running the workflow." + ], + "metadata": { + "id": "tOIkt9GOwf6i" + } + }, + { + "cell_type": "code", + "source": [ + "res = unstructured_client.workflows.run_workflow(\n", + " request={\n", + " \"workflow_id\": workflow_id,\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(res.job_information)" + ], + "metadata": { + "id": "7QLgPRt-JNYD", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "49b67720-0cf3-4716-ff96-bbcaa93656bf" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-07-12T15:09:12.501570Z\",\n", + " \"id\": \"38a93dc7-67bf-43b6-bea4-7c9a5ce212d6\",\n", + " \"status\": \"SCHEDULED\",\n", + " \"workflow_id\": \"0d9bc496-44cb-4bac-8302-bd0438fd58da\",\n", + " \"workflow_name\": \"Redis Tutorial Workflow_\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 7: Get the workflow run's job ID\n", + "\n", + "Run the following cell to get the workflow run's job ID, which is needed to poll for job completion later. If successful, Unstructured prints the job's ID." + ], + "metadata": { + "id": "ObIv1fHfwigb" + } + }, + { + "cell_type": "code", + "source": [ + "response = unstructured_client.jobs.list_jobs(\n", + " request={\n", + " \"workflow_id\": workflow_id\n", + " }\n", + ")\n", + "\n", + "last_job = response.response_list_jobs[0]\n", + "job_id = last_job.id\n", + "print(f\"job_id: {job_id}\")" + ], + "metadata": { + "id": "LP5ZPuQJJgQp", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "5b3dc205-c557-4629-8456-f81a4a26a66e" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "job_id: 38a93dc7-67bf-43b6-bea4-7c9a5ce212d6\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 8: Poll for job completion\n", + "\n", + "Run the following cell to confirm the job has finished running. If successful, Unstructured prints `\"status\": \"COMPLETED\"` within the information about the job." + ], + "metadata": { + "id": "hJoLbPwLJupD" + } + }, + { + "cell_type": "code", + "source": [ + "import time\n", + "\n", + "def poll_job_status(job_id, wait_time=30):\n", + " while True:\n", + " response = unstructured_client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(f\"Job is scheduled, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(f\"Job is in progress, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " else:\n", + " print(\"Job is completed\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "job = poll_job_status(job_id)\n", + "pretty_print_model(job)" + ], + "metadata": { + "id": "nEfi8Q_SJzuh", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "74ff9007-c7c6-4258-8f6c-43c1e507ab93" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is scheduled, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is completed\n", + "{\n", + " \"created_at\": \"2025-07-12T15:09:12.501570\",\n", + " \"id\": \"38a93dc7-67bf-43b6-bea4-7c9a5ce212d6\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"0d9bc496-44cb-4bac-8302-bd0438fd58da\",\n", + " \"workflow_name\": \"Redis Tutorial Workflow_\",\n", + " \"job_type\": \"ephemeral\",\n", + " \"runtime\": \"PT0S\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 9: View the processed data\n", + "\n", + "Once the job finishes, your documents will be processed and stored in Redis as key-value pairs — one key per chunk. Each value holds the structured JSON output, ready to be retrieved for semantic search or downstream use.\n", + "\n", + "You can open Redis Insight to browse and inspect the inserted data directly.\n", + "\n", + "\n", + "![image.png]()" + ], + "metadata": { + "id": "605bniKrNfIY" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "DLptIOnHha-9" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/notebooks/Preserving_Table_Structure_for_Better_Retrieval.ipynb b/notebooks/Preserving_Table_Structure_for_Better_Retrieval.ipynb new file mode 100644 index 0000000..4d9b5bf --- /dev/null +++ b/notebooks/Preserving_Table_Structure_for_Better_Retrieval.ipynb @@ -0,0 +1,1191 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "## 📄 Preserving Table Structure for Better Retrieval\n", + "\n", + "Financial documents like 10-Qs and earnings releases often contain important data inside tables. These tables are usually embedded in PDFs, which makes them hard to extract and even harder to query later.\n", + "\n", + "In this notebook, we'll build a pipeline to process those documents and preserve the tabular structure in a way that's usable by downstream applications.\n", + "\n", + "Specifically, we'll:\n", + "\n", + "- Use **Unstructured** to extract tables from PDFs and represent them as HTML\n", + "- Store the structured HTML chunks in **AstraDB** with OpenAI embeddings\n", + "- Run semantic queries to fetch tables related to a question\n", + "- Display the tables as HTML to preserve layout and formatting\n", + "\n", + "The goal here is not generation or summarization, it's to keep the table structure intact so it can be used by other systems, agents, or UI components later on.\n", + "\n", + "This approach is useful when working with data where much of the critical information is stored in tables, and structure needs to be preserved for any downstream use.\n" + ], + "metadata": { + "id": "q6YEKxii0WgP" + } + }, + { + "cell_type": "markdown", + "source": [ + "#Preparing the Data\n", + "To prepare data for retrieval, we need to first break down the raw PDFs into structured chunks. This step is foundational for any RAG pipeline, and it’s where [Unstructured](https://unstructured.io) comes in.\n", + "\n", + "The Unstructured API lets us:\n", + "- Extract clean, structured content from any document.\n", + "- Generates metadata, chunk text, and prep it for downstream applications." + ], + "metadata": { + "id": "Liy5kmr7DKp7" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Setting Up the Unstructured Client\n", + "\n", + "Before we can begin parsing documents, we need to set up access to the [Unstructured platform](https://unstructured.io). The Unstructured Platform API allows us to programmatically process documents, extract structured elements, and prepare them for chunking and embedding, all from within this notebook.\n", + "\n", + "[Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.\n" + ], + "metadata": { + "id": "-IDg_hZ-Dqzo" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1yEFnvkQGcSV", + "outputId": "0e12201d-b3d2-47fd-a71e-c0dcd43349ba", + "collapsed": true + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting unstructured-client\n", + " Downloading unstructured_client-0.42.1-py3-none-any.whl.metadata (23 kB)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (0.28.1)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (2.11.7)\n", + "Collecting pypdf>=4.0 (from unstructured-client)\n", + " Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.0)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography>=3.1->unstructured-client) (1.17.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (2025.7.14)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (4.9.0)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (3.10)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.33.2)\n", + "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (4.14.1)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.1)\n", + "Requirement already satisfied: requests<3.0.0,>=2.0.1 in /usr/local/lib/python3.11/dist-packages (from requests-toolbelt>=1.0.0->unstructured-client) (2.32.3)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.22)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (3.4.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (2.5.0)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->unstructured-client) (1.3.1)\n", + "Downloading unstructured_client-0.42.1-py3-none-any.whl (207 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.2/207.2 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pypdf-5.9.0-py3-none-any.whl (313 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m313.2/313.2 kB\u001b[0m \u001b[31m16.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: pypdf, unstructured-client\n", + "Successfully installed pypdf-5.9.0 unstructured-client-0.42.1\n" + ] + } + ], + "source": [ + "!pip install -U \"unstructured-client\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "If you haven’t already:\n", + "1. Login\n", + "2. In the sidebar, go to **API Keys**.\n", + "3. Click **New Key**, give it a name, and copy the key.\n" + ], + "metadata": { + "id": "es7EaldpEtj7" + } + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "import time\n", + "from google.colab import userdata\n", + "from unstructured_client import UnstructuredClient" + ], + "metadata": { + "id": "mUUQ6TnbE-R-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Fetching the keys from Colab Secrets!" + ], + "metadata": { + "id": "8vH2wwCAF2mF" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ['UNSTRUCTURED_API_KEY'] = userdata.get(\"UNSTRUCTURED_API_KEY\")\n", + "client = UnstructuredClient(api_key_auth=os.getenv(\"UNSTRUCTURED_API_KEY\"))" + ], + "metadata": { + "id": "wUG_gKfH0d9R" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# utility for inspecting responses in a readable way\n", + "def pretty_print_model(response_model):\n", + " print(response_model.model_dump_json(indent=4))" + ], + "metadata": { + "id": "aAgNx9Ej0cjz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up the S3 Source Connector\n", + "\n", + "\n", + "We’ll start by connecting to our document source on S3.\n", + "\n", + "This example uses **AWS access key and secret** for authentication. You'll want to save the following secrets in Colab beforehand:\n", + "\n", + "- `AWS_ACCESS`\n", + "- `AWS_SECRET`\n", + "- `S3_REMOTE_URL` — the URI to your S3 bucket, e.g.:\n", + " - `s3://my-bucket/` (for root-level files)\n", + " - `s3://my-bucket/my-folder/` (for nested folders)\n", + "\n", + "Once those are set, we can spin up a source connector that points to your remote content.\n", + "\n", + "Need more auth options? [Check the docs](https://docs.unstructured.io/api-reference/workflow/sources/s3)." + ], + "metadata": { + "id": "RXvczp6t0h_S" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')\n", + "os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')\n", + "os.environ['S3_REMOTE_URL'] = userdata.get('S3_REMOTE_URL')" + ], + "metadata": { + "id": "gebtI0S20fum" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import CreateSourceConnector\n", + "\n", + "\n", + "source_response = client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=\"Table Analysis Source_\",\n", + " type=\"s3\",\n", + " config={\n", + " \"key\": os.environ.get('AWS_ACCESS'),\n", + " \"secret\": os.environ.get('AWS_SECRET'),\n", + " \"remote_url\": os.environ.get('S3_REMOTE_URL'),\n", + " \"recursive\": True\n", + " }\n", + " )\n", + " )\n", + " )" + ], + "metadata": { + "id": "v-HJOSLTnfw6" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import (\n", + " CreateSourceConnector,\n", + " SourceConnectorType,\n", + " S3SourceConnectorConfigInput\n", + ")\n", + "\n", + "source_response = client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=f\"Table Analysis Source_\",\n", + " type=SourceConnectorType.S3,\n", + " config=S3SourceConnectorConfigInput(\n", + " key=os.environ.get('AWS_ACCESS'),\n", + " secret=os.environ.get('AWS_SECRET'),\n", + " remote_url=os.environ.get('S3_REMOTE_URL'),\n", + " recursive=True\n", + " )\n", + " )\n", + " )\n", + ")" + ], + "metadata": { + "id": "8GvTF2jI0lED" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "pretty_print_model(source_response.source_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "os1_javB0mQT", + "outputId": "043125f8-fad3-435e-abc0-fd2927565a00" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"anonymous\": false,\n", + " \"recursive\": true,\n", + " \"remote_url\": \"s3://ajay-uns-devrel-content/agentic-analysis/\",\n", + " \"key\": \"**********\",\n", + " \"secret\": \"**********\"\n", + " },\n", + " \"created_at\": \"2025-08-06T13:43:31.376615Z\",\n", + " \"id\": \"de99e386-d13d-4450-8c96-928a1b350a41\",\n", + " \"name\": \"Table Analysis Source_\",\n", + " \"type\": \"s3\",\n", + " \"updated_at\": \"2025-08-06T13:43:31.527712Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up the AstraDB Destination Connector\n", + "\n", + "With our documents flowing in from S3, let’s define where they’ll be stored after processing.\n", + "\n", + "For this, we’ll use **Astra DB**, a serverless vector database that integrates cleanly with Unstructured’s pipeline. All chunked + embedded records will be pushed directly into an Astra collection for retrieval and querying later.\n", + "\n", + "\n", + "Make sure the following secrets are set in your Colab environment:\n", + "\n", + "- `ASTRA_DB_API_ENDPOINT` — the database’s REST API endpoint \n", + "- `ASTRA_DB_APPLICATION_TOKEN` — app token with write access \n", + "- `ASTRA_DB_COLLECTION_NAME` — name of the collection inside the keyspace \n", + "- `ASTRA_DB_KEYSPACE` — the keyspace where this collection will live\n", + "\n", + "> 🧠 If the collection doesn't exist yet, Unstructured will create one at runtime. \n", + "> Just make sure your embedding model dimensions match the collection’s config.\n", + "\n", + "We’ll now create the destination connector that sends data from Unstructured → Astra. For detailed information on the credentials and arguments, refer to [this doc](https://docs.unstructured.io/api-reference/workflow/destinations/astradb)" + ], + "metadata": { + "id": "XQgI4qJr0oFC" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ['ASTRA_DB_API_ENDPOINT'] = userdata.get('ASTRA_DB_API_ENDPOINT')\n", + "os.environ['ASTRA_DB_APPLICATION_TOKEN'] = userdata.get('ASTRA_DB_APPLICATION_TOKEN')\n", + "os.environ['ASTRA_DB_COLLECTION_NAME'] = userdata.get('ASTRA_DB_COLLECTION_NAME')\n", + "os.environ['ASTRA_DB_KEYSPACE'] = userdata.get('ASTRA_DB_KEYSPACE')" + ], + "metadata": { + "id": "J8eBYS48G734" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateDestinationRequest\n", + "from unstructured_client.models.shared import CreateDestinationConnector\n", + "\n", + "destination_response = client.destinations.create_destination(\n", + " request=CreateDestinationRequest(\n", + " create_destination_connector=CreateDestinationConnector(\n", + " name=\"Table Analysis Destination_\",\n", + " type=\"astradb\",\n", + " config={\n", + " \"token\": os.environ.get('ASTRA_DB_APPLICATION_TOKEN'),\n", + " \"api_endpoint\": os.environ.get('ASTRA_DB_API_ENDPOINT'),\n", + " \"collection_name\": os.environ.get('ASTRA_DB_COLLECTION_NAME'),\n", + " \"keyspace\": os.environ.get('ASTRA_DB_KEYSPACE'),\n", + " \"batch_size\": 20,\n", + " \"flatten_metadata\": False\n", + " }\n", + " )\n", + " )\n", + ")" + ], + "metadata": { + "id": "mEI9o7iroQvx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "pretty_print_model(destination_response.destination_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "q2sVCeSjodGs", + "outputId": "3986de16-018f-4a6c-8cc9-20cce2fd6615" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"api_endpoint\": \"**********\",\n", + " \"batch_size\": 20,\n", + " \"collection_name\": \"uns_demo_1\",\n", + " \"token\": \"**********\",\n", + " \"keyspace\": \"demo2\"\n", + " },\n", + " \"created_at\": \"2025-08-06T13:45:59.454874Z\",\n", + " \"id\": \"293f4a19-5bc0-4868-a0d0-7bde3b249e76\",\n", + " \"name\": \"Table Analysis Destination_\",\n", + " \"type\": \"astradb\",\n", + " \"updated_at\": \"2025-08-06T13:45:59.614155Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateDestinationRequest\n", + "from unstructured_client.models.shared import (\n", + " CreateDestinationConnector,\n", + " DestinationConnectorType,\n", + " AstraDBConnectorConfigInput\n", + ")\n", + "destination_response = client.destinations.create_destination(\n", + " request=CreateDestinationRequest(\n", + " create_destination_connector=CreateDestinationConnector(\n", + " name=\"Table Analysis Destination_\",\n", + " type=\"astradb\",\n", + " config=AstraDBConnectorConfigInput(\n", + "\n", + " )\n", + " )\n", + " )\n", + ")" + ], + "metadata": { + "id": "o2I_h7rlG-T4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we’ll wire everything together into a full document processing workflow.\n" + ], + "metadata": { + "id": "ZIy3Y1gEGpOj" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Creating the Processing Workflow\n", + "\n", + "In this version of the pipeline, we’re not just cleaning up PDFs — we’re enriching them with useful metadata and semantic summaries before embedding.\n", + "\n", + "This workflow introduces multiple processing stages beyond the standard partition–chunk–embed stack:\n", + "\n", + "\n", + "### What’s in the Flow\n", + "\n", + "- **Partitioner** \n", + " Extracts high-fidelity document elements using Unstructured’s `hi_res` strategy. Also enables table structure inference and captures image/table blocks explicitly.\n", + "\n", + "- **Image Summarizer** \n", + " Uses OpenAI to generate captions for extracted images.\n", + "\n", + "- **Table Summarizer** \n", + " Summarizes the structure and contents of tables using Anthropic's model.\n", + "\n", + "- **Chunker** \n", + " Segments the enriched elements into overlapping, title-aware sections — ideal for focused retrieval.\n", + "\n", + "- **Embedder** \n", + " Converts each chunk into a vector using Azure-hosted `text-embedding-3-large` for semantic search downstream.\n", + "\n", + "\n", + "\n", + "This is a great setup when you want your RAG system to reason over structured documents that include **tables**, **figures**, or **formal language** — especially when entity extraction and summarization matter.\n", + "\n", + "You can modify or swap any node depending on your retrieval goals. For more node types, check the [Unstructured Concepts Guide](https://docs.unstructured.io/ui/document-elements).\n" + ], + "metadata": { + "id": "onYT6ODu0uSp" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.shared import (\n", + " WorkflowNode,\n", + " WorkflowType,\n", + " Schedule\n", + ")\n", + "\n", + "parition_node = WorkflowNode(\n", + " name=\"Partitioner\",\n", + " subtype=\"unstructured_api\",\n", + " type=\"partition\",\n", + " settings={\n", + " \"strategy\": \"hi_res\",\n", + " \"pdf_infer_table_structure\": True,\n", + " \"extract_image_block_types\": [\n", + " \"Image\",\n", + " \"Table\"\n", + " ],\n", + " \"infer_table_structure\": True,\n", + " }\n", + ")\n", + "\n", + "\n", + "image_summarizer_node = WorkflowNode(\n", + " name=\"Image summarizer\",\n", + " subtype=\"openai_image_description\",\n", + " type=\"prompter\",\n", + " settings={}\n", + ")\n", + "\n", + "table_summarizer_node = WorkflowNode(\n", + " name=\"Table summarizer\",\n", + " subtype=\"anthropic_table_description\",\n", + " type=\"prompter\",\n", + " settings={}\n", + ")\n", + "\n", + "\n", + "chunk_node = WorkflowNode(\n", + " name=\"Chunker\",\n", + " subtype=\"chunk_by_title\",\n", + " type=\"chunk\",\n", + " settings={\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150\n", + " }\n", + ")\n", + "\n", + "embedder_node = WorkflowNode(\n", + " name='Embedder',\n", + " subtype='azure_openai',\n", + " type=\"embed\",\n", + " settings={\n", + " 'model_name': 'text-embedding-3-large'\n", + " }\n", + " )\n", + "\n", + "\n", + "\n", + "response = client.workflows.create_workflow(\n", + " request={\n", + " \"create_workflow\": {\n", + " \"name\": f\"s3-to-astra Agentic Analysis {time.time()}\",\n", + " \"source_id\": source_response.source_connector_information.id,\n", + " \"destination_id\": destination_response.destination_connector_information.id,\n", + " \"workflow_type\": WorkflowType.CUSTOM,\n", + " \"workflow_nodes\": [\n", + " parition_node,\n", + " image_summarizer_node,\n", + " table_summarizer_node,\n", + " chunk_node,\n", + " embedder_node\n", + " ],\n", + " }\n", + " }\n", + ")\n", + "\n", + "workflow_id = response.workflow_information.id\n", + "pretty_print_model(response.workflow_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0D2J-NnNJ6lo", + "outputId": "d1cc3655-2c6d-4f16-b715-e89e22a23684" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-08-06T13:46:25.880580Z\",\n", + " \"destinations\": [\n", + " \"293f4a19-5bc0-4868-a0d0-7bde3b249e76\"\n", + " ],\n", + " \"id\": \"346175ce-c147-4fa3-b9e0-9b6df8f3fead\",\n", + " \"name\": \"s3-to-astra Agentic Analysis 1754487985.6931396\",\n", + " \"sources\": [\n", + " \"de99e386-d13d-4450-8c96-928a1b350a41\"\n", + " ],\n", + " \"status\": \"active\",\n", + " \"workflow_nodes\": [\n", + " {\n", + " \"name\": \"Partitioner\",\n", + " \"subtype\": \"unstructured_api\",\n", + " \"type\": \"partition\",\n", + " \"id\": \"121bddeb-3992-4b85-a434-84dfaa154496\",\n", + " \"settings\": {\n", + " \"strategy\": \"hi_res\",\n", + " \"include_page_breaks\": false,\n", + " \"pdf_infer_table_structure\": true,\n", + " \"exclude_elements\": null,\n", + " \"xml_keep_tags\": false,\n", + " \"encoding\": \"utf-8\",\n", + " \"ocr_languages\": [\n", + " \"eng\"\n", + " ],\n", + " \"extract_image_block_types\": [\n", + " \"Image\",\n", + " \"Table\"\n", + " ],\n", + " \"infer_table_structure\": true\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Table summarizer\",\n", + " \"subtype\": \"anthropic_table_description\",\n", + " \"type\": \"prompter\",\n", + " \"id\": \"017ce153-6a4c-44c2-a82e-adaa6ed09e72\",\n", + " \"settings\": {}\n", + " },\n", + " {\n", + " \"name\": \"Image summarizer\",\n", + " \"subtype\": \"openai_image_description\",\n", + " \"type\": \"prompter\",\n", + " \"id\": \"6b2c909a-809f-4b5f-9680-5f79ea05a2ac\",\n", + " \"settings\": {}\n", + " },\n", + " {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_title\",\n", + " \"type\": \"chunk\",\n", + " \"id\": \"a426b969-1b41-4be0-841d-7bb191fde8d9\",\n", + " \"settings\": {\n", + " \"unstructured_api_url\": null,\n", + " \"unstructured_api_key\": null,\n", + " \"multipage_sections\": false,\n", + " \"combine_text_under_n_chars\": null,\n", + " \"include_orig_elements\": false,\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150,\n", + " \"overlap_all\": false,\n", + " \"contextual_chunking_strategy\": null\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Embedder\",\n", + " \"subtype\": \"azure_openai\",\n", + " \"type\": \"embed\",\n", + " \"id\": \"e6ad4210-fd34-4377-9766-0965e63a6211\",\n", + " \"settings\": {\n", + " \"model_name\": \"text-embedding-3-large\"\n", + " }\n", + " }\n", + " ],\n", + " \"reprocess_all\": false,\n", + " \"schedule\": {\n", + " \"crontab_entries\": []\n", + " },\n", + " \"updated_at\": \"2025-08-06T13:46:25.897091Z\",\n", + " \"workflow_type\": \"custom\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Running the workflow\n", + "\n", + "Now that we've defined how we want to process our documentation, let's start the workflow and wait for it to complete:" + ], + "metadata": { + "id": "CTrsjVhIqvTN" + } + }, + { + "cell_type": "code", + "source": [ + "res = client.workflows.run_workflow(\n", + " request={\n", + " \"workflow_id\": workflow_id,\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(res.job_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fuX5R8p37Src", + "outputId": "b3320bec-768e-4170-d0d7-e1093eedf90b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-08-06T13:46:35.091970Z\",\n", + " \"id\": \"ecddf3fc-0acc-4aca-a2e5-140bf7911b42\",\n", + " \"status\": \"SCHEDULED\",\n", + " \"workflow_id\": \"346175ce-c147-4fa3-b9e0-9b6df8f3fead\",\n", + " \"workflow_name\": \"s3-to-astra Agentic Analysis 1754487985.6931396\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "response = client.jobs.list_jobs(\n", + " request={\n", + " \"workflow_id\": workflow_id\n", + " }\n", + ")\n", + "\n", + "last_job = response.response_list_jobs[0]\n", + "job_id = last_job.id\n", + "print(f\"job_id: {job_id}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Cxc3G6DS7WDR", + "outputId": "29dc5587-93d6-46f0-e27d-0ecb8d0049f7" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "job_id: ecddf3fc-0acc-4aca-a2e5-140bf7911b42\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we've created and started a job, we can poll Unstructured's `get_job` endpoint and check for its status every 30s till completion" + ], + "metadata": { + "id": "A1zTfzanq4Vf" + } + }, + { + "cell_type": "code", + "source": [ + "import time\n", + "\n", + "def poll_job_status(job_id, wait_time=30):\n", + " while True:\n", + " response = client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(f\"Job is scheduled, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(f\"Job is in progress, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " else:\n", + " print(\"Job is completed\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "job = poll_job_status(job_id)\n", + "pretty_print_model(job)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7-DJdmDQ7nki", + "outputId": "7fd3a33d-0513-4a92-fd8f-310d75cb31e8" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is completed\n", + "{\n", + " \"created_at\": \"2025-08-06T13:46:35.091970\",\n", + " \"id\": \"ecddf3fc-0acc-4aca-a2e5-140bf7911b42\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"346175ce-c147-4fa3-b9e0-9b6df8f3fead\",\n", + " \"workflow_name\": \"s3-to-astra Agentic Analysis 1754487985.6931396\",\n", + " \"job_type\": \"ephemeral\",\n", + " \"runtime\": \"PT0S\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "With everything embedded and indexed, we're ready to query. Our goal now: pull the most relevant tables from financial documents and surface them cleanly." + ], + "metadata": { + "id": "iMQbRKJ1NBwX" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Retrieval 🧠\n", + "\n", + "Now that our financial documents are processed and indexed, we can start querying for relevant tables.\n", + "\n", + "This section is simple: we take a natural language question, search across all embedded tables, and display the ones that match best.\n" + ], + "metadata": { + "id": "Y7imZgkEBOXn" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install -qU langchain-astradb langchain-openai openai\n" + ], + "metadata": { + "id": "5iVzjyqQOGkl", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "dfcbd7d8-fe31-42f7-d292-c3705ff84f95" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.0/61.0 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m69.7/69.7 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m70.6/70.6 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m767.8/767.8 kB\u001b[0m \u001b[31m26.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m300.5/300.5 kB\u001b[0m \u001b[31m23.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/2.5 MB\u001b[0m \u001b[31m76.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m18.3/18.3 MB\u001b[0m \u001b[31m83.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.2/45.2 kB\u001b[0m \u001b[31m3.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.4/1.4 MB\u001b[0m \u001b[31m67.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m313.6/313.6 kB\u001b[0m \u001b[31m24.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m50.9/50.9 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", + "opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", + "opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= \"3.9\", but you have numpy 1.26.4 which is incompatible.\n", + "thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "from langchain.schema import Document\n", + "from langchain_astradb.vectorstores import AstraDBVectorStore\n", + "\n", + "import openai\n", + "from astrapy import DataAPIClient\n", + "\n" + ], + "metadata": { + "id": "DXkze2Y6j9gU" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We’ll use OpenAI’s `text-embedding-3-large` to embed queries and compare them with our document chunks.\n", + "\n", + "Next, we connect to **Astra DB** to access all the embedded tables.\n" + ], + "metadata": { + "id": "AmF8sVNqrnXi" + } + }, + { + "cell_type": "code", + "source": [ + "# Initialize Embedding Model\n", + "os.environ['OPENAI_API_KEY'] = userdata.get(\"OPENAI_API_KEY\")\n", + "openai_client = openai.OpenAI(api_key=os.environ[\"OPENAI_API_KEY\"])\n", + "embedding_model = \"text-embedding-3-large\"\n", + "\n", + "# Connect to DataStax Astra DB via LangChain\n", + "astra_client = DataAPIClient(os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"])\n", + "database = astra_client.get_database(os.environ[\"ASTRA_DB_API_ENDPOINT\"])\n", + "COLLECTION = database.get_collection(\n", + " name=os.environ[\"ASTRA_DB_COLLECTION_NAME\"],\n", + " keyspace=os.environ[\"ASTRA_DB_KEYSPACE\"]\n", + "\n", + ")\n", + "print(\"Connected to AstraDB vector store\")\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mqswAdnDiUMV", + "outputId": "167ec30a-c382-4b47-b001-092d91e1b2d3" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Connected to AstraDB vector store\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "We also define an `AgentState` object to hold:\n", + "- the user query (`user_input`)\n", + "- the retrieved document chunks (`retrieved_docs`)" + ], + "metadata": { + "id": "0kLKlpAhrwOg" + } + }, + { + "cell_type": "code", + "source": [ + "from typing import TypedDict, Optional\n", + "\n", + "class AgentState(TypedDict):\n", + " user_input: str\n", + " retrieved_docs: Optional[list]\n" + ], + "metadata": { + "id": "mdJzz0Fuh-ns" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We define two utility functions here:\n", + "\n", + "- `get_embedding(text)` → takes in a query string and returns its OpenAI embedding.\n", + "- `simple_retriever(query, n)` → searches AstraDB for the top-`n` most similar chunks based on that embedding.\n", + "\n", + "This lets us run semantic search over our processed tables." + ], + "metadata": { + "id": "1DY8JcPCrzaS" + } + }, + { + "cell_type": "code", + "source": [ + "def get_embedding(text: str):\n", + " try:\n", + " response = openai_client.embeddings.create(\n", + " model=embedding_model,\n", + " input=[text] # must be a list\n", + " )\n", + " return response.data[0].embedding\n", + " except Exception as e:\n", + " print(\"Embedding error:\", e)\n", + " return None\n", + "\n", + "def simple_retriever(query: str, n: int = 5) -> str:\n", + " embedding = get_embedding(query)\n", + " results = COLLECTION.find(sort={\"$vector\": embedding}, limit=n)\n", + " docs = [doc[\"content\"] for doc in results]\n", + " return \"\\n\".join(f\"===== Document {i+1} =====\\n{doc}\" for i, doc in enumerate(docs))" + ], + "metadata": { + "id": "y_bUgOJHQ5Zl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here we build the actual table retriever function.\n", + "\n", + "Given a user query:\n", + "- It computes the embedding\n", + "- Searches AstraDB for the top matches\n", + "- Filters the results to only include chunks that contain tables\n", + "\n", + "We wrap each match in a `Document` object so it can be passed downstream — e.g. for rendering or further processing." + ], + "metadata": { + "id": "8uG_R950r2UW" + } + }, + { + "cell_type": "code", + "source": [ + "def retrieve_table(state: AgentState) -> AgentState:\n", + " \"\"\"\n", + " retrieve HTML table chunks using raw AstraDB collection + OpenAI embedding\n", + " \"\"\"\n", + " query = state[\"user_input\"]\n", + " try:\n", + " embedding = get_embedding(query)\n", + "\n", + " results = COLLECTION.find(sort={\"$vector\": embedding}, limit=10)\n", + " table_docs = []\n", + " for r in results:\n", + " if r.get('metadata').get('metadata').get('text_as_html',None) is not None:\n", + " doc = Document(\n", + " page_content=r[\"content\"],\n", + " metadata=r.get(\"metadata\", {})\n", + " )\n", + " table_docs.append(doc)\n", + "\n", + " state[\"retrieved_docs\"] = table_docs\n", + " return state\n", + "\n", + " except Exception as e:\n", + " print(\"Error occured\",e)\n", + " state[\"retrieved_docs\"] = []\n", + " return state\n" + ], + "metadata": { + "id": "Yg_HDVfxOY0g" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Let’s test the full flow by asking for **Q2 segment results for Alphabet**.\n" + ], + "metadata": { + "id": "kXe9V9OlsOh9" + } + }, + { + "cell_type": "code", + "source": [ + "test_state = {\"user_input\": \"Show me Q2 Segment Results for Alphabet\"}\n", + "test_state = retrieve_table(test_state)" + ], + "metadata": { + "id": "gsNMwlDeOgyD" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "docs = test_state['retrieved_docs']" + ], + "metadata": { + "id": "A6KptXt8TwjS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We’ll embed the query, retrieve the most relevant table chunks from AstraDB, and render them using their original HTML format.\n", + "\n", + "Each chunk was processed with layout and formatting preserved, so what you see here matches the actual structure from the source document.\n", + "\n", + "The tables you see below aren’t screenshots or scraped HTML, they were generated by Unstructured from the original documents using its layout-aware parsing." + ], + "metadata": { + "id": "2dNKb6OisQ5U" + } + }, + { + "cell_type": "code", + "source": [ + "from IPython.core.display import display, HTML\n", + "\n", + "print(f\"{'='*40}\")\n", + "display(HTML(docs[0].metadata['metadata']['text_as_html']))\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 335 + }, + "id": "gREU0ZwmPuDt", + "outputId": "67c29251-715d-4f6d-e5aa-b5e303d662d4" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "========================================\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
Quarter Ended June 30,
20242025
Revenues:
Google Services73,92882,543
Google Cloud10,34713,624
Other Bets365373
Hedging gains (losses)102(112)
Total revenues84,74296,428
Operating income (loss):
Google Services29,67433,063
Google Cloud1,1722,826
Other Bets(1,134)(1,246)
Alphabet-level activities(2,287)(3,372)
Total income from operations27,42531,271
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "![image.png]()" + ], + "metadata": { + "id": "i-lIbKEgrRJm" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Final Thoughts\n", + "\n", + "The key idea behind this notebook isn’t just retrieval, it’s **structure preservation**.\n", + "\n", + "By parsing documents with Unstructured and saving tables in HTML form, we retain the original layout, semantics, and visual grouping — all of which are lost in plain text formats. That structure becomes a powerful substrate for downstream systems.\n", + "\n", + "With structured HTML in hand, your agents can:\n", + "- Render tables cleanly in a UI\n", + "- Summarize rows, columns, or headers in isolation\n", + "- Compare tables across documents\n", + "- Answer questions with specific numeric context\n", + "- Or just decide when *not* to respond, if structure is missing\n", + "\n", + "This approach creates a flexible interface between messy PDFs and agentic logic, giving you both precision and creativity in how your applications interact with financial data.\n", + "\n", + "Preserve structure early. Use it however you want later." + ], + "metadata": { + "id": "t6TJUoSZtcrp" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "7c5ze3-PaX5P" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/notebooks/Rag_with_Reranking.ipynb b/notebooks/Rag_with_Reranking.ipynb new file mode 100644 index 0000000..742e54d --- /dev/null +++ b/notebooks/Rag_with_Reranking.ipynb @@ -0,0 +1,1459 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# RAG for Patent Question Answering with Reranker\n", + "\n", + "Most retrieval-augmented generation (RAG) pipelines follow a common recipe: take a user’s question, retrieve relevant documents, and feed them to a language model to generate a response. This works reasonably well — until it doesn't.\n", + "\n", + "When dealing with complex domains like patents, the limitations of naive retrieval become glaring:\n", + "- The language is dense and technical.\n", + "- Similarity-based retrievers often surface verbose but irrelevant sections.\n", + "- Critical information may be buried across long documents.\n", + "\n", + "In this notebook, we’ll build a more **robust and domain-aware RAG system** specifically designed to answer technical and legal questions over patents. To improve retrieval quality, we’ll incorporate a **reranker** — a model that sits between retrieval and generation, reshuffling candidate passages to surface the most answer-relevant chunks.\n", + "\n", + "This system will:\n", + "- Load and structure unstructured patent filings using the [Unstructured Platform](https://unstructured.io/).\n", + "- Ingest data into a [Pinecone](https://www.pinecone.io/product/) vector database for fast semantic retrieval.\n", + "- Re-rank retrieved candidates using **Cohere’s `rerank-english-v3.0`**.\n", + "- Answer user questions using **GPT-4o** grounded in the reranked context.\n", + "\n", + "We’ll go step by step — starting with document ingestion and ending with an end-to-end QA pipeline that performs well even on nuanced queries.\n", + "\n", + "Let’s dive in.\n" + ], + "metadata": { + "id": "q6YEKxii0WgP" + } + }, + { + "cell_type": "markdown", + "source": [ + "#Preparing the Data\n", + "To prepare our patent data for retrieval and reranking, we need to first break down the raw PDFs into structured chunks. This step is foundational for any RAG pipeline, and it’s where [Unstructured](https://unstructured.io) comes in.\n", + "\n", + "The Unstructured API lets us:\n", + "- Extract clean, structured content from any document.\n", + "- Generates metadata, chunk text, and prep it for downstream applications." + ], + "metadata": { + "id": "Liy5kmr7DKp7" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Setting Up the Unstructured Client\n", + "\n", + "Before we can begin parsing raw patent documents, we need to set up access to the [Unstructured platform](https://unstructured.io). The Unstructured Platform API allows us to programmatically process documents, extract structured elements, and prepare them for chunking and embedding, all from within this notebook.\n", + "\n", + "[Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.\n" + ], + "metadata": { + "id": "-IDg_hZ-Dqzo" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1yEFnvkQGcSV", + "outputId": "f6f11e5c-98aa-4498-8fc8-6bd91e018416", + "collapsed": true + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: unstructured-client in /usr/local/lib/python3.11/dist-packages (0.42.1)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (0.28.1)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (2.11.7)\n", + "Requirement already satisfied: pypdf>=4.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (5.9.0)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.0)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography>=3.1->unstructured-client) (1.17.1)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (2025.7.14)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (4.9.0)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (3.10)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.33.2)\n", + "Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (4.14.1)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.1)\n", + "Requirement already satisfied: requests<3.0.0,>=2.0.1 in /usr/local/lib/python3.11/dist-packages (from requests-toolbelt>=1.0.0->unstructured-client) (2.32.3)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.22)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (3.4.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests<3.0.0,>=2.0.1->requests-toolbelt>=1.0.0->unstructured-client) (2.5.0)\n", + "Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->unstructured-client) (1.3.1)\n" + ] + } + ], + "source": [ + "!pip install -U \"unstructured-client\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "If you haven’t already:\n", + "1. Login to [platform.unstructured.io](https://platform.unstructured.io)\n", + "2. In the sidebar, go to **API Keys**.\n", + "3. Click **New Key**, give it a name like `\"patent-qna-notebook\"`, and copy the key.\n" + ], + "metadata": { + "id": "es7EaldpEtj7" + } + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "import time\n", + "from google.colab import userdata\n", + "from unstructured_client import UnstructuredClient" + ], + "metadata": { + "id": "mUUQ6TnbE-R-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Fetching the keys from Colab Secrets!" + ], + "metadata": { + "id": "8vH2wwCAF2mF" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ['UNSTRUCTURED_API_KEY'] = userdata.get(\"UNSTRUCTURED_API_KEY\")\n", + "client = UnstructuredClient(api_key_auth=os.getenv(\"UNSTRUCTURED_API_KEY\"))" + ], + "metadata": { + "id": "wUG_gKfH0d9R" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# utility for inspecting responses in a readable way\n", + "def pretty_print_model(response_model):\n", + " print(response_model.model_dump_json(indent=4))" + ], + "metadata": { + "id": "aAgNx9Ej0cjz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up the S3 Source Connector\n", + "\n", + "\n", + "For this demo, we will be using AWS Key and Secret for Authentication.\n", + "Make sure to add fetch the corresponding values and for `S3_AWS_KEY` and `S3_AWS_SECRET` and add to the Secrets in Colab.\n", + "\n", + "\n", + "Similarly, fetch the the S3 URI to the bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/` and add it to `S3_REMOTE_URL` in the Secrets.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "For other authentication options and more details refer to [this](https://docs.unstructured.io/api-reference/workflow/sources/s3)." + ], + "metadata": { + "id": "RXvczp6t0h_S" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')\n", + "os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')\n", + "os.environ['S3_REMOTE_URL'] = userdata.get('S3_REMOTE_URL')" + ], + "metadata": { + "id": "gebtI0S20fum" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import CreateSourceConnector\n", + "\n", + "source_response = client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=f\"Reranker Tutorial Source Connector_\",\n", + " type=\"s3\",\n", + " config={\n", + " \"key\": os.environ.get('AWS_ACCESS'),\n", + " \"secret\": os.environ.get('AWS_SECRET'),\n", + " \"remote_url\": os.environ.get('S3_REMOTE_URL'),\n", + " \"recursive\": True\n", + " }\n", + " )\n", + " )\n", + ")" + ], + "metadata": { + "id": "IobJkeHa0wKJ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import (\n", + " CreateSourceConnector,\n", + " SourceConnectorType,\n", + " S3SourceConnectorConfigInput\n", + ")\n", + "\n", + "source_response = client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=f\"Reranker Tutorial Source Connector_\",\n", + " type=SourceConnectorType.S3,\n", + " config=S3SourceConnectorConfigInput(\n", + " key=os.environ.get('AWS_ACCESS'),\n", + " secret=os.environ.get('AWS_SECRET'),\n", + " remote_url=os.environ.get('S3_REMOTE_URL'),\n", + " recursive=True\n", + " )\n", + " )\n", + " )\n", + ")" + ], + "metadata": { + "id": "8GvTF2jI0lED" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "pretty_print_model(source_response.source_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "os1_javB0mQT", + "outputId": "7f7543b0-f7d2-470d-cc46-f8082667b6c3" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"anonymous\": false,\n", + " \"recursive\": true,\n", + " \"remote_url\": \"s3://ajay-uns-devrel-content/mm-agentic-rag/\",\n", + " \"key\": \"**********\",\n", + " \"secret\": \"**********\"\n", + " },\n", + " \"created_at\": \"2025-08-06T14:57:07.277627Z\",\n", + " \"id\": \"e63b3e59-58e7-4e0b-90b3-85a7a6f5ad69\",\n", + " \"name\": \"Reranker Tutorial Source Connector_\",\n", + " \"type\": \"s3\",\n", + " \"updated_at\": \"2025-08-06T14:57:07.416960Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up the Pinecone Destination Connector\n", + "\n", + "Now that we’ve defined our document source (from S3), the next step is to configure where the processed chunks should go. For that, we’re using **Pinecone** — a fast, scalable vector database that's perfect for similarity search.\n", + "\n", + "In our case, we’ll send embedded chunks of patent text to Pinecone, where they can later be searched via semantic queries.\n", + "\n", + "---\n", + "\n", + "### 🌲 Why Pinecone?\n", + "\n", + "Pinecone is optimized for storing and querying high-dimensional vector embeddings. It provides:\n", + "- Scalable infrastructure for similarity search.\n", + "- Fast approximate nearest neighbor lookup.\n", + "- Simple API access for indexing and querying.\n", + "\n", + "In this setup, Unstructured handles:\n", + "- Preprocessing the data (partitioning, chunking, embedding).\n", + "- Pushing the output vectors directly into our Pinecone index.\n", + "\n", + "---\n", + "\n", + "\n", + "To connect Unstructured with Pinecone, you’ll need:\n", + "\n", + "- **API Key**: Found under the API Keys tab in the Pinecone dashboard.\n", + "- **Index Name**: Create one manually from the dashboard, and ensure it’s in the \"Serverless\" environment.\n", + "- (Optional: Namespace) — used to logically group your documents inside the index.\n", + "\n", + "If you haven’t already:\n", + "1. Go to [https://app.pinecone.io](https://app.pinecone.io) and sign in.\n", + "2. Create a **Serverless Index**.\n", + "3. Note the **index name** and **API key** from the dashboard.\n", + "\n", + "Store both values securely in Colab secrets:" + ], + "metadata": { + "id": "XQgI4qJr0oFC" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ['PINECONE_INDEX'] = userdata.get('PINECONE_INDEX')\n", + "os.environ['PINECONE_API_KEY'] = userdata.get('PINECONE_API_KEY')\n" + ], + "metadata": { + "id": "4Hyxry7D8Ivq" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateDestinationRequest\n", + "from unstructured_client.models.shared import CreateDestinationConnector\n", + "\n", + "destination_response = client.destinations.create_destination(\n", + " request=CreateDestinationRequest(\n", + " create_destination_connector=CreateDestinationConnector(\n", + " name=f\"Reranker Tutorial Destination Connector_\",\n", + " type=\"pinecone\",\n", + " config={\n", + " \"index_name\": os.environ.get(\"PINECONE_INDEX\"),\n", + " \"api_key\": os.environ.get(\"PINECONE_API_KEY\"),\n", + " \"batch_size\": 50,\n", + " \"namespace\": \"Default\" # Default Option\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(destination_response.destination_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HZn6XbJw1DhQ", + "outputId": "2b0a2235-64b8-42e1-acdb-8d65cc733248" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"api_key\": \"**********\",\n", + " \"batch_size\": 50,\n", + " \"index_name\": \"uns-demo-2\",\n", + " \"namespace\": \"Default\"\n", + " },\n", + " \"created_at\": \"2025-08-06T14:57:09.636042Z\",\n", + " \"id\": \"3122da51-b23b-415c-a544-e329ba964c66\",\n", + " \"name\": \"Reranker Tutorial Destination Connector_\",\n", + " \"type\": \"pinecone\",\n", + " \"updated_at\": \"2025-08-06T14:57:09.739495Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we’ll wire everything together into a full document processing workflow.\n" + ], + "metadata": { + "id": "ZIy3Y1gEGpOj" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Creating a Document Processing Workflow\n", + "\n", + "Now that we have access to our data, the next step is setting up how it should be processed.\n", + "\n", + "We'll define a simple but powerful document pipeline using three key types of processing nodes:\n", + "\n", + "- **Partitioner** \n", + " This step takes raw, unstructured files and extracts structured content from them. \n", + " We'll use a **Vision-Language Model (VLM) Partitioner**, which leverages a model capable of understanding both text and layout information from documents — pulling out elements from each page with higher fidelity.\n", + "\n", + "- **Chunker** \n", + " After partitioning, the extracted elements are grouped into manageable \"chunks.\" \n", + " Chunking ensures that during retrieval, we can focus only on the most relevant sections of a document — not the whole thing.\n", + "\n", + "- **Embedder** \n", + " Finally, we'll generate vector embeddings for each chunk of text. \n", + " Embeddings are numeric representations that capture the meaning of the text, making it searchable and retrievable later on. We'll rely on an embedding provider to handle this step for us.\n", + "\n", + "Each node plays a critical role in making our documents **retrieval-ready** for downstream RAG applications.\n", + "\n", + "If you're curious about the different configuration options available for these processing steps, you can explore more details in the [Concepts documentation](https://docs.unstructured.io/ui/document-elements).\n" + ], + "metadata": { + "id": "onYT6ODu0uSp" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.shared import (\n", + " WorkflowNode,\n", + " WorkflowType,\n", + " Schedule\n", + ")\n", + "\n", + "parition_node = WorkflowNode(\n", + " name=\"Partitioner\",\n", + " subtype=\"vlm\",\n", + " type=\"partition\",\n", + " settings={\n", + " \"provider\": \"anthropic\",\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " }\n", + " )\n", + "\n", + "chunk_node = WorkflowNode(\n", + " name=\"Chunker\",\n", + " subtype=\"chunk_by_title\",\n", + " type=\"chunk\",\n", + " settings={\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150\n", + " }\n", + ")\n", + "\n", + "embedder_node = WorkflowNode(\n", + " name='Embedder',\n", + " subtype='azure_openai',\n", + " type=\"embed\",\n", + " settings={\n", + " 'model_name': 'text-embedding-3-large'\n", + " }\n", + " )\n", + "\n", + "\n", + "response = client.workflows.create_workflow(\n", + " request={\n", + " \"create_workflow\": {\n", + " \"name\": f\"Reranker Tutorial Workflow_{time.time()}\",\n", + " \"source_id\": source_response.source_connector_information.id,\n", + " \"destination_id\": destination_response.destination_connector_information.id,\n", + " \"workflow_type\": WorkflowType.CUSTOM,\n", + " \"workflow_nodes\": [\n", + " parition_node,\n", + " chunk_node,\n", + " embedder_node\n", + " ]\n", + " }\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(response.workflow_information)\n", + "workflow_id = response.workflow_information.id" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "g2tJmoCP0u1W", + "outputId": "0a2fab62-3528-494f-cc5c-845e61703070" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-08-06T14:57:11.657721Z\",\n", + " \"destinations\": [\n", + " \"3122da51-b23b-415c-a544-e329ba964c66\"\n", + " ],\n", + " \"id\": \"974f7a59-df45-469e-94d2-09e0ec1f2500\",\n", + " \"name\": \"Reranker Tutorial Workflow_1754492231.632472\",\n", + " \"sources\": [\n", + " \"e63b3e59-58e7-4e0b-90b3-85a7a6f5ad69\"\n", + " ],\n", + " \"status\": \"active\",\n", + " \"workflow_nodes\": [\n", + " {\n", + " \"name\": \"Partitioner\",\n", + " \"subtype\": \"vlm\",\n", + " \"type\": \"partition\",\n", + " \"id\": \"639c45c1-8009-4bfa-80a1-5e0e4b325467\",\n", + " \"settings\": {\n", + " \"provider\": \"anthropic\",\n", + " \"provider_api_key\": null,\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " \"output_format\": \"text/html\",\n", + " \"prompt\": null,\n", + " \"format_html\": true,\n", + " \"unique_element_ids\": true,\n", + " \"is_dynamic\": false,\n", + " \"allow_fast\": true\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_title\",\n", + " \"type\": \"chunk\",\n", + " \"id\": \"c51efc7d-6a8e-4663-b852-5bb5f7023539\",\n", + " \"settings\": {\n", + " \"unstructured_api_url\": null,\n", + " \"unstructured_api_key\": null,\n", + " \"multipage_sections\": false,\n", + " \"combine_text_under_n_chars\": null,\n", + " \"include_orig_elements\": false,\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150,\n", + " \"overlap_all\": false,\n", + " \"contextual_chunking_strategy\": null\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Embedder\",\n", + " \"subtype\": \"azure_openai\",\n", + " \"type\": \"embed\",\n", + " \"id\": \"a34562a2-766d-4519-92be-4833288f7f83\",\n", + " \"settings\": {\n", + " \"model_name\": \"text-embedding-3-large\"\n", + " }\n", + " }\n", + " ],\n", + " \"reprocess_all\": false,\n", + " \"schedule\": {\n", + " \"crontab_entries\": []\n", + " },\n", + " \"updated_at\": \"2025-08-06T14:57:11.672337Z\",\n", + " \"workflow_type\": \"custom\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Running the workflow\n", + "\n", + "Now that we've defined how we want to process our documentation, let's start the workflow and wait for it to complete:" + ], + "metadata": { + "id": "CTrsjVhIqvTN" + } + }, + { + "cell_type": "code", + "source": [ + "res = client.workflows.run_workflow(\n", + " request={\n", + " \"workflow_id\": workflow_id,\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(res.job_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "fuX5R8p37Src", + "outputId": "16f79d45-57ee-46f5-c001-7d415f7b7920" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-08-06T14:57:13.160450Z\",\n", + " \"id\": \"5dfaecab-e7f5-4ff2-84b6-6b460756bdf6\",\n", + " \"status\": \"SCHEDULED\",\n", + " \"workflow_id\": \"974f7a59-df45-469e-94d2-09e0ec1f2500\",\n", + " \"workflow_name\": \"Reranker Tutorial Workflow_1754492231.632472\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "response = client.jobs.list_jobs(\n", + " request={\n", + " \"workflow_id\": workflow_id\n", + " }\n", + ")\n", + "\n", + "last_job = response.response_list_jobs[0]\n", + "job_id = last_job.id\n", + "print(f\"job_id: {job_id}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Cxc3G6DS7WDR", + "outputId": "d8034e8f-1b10-4bd1-ec15-8c8cab6d9895" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "job_id: 5dfaecab-e7f5-4ff2-84b6-6b460756bdf6\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we've created and started a job, we can poll Unstructured's `get_job` endpoint and check for its status every 30s till completion" + ], + "metadata": { + "id": "A1zTfzanq4Vf" + } + }, + { + "cell_type": "code", + "source": [ + "import time\n", + "\n", + "def poll_job_status(job_id, wait_time=30):\n", + " while True:\n", + " response = client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(f\"Job is scheduled, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(f\"Job is in progress, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " else:\n", + " print(\"Job is completed\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "job = poll_job_status(job_id)\n", + "pretty_print_model(job)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7-DJdmDQ7nki", + "outputId": "8ccbeb91-b3aa-4364-f948-64eadb6fae11" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is scheduled, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is completed\n", + "{\n", + " \"created_at\": \"2025-08-06T14:57:13.160450\",\n", + " \"id\": \"5dfaecab-e7f5-4ff2-84b6-6b460756bdf6\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"974f7a59-df45-469e-94d2-09e0ec1f2500\",\n", + " \"workflow_name\": \"Reranker Tutorial Workflow_1754492231.632472\",\n", + " \"job_type\": \"ephemeral\",\n", + " \"runtime\": \"PT0S\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "At this point, we've completed all the foundational steps:\n", + "\n", + "- Extracted structured elements from raw documents using a **Partitioner**.\n", + "- Organized the extracted content into manageable chunks with a **Chunker**.\n", + "- Generated vector embeddings for those chunks through an **Embedder**.\n", + "\n", + "Our processed data is now stored and ready for retrieval.\n", + "\n", + "Next, we'll connect the pieces together and build a RAG pipeline that can answer questions grounded in this freshly structured knowledge base.\n", + "\n" + ], + "metadata": { + "id": "iMQbRKJ1NBwX" + } + }, + { + "cell_type": "markdown", + "source": [ + "# RAG 🧠\n", + "\n", + "With our patent documents now chunked, embedded, and stored in Pinecone — we’re ready to move into the **retrieval-augmented generation (RAG)** phase.\n", + "\n", + "In this section, we'll wire together:\n", + "- A **retriever**, backed by Pinecone, to pull relevant chunks.\n", + "- A **reranker**, using Cohere’s `rerank-english-v3.0`, to boost the most contextually relevant results.\n", + "- A **generator**, using OpenAI’s `gpt-4o`, to produce accurate, grounded answers based on that refined context.\n", + "\n", + "We’ll also wrap these into a clean RAG pipeline using LangChain’s modular components.\n", + "\n", + "\n", + "For this portion, we will be using:\n", + "\n", + "- **`pinecone-client`**: Native SDK to interact with Pinecone vector indices (for inserting, querying, and managing embeddings).\n", + "- **`cohere`**: Official client to access Cohere’s APIs — including rerankers and language models.\n", + "- **`langchain-*`**: A modular framework for chaining together LLMs, retrievers, tools, rerankers, and more — perfect for building custom RAG pipelines.\n", + "\n", + "Once everything's installed, we'll connect to our vector store, load our reranker, and build a chain that retrieves → reranks → generates.\n" + ], + "metadata": { + "id": "Y7imZgkEBOXn" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install pinecone-client langchain-pinecone langchain-openai langchain-community cohere --upgrade --quiet\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nOs3tfvDafSk", + "outputId": "cd40c08a-9874-46ce-9a9e-9586b1e0bd4b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m70.4/70.4 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/2.5 MB\u001b[0m \u001b[31m28.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m259.5/259.5 kB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.3/3.3 MB\u001b[0m \u001b[31m25.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.3/46.3 kB\u001b[0m \u001b[31m1.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m587.6/587.6 kB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.2/45.2 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m50.9/50.9 kB\u001b[0m \u001b[31m1.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m240.0/240.0 kB\u001b[0m \u001b[31m10.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m52.2/52.2 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.3/42.3 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.3/44.3 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "import requests\n", + "import urllib3\n", + "from google.colab import userdata\n", + "import cohere\n", + "from pinecone import Pinecone\n", + "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n", + "from langchain_pinecone import PineconeVectorStore\n", + "from langchain.schema import Document\n", + "from langchain.callbacks import get_openai_callback\n" + ], + "metadata": { + "id": "JnHFw0t2H49B" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we’ve installed our libraries, it’s time to wire up the APIs. We’ll be using three providers in this RAG pipeline:\n", + "\n", + "\n", + "- **Cohere**: for reranking retrieved chunks based on their actual relevance to a query.\n", + "- **OpenAI**: for generating answers with `gpt-4o`.\n", + "- **Pinecone**: to query the vector index we populated earlier.\n", + "\n", + "We’ll securely fetch each API key from Colab secrets" + ], + "metadata": { + "id": "OTyWC4j7IPEn" + } + }, + { + "cell_type": "code", + "source": [ + "# Set your API keys using Colab userdata\n", + "os.environ['COHERE_API_KEY'] = userdata.get(\"COHERE_API_KEY\")\n", + "os.environ[\"OPENAI_API_KEY\"] = userdata.get(\"OPENAI_API_KEY\")\n", + "os.environ[\"PINECONE_API_KEY\"] = userdata.get(\"PINECONE_API_KEY\")\n", + "os.environ[\"PINECONE_INDEX\"] = userdata.get(\"PINECONE_INDEX\")\n", + "\n", + "# Initialize Cohere client\n", + "cohere_client = cohere.Client(os.environ['COHERE_API_KEY'])" + ], + "metadata": { + "id": "K-JkRoYks5Ka" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "🛠 Fixing Pinecone in Colab\n", + "\n", + "If you're running this in a Colab environment, Pinecone’s client can sometimes misbehave due to Colab’s proxy settings.\n", + "\n", + "This little fix disables warnings and clears proxy-related environment variables:" + ], + "metadata": { + "id": "Hv0Oj6HMIZ1W" + } + }, + { + "cell_type": "code", + "source": [ + "# run this to ensure pinecone client works in your colab environment\n", + "urllib3.disable_warnings()\n", + "\n", + "# Clear proxy environment variables that might cause connection issues\n", + "proxy_vars = ['HTTP_PROXY', 'HTTPS_PROXY', 'http_proxy', 'https_proxy']\n", + "for var in proxy_vars:\n", + " if var in os.environ:\n", + " del os.environ[var]\n", + "\n", + "original_getproxies = requests.utils.getproxies\n", + "requests.utils.getproxies = lambda: {}" + ], + "metadata": { + "id": "8IsTOabPitVl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Before wiring things up, here’s a breakdown of the core functions used in this section.\n", + "\n", + "- `connect_pinecone(index_name: str)` \n", + " Sets up a connection to the Pinecone index and wraps it as a LangChain-compatible vectorstore using OpenAI’s `text-embedding-3-large`. Returns the vectorstore so we can use it for retrieval.\n", + "\n", + "- `retrieve_docs(vectorstore, query: str, k: int = 20)` \n", + " Performs basic similarity search against the vectorstore. Grabs the top-k chunks closest to the query based on embeddings.\n", + "\n", + "- `rerank_docs(query: str, docs: list[Document], top_n: int = 5)` \n", + " Takes the initial retrieved results and reorders them using Cohere’s reranker model. This lets us prioritize documents that are actually useful for answering the question — not just semantically close.\n", + "\n", + "- `generate_answer(query: str, docs: list[Document])` \n", + " Feeds the reranked context to GPT-4o to generate a final answer." + ], + "metadata": { + "id": "tonjlb1VKHU0" + } + }, + { + "cell_type": "code", + "source": [ + "def connect_pinecone(index_name: str):\n", + " \"\"\"\n", + " Connect to Pinecone vectorstore\n", + "\n", + " Args:\n", + " index_name: Name of the Pinecone index\n", + "\n", + " Returns:\n", + " Configured vectorstore\n", + " \"\"\"\n", + " try:\n", + " embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + " pc = Pinecone(api_key=os.environ[\"PINECONE_API_KEY\"])\n", + " index = pc.Index(index_name)\n", + "\n", + " vectorstore = PineconeVectorStore(\n", + " index=index,\n", + " embedding=embeddings,\n", + " text_key=\"text\",\n", + " namespace='Default'\n", + " )\n", + "\n", + " print(f\"Connected to Pinecone index: {index_name}\")\n", + " return vectorstore\n", + "\n", + " except Exception as e:\n", + " print(f\"Failed to connect to Pinecone: {e}\")\n", + " return None\n", + "\n", + "\n", + "def retrieve_docs(vectorstore, query: str, k: int = 20):\n", + " \"\"\"\n", + " Retrieve documents from vectorstore\n", + "\n", + " Args:\n", + " vectorstore: Pinecone vectorstore\n", + " query: Search query\n", + " k: Number of documents to retrieve\n", + "\n", + " Returns:\n", + " List of relevant documents\n", + " \"\"\"\n", + " try:\n", + " docs = vectorstore.similarity_search(query, k=k)\n", + " print(f\"Retrieved {len(docs)} documents\")\n", + " return docs\n", + " except Exception as e:\n", + " print(f\"Document retrieval failed: {e}\")\n", + " return []\n", + "\n", + "def rerank_docs(query: str, docs: list[Document], top_n: int = 5):\n", + " \"\"\"\n", + " Rerank documents using Cohere's reranking model\n", + "\n", + " Args:\n", + " query: Original search query\n", + " docs: List of retrieved documents\n", + " top_n: Number of top documents to return\n", + "\n", + " Returns:\n", + " List of reranked documents\n", + " \"\"\"\n", + " try:\n", + " response = cohere_client.rerank(\n", + " query=query,\n", + " documents=[doc.page_content for doc in docs],\n", + " top_n=top_n,\n", + " model=\"rerank-english-v3.0\"\n", + " )\n", + "\n", + " reranked_docs = [docs[r.index] for r in response.results]\n", + " print(f\"Reranked to top {len(reranked_docs)} documents\")\n", + " return reranked_docs\n", + "\n", + " except Exception as e:\n", + " print(f\"Reranking failed: {e}\")\n", + " return docs[:top_n] # Fallback\n", + "\n", + "def generate_answer(query: str, docs: list[Document]):\n", + " \"\"\"\n", + " Generate answer using retrieved documents\n", + "\n", + " Args:\n", + " query: User question\n", + " docs: List of relevant documents\n", + "\n", + " Returns:\n", + " Generated answer\n", + " \"\"\"\n", + " try:\n", + " llm = ChatOpenAI(model=\"gpt-4o\", temperature=0)\n", + " context = \"\\n\\n\".join([doc.page_content for doc in docs])\n", + "\n", + " prompt = f\"\"\"Answer the following question using the context below. Answer only based on the context provided, if there is not enough information, mention that there's not enough information:\n", + "\n", + " Context:\n", + " {context}\n", + "\n", + " Question: {query}\n", + "\n", + " Answer:\"\"\"\n", + " with get_openai_callback() as cb:\n", + " response = llm.invoke(prompt)\n", + "\n", + " result = {\n", + " \"answer\": response.content,\n", + " \"prompt_tokens\": cb.prompt_tokens,\n", + " \"completion_tokens\": cb.completion_tokens,\n", + " \"total_tokens\": cb.total_tokens,\n", + " \"total_cost\": cb.total_cost\n", + " }\n", + " return result\n", + "\n", + " except Exception as e:\n", + " print(f\"Answer generation failed: {e}\")\n", + " return None\n", + "\n", + "# Connect to vectorstore\n", + "vectorstore = connect_pinecone(os.environ[\"PINECONE_INDEX\"])\n", + "\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "LhJw9h3DtG4M", + "outputId": "d490c77b-794b-4a4d-ff71-01c1b92cbaee" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Connected to Pinecone index: uns-demo1\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Vanilla RAG\n", + "\n", + "We’ll start with a simple retrieval-augmented generation setup: grab the top-k documents from Pinecone using embedding similarity, and pass them directly to GPT-4o.\n" + ], + "metadata": { + "id": "IUBBvWKGMrB2" + } + }, + { + "cell_type": "code", + "source": [ + "class BasicRAGSystem:\n", + " def __init__(self, vectorstore, k=10):\n", + " self.vectorstore = vectorstore\n", + " self.k = k\n", + "\n", + " def query(self, question):\n", + " \"\"\"Execute basic RAG pipeline\"\"\"\n", + "\n", + " # Retrieve documents\n", + " docs = retrieve_docs(self.vectorstore, question, k=self.k)\n", + "\n", + " # Generate answer\n", + " answer = generate_answer(question, docs)\n", + "\n", + "\n", + " result = {\n", + " \"documents\": docs,\n", + " \"num_docs\": len(docs)\n", + " }\n", + " result.update(answer)\n", + "\n", + " return result\n", + "\n", + "basic_rag = BasicRAGSystem(vectorstore,10)\n", + "\n" + ], + "metadata": { + "id": "qw1HcYKSvig6" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "test_query = \"What is the primary function of the context analysis engine described in US11886826B1?\"\n", + "\n", + "print(\"Basic RAG Results:\")\n", + "print(\"-\" * 50)\n", + "basic_result = basic_rag.query(test_query)\n", + "\n", + "print(f\"Answer: {basic_result['answer']}\")\n", + "print(f\"Retrieved {basic_result['num_docs']} documents\")\n", + "print(f\"Total Tokens: {basic_result['total_tokens']}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ALLZZ7gQvsNZ", + "outputId": "3e708506-8318-4b52-9244-3138ef8c17d7" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Basic RAG Results:\n", + "--------------------------------------------------\n", + "Retrieved 10 documents\n", + "Answer: The primary function of the context analysis engine described in US11886826B1 is to analyze input data and/or user instructions to output a set of context parameters associated with the input data. These context parameters may include information such as location (\"where\"), person (\"who\"), time period or time of day (\"when\"), event (\"what\"), or causal reasoning (\"why\") associated with the input data. The context analysis engine may also retain the output of the set of context parameters through multiple iterations of execution, allowing for retention of context information for changes without needing to reload large amounts of information.\n", + "Retrieved 10 documents\n", + "Total Tokens: 7741\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "The vanilla setup gets the right answer here. It finds the relevant chunk in the top 10 and generates a clean response.\n", + "\n", + "Now let's try a more complex question" + ], + "metadata": { + "id": "U9s9WUB8MAEV" + } + }, + { + "cell_type": "code", + "source": [ + "test_query = \"Which of the two patents does not reference reward‑based optimization, and what training approach does it use instead?\"\n", + "\n", + "print(\"Basic RAG Results:\")\n", + "print(\"-\" * 50)\n", + "basic_result = basic_rag.query(test_query)\n", + "\n", + "print(f\"Answer: {basic_result['answer']}\")\n", + "print(f\"Retrieved {basic_result['num_docs']} documents\")\n", + "print(f\"Total Tokens: {basic_result['total_tokens']}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "QEp5dJ76MKKf", + "outputId": "2a72400d-fa14-46cc-8cac-6b1dda490aaf" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Basic RAG Results:\n", + "--------------------------------------------------\n", + "Retrieved 10 documents\n", + "Answer: There's not enough information to determine which of the two patents does not reference reward-based optimization and what training approach it uses instead.\n", + "Retrieved 10 documents\n", + "Total Tokens: 7234\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Even though we retrieved 10 chunks, none had what we needed. Let's try out a different approach to fetch the chunks **most relavant** to the query." + ], + "metadata": { + "id": "i2Ay_UciMQfI" + } + }, + { + "cell_type": "markdown", + "source": [ + "### RAG with Reranking\n", + "\n", + "\n", + "Plain vector search can only get us so far. It’s fast and useful, but it’s not perfect, sometimes the right chunk doesn’t make it into the top-10.\n", + "\n", + "To fix this, we add a reranking step.\n", + "\n", + "Here’s how it works:\n", + "\n", + "- First, we fetch a **larger set of candidate chunks** — say 30 — from the vectorstore.\n", + "- Then we use a **reranker model** (in this case, Cohere’s `rerank-english-v3.0`) to score each chunk by how well it matches the question.\n", + "- We keep only the **top-N** (e.g. top 10) reranked chunks and send those to the LLM.\n", + "\n", + "This extra scoring step helps surface the most relevant content, especially for nuanced or multi-part questions that vector search might miss.\n", + "\n", + "\n" + ], + "metadata": { + "id": "0QMpaRyQMtpR" + } + }, + { + "cell_type": "code", + "source": [ + "class EnhancedRAGSystem:\n", + " def __init__(self, vectorstore, k=40, top_n=20):\n", + " self.vectorstore = vectorstore\n", + " self.k = k\n", + " self.top_n = top_n\n", + "\n", + " def query(self, question):\n", + " \"\"\"Execute enhanced RAG pipeline with reranking\"\"\"\n", + "\n", + " initial_docs = retrieve_docs(self.vectorstore, question, k=self.k)\n", + "\n", + " reranked_docs = rerank_docs(question, initial_docs, top_n=self.top_n)\n", + "\n", + " answer = generate_answer(question, reranked_docs)\n", + "\n", + " result = {\n", + " \"documents\": reranked_docs,\n", + " \"initial_docs\": initial_docs,\n", + " \"num_docs\": len(reranked_docs)\n", + " }\n", + " result.update(answer)\n", + " return result\n", + "\n", + "# Initialize enhanced RAG system to fetch 30 candidate docs -> 10 reranked docs\n", + "enhanced_rag = EnhancedRAGSystem(vectorstore,30,10)\n" + ], + "metadata": { + "id": "_ElPAmYgvtfj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "And now, a query that failed with Vanilla RAG" + ], + "metadata": { + "id": "L4cESO-8Nz9J" + } + }, + { + "cell_type": "code", + "source": [ + "test_query = \"Which of the two patents does not reference reward‑based optimization, and what training approach does it use instead?\"\n", + "\n", + "print(\"\\nEnhanced RAG with Reranking:\")\n", + "print(\"-\" * 50)\n", + "enhanced_result = enhanced_rag.query(test_query)\n", + "\n", + "print(f\"Answer: {enhanced_result['answer']}\")\n", + "print(f\"Retrieved {enhanced_result['num_docs']} documents (from {len(enhanced_result['initial_docs'])} initial)\")\n", + "print(f\"Total Tokens: {enhanced_result['total_tokens']}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "NUulXbcFv2ww", + "outputId": "eb555370-1487-4392-ff98-95722576ddac" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "Enhanced RAG with Reranking:\n", + "--------------------------------------------------\n", + "Retrieved 30 documents\n", + "Reranked to top 10 documents\n", + "Answer: The patent US 11,886,826 B1 does not reference reward-based optimization. Instead, it uses an iterative training approach based on one or more datasets, which may include user instruction data or user-labeled data.\n", + "Retrieved 10 documents (from 30 initial)\n", + "Total Tokens: 7662\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "So what changed?\n", + "\n", + "Turns out the key chunk was buried deeper in the retrieval set, somewhere in the top 30, but not in the top 10 that vanilla RAG uses.\n", + "\n", + "With reranking, we’re able to pull it up and pass it to the LLM, which now has enough signal to answer correctly.\n" + ], + "metadata": { + "id": "0vCKuBqZOkSa" + } + }, + { + "cell_type": "markdown", + "source": [ + "### Why not send the entire context to the LLM?\n", + "\n", + "Let's test it out." + ], + "metadata": { + "id": "s6pH1Q31M-ih" + } + }, + { + "cell_type": "code", + "source": [ + "basic_rag = BasicRAGSystem(vectorstore,30)\n", + "test_query = \"Which of the two patents does not reference reward‑based optimization, and what training approach does it use instead?\"\n", + "\n", + "print(\"Basic RAG Results:\")\n", + "print(\"-\" * 50)\n", + "basic_result = basic_rag.query(test_query)\n", + "\n", + "print(f\"Answer: {basic_result['answer']}\")\n", + "print(f\"Retrieved {basic_result['num_docs']} documents\")\n", + "print(f\"Total Tokens: {basic_result['total_tokens']}\")" + ], + "metadata": { + "id": "SqIT4IBGCiYZ", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "9a9c1abd-3a43-4060-ce28-225cca5a3197" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Basic RAG Results:\n", + "--------------------------------------------------\n", + "Retrieved 30 documents\n", + "Answer: The patent that does not reference reward-based optimization is US 2024/0256582 A1. Instead, it uses a training approach that involves generating a set of search results for a search query and providing the set of search results as part of an input prompt to guide a generative AI model in generating a summary response of the set of search results.\n", + "Retrieved 30 documents\n", + "Total Tokens: 22521\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "\n", + "print(\"\\nEnhanced RAG with Reranking:\")\n", + "print(\"-\" * 50)\n", + "enhanced_result = enhanced_rag.query(test_query)\n", + "\n", + "print(f\"Answer: {enhanced_result['answer']}\")\n", + "print(f\"Retrieved {enhanced_result['num_docs']} documents (from {len(enhanced_result['initial_docs'])} initial)\")\n", + "print(f\"Total Tokens: {enhanced_result['total_tokens']}\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SEN3_Yyx-RvK", + "outputId": "a3370c05-1acc-4911-98ca-f32dcbef0338" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "Enhanced RAG with Reranking:\n", + "--------------------------------------------------\n", + "Retrieved 30 documents\n", + "Reranked to top 10 documents\n", + "Answer: The patent US 11,886,826 B1 does not reference reward-based optimization. Instead, it uses an iterative training approach based on one or more datasets, which may include user instruction data or user-labeled data.\n", + "Retrieved 10 documents (from 30 initial)\n", + "Total Tokens: 7662\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Here, Vanilla RAG also gave a confused answer from using all 30 chunks as context and the cost difference is also huge.\n", + "\n", + "- **Vanilla RAG (k=30)** sends all 30 chunks straight to the LLM.\n", + "- **Reranked RAG** pulls 30 candidates, scores them, and keeps only the top 10.\n", + "\n", + "That’s **3x fewer tokens** for the same output.\n", + "\n", + "This isn’t just about cost. With longer inputs, LLM latency also goes up. \n", + "Reranking helps us trim the fat and stay within context limits without sacrificing accuracy.\n", + "\n", + "So if you're going to over-fetch from the vector store, it's almost always better to rerank before you send." + ], + "metadata": { + "id": "nB7IBMgbPSAr" + } + }, + { + "cell_type": "markdown", + "source": [ + "If you’re building anything question-answering or doc-heavy, try plugging in a reranker. \n", + "It’s a simple addition that can boost accuracy, trim cost, and make your LLMs look smarter.\n", + "\n", + "You can adapt the exact same setup to papers, reports, contracts — anything longform where chunk retrieval alone might not cut it.\n", + "\n", + "Start from this notebook, swap in your own data, and see what changes." + ], + "metadata": { + "id": "rlDHmBDD0BbH" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "5m65iCtPwIWW" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/notebooks/Rag_without_Embeddings.ipynb b/notebooks/Rag_without_Embeddings.ipynb new file mode 100644 index 0000000..6020f5f --- /dev/null +++ b/notebooks/Rag_without_Embeddings.ipynb @@ -0,0 +1,1140 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# RAG without embeddings: A keyword-first retrieval stack\n", + "\n", + "Not every search problem needs a vector store.\n", + "\n", + "There are plenty of use cases especially in incident response, enterprise ops, or tightly-scoped document corpora where plain old keyword retrieval can get you surprisingly far.\n", + "\n", + "This notebook explores what that looks like in practice: \n", + "A **BM25-powered RAG pipeline** built entirely without embeddings.\n", + "\n", + "We’ll use:\n", + "- **Unstructured** to extract and chunk source docs from S3\n", + "- **Elasticsearch Serverless** to handle retrieval via BM25\n", + "- **LangChain + OpenAI** to run natural language queries over the results\n", + "\n", + "Along the way, we’ll see where this setup shines and where it quietly falls apart. \n", + "Some queries will resolve beautifully. Others will fail in subtle ways, with answers that *sound* right but aren't grounded.\n", + "\n", + "This isn’t about proving BM25 is enough. It’s about understanding what you get when you start simple.\n" + ], + "metadata": { + "id": "bJSlGjQx0XvA" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bBCWCbIH-y2B", + "outputId": "6b41b07d-0b80-4336-de7e-8e5e92fedfa2" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting unstructured-client\n", + " Downloading unstructured_client-0.42.1-py3-none-any.whl.metadata (23 kB)\n", + "Collecting elasticsearch\n", + " Downloading elasticsearch-9.1.0-py3-none-any.whl.metadata (8.4 kB)\n", + "Collecting langchain-community\n", + " Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting langchain-openai\n", + " Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)\n", + "Collecting langchain-elasticsearch\n", + " Downloading langchain_elasticsearch-0.3.2-py3-none-any.whl.metadata (8.3 kB)\n", + "Requirement already satisfied: aiofiles>=24.1.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (24.1.0)\n", + "Requirement already satisfied: cryptography>=3.1 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (43.0.3)\n", + "Requirement already satisfied: httpcore>=1.0.9 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.9)\n", + "Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (0.28.1)\n", + "Requirement already satisfied: pydantic>=2.11.2 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (2.11.7)\n", + "Collecting pypdf>=4.0 (from unstructured-client)\n", + " Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /usr/local/lib/python3.11/dist-packages (from unstructured-client) (1.0.0)\n", + "Collecting elastic-transport<10,>=9.1.0 (from elasticsearch)\n", + " Downloading elastic_transport-9.1.0-py3-none-any.whl.metadata (3.9 kB)\n", + "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.11/dist-packages (from elasticsearch) (2.9.0.post0)\n", + "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.11/dist-packages (from elasticsearch) (4.14.1)\n", + "Requirement already satisfied: langchain-core<1.0.0,>=0.3.66 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (0.3.72)\n", + "Requirement already satisfied: langchain<1.0.0,>=0.3.26 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (0.3.27)\n", + "Requirement already satisfied: SQLAlchemy<3,>=1.4 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (2.0.41)\n", + "Requirement already satisfied: requests<3,>=2 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (2.32.3)\n", + "Requirement already satisfied: PyYAML>=5.3 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (6.0.2)\n", + "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (3.12.14)\n", + "Requirement already satisfied: tenacity!=8.4.0,<10,>=8.1.0 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (8.5.0)\n", + "Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)\n", + " Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)\n", + "Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)\n", + " Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)\n", + "Requirement already satisfied: langsmith>=0.1.125 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (0.4.8)\n", + "Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)\n", + " Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)\n", + "Requirement already satisfied: numpy>=1.26.2 in /usr/local/lib/python3.11/dist-packages (from langchain-community) (2.0.2)\n", + "Requirement already satisfied: openai<2.0.0,>=1.86.0 in /usr/local/lib/python3.11/dist-packages (from langchain-openai) (1.97.1)\n", + "Requirement already satisfied: tiktoken<1,>=0.7 in /usr/local/lib/python3.11/dist-packages (from langchain-openai) (0.9.0)\n", + "Collecting elasticsearch\n", + " Downloading elasticsearch-8.19.0-py3-none-any.whl.metadata (9.2 kB)\n", + "Collecting elastic-transport<9,>=8.15.1 (from elasticsearch)\n", + " Downloading elastic_transport-8.17.1-py3-none-any.whl.metadata (3.8 kB)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (2.6.1)\n", + "Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (1.4.0)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (25.3.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (1.7.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (6.6.3)\n", + "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (0.3.2)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.3->langchain-community) (1.20.1)\n", + "Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.11/dist-packages (from cryptography>=3.1->unstructured-client) (1.17.1)\n", + "Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)\n", + " Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)\n", + "Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)\n", + " Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)\n", + "Requirement already satisfied: urllib3<3,>=1.26.2 in /usr/local/lib/python3.11/dist-packages (from elastic-transport<9,>=8.15.1->elasticsearch) (2.5.0)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from elastic-transport<9,>=8.15.1->elasticsearch) (2025.7.14)\n", + "Requirement already satisfied: simsimd>=3 in /usr/local/lib/python3.11/dist-packages (from elasticsearch[vectorstore-mmr]<9.0.0,>=8.13.1->langchain-elasticsearch) (6.5.0)\n", + "Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.11/dist-packages (from httpcore>=1.0.9->unstructured-client) (0.16.0)\n", + "Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (4.9.0)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->unstructured-client) (3.10)\n", + "Requirement already satisfied: langchain-text-splitters<1.0.0,>=0.3.9 in /usr/local/lib/python3.11/dist-packages (from langchain<1.0.0,>=0.3.26->langchain-community) (0.3.9)\n", + "Requirement already satisfied: jsonpatch<2.0,>=1.33 in /usr/local/lib/python3.11/dist-packages (from langchain-core<1.0.0,>=0.3.66->langchain-community) (1.33)\n", + "Requirement already satisfied: packaging>=23.2 in /usr/local/lib/python3.11/dist-packages (from langchain-core<1.0.0,>=0.3.66->langchain-community) (25.0)\n", + "Requirement already satisfied: orjson<4.0.0,>=3.9.14 in /usr/local/lib/python3.11/dist-packages (from langsmith>=0.1.125->langchain-community) (3.11.1)\n", + "Requirement already satisfied: zstandard<0.24.0,>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from langsmith>=0.1.125->langchain-community) (0.23.0)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.86.0->langchain-openai) (1.9.0)\n", + "Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.86.0->langchain-openai) (0.10.0)\n", + "Requirement already satisfied: sniffio in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.86.0->langchain-openai) (1.3.1)\n", + "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.11/dist-packages (from openai<2.0.0,>=1.86.0->langchain-openai) (4.67.1)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (2.33.2)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.11.2->unstructured-client) (0.4.1)\n", + "Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)\n", + " Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests<3,>=2->langchain-community) (3.4.2)\n", + "Requirement already satisfied: greenlet>=1 in /usr/local/lib/python3.11/dist-packages (from SQLAlchemy<3,>=1.4->langchain-community) (3.2.3)\n", + "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.11/dist-packages (from tiktoken<1,>=0.7->langchain-openai) (2024.11.6)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil->elasticsearch) (1.17.0)\n", + "Requirement already satisfied: pycparser in /usr/local/lib/python3.11/dist-packages (from cffi>=1.12->cryptography>=3.1->unstructured-client) (2.22)\n", + "Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.11/dist-packages (from jsonpatch<2.0,>=1.33->langchain-core<1.0.0,>=0.3.66->langchain-community) (3.0.0)\n", + "Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)\n", + " Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)\n", + "Downloading unstructured_client-0.42.1-py3-none-any.whl (207 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.2/207.2 kB\u001b[0m \u001b[31m9.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading langchain_community-0.3.27-py3-none-any.whl (2.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/2.5 MB\u001b[0m \u001b[31m56.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading langchain_openai-0.3.28-py3-none-any.whl (70 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m70.6/70.6 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading langchain_elasticsearch-0.3.2-py3-none-any.whl (45 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading elasticsearch-8.19.0-py3-none-any.whl (926 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m926.9/926.9 kB\u001b[0m \u001b[31m30.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)\n", + "Downloading elastic_transport-8.17.1-py3-none-any.whl (64 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m65.0/65.0 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading httpx_sse-0.4.1-py3-none-any.whl (8.1 kB)\n", + "Downloading pydantic_settings-2.10.1-py3-none-any.whl (45 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.2/45.2 kB\u001b[0m \u001b[31m3.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pypdf-5.9.0-py3-none-any.whl (313 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m313.2/313.2 kB\u001b[0m \u001b[31m18.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading marshmallow-3.26.1-py3-none-any.whl (50 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m50.9/50.9 kB\u001b[0m \u001b[31m2.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)\n", + "Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)\n", + "Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)\n", + "Installing collected packages: python-dotenv, pypdf, mypy-extensions, marshmallow, httpx-sse, elastic-transport, typing-inspect, elasticsearch, unstructured-client, pydantic-settings, dataclasses-json, langchain-openai, langchain-elasticsearch, langchain-community\n", + "Successfully installed dataclasses-json-0.6.7 elastic-transport-8.17.1 elasticsearch-8.19.0 httpx-sse-0.4.1 langchain-community-0.3.27 langchain-elasticsearch-0.3.2 langchain-openai-0.3.28 marshmallow-3.26.1 mypy-extensions-1.1.0 pydantic-settings-2.10.1 pypdf-5.9.0 python-dotenv-1.1.1 typing-inspect-0.9.0 unstructured-client-0.42.1\n" + ] + } + ], + "source": [ + "!pip install -U unstructured-client elasticsearch langchain-community langchain-openai langchain-elasticsearch" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up credentials and environment variables\n", + "\n", + "Before we define our workflow, we’ll load the necessary credentials for all the external services we’ll be using — Unstructured, AWS S3 (as our source), and Elasticsearch (as our destination).\n", + "\n", + "These are securely pulled from Colab secrets using `userdata.get(...)`, so make sure you’ve already added them via the “🔐 Secrets” tab in Colab.\n", + "\n", + "Here’s what each one is used for:\n", + "\n", + "- **Unstructured API key**: Required to access the Unstructured Workflows API.\n", + "- **S3 credentials**: Used to fetch documents from an S3 bucket or folder.\n", + "- **Elasticsearch credentials**: Used to push the processed, structured data into an Elasticsearch Serverless index.\n", + "\n", + "---\n", + "\n", + "### Where to get these values\n", + "\n", + "Here’s a quick guide on how to fetch the required credentials:\n", + "\n", + "#### 🔑 Unstructured API Key\n", + "[Contact us](https://unstructured.io/enterprise) to get access or log in if you're already a user.\n", + "\n", + "\n", + "#### 🪣 S3 Credentials\n", + "We’re using the [S3 Source Connector](https://docs.unstructured.io/api-reference/workflow/sources/s3). You’ll need:\n", + "\n", + "- **AWS Access Key ID** and **Secret Access Key**: You can create these from your AWS IAM dashboard by creating a user with “AmazonS3ReadOnlyAccess” or similar permissions.\n", + "- **S3 Remote URL**: This should point to the folder or bucket you want to ingest from — e.g. `s3://your-bucket-name/path-to-folder/`. Make sure it’s in URI format.\n", + "\n", + "\n", + "\n", + "#### 🔍 Elasticsearch (Serverless)\n", + "We’re using the [Elasticsearch destination connector](https://docs.unstructured.io/api-reference/workflow/destinations/elasticsearch). To set this up:\n", + "\n", + "1. Go to [https://cloud.elastic.co](https://cloud.elastic.co) and create a **Serverless Project**.\n", + "2. Under **Project Settings → API Keys**, create a new key.\n", + "3. Grab the following values:\n", + " - **API key** (you’ll use this as `ES_API_KEY`)\n", + " - **Deployment URL** (this becomes `ES_HOST_NAME`)\n", + " - Your target **index name** (set this as `ES_INDEX_NAME`)\n", + "\n", + "That’s it — once these are in place as secrets, we’re ready to configure the connectors programmatically in the next step.\n" + ], + "metadata": { + "id": "lKwIDPhOt43L" + } + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "import time\n", + "from datetime import datetime\n", + "from google.colab import userdata\n", + "\n", + "# Unstructured\n", + "os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')\n", + "\n", + "# AWS S3\n", + "os.environ['AWS_ACCESS'] = userdata.get('AWS_ACCESS')\n", + "os.environ['AWS_SECRET'] = userdata.get('AWS_SECRET')\n", + "os.environ['S3_REMOTE_URL'] = userdata.get(\"S3_REMOTE_URL\")\n", + "\n", + "\n", + "# Elasticsearch Serverless\n", + "os.environ['ES_INDEX_NAME'] = userdata.get('ES_INDEX_NAME')\n", + "os.environ['ES_HOST_NAME'] = userdata.get('ES_HOST_NAME')\n", + "os.environ['ES_API_KEY'] = userdata.get('ES_API_KEY')\n", + "\n", + "\n", + "\n" + ], + "metadata": { + "id": "xzK2Wkd4_B78" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# instantiate Unstructured Client\n", + "from unstructured_client import UnstructuredClient\n", + "\n", + "unstructured_client = UnstructuredClient(api_key_auth=os.environ['UNSTRUCTURED_API_KEY'])\n", + "\n", + "# helper function\n", + "def pretty_print_model(response_model):\n", + " print(response_model.model_dump_json(indent=4))" + ], + "metadata": { + "id": "F5ODuJ17HdzP" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Registering the S3 source connector\n", + "\n", + "Now that our credentials are set, let’s connect to the raw data stored in S3.\n", + "\n", + "This step registers an **S3 source connector** with the Unstructured API. Once created, this connector tells the system where to pull documents from during workflow execution.\n", + "\n", + "Here’s what’s happening:\n", + "- We use the S3 credentials and remote URL from earlier.\n", + "- `recursive=True` ensures that files inside nested folders will also be processed.\n", + "\n", + "Once the source is registered, Unstructured will return a unique `source_id` — you’ll use this to define the pipeline input in the next step." + ], + "metadata": { + "id": "hpPiTZ63ulgQ" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateSourceRequest\n", + "from unstructured_client.models.shared import CreateSourceConnector\n", + "\n", + "formatted_time = datetime.now().strftime(\"%H:%M:%S\")\n", + "source_response = unstructured_client.sources.create_source(\n", + " request=CreateSourceRequest(\n", + " create_source_connector=CreateSourceConnector(\n", + " name=f\"Rag w/o Embeddings Source_ {formatted_time}\",\n", + " type=\"s3\",\n", + " config={\n", + " \"key\": os.environ.get('AWS_ACCESS'),\n", + " \"secret\": os.environ.get('AWS_SECRET'),\n", + " \"remote_url\": os.environ.get('S3_REMOTE_URL'),\n", + " \"recursive\": True\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(source_response.source_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MEGFPod_zBhE", + "outputId": "6ae57580-3068-4be6-c3a4-4342194ff751" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"anonymous\": false,\n", + " \"recursive\": true,\n", + " \"remote_url\": \"s3://ajay-uns-devrel-content/agentic-analysis/\",\n", + " \"key\": \"**********\",\n", + " \"secret\": \"**********\"\n", + " },\n", + " \"created_at\": \"2025-08-06T14:34:21.898458Z\",\n", + " \"id\": \"fbb2a2da-156e-4317-a394-40596bc7b102\",\n", + " \"name\": \"Rag w/o Embeddings Source_ 14:34:21\",\n", + " \"type\": \"s3\",\n", + " \"updated_at\": \"2025-08-06T14:34:22.081140Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Registering the Elasticsearch destination connector\n", + "\n", + "With our source in place, we now define where the processed data should go.\n", + "\n", + "In this case, we’re using **Elasticsearch Serverless** as our destination. This connector pushes cleaned, structured chunks directly into your configured index — making them queryable for downstream RAG tasks.\n", + "\n", + "Here’s a breakdown of what’s passed into the connector:\n", + "- `hosts`: The Elasticsearch deployment URL (from your Serverless project).\n", + "- `es_api_key`: The API key you created earlier for secure access.\n", + "- `index_name`: The target index where documents will be stored.\n", + "\n", + "> 📌 Note: The index will be created automatically if it doesn’t already exist.\n", + "\n", + "After this step, Unstructured will return a `destination_id`, which we’ll use to tie the source and destination together in the next step: building the workflow.\n" + ], + "metadata": { + "id": "hCHd82pruznb" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.operations import CreateDestinationRequest\n", + "from unstructured_client.models.shared import CreateDestinationConnector\n", + "\n", + "destination_response = unstructured_client.destinations.create_destination(\n", + " request=CreateDestinationRequest(\n", + " create_destination_connector=CreateDestinationConnector(\n", + " name=f\"ES_Destination_connector_{formatted_time}\",\n", + " type=\"elasticsearch\",\n", + " config={\n", + " \"hosts\": [os.environ['ES_HOST_NAME']],\n", + " \"es_api_key\": os.environ['ES_API_KEY'],\n", + " \"index_name\": os.environ['ES_INDEX_NAME']\n", + " }\n", + " )\n", + " )\n", + ")\n", + "\n", + "pretty_print_model(destination_response.destination_connector_information)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3bBlUKx1zoEi", + "outputId": "106525eb-fb68-44f7-ab88-c724e602908b" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"config\": {\n", + " \"es_api_key\": \"**********\",\n", + " \"hosts\": [\n", + " \"https://my-elasticsearch-project-cf9288.es.us-east-1.aws.elastic.cloud:443\"\n", + " ],\n", + " \"index_name\": \"es-demo\"\n", + " },\n", + " \"created_at\": \"2025-08-06T14:36:34.442290Z\",\n", + " \"id\": \"19bd8287-d7b5-4d7d-84ab-63ad14e07b70\",\n", + " \"name\": \"ES_Destination_connector_14:34:21\",\n", + " \"type\": \"elasticsearch\",\n", + " \"updated_at\": \"2025-08-06T14:36:34.562580Z\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Building an Unstructured workflow\n", + "\n", + "Now that we’ve registered both our source and destination connectors, it’s time to define how documents should be processed.\n", + "\n", + "This step creates a **custom workflow** in Unstructured that connects:\n", + "1. The S3 source (documents in)\n", + "2. A two-step transformation pipeline\n", + "3. The Elasticsearch destination (clean chunks out)\n", + "\n", + "Here’s what the processing nodes do:\n", + "\n", + "- **Partitioner**: Uses a Vision-Language Model (Anthropic Claude Sonnet) to extract clean structured content — preserving layout, tables, and section headers.\n", + "- **Chunker**: Breaks up the content into smaller pieces. We’re using a title-aware strategy with controlled overlap (`4096` max characters, `150` character overlap) to preserve context for retrieval.\n", + "\n", + "> 🔍 No embedder here, and that’s intentional. \n", + "> For this tutorial, we’ll be using **BM25** for retrieval instead of dense vector embeddings, so there’s no need to generate embeddings in this pipeline.\n", + "\n", + "Once the workflow is created, we save the `workflow_id` so we can run it in the next step.\n" + ], + "metadata": { + "id": "-cc-hhaOvRs_" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured_client.models.shared import (\n", + " WorkflowNode,\n", + " WorkflowType,\n", + " Schedule\n", + ")\n", + "\n", + "parition_node = WorkflowNode(\n", + " name=\"Partitioner\",\n", + " subtype=\"vlm\",\n", + " type=\"partition\",\n", + " settings={\n", + " \"provider\": \"anthropic\",\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " }\n", + " )\n", + "\n", + "chunk_node = WorkflowNode(\n", + " name=\"Chunker\",\n", + " subtype=\"chunk_by_title\",\n", + " type=\"chunk\",\n", + " settings={\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150\n", + " }\n", + ")\n", + "\n", + "response = unstructured_client.workflows.create_workflow(\n", + " request={\n", + " \"create_workflow\": {\n", + " \"name\": f\"Rag w/o Embeddings Tutorial Workflow_ {time.time()}\",\n", + " \"source_id\": source_response.source_connector_information.id,\n", + " \"destination_id\": destination_response.destination_connector_information.id,\n", + " \"workflow_type\": WorkflowType.CUSTOM,\n", + " \"workflow_nodes\": [\n", + " parition_node,\n", + " chunk_node\n", + " ]\n", + " }\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(response.workflow_information)\n", + "workflow_id = response.workflow_information.id" + ], + "metadata": { + "id": "i2usCToqHy3O", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "377e58a4-0e50-4d1a-b3dc-16a5b8b3fccd" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-07-19T19:40:35.002072Z\",\n", + " \"destinations\": [\n", + " \"91757008-ff9a-4bb0-9b17-6b7253ff8739\"\n", + " ],\n", + " \"id\": \"46a8b815-0528-4afe-bba4-03f05f4310b5\",\n", + " \"name\": \"Rag w/o Embeddings Tutorial Workflow_ 1752954034.9518843\",\n", + " \"sources\": [\n", + " \"3792b879-16c5-4452-9f95-6c45400b5573\"\n", + " ],\n", + " \"status\": \"active\",\n", + " \"workflow_nodes\": [\n", + " {\n", + " \"name\": \"Partitioner\",\n", + " \"subtype\": \"vlm\",\n", + " \"type\": \"partition\",\n", + " \"id\": \"969ca37e-4dd3-44a0-939b-0098c2cb9a9b\",\n", + " \"settings\": {\n", + " \"provider\": \"anthropic\",\n", + " \"provider_api_key\": null,\n", + " \"model\": \"claude-3-7-sonnet-20250219\",\n", + " \"output_format\": \"text/html\",\n", + " \"prompt\": null,\n", + " \"format_html\": true,\n", + " \"unique_element_ids\": true,\n", + " \"is_dynamic\": false,\n", + " \"allow_fast\": true\n", + " }\n", + " },\n", + " {\n", + " \"name\": \"Chunker\",\n", + " \"subtype\": \"chunk_by_title\",\n", + " \"type\": \"chunk\",\n", + " \"id\": \"eb9525bb-1915-48db-80fe-c211b69c7593\",\n", + " \"settings\": {\n", + " \"unstructured_api_url\": null,\n", + " \"unstructured_api_key\": null,\n", + " \"multipage_sections\": false,\n", + " \"combine_text_under_n_chars\": null,\n", + " \"include_orig_elements\": false,\n", + " \"new_after_n_chars\": 1000,\n", + " \"max_characters\": 4096,\n", + " \"overlap\": 150,\n", + " \"overlap_all\": false,\n", + " \"contextual_chunking_strategy\": null\n", + " }\n", + " }\n", + " ],\n", + " \"reprocess_all\": false,\n", + " \"schedule\": {\n", + " \"crontab_entries\": []\n", + " },\n", + " \"updated_at\": \"2025-07-19T19:40:35.013554Z\",\n", + " \"workflow_type\": \"custom\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Run the workflow\n", + "\n", + "Run the following cell to start running the workflow." + ], + "metadata": { + "id": "tOIkt9GOwf6i" + } + }, + { + "cell_type": "code", + "source": [ + "res = unstructured_client.workflows.run_workflow(\n", + " request={\n", + " \"workflow_id\": workflow_id,\n", + " }\n", + ")\n", + "\n", + "pretty_print_model(res.job_information)" + ], + "metadata": { + "id": "7QLgPRt-JNYD", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "8660feae-8dec-4836-a947-709b47bcb792" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{\n", + " \"created_at\": \"2025-07-19T19:40:36.320615Z\",\n", + " \"id\": \"5270557c-2e97-4bc2-998c-0eb8af189c18\",\n", + " \"status\": \"SCHEDULED\",\n", + " \"workflow_id\": \"46a8b815-0528-4afe-bba4-03f05f4310b5\",\n", + " \"workflow_name\": \"Rag w/o Embeddings Tutorial Workflow_ 1752954034.9518843\",\n", + " \"job_type\": \"ephemeral\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Get the workflow run's job ID\n", + "\n", + "Run the following cell to get the workflow run's job ID, which is needed to poll for job completion later. If successful, Unstructured prints the job's ID." + ], + "metadata": { + "id": "ObIv1fHfwigb" + } + }, + { + "cell_type": "code", + "source": [ + "response = unstructured_client.jobs.list_jobs(\n", + " request={\n", + " \"workflow_id\": workflow_id\n", + " }\n", + ")\n", + "\n", + "last_job = response.response_list_jobs[0]\n", + "job_id = last_job.id\n", + "print(f\"job_id: {job_id}\")" + ], + "metadata": { + "id": "LP5ZPuQJJgQp", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "66223620-9b5f-4bab-84db-398dcfd7a2c9" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "job_id: 5270557c-2e97-4bc2-998c-0eb8af189c18\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Poll for job completion\n", + "\n", + "Run the following cell to confirm the job has finished running. If successful, Unstructured prints `\"status\": \"COMPLETED\"` within the information about the job." + ], + "metadata": { + "id": "hJoLbPwLJupD" + } + }, + { + "cell_type": "code", + "source": [ + "def poll_job_status(job_id, wait_time=30):\n", + " while True:\n", + " response = unstructured_client.jobs.get_job(\n", + " request={\n", + " \"job_id\": job_id\n", + " }\n", + " )\n", + "\n", + " job = response.job_information\n", + "\n", + " if job.status == \"SCHEDULED\":\n", + " print(f\"Job is scheduled, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " elif job.status == \"IN_PROGRESS\":\n", + " print(f\"Job is in progress, polling again in {wait_time} seconds...\")\n", + " time.sleep(wait_time)\n", + " else:\n", + " print(\"Job is completed\")\n", + " break\n", + "\n", + " return job\n", + "\n", + "job = poll_job_status(job_id)\n", + "pretty_print_model(job)" + ], + "metadata": { + "id": "nEfi8Q_SJzuh", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "8010469c-764b-423e-c5a2-21debc442537" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Job is scheduled, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is in progress, polling again in 30 seconds...\n", + "Job is completed\n", + "{\n", + " \"created_at\": \"2025-07-19T19:40:36.320615\",\n", + " \"id\": \"5270557c-2e97-4bc2-998c-0eb8af189c18\",\n", + " \"status\": \"COMPLETED\",\n", + " \"workflow_id\": \"46a8b815-0528-4afe-bba4-03f05f4310b5\",\n", + " \"workflow_name\": \"Rag w/o Embeddings Tutorial Workflow_ 1752954034.9518843\",\n", + " \"job_type\": \"ephemeral\",\n", + " \"runtime\": \"PT0S\"\n", + "}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "At this point, we’ve successfully run the full Unstructured pipeline:\n", + "\n", + "- Documents were pulled from S3\n", + "- Cleaned and chunked using the Partitioner and Chunker nodes\n", + "- And indexed into our Elasticsearch Serverless instance\n", + "\n", + "All of this happened without generating embeddings — and that’s by design.\n", + "\n", + "In the next section, we’ll build a lightweight **RAG pipeline** that uses traditional keyword-based search (**BM25**) to retrieve context from Elasticsearch" + ], + "metadata": { + "id": "dqlCmPPDwGQb" + } + }, + { + "cell_type": "markdown", + "source": [ + "## RAG\n", + "\n", + "In this section, we’ll build a Retrieval-Augmented Generation (RAG) pipeline but without using any embeddings. \n", + "Instead, we’ll rely on a classic scoring algorithm called **BM25**, which powers the keyword-based search inside Elasticsearch.\n", + "\n", + "### What is BM25?\n", + "\n", + "BM25 is a **ranking function** that scores documents based on how well they match a query using exact terms, partial matches, and some clever normalization behind the scenes.\n", + "\n", + "It’s been a staple in information retrieval for decades, and it still holds up remarkably well when:\n", + "- Your documents are chunked cleanly\n", + "- Your queries are fairly literal (i.e., not abstract or fuzzy)\n", + "\n", + "Here’s how it works, at a high level:\n", + "\n", + "- **Matching terms boost relevance**: If a chunk contains your search terms, it scores higher.\n", + "- **Rare words carry more weight**: Matches on uncommon terms matter more than matches on generic words.\n", + "- **Document length is normalized**: Longer chunks don’t get an unfair advantage just because they mention everything.\n", + "\n", + "Unlike dense embeddings, BM25 doesn’t “understand” semantic meaning. It’s not going to connect synonyms or paraphrases. But when your queries are sharp and your chunking is good — it can work surprisingly well.\n", + "\n", + "> 🧠 Why use this?\n", + "> - It’s **fast**, **transparent**, and doesn’t need a GPU or embedding model.\n", + "> - It’s perfect for bootstrapping or low-latency use cases.\n", + "\n", + "We’ll now query the indexed data in Elasticsearch using BM25 and pass the results into our LLM to generate grounded answers.\n" + ], + "metadata": { + "id": "g7ZYinLyKkc0" + } + }, + { + "cell_type": "code", + "source": [ + "from langchain_elasticsearch import ElasticsearchStore, BM25Strategy\n", + "from langchain_openai import ChatOpenAI\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from elasticsearch import Elasticsearch" + ], + "metadata": { + "id": "wbsTPR6rxo2c" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Setting up the BM25-backed RAG pipeline\n", + "\n", + "With our data indexed and ready, we can now run queries over it using BM25 retrieval.\n", + "\n", + "Here’s how this section works:\n", + "1. We connect to the Elasticsearch Serverless instance using the `Elasticsearch` Python client.\n", + "2. We initialize a `BM25Strategy` — this wraps keyword-based scoring around our document chunks.\n", + "3. We query Elasticsearch for the top-k most relevant chunks (`similarity_search`), and pass them to GPT-4o to generate an answer.\n", + "\n", + "#### BM25 parameters: `k1` and `b`\n", + "\n", + "- **`k1` (default: `1.2`)** \n", + " Controls **term frequency scaling** — how much repeated terms matter. \n", + " - Higher `k1` = more boost for repeated keywords \n", + " - Lower `k1` = frequency saturates quickly\n", + "\n", + "- **`b` (default: `0.75`)** \n", + " Controls **document length normalization** — i.e., should longer chunks be penalized? \n", + " - `b = 0` → No length penalty (longer chunks may dominate) \n", + " - `b = 1` → Full normalization (neutralizes doc length bias)\n", + "\n", + "These values work well in practice, but you can tune them if:\n", + "- Your chunks are very short/long\n", + "- You see irrelevant long documents dominating results\n", + "\n", + "\n", + "The `run_query_direct(...)` function wraps the whole RAG flow:\n", + "\n", + "- It retrieves the top-k hits via BM25\n", + "- Assembles a context string\n", + "- Injects it into a prompt\n", + "- And uses GPT-4o to answer based only on that context\n" + ], + "metadata": { + "id": "Ewi4CNb7xjJV" + } + }, + { + "cell_type": "code", + "source": [ + "os.environ[\"OPENAI_API_KEY\"] = userdata.get('OPENAI_API_KEY')\n", + "\n", + "def connect_elasticsearch():\n", + " return Elasticsearch(\n", + " os.environ['ES_HOST_NAME'],\n", + " api_key=os.environ['ES_API_KEY']\n", + " )\n", + "\n", + "def init_bm25_store(es_client, index_name):\n", + " bm25_strategy = BM25Strategy(k1=1.2, b=0.75)\n", + "\n", + " store = ElasticsearchStore(\n", + " es_connection=es_client,\n", + " index_name=index_name,\n", + " strategy=bm25_strategy\n", + " )\n", + " return store\n", + "\n", + "def run_query_direct(store, query, k=5):\n", + " print(f\"\\n--- QUERY: {query} ---\")\n", + "\n", + " docs = store.similarity_search(query, k=k)\n", + "\n", + " context = \"\\n\\n\".join([doc.page_content for doc in docs])\n", + "\n", + " llm = ChatOpenAI(model=\"gpt-4o\")\n", + "\n", + " prompt = ChatPromptTemplate.from_template(\"\"\"\n", + " Answer the following question based only on the provided context:\n", + "\n", + " Context: {context}\n", + "\n", + " Question: {question}\n", + "\n", + " Answer:\n", + " \"\"\")\n", + "\n", + " formatted_prompt = prompt.format(context=context, question=query)\n", + " response = llm.invoke(formatted_prompt)\n", + "\n", + " print(\"RETRIEVED DOCUMENTS:\")\n", + " for i, doc in enumerate(docs, 1):\n", + " print(f\"{i}. {doc.page_content[:200]}...\")\n", + "\n", + "\n", + " print(f\"\\nANSWER:\")\n", + " print(response.content)\n", + "\n", + " return response.content, docs" + ], + "metadata": { + "id": "ZY3cdKCeNw5u" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "es_client = connect_elasticsearch()\n", + "store = init_bm25_store(es_client, os.environ['ES_INDEX_NAME'])\n" + ], + "metadata": { + "collapsed": true, + "id": "fe92JbhvOBu5" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now let's run a sample question on our data" + ], + "metadata": { + "id": "qxwxZFNkyjj5" + } + }, + { + "cell_type": "code", + "source": [ + "response, docs = run_query_direct(store, \"What are the containment procedures?\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "S2DfXAfOO1lx", + "outputId": "f7f5db43-b103-4c1f-db28-8dd8f4c7b762" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "--- QUERY: What are the containment procedures? ---\n", + "RETRIEVED DOCUMENTS:\n", + "1. Analyze for Common Adversary TTPs\n", + "\n", + "Compare TTPs to adversary TTPs documented in ATT&CK and analyze how the TTPs fit into the attack lifecycle. TTPs describe \"why,\" \"what,\" and \"how.\" Tactics describe ...\n", + "2. TLP:CLEAR\n", + "\n", + "Incident Response Process flowchart showing the workflow from START through various phases including Declare Incident, Determine Investigation Scope, Share CTI, Collect and Preserve Data, P...\n", + "3. TLP:CLEAR\n", + "\n", + "Step Incident Response Procedure Action Taken Date Completed 9c. Reset passwords on compromised accounts. 9d. Implement multi-factor authentication for all access methods. 9e. Install updat...\n", + "4. 7. Contain Activity (Short-term Mitigations)\n", + "\n", + "7a. Determine appropriate containment strategy, including: • Requirement to preserve evidence • Availability of services (e.g., network connectivity, serv...\n", + "5. TLP:CLEAR\n", + "\n", + "Term Definition Source National Security Systems (NSS) National Security Systems (NSS) are information systems as defined in 44 U.S.C.3552(b)(6). {A}The term \"national security system\" mean...\n", + "\n", + "ANSWER:\n", + "The containment procedures outlined in the provided context involve several strategic actions aimed at mitigating immediate threats while preserving evidence and maintaining system operations where possible. Here is a summary of the containment procedures:\n", + "\n", + "1. **Determine Appropriate Containment Strategy**:\n", + " - Assess the requirement to preserve evidence.\n", + " - Consider the availability of services like network connectivity and services continuity.\n", + " - Take into account resource constraints and the duration of containment steps.\n", + "\n", + "2. **System Backup**:\n", + " - Create backups to preserve evidence and facilitate continued investigation.\n", + "\n", + "3. **Coordinate with Law Enforcement**:\n", + " - If necessary, engage with law enforcement to collect and preserve evidence before eradication.\n", + "\n", + "4. **Isolation of Affected Systems and Networks**:\n", + " - Implement perimeter containment and internal network containment.\n", + " - Conduct host-based or endpoint isolation.\n", + " - Temporarily disconnect public-facing systems from the internet.\n", + "\n", + "5. **Update Security Configurations**:\n", + " - Close specific network ports and mail servers.\n", + " - Update firewall filtering rules.\n", + "\n", + "6. **Credential and Access Management**:\n", + " - Change system admin passwords.\n", + " - Rotate private keys and service/application account secrets where compromise is suspected.\n", + " - Revoke privileged access.\n", + "\n", + "7. **Blocking and Monitoring**:\n", + " - Block and log unauthorized access and traffic to and from known attacker IP addresses and other identified sources of threat.\n", + " - Prevent DNS resolution of known attacker domain names.\n", + "\n", + "8. **Restrict Network Communications**:\n", + " - Prevent compromised systems from connecting to other systems on the network.\n", + "\n", + "9. **Adversary Activity Monitoring**:\n", + " - Advanced Security Operations Centers (SOCs) may redirect adversaries to a sandbox environment to monitor activities, gather evidence, and identify TTPs.\n", + " - Continuously monitor for signs of threat actor response to containment activities.\n", + "\n", + "10. **Reporting and Adjustments**:\n", + " - Update the timeline and findings with new indicators.\n", + " - If new signs of compromise are found, return to technical analysis to reassess and potentially expand the investigation scope.\n", + "\n", + "11. **Evidence Preservation**:\n", + " - Upon successful containment with no new signs of compromise, preserve evidence for future reference or law enforcement investigation.\n", + "\n", + "These steps collectively help to limit the spread of the attack while facilitating further investigation and evidence collection, ultimately supporting subsequent eradication efforts.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "Because the query used clear, operational language (“containment procedures”), BM25 was able to surface high-signal chunks that directly addressed the topic — including full containment checklists and tactical steps.\n", + "\n", + "The LLM then stitched together the overlapping context into a clean, actionable list — covering evidence preservation, system isolation, access revocation, and more.\n", + "\n", + "> ✅ This is where keyword search shines: when your documents are structured, and your query terms match section headers or list items directly.\n" + ], + "metadata": { + "id": "JcLnWVnqyuRN" + } + }, + { + "cell_type": "markdown", + "source": [ + "Now let's try a more abstract query" + ], + "metadata": { + "id": "jS5SQkpJzKVg" + } + }, + { + "cell_type": "code", + "source": [ + "response, docs = run_query_direct(store, \"Where in the document does it describe coordination between the SOC and executive leadership?\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Qw7xpb-NO13b", + "outputId": "7c387616-6a12-440a-8de2-a8bd256c0898" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "--- QUERY: Where in the document does it describe coordination between the SOC and executive leadership? ---\n", + "RETRIEVED DOCUMENTS:\n", + "1. TLP:CLEAR TLP:CLEAR label in black background with white text CISA | Cybersecurity and Infrastructure Security Agency 2\n", + "\n", + "TLP:CLEAR\n", + "\n", + "INTRODUCTION\n", + "\n", + "The Cybersecurity and Infrastructure Security Agency (...\n", + "2. TLP:CLEAR\n", + "\n", + "Term Definition Source National Security Systems (NSS) National Security Systems (NSS) are information systems as defined in 44 U.S.C.3552(b)(6). {A}The term \"national security system\" mean...\n", + "3. APPENDIX G: SOURCE TEXT\n", + "\n", + "Agency Responsibilities References Cyber Response Group (CRG) Coordinates the development and implementation of the federal government's policies, strategies, and procedures f...\n", + "4. TLP:CLEAR\n", + "\n", + "Step Incident Response Procedure Action Taken Date Completed 9c. Reset passwords on compromised accounts. 9d. Implement multi-factor authentication for all access methods. 9e. Install updat...\n", + "5. Coordination with CISA\n", + "\n", + "Cyber defense capabilities vary widely. For this reason, coordination involves different degrees of engagement between the affected agency and CISA. As a baseline, every cybers...\n", + "\n", + "ANSWER:\n", + "The document discusses coordination between CISA and affected FCEB agencies in several places, but it does not specifically mention coordination between the Security Operations Center (SOC) and executive leadership. The document primarily focuses on the coordination of incident response activities, communication, and information sharing between CISA and FCEB agencies, along with the responsibilities of various groups such as the Cyber Unified Coordination Group (C-UCG). However, it emphasizes the importance of communication and situational awareness between agency leadership and CISA, which may imply coordination at various organizational levels, including the SOC and executive leadership.\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "We asked about coordination between the **SOC** and **executive leadership**, but the documents retrieved didn’t contain an exact match. Instead, they surfaced adjacent topics like:\n", + "\n", + "- Reporting incidents to **CISA** and **IT leadership**\n", + "- Establishing cross-agency communications protocols\n", + "\n", + "The LLM still produced a fluent answer — but it was largely inferred from context, not grounded in an exact passage. This is a classic case where **BM25 lacks the fuzziness or semantic awareness** needed to bridge slightly different wording.\n", + "\n", + "> ⚠️ This is where embedding-based retrieval would outperform: \n", + "> A vector store could connect “SOC coordination” with descriptions of escalation protocols, even if those words aren’t used verbatim.\n", + "\n", + "So while BM25 gave us *close-ish* chunks, the final answer wasn’t fully supported by the source, and that’s important to catch." + ], + "metadata": { + "id": "JBfXX1DQzoIz" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Conclusion\n", + "\n", + "This walkthrough demonstrated how to build a RAG pipeline without using embeddings — relying instead on **BM25 keyword search** for document retrieval.\n", + "\n", + "We saw that:\n", + "\n", + "- 🔍 BM25 performs well when queries use **precise terms** that align closely with the document’s language or structure.\n", + "- ⚠️ It falls short when the language **diverges** — like asking abstract or cross-functional questions not spelled out in exact keywords.\n", + "- 🤖 The LLM can sometimes *paper over* poor retrieval by guessing — but that breaks the grounding contract of RAG.\n", + "\n", + "### When does this approach make sense?\n", + "\n", + "Use BM25-based RAG when:\n", + "- Your document set is small to medium-sized\n", + "- You don’t want to manage embeddings or vector stores\n", + "- Your queries are likely to match real wording in the docs (e.g., checklists, procedures, FAQs)\n", + "\n", + "But if you’re working with more ambiguous queries — or documents with varied phrasing — **embedding-based search or a hybrid strategy** will perform better.\n", + "\n", + "---\n", + "\n", + "### ✅ Next steps\n", + "\n", + "Try extending this notebook by:\n", + "- Swapping in a **hybrid retrieval strategy** (BM25 + vectors)\n", + "- Adding an **embedding step** to the Unstructured workflow\n", + "- Testing queries that deliberately push the limits of lexical matching\n", + "\n", + "You now have a full BM25-based RAG system running, feel free to plug in your own docs and explore how it holds up.\n", + "\n", + "> ⚡️ Want to go deeper? \n", + "> Check out [Unstructured’s API docs](https://docs.unstructured.io) for advanced connectors, chunking strategies, and embedding options.\n" + ], + "metadata": { + "id": "Um4A3Se4z_cN" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "wVX0TlZOSFBT" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file