# Building RAG for Docs with Firecrawl <> Unstructured Platform API <> MongoDB

Have you ever found yourself frustrated while searching through technical documentation? Hours spent going through pages, trying to find that one specific detail you need to solve your problem? You're not alone.

What if you could have a personal documentation assistant that knows all your docs inside and out, can stay up-to-date automatically, and answer your specific questions precisely?

In this tutorial, we're building exactly that: a documentation copilot that combines web crawling, intelligent document processing, and vector search to create a system that actually understands your documentation and can answer questions about it naturally.

By connecting [Firecrawl](https://www.firecrawl.dev/)'s powerful web crawler, [Unstructured](https://unstructured.io/)'s document processing workflows, and [MongoDB](https://www.mongodb.com/)'s vector search, we'll create a tool that:

* Ingests documentation from any source (public or private)
* Understands context across document boundaries
* Answers questions using your documentation as its source of truth
* Saves you hours of searching through fragmented documentation

Let's dive in and build a tool that makes documentation work for you instead of the other way around!

## Setting up Dependencies

### Requirements

For this project, you'll need the following libraries:

- `firecrawl-py` – to use Firecrawl's Python SDK for scraping websites
- `unstructured` – to preprocess your scraped data using the Unstructured client
- `openai` – to generate embeddings and run chat completions with OpenAI models
- `pymongo` – to interact with MongoDB and perform vector search


In [None]:
!pip install -qU pymongo openai firecrawl-py "unstructured-client>=0.31.3" boto3

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m599.1/599.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.2/178.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.6/139.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.4/84.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h

This notebook walks through the complete process of building an intelligent documentation assistant. We'll cover:

* Data Collection: Using Firecrawl to automatically gather documentation from websites
* Document Processing: Using Unstructured to transform raw HTML into LLM-ready chunks
* Storage & Retrieval: Setting up MongoDB for efficient vector search
* Question Answering: Building a RAG system that answers questions based on your docs

By the end, you'll have a system that can answer questions about any documentation you feed it. Let's start by collecting the data!


## Using Firecrawl to crawl and fetch documentation pages

First, we need to gather the documentation content. For public docs, we'll use web crawling. Firecrawl is a specialized tool designed specifically for this purpose.

To use Firecrawl's capabilities, you'll need an API key from https://www.firecrawl.dev/.

For storage, we'll use an S3 bucket to temporarily hold our crawled data. This gives us flexibility and makes it easy for other systems to access the content later.

In [None]:
import os
from google.colab import userdata

# your firecrawl api key to call firecrawl
os.environ['FIRECRAWL_API_KEY'] = userdata.get('FIRECRAWL_API_KEY')

# AWS envs to host data onto a bucket
os.environ['S3_AWS_KEY'] = userdata.get('S3_AWS_KEY')
os.environ['S3_AWS_SECRET'] = userdata.get('S3_AWS_SECRET')


We'll use the following helper functions to invoke Firecrawl with a given `url`.  The crawler will retrieve up to `limit` number of URLs, exploring up to `maxDepth` levels deep from each page.

The functions below handle:
- Configuring and authenticating with the Firecrawl API
- Crawl a given url
- Polling for job completion
- Saving the crawled HTML files locally
- Uploading the results to a specified S3 bucket

Make sure you already set your `FIRECRAWL_API_KEY` and AWS credentials (`S3_AWS_KEY`, `S3_AWS_SECRET`) as environment variables.


In [None]:
import time
import tempfile
import boto3
import hashlib
from typing import Dict, Any
from firecrawl import FirecrawlApp

def prepare_firecrawl_config(api_key=None) -> Dict[str, str]:
    '''
    Fetch all env variables for firecrawl
    '''
    api_key = api_key or os.environ["FIRECRAWL_API_KEY"]
    if not api_key:
        return {"error": "Firecrawl API key is required."}
    return {"api_key": api_key}


def invoke_firecrawl_crawlhtml(url: str, s3_uri: str, limit: int = 100, maxDepth: int = 2, api_key: str = None,
                                timeout: int = 3600) -> Dict[str, Any]:
    """
    Start a web crawl job using Firecrawl to retrieve HTML content. We poll every 30s to check if the job has been completed.
    When done, the result data is stored as html files and uploaded into a directory in S3.
    Args:
      url: Url to crawl
      s3_uri: Uri to store the html files (the results get stored under {s3_uri}/{id_generated by the process})
      limit: Number of urls to crawl totally
      maxDepth: Number of urls to crawl from a particular page
      timeout: When this Job should timeout if it doesn't finish by then

    Returns:
      Dict with Job ID, Status of Job, s3_uri of job content, number of files uploaded, upload stats
    """
    config = prepare_firecrawl_config(api_key)
    if "error" in config:
        return {"error": config["error"]}

    try:
        validated_s3_uri = ensure_valid_s3_uri(s3_uri)
    except ValueError as e:
        return {"error": str(e)}

    firecrawl = FirecrawlApp(api_key=config["api_key"])
    job = firecrawl.async_crawl_url(url, params={"limit": limit, "maxDepth": maxDepth, "scrapeOptions": {"formats": ["html"]}})
    job_id = job.get("id")

    start = time.time()
    while True:
        status = firecrawl.check_crawl_status(job_id)
        if status.get("status") == "completed":
            break
        if time.time() - start > timeout:
            return {"id": job_id, "status": "timeout", "error": "Job timed out."}
        print(f"Job status is {status.get('status')} Sleeping 30s before polling again...")
        time.sleep(30)

    with tempfile.TemporaryDirectory() as tmpdir:
        job_dir = os.path.join(tmpdir, job_id)
        os.makedirs(job_dir, exist_ok=True)
        file_count = process_crawlhtml_results(status, job_dir)
        upload_stats = upload_directory_to_s3(job_dir, validated_s3_uri + job_id + "/")

    return {
        "id": job_id,
        "status": "completed",
        "s3_uri": validated_s3_uri + job_id + "/",
        "file_count": file_count,
        **upload_stats
    }

def process_crawlhtml_results(result: Dict[str, Any], output_dir: str) -> int:
    """
    Takes the crawled results and saves them as .html files in the output directory
    Args:
      result: Result from the Crawl Job
      output_dir: Directory to store the .html files

    Returns:
      Number of files saved
    """
    count = 0
    for i, page in enumerate(result.get("data", [])):
        html = page.get("html")
        if not html:
            continue
        url = page.get("metadata", {}).get("url", f"page-{i}")
        filename = clean_url_to_filename(url)
        with open(os.path.join(output_dir, filename), "w", encoding="utf-8") as f:
            f.write(html)
        count += 1
    return count

def upload_directory_to_s3(local_dir: str, s3_uri: str) -> Dict[str, Any]:
    """Uploads a directory to the S3 path provided

    Args:
        local_dir: Local directory to upload
        s3_uri: S3 URI to upload to

    Returns:
        Dict with upload stats
    """
    bucket, prefix = s3_uri[5:].split('/', 1)
    s3_client = boto3.client(
        's3',
        aws_access_key_id=os.environ['S3_AWS_KEY'],
        aws_secret_access_key=os.environ['S3_AWS_SECRET']
    )
    stats = {"uploaded_files": 0, "failed_files": 0, "total_bytes": 0}
    for root, _, files in os.walk(local_dir):
        for file in files:
            local_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_path, local_dir)
            s3_key = os.path.join(prefix, relative_path).replace("\\", "/")
            try:
                s3_client.upload_file(local_path, bucket, s3_key)
                stats["uploaded_files"] += 1
                stats["total_bytes"] += os.path.getsize(local_path)
                print(f"Uploaded {local_path} to s3://{bucket}/{s3_key}")
            except Exception as e:
                print(f"Error uploading {local_path}: {str(e)}")
                stats["failed_files"] += 1
    return stats


def ensure_valid_s3_uri(s3_uri: str) -> str:
    '''
    Validate and Normalize input S3 Uri

    Args:
      s3_uri: input s3_uri

    Returns:
      ValueError if not a valid s3_uri else a normalizewd s3_uri
    '''
    if not s3_uri.startswith("s3://"):
        raise ValueError("S3 URI must start with 's3://'")
    return s3_uri if s3_uri.endswith("/") else s3_uri + "/"

def clean_url_to_filename(url: str) -> str:
    '''
    Convert a URL to a valid filename.

    Args:
      s3_uri: input url

    Returns:
      Url converted into a validfilename
    '''
    filename = url.replace("https://", "").replace("http://", "")
    filename = filename.replace("/", "_").replace("?", "_").replace("&", "_").replace(":", "_")
    if len(filename) > 200:
        domain = filename.split('_')[0]
        filename_hash = hashlib.md5(url.encode()).hexdigest()
        return f"{domain}_{filename_hash}.html"
    return f"{filename}.html"

### Let's Start Crawling!
For this example, we'll crawl the documentation from the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) - a standard for providing context to AI models. This is just an example - you could use any documentation site that's relevant to your work.

We'll limit the documentation crawling to 10 urls with a maxDepth of 2 urls from each site.

In [None]:
remote_s3_uri = "s3://<bucketname>/<directoryname>" # Provide your own S3 Uri
crawl_result = invoke_firecrawl_crawlhtml(
    url="https://modelcontextprotocol.io/",
    s3_uri=remote_s3_uri,
    limit=10, # num urls to crawl
    maxDepth=2, # max urls to crawl within a given url (As per the documentation, the max number of slashes the pathname of a scraped URL may contain.)
)


Job status is scraping Sleeping 30s before polling again...
Uploaded /tmp/tmp4lk7babp/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_development_roadmap.html to s3://ajay-uns-devrel-content/mcp-content1/crawl_results/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_development_roadmap.html
Uploaded /tmp/tmp4lk7babp/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_clients.html to s3://ajay-uns-devrel-content/mcp-content1/crawl_results/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_clients.html
Uploaded /tmp/tmp4lk7babp/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_quickstart_server.html to s3://ajay-uns-devrel-content/mcp-content1/crawl_results/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_quickstart_server.html
Uploaded /tmp/tmp4lk7babp/193ce3ec-a289-4ba9-abda-63790faa7aad/modelcontextprotocol.io_tutorials_building-mcp-with-llms.html to s3://ajay-uns-devrel-content/mcp-content1/crawl_results/193ce3ec-a289

Great! Now we have raw documentation content stored in S3. Now we have to transform the raw HTML docs into a format we can use for RAG. This is where Unstructured comes in.

## Using Unstructured to Transform Documentation into RAG-Ready Data

Raw HTML is messy - it's filled with navigation bars, footers, styling code, and other elements that aren't relevant to the actual documentation content. This "noise" makes it hard for LLMs to focus on what matters.
The Unstructured Platform specializes in turning messy documents into clean, structured data that's perfect for AI applications. It can:

* Extract the meaningful content from HTML documents
* Identify headers, paragraphs, code blocks, and other elements
* Split content into chunks that preserve semantic meaning
* Generate vector embeddings that capture the conceptual meaning

To use Unstructured Platform, you'll need an API key:

1. Go to [https://platform.unstructured.io](https://platform.unstructured.io) and use your email address, Google account, or GitHub account to sign up (if you don't already have an account) and sign in. The Unstructured user interface (UI) will appear.

2. Get your Unstructured API key:
   - In the Unstructured UI, click **API Keys** on the sidebar.
   - Click **Generate API Key**.
   - Follow the on-screen instructions to finish generating the key.
   - Click the **Copy** icon next to your new key to copy it to your clipboard.


In [None]:
os.environ['UNSTRUCTURED_API_KEY'] = userdata.get('UNSTRUCTURED_API_KEY')

In [None]:
# helper code to pretty print info

def pretty_print_model(response_model):
    print(response_model.model_dump_json(indent=4))

### Understanding Unstructured Workflows

Unstructured uses a workflow concept to process documents. A workflow consists of:

* Source Connectors: Where Unstructured gets the documents from (S3 in our case)
* Processing Nodes: How Unstructured transforms the documents
* Destination Connectors: Where Unstructured sends the processed results (MongoDB in our case)

Let's set these up one by one:


In [None]:
from unstructured_client import UnstructuredClient
from unstructured_client.models.operations import CreateSourceRequest
from unstructured_client.models.operations import CreateDestinationRequest

from unstructured_client.models.shared import (
    CreateSourceConnector,
    SourceConnectorType,
    S3SourceConnectorConfigInput
)
from unstructured_client.models.shared import (
    CreateDestinationConnector,
    DestinationConnectorType,
    MongoDBConnectorConfigInput
)

unstructured_client = UnstructuredClient(api_key_auth=os.environ['UNSTRUCTURED_API_KEY'])

### Setting up S3 Source Connector


The Source Connector for this notebook is AWS S3. We're setting this up with AWS secret access key authentication. For this you require AWS Key, AWS Secret and the S3_Uri containing the files. Using the ones from the previous steps

We also support 20 other source connectors (check out [here](https://docs.unstructured.io/api-reference/workflow/sources/overview))

In [None]:
source_connector_response = unstructured_client.sources.create_source(
    request=CreateSourceRequest(
        create_source_connector=CreateSourceConnector(
            name="fc_mongo_source",
            type=SourceConnectorType.S3,
            config=S3SourceConnectorConfigInput(
                key=os.environ['S3_AWS_KEY'],
                secret=os.environ['S3_AWS_SECRET'],
                remote_url=crawl_result['s3_uri'], # The s3 uri with the crawled results
                recursive=True #True/False - True enables processing every file in the directory
            )
        )
    )
)

pretty_print_model(source_connector_response.source_connector_information)

{
    "config": {
        "anonymous": false,
        "recursive": true,
        "remote_url": "s3://ajay-uns-devrel-content/mcp-content1/crawl_results/193ce3ec-a289-4ba9-abda-63790faa7aad/",
        "key": "**********",
        "secret": "**********"
    },
    "created_at": "2025-04-01T13:27:43.310017Z",
    "id": "5894cdff-9a3e-49da-ab49-b1d5a19df98b",
    "name": "fc_mongo_source",
    "type": "s3",
    "updated_at": "2025-04-01T13:27:43.937805Z"
}


### Setting up MongoDB Destination Connector

After processing, we'll store our enriched documentation in MongoDB. MongoDB Atlas is perfect for this use case because it supports vector search, which will let us quickly find relevant pieces of documentation.

Unstructured requires the following information to set up a MongoDB Destination Connector:

- `database name`
- `collection name`
- `uri`

Follow [this tutorial](https://docs.unstructured.io/api-reference/workflow/destinations/mongodb) to create your MongoDB cluster and get the connection details.

> **Note:**
> Make sure your IP address can access MongoDB.  

Also, ensure you've created a **search index** for the `embeddings` field to enable vector search. Use the following configuration to create the index. Make sure to set the correct dimensionality of 1536 for OpenAI embeddings. This is necessary for similarity search later:

```json
{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embeddings": {
        "dimensions": 1536,
        "similarity": "cosine",
        "type": "knnVector"
      }
    }
  }
}


In [None]:
os.environ['MONGO_DATABASE'] = userdata.get('MONGO_DATABASE')
os.environ['MONGO_COLLECTION'] = userdata.get('MONGO_COLLECTION')
os.environ['MONGO_URI'] = userdata.get('MONGO_URI')

In [None]:
dest_connector_response = unstructured_client.destinations.create_destination(
    request=CreateDestinationRequest(
        create_destination_connector=CreateDestinationConnector(
            name="fc_mongo_dest",
            type=DestinationConnectorType.MONGODB,
            config=MongoDBConnectorConfigInput(
                database=os.environ['MONGO_DATABASE'],
                collection=os.environ['MONGO_COLLECTION'],
                uri=os.environ['MONGO_URI']
            )
        )
    )
)

pretty_print_model(dest_connector_response.destination_connector_information)

{
    "config": {
        "collection": "demo2",
        "database": "demo2",
        "uri": "**********"
    },
    "created_at": "2025-04-01T13:27:45.544845Z",
    "id": "692e8385-6d45-4c62-90b9-10d8a8005b49",
    "name": "fc_mongo_dest",
    "type": "mongodb",
    "updated_at": "2025-04-01T13:27:45.787342Z"
}


### Creating a Document Processing Workflow

Now we'll define how our documents should be processed. We'll use three key data processing nodes/steps:

* **Partitioner** - Partitioning extracts content from raw unstructured files and outputs that content as structured document elements. We use a VLM Partitioner here, which essentially uses a Vision Language Model (`claude-3.7 sonnent` to extract elements from each page.

* **Chunker** - chunking arranges the resulting document elements into manageable “chunks” to retrieve only parts of documents that contain only the information that is relevant to a user’s query.

* **Embedder** - creates arrays of numbers known as vectors, representing the text that is extracted by Unstructured. These vectors are stored or embedded next to the text itself. These vector embeddings are generated by an embedding model that is provided by an embedding provider.


For more information on the arguments used, check out the [Concepts tab](https://docs.unstructured.io/ui/document-elements).


In [None]:
from unstructured_client.models.shared import (
    WorkflowNode,
    WorkflowNodeType,
    WorkflowType,
    Schedule
)
from datetime import datetime


vlm_partitioner_workflow_node = WorkflowNode(
    name="Partitioner",
    subtype="vlm",
    type=WorkflowNodeType.PARTITION,
    settings={
       "provider": "anthropic",
       "model": "claude-3-7-sonnet-20250219", # Vision Language Model used for partitioning
       "output_format": "text/html",
       "user_prompt": None, # provide custom instructions or prompts to guide the partitioning process (String)
       "format_html": True, # formats documents in html, particularly helpful when you want to preserve the layout
       "unique_element_ids": True, # each extracted element has its own id
   }
)



# Chunk the partitioned content.
chunk_node = WorkflowNode(
    name="Chunker",
    subtype="chunk_by_title",
    type=WorkflowNodeType.CHUNK,
    settings={
        "multipage_sections": False, # Elements that occur on different pages are separated into distinct chunks.
        "combine_text_under_n_chars": 0, # It combines elements from a section into a chunk until the section reaches a specified character length.
        "include_orig_elements": True, # maintains detailed metadata that might otherwise be lost during consolidation.
        "new_after_n_chars": 1900, #  a "soft" maximum size for a chunk, closing a chunk after reaching this character length.
        "max_characters": 2048, # If a single element exceeds this size, it will be divided into two or more chunks using text-splitting.
        "overlap": 160, # create overlapping text between chunks
        "overlap_all": False, # Applies overlap between all consecutive chunks. If False, overlap is to chunks formed from oversized elements
    }
)

# Generate vector embeddings.
embed_node = WorkflowNode(
    name="Embedder",
    subtype="azure_openai",
    type=WorkflowNodeType.EMBED, # Embedding type node creates embedding
    settings={
        "model_name": "text-embedding-3-small" # max dimension size with MongoDB is 2048, so lets use a model with an embedding size of 1536
    }
)

unique_workflow_suffix = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

response = unstructured_client.workflows.create_workflow(
    request={
        "create_workflow": {
            "name": f"fc-s3-mongo-custom-workflow-{unique_workflow_suffix}",
            "source_id": source_connector_response.source_connector_information.id,
            "destination_id": dest_connector_response.destination_connector_information.id,
            "workflow_type": WorkflowType.CUSTOM,
            "workflow_nodes": [
                vlm_partitioner_workflow_node,
                chunk_node,
                embed_node
            ],
        }
    }
)

workflow_id = response.workflow_information.id
pretty_print_model(response.workflow_information)

{
    "created_at": "2025-04-01T13:28:02.998345Z",
    "destinations": [
        "692e8385-6d45-4c62-90b9-10d8a8005b49"
    ],
    "id": "664dfc70-7828-45f5-ba82-c7f558c72b77",
    "name": "fc-s3-mongo-custom-workflow-2025-04-01-13-28-02",
    "sources": [
        "5894cdff-9a3e-49da-ab49-b1d5a19df98b"
    ],
    "status": "active",
    "workflow_nodes": [
        {
            "name": "Partitioner",
            "subtype": "vlm",
            "type": "partition",
            "id": "f1fc40a5-f1af-4a5c-af3d-2a4f5536f56b",
            "settings": {
                "provider": "anthropic",
                "provider_api_key": null,
                "model": "claude-3-7-sonnet-20250219",
                "output_format": "text/html",
                "user_prompt": null,
                "format_html": true,
                "unique_element_ids": true,
                "is_dynamic": false,
                "allow_fast": true
            }
        },
        {
            "name": "Chunker",
         

## Running the workflow

Now that we've defined how we want to process our documentation, let's start the workflow and wait for it to complete:

In [None]:
job = unstructured_client.workflows.run_workflow(
    request={
        "workflow_id": workflow_id,
    }
)

pretty_print_model(job.job_information)

{
    "created_at": "2025-04-01T13:28:05.031504Z",
    "id": "956240a1-8050-4919-88a9-ca870040d059",
    "status": "SCHEDULED",
    "workflow_id": "664dfc70-7828-45f5-ba82-c7f558c72b77",
    "workflow_name": "fc-s3-mongo-custom-workflow-2025-04-01-13-28-02"
}


In [None]:
response = unstructured_client.jobs.list_jobs(
    request={
        "workflow_id": workflow_id
    }
)

last_job = response.response_list_jobs[0]
job_id = last_job.id
print(f"job_id: {job_id}")

job_id: 956240a1-8050-4919-88a9-ca870040d059


Now that we've created and started a job, we can poll Unstructured's `get_job` endpoint and check for its status every 30s till completion

In [None]:
import time

def poll_job_status(job_id, poll_time=30):
    while True:
        response = unstructured_client.jobs.get_job(
            request={
                "job_id": job_id
            }
        )

        job = response.job_information

        if job.status == "SCHEDULED":
            print(f"Job is scheduled, polling again in {poll_time} seconds...")
            time.sleep(poll_time)
        elif job.status == "IN_PROGRESS":
            print(f"Job is in progress, polling again in {poll_time} seconds...")
            time.sleep(poll_time)
        else:
            print("Job is completed")
            break

    return job

job = poll_job_status(job_id)
pretty_print_model(job)

Job is scheduled, polling again in 30 seconds...
Job is in progress, polling again in 30 seconds...
Job is completed
{
    "created_at": "2025-04-01T13:28:05.031504",
    "id": "956240a1-8050-4919-88a9-ca870040d059",
    "status": "COMPLETED",
    "workflow_id": "664dfc70-7828-45f5-ba82-c7f558c72b77",
    "workflow_name": "fc-s3-mongo-custom-workflow-2025-04-01-13-28-02",
    "runtime": "PT0S"
}


Fantastic! At this point, our documentation has been:

* Crawled from the web using Firecrawl
* Processed and structured using Unstructured
* Stored as vector-embedded chunks in MongoDB

Now comes the exciting part - actually using this processed documentation to answer questions!

Let's build RAG!

## RAG 🤖


Now we'll create a simple RAG (Retrieval-Augmented Generation) system that can answer questions about our documentation. Here's how it works:


RAG combines two key steps:
1. **Retrieval** – We use MongoDB Atlas Vector Search to find the most relevant documents based on a user's query.
2. **Generation** – We feed those documents into an OpenAI model to generate a grounded, context-aware answer.

This approach helps LLMs go beyond their training data by anchoring responses in real, up-to-date content — including internal or domain-specific docs you control.

In [None]:
from IPython.display import Markdown, display
from openai import OpenAI


To use OpenAI models for embedding and generation, you'll need add your `OPENAI_API_KEY` to secrets.

In [None]:
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

OPENAI_CLIENT = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
EMBEDDING_MODEL = "text-embedding-3-small"
GENERATION_MODEL = "gpt-4o-2024-11-20"

Now let's connect to our MongoDB collection:

In [None]:
import os
from pymongo import MongoClient
from typing import List

# MongoDB setup
def get_collection(collection_name: str, database_name: str):
    """
    Establish connection to MongoDB Atlas and return the specified collection.
    Args:
        collection_name (str): Name of the collection to retrieve
        database_name (str): Name of the database
    Returns:
        Collection object from MongoDB
    """

    mongo_uri = os.environ["MONGO_URI"]
    client = MongoClient(mongo_uri)
    database = client[database_name]
    collection = database[collection_name]

    print(f"Connected to MongoDB Collection: {database_name}")
    return collection




In [None]:
# Assuming you have these set in your environment
COLLECTION = get_collection(
    os.environ["MONGO_COLLECTION"],
    os.environ["MONGO_DATABASE"]
)

Connected to MongoDB Collection: demo2



We're building a lightweight RAG (Retrieval-Augmented Generation) setup using OpenAI and MongoDB Atlas Vector Search.

Here’s how it works:

1. **`get_embedding()`** takes any text and generates a vector using OpenAI’s embedding model.

2. **`simple_retriever()`** uses that vector to query MongoDB using `$vectorSearch`, returning the top-N most relevant documents based on similarity.

3. **`build_debugging_prompt()`** takes the retrieved docs and the user’s question, and constructs a prompt that grounds the LLM’s response in real context.

4. **`vanilla_rag()`** ties it all together — it fetches the relevant docs, builds the prompt, and calls OpenAI to generate a context-aware answer.


In [None]:
# Embedding function
def get_embedding(text: str):
    """
    Generate embedding for given text using OpenAI's embedding model.
    """
    return OPENAI_CLIENT.embeddings.create(
        input=text, model=EMBEDDING_MODEL
    ).data[0].embedding

# Simple retrieval using $vector search (requires MongoDB Atlas Vector Search)
def simple_retriever(query: str, n=10):
    """Retrieve top-N most relevant documents from MongoDB vector index"""
    query_embedding = get_embedding(query)

    results = COLLECTION.aggregate([
        {
            "$vectorSearch": {
                "queryVector": query_embedding,
                "path": "embeddings",        # This must match the field in your documents
                "numCandidates": 50,        # Increase if needed for better recall
                "limit": n,
                "index": "vector_index"      # Use the name you gave your search index
            }
        }
    ])

    docs = [doc["text"] for doc in results]
    return "\n".join(
        [f"\n\n===== Document {i+1} =====\n{doc}" for i, doc in enumerate(docs)]
    )


# Prompt builder
def build_debugging_prompt(user_query: str, retrieved_docs: str) -> str:
    """Create a prompt for the LLM grounded in retrieved context"""
    prompt = f"""
---

### Retrieved Documentation:
{retrieved_docs}

Analyse the retrieved docs, and then take the user's question, retrieved documents and suggest an answer based on the context provided. You can use your understanding of the retrieved documents to build on top and answer the user's question.

Respond in clear **Markdown**. Use code blocks where relevant. Make sure the code is syntactically accurate and the content very relevant.

---

### User Question:
{user_query}
"""
    return prompt.strip()

# Main RAG function
def vanilla_rag(question: str) -> str:
    """
    Generate a structured, grounded answer using retrieved documents and user query.
    """
    retrieved_docs = simple_retriever(question)
    final_prompt = build_debugging_prompt(question, retrieved_docs)

    response = OPENAI_CLIENT.chat.completions.create(
        model=GENERATION_MODEL,
        temperature=0,
        messages=[{"role": "user", "content": final_prompt}]
    )

    return response.choices[0].message.content, retrieved_docs


Putting everything together, you can start asking questions that are tailored to your needs


In [None]:
query = "Explain MCP to me"
explanation, docs = vanilla_rag(query)

display(Markdown(explanation))

### What is MCP?

The **Model Context Protocol (MCP)** is an **open protocol** designed to standardize how applications provide context to **Large Language Models (LLMs)**. Think of MCP as a **USB-C port for AI applications**—it provides a universal way to connect LLMs to various data sources, tools, and workflows. This makes it easier to build AI-powered applications that can interact with external systems, databases, and tools in a consistent and flexible manner.

---

### Why MCP?

MCP addresses the challenges of integrating LLMs with external data and tools by providing:

1. **Standardized Integration**: MCP creates a common framework for connecting LLMs to data sources, tools, and workflows, reducing the complexity of custom integrations.
2. **Flexibility**: It allows developers to switch between LLM providers and vendors without significant changes to their applications.
3. **Security**: MCP emphasizes best practices for securing data within your infrastructure.
4. **Interoperability**: By joining the MCP ecosystem, applications can leverage a growing list of pre-built integrations and tools.

---

### How MCP Works

At its core, MCP follows a **client-server architecture**:

- **MCP Hosts**: Applications like IDEs, AI tools, or platforms (e.g., Claude Desktop) that want to access data or tools via MCP.
- **MCP Clients**: Protocol clients that establish 1:1 connections with MCP servers.
- **MCP Servers**: Lightweight programs that expose specific capabilities (e.g., accessing a database, running tools) through the MCP standard.
- **Local Data Sources**: Files, databases, or services on your computer that MCP servers can securely access.
- **Remote Services**: External systems (e.g., APIs) that MCP servers can connect to.

This architecture allows LLMs to interact with both local and remote resources in a secure and standardized way.

---

### Key Features of MCP

1. **Resources**: MCP servers can expose data (e.g., database schemas, files) as resources that LLMs can query or interact with.
2. **Tools**: MCP enables LLMs to perform actions (e.g., running SQL queries, searching the web) through tools provided by servers.
3. **Prompts**: Developers can create reusable prompt templates and workflows for LLMs.
4. **Sampling**: Servers can request completions or responses from LLMs, enabling dynamic interactions.
5. **Communication Mechanism**: MCP uses **Server-Sent Events (SSE)** for real-time communication between clients and servers.

---

### Example Use Cases

1. **Database Integration**: An MCP server can connect to a PostgreSQL database, expose table schemas as resources, and provide tools for running SQL queries.
2. **File System Access**: An MCP server can allow LLMs to securely access and analyze files on your local machine.
3. **Agent Workflows**: MCP supports building hierarchical agent systems, enabling complex workflows where agents interact with each other and external systems.

---

### Benefits of Using MCP

- **Interoperability**: Join a growing ecosystem of AI applications that work seamlessly together.
- **Customizability**: Build custom servers and clients tailored to your specific needs.
- **Ease of Use**: Pre-built SDKs (e.g., Python, TypeScript) and tools make it easy to implement MCP in your applications.
- **Scalability**: MCP supports both local-first workflows and remote connections, making it suitable for a wide range of use cases.

---

### Example MCP Servers

Here are some examples of MCP servers and their capabilities:

```json
{
  "mcpServers": {
    "memory": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-memory"]
    },
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/allowed/files"]
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "<YOUR_TOKEN>"
      }
    }
  }
}
```

- **Memory Server**: Provides in-memory storage for testing and development.
- **File System Server**: Allows secure access to specific files on your local machine.
- **GitHub Server**: Interacts with GitHub repositories using a personal access token.

---

### Getting Started with MCP

1. **Choose Your Role**:
   - **Server Developer**: Build your own MCP server to expose data or tools.
   - **Client Developer**: Build a client that integrates with existing MCP servers.
   - **End User**: Use pre-built servers with tools like Claude Desktop.

2. **Use SDKs**:
   - MCP provides SDKs in **Python** and **TypeScript** to simplify development.

3. **Explore Examples**:
   - Check out the [MCP Servers Repository](https://github.com) for reference implementations.

4. **Test and Debug**:
   - Use tools like the **MCP Inspector** or **Apify MCP Tester** to test your servers and clients.

---

### Conclusion

MCP is a powerful protocol that simplifies the integration of LLMs with external systems and tools. By standardizing how context is provided to LLMs, MCP enables developers to build more robust, flexible, and interoperable AI applications. Whether you're building custom servers, creating clients, or using pre-built integrations, MCP provides the tools and framework to unlock the full potential of LLMs.

Even ones that are highly specific

In [None]:
query = "Can you explain the components of MCP in more detail to me like I'm an engineer with no idea of tech"
explanation, docs = vanilla_rag(query)

display(Markdown(explanation))

Certainly! Let me break down the **Model Context Protocol (MCP)** into simple, digestible components, using relatable analogies and examples. Think of MCP as a system that helps **AI models (like ChatGPT or Claude)** connect to different tools, data, and services in a standardized way. It’s like a universal adapter for AI applications.

---

### **Core Components of MCP**

1. **MCP Hosts**
   - **What it is:** These are the programs or applications that want to use AI models to access data or perform tasks.
   - **Analogy:** Imagine a smartphone that wants to connect to different devices like headphones, printers, or TVs. The smartphone is the "host."
   - **Example:** Claude Desktop App or an IDE (like VS Code) that uses MCP to fetch data or run tools.

2. **MCP Servers**
   - **What it is:** These are lightweight programs that provide specific capabilities or data to the host. They act as the "providers" of information or tools.
   - **Analogy:** Think of a vending machine that provides snacks. Each vending machine (server) offers a specific type of snack (capability).
   - **Example:** An MCP server could connect to a database, expose file systems, or provide access to APIs like GitHub.

3. **MCP Clients**
   - **What it is:** These are the middlemen that connect the host to the server. They maintain a 1:1 connection with the servers and help the host communicate with them.
   - **Analogy:** Imagine a translator who helps two people speaking different languages understand each other. The client is the translator.
   - **Example:** A client could be a tool like the **Apify MCP Tester**, which connects to servers and tests their functionality.

4. **Local Data Sources**
   - **What it is:** These are files, databases, or services on your computer that MCP servers can securely access.
   - **Analogy:** Think of your computer as a library, and the MCP server is a librarian who fetches the books (data) you need.
   - **Example:** An MCP server might access your local file system to retrieve documents or images.

5. **Remote Services**
   - **What it is:** These are external systems or APIs available over the internet that MCP servers can connect to.
   - **Analogy:** Imagine ordering food from a restaurant using a delivery app. The restaurant is the remote service, and the app is the MCP server connecting to it.
   - **Example:** An MCP server could connect to a weather API to fetch real-time weather data.

---

### **How MCP Works Together**

Let’s put it all together with an example:

- **Scenario:** You’re using an AI assistant in your IDE (like VS Code) to analyze a database and generate a report.
  1. The **MCP Host** (your IDE) wants to access the database.
  2. The **MCP Client** connects the host to an **MCP Server** that knows how to interact with the database.
  3. The **MCP Server** fetches the data from the **Local Data Source** (your database) and sends it back to the host.
  4. The AI assistant in the IDE uses this data to generate the report.

---

### **Key Features of MCP**

- **Standardization:** MCP provides a common way for AI models to interact with tools and data, like how USB-C works for connecting devices.
- **Flexibility:** You can switch between different AI models or tools without changing the underlying infrastructure.
- **Security:** MCP ensures that data stays within your infrastructure and is accessed securely.

---

### **Why MCP is Useful**

- **For Developers:** It simplifies the process of integrating AI models with tools and data. You don’t need to reinvent the wheel every time you want to connect an AI model to a new system.
- **For Businesses:** It allows seamless workflows and better utilization of AI capabilities, making it easier to build powerful applications.

---

### **Visualizing MCP**

Here’s a simple diagram to help you visualize the components:

```
[Host (e.g., IDE)] <---> [Client] <---> [Server] <---> [Data Source/Remote Service]
```

---

### **Example Use Case**

Let’s say you want to build an MCP server that connects to a PostgreSQL database. Here’s how you’d describe it:

1. **What it does:** The server connects to the database, exposes table schemas as resources, and allows running read-only SQL queries.
2. **How it helps:** It enables an AI assistant to analyze the database and provide insights without directly accessing the database itself.

---

### **Next Steps**

If you want to dive deeper, you can:
- Explore the **MCP SDKs** (e.g., Python or TypeScript) to start building your own servers or clients.
- Use tools like the **MCP Inspector** to test and debug your servers.
- Check out the **MCP Servers Repository** for examples and inspiration.

Let me know if you’d like more details or examples! 😊

Congratulations! You've built a powerful documentation assistant using **Firecrawl**, **Unstructured**, and **MongoDB Atlas** that can:

- Programmatically crawl and extract content from the web
- Preprocess and structure unstructured data for LLM consumption
- Ingest and query that data using a vector-enabled database
- Generate grounded, context-rich responses using OpenAI's gpt-4o

**Good RAG starts with well-prepared data, and the [Unstructured Platform](https://unstructured.io/developers) simplifies this critical first step.** By enabling efficient ingestion, partitioning, and metadata enrichment of unstructured data, it ensures that your RAG pipeline is built on a solid foundation — unlocking its full potential.

Now go forth and conquer your documentation!