# CloudflareVectorize
This notebook demonstrates how to setup and use Cloudflare's Vectorize wrapper for LangChain.

# Setup

In [1]:
import asyncio
import json
import os
import uuid
import warnings

from dotenv import load_dotenv
from langchain_community.document_loaders import WikipediaLoader
from langchain_cloudflare.embeddings import (
    CloudflareWorkersAIEmbeddings,
)
from langchain_cloudflare.vectorstores import (
    CloudflareVectorize,
)
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

warnings.filterwarnings("ignore")
load_dotenv(".env")

True

### API Tokens

This Python package is a wrapper around Cloudflare's REST API.  To interact with the API, you need to provid an API token with the appropriate privileges.

You can create and manage API tokens here:

https://dash.cloudflare.com/YOUR-ACCT-NUMBER/api-tokens

**Note:**
CloudflareVectorize depends on WorkersAI (if you want to use it for Embeddings), and D1 (if you are using it to store and retrieve raw values).

While you can create a single `api_token` with Edit privileges to all needed resources (WorkersAI, Vectorize & D1), you may want to follow the principle of "least privilege access" and create separate API tokens for each service

**Note:** These service-specific tokens (if provided) will take preference over a global token.  You could provide these instead of a global token.


In [2]:
cf_acct_id = os.getenv("cf_acct_id")

# single token with WorkersAI, Vectorize & D1
api_token = os.getenv("api_token")

# OR, separate tokens with access to each service
cf_vectorize_token = os.getenv("cf_vectorize_token")
cf_d1_token = os.getenv("d1_api_token")

## Initialization

In [3]:
# name your vectorize index
vectorize_index_name = f"test-langchain-{uuid.uuid4().hex}"

### Embeddings

For storage of embeddings, semantic search and retrieval, you must embed your raw values as embeddings.  Specify an embedding model, one available on WorkersAI

[https://developers.cloudflare.com/workers-ai/models/](https://developers.cloudflare.com/workers-ai/models/)

In [4]:
MODEL_WORKERSAI = "@cf/baai/bge-large-en-v1.5"

In [5]:
cf_ai_token = os.getenv(
    "cf_ai_token"
)  # needed if you want to use workersAI for embeddings

embedder = CloudflareWorkersAIEmbeddings(
    account_id=cf_acct_id, api_token=cf_ai_token, model_name=MODEL_WORKERSAI
)

### Raw Values with D1

Vectorize only stores embeddings, metadata and namespaces. If you want to store and retrieve raw values, you must leverage Cloudflare's SQL Database D1.

You can create a database here and retrieve its id:

[https://dash.cloudflare.com/YOUR-ACCT-NUMBER/workers/d1

In [6]:
# provide the id of your D1 Database
d1_database_id = "8ce9ce08-8961-475c-98fb-1ef0e6e4ca40"

### CloudflareVectorize Class

Now we can create the CloudflareVectorize instance.  Here we passed:

* The `embedding` instance from earlier
* The account ID
* A global API token for all services (WorkersAI, Vectorize, D1)
* Individual API tokens for each service

In [7]:
cfVect = CloudflareVectorize(
    embedding=embedder,
    account_id=cf_acct_id,
    d1_api_token=cf_d1_token,  # (Optional if using global token)
    vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
    d1_database_id=d1_database_id,  # (Optional if not using D1)
)

### Cleanup
Before we get started, let's delete any `test-langchain*` indexes we have for this walkthrough

In [8]:
# depending on your notebook environment you might need to include:
# import nest_asyncio
# nest_asyncio.apply()

arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]
arr_async_requests = [
    cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);

### Gotchyas

A few "gotchyas" are shown below for various missing token/parameter combinations

D1 Database ID provided but no "global" `api_token` and no `d1_api_token`

In [None]:
try:
    cfVect = CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, # (Optional if using service-specific token)
        ai_api_token=cf_ai_token,  # (Optional if using global token)
        # d1_api_token=cf_d1_token,  # (Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
        d1_database_id=d1_database_id,  # (Optional if not using D1)
    )
except Exception as e:
    print(str(e))

No "global" `api_token` provided and either missing `ai_api_token` or `vectorize_api_token`

In [None]:
try:
    cfVect = CloudflareVectorize(
        embedding=embedder,
        account_id=cf_acct_id,
        # api_token=api_token, # (Optional if using service-specific token)
        # ai_api_token=cf_ai_token,  # (Optional if using global token)
        d1_api_token=cf_d1_token,  # (Optional if using global token)
        vectorize_api_token=cf_vectorize_token,  # (Optional if using global token)
        d1_database_id=d1_database_id,  # (Optional if not using D1)
    )
except Exception as e:
    print(str(e))

## Manage Vector Store

### Creating an Index

Let's start off this example by creating and index (and first deleting if it exists).  If the index doesn't exist we will get a an error from Cloudflare telling us so.

In [None]:
%%capture

try:
    cfVect.delete_index(index_name=vectorize_index_name, wait=True)
except Exception as e:
    print(e)

In [None]:
r = cfVect.create_index(index_name=vectorize_index_name, wait=True)
print(r)

### Listing Indexes

Now, we can list our indexes on our account

In [None]:
indexes = cfVect.list_indexes()
indexes = [x for x in indexes if "test-langchain" in x.get("name")]
print(indexes)

### Get Index Info
We can also get certain indexes and retrieve more granular information about an index.

This call returns a `processedUpToMutation` which can be used to track the status of operations such as creating indexes, adding or deleting records.

In [None]:
r = cfVect.get_index_info(index_name=vectorize_index_name)
print(r)

### Adding Metadata Indexes

It is common to assist retrieval by supplying metadata filters in quereies.  In Vectorize, this is accomplished by first creating a "metadata index" on your Vectorize Index.  We will do so for our example by creating one on the `section` field in our documents.

**Reference:** [https://developers.cloudflare.com/vectorize/reference/metadata-filtering/](https://developers.cloudflare.com/vectorize/reference/metadata-filtering/)


In [None]:
r = cfVect.create_metadata_index(
    property_name="section",
    index_type="string",
    index_name=vectorize_index_name,
    wait=True,
)
print(r)

### Listing Metadata Indexes

In [None]:
r = cfVect.list_metadata_indexes(index_name=vectorize_index_name)
print(r)

### Adding Documents
For this example, we will use LangChain's Wikipedia loader to pull an article about Cloudflare.  We will store this in Vectorize and query its contents later.

In [13]:
docs = WikipediaLoader(query="Cloudflare", load_max_docs=2).load()

We will then create some simple chunks with metadata based on the chunk sections.

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([docs[0].page_content])

running_section = ""
for idx, text in enumerate(texts):
    if text.page_content.startswith("="):
        running_section = text.page_content
        running_section = running_section.replace("=", "").strip()
    else:
        if running_section == "":
            text.metadata = {"section": "Introduction"}
        else:
            text.metadata = {"section": running_section}

In [15]:
print(len(texts))
print(texts[0], "\n\n", texts[-1])

55
page_content='Cloudflare, Inc., is an American company that provides content delivery network services,' metadata={'section': 'Introduction'} 

 page_content='attacks, Cloudflare ended up being attacked as well; Google and other companies eventually' metadata={'section': 'DDoS mitigation'}


Now we will add documents to our Vectorize Index.

**Note:**
Adding embeddings to Vectorize happens `asyncronously`, meaning there will be a small delay between adding the embeddings and being able to query them.  By default `add_documents` has a `wait=True` parameter which waits for this operation to complete before returning a response.  If you do not want the program to wait for embeddings availability, you can set this to `wait=False`.


In [None]:
r = cfVect.add_documents(index_name=vectorize_index_name, documents=texts, wait=True)

In [None]:
print(json.dumps(r)[:300])

## Query vector store

We will do some searches on our embeddings.  We can specify our search `query` and the top number of results we want with `k`.


In [None]:
query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name, query="Workers AI", k=100, return_metadata="none"
)

print(f"{len(query_documents)} results:\n{query_documents[:3]}")

### Output

If you want to return metadata you can pass `return_metadata="all" | 'indexed'`.  The default is `all`.

If you want to return the embeddings values, you can pass `return_values=True`.  The default is `False`.
Embeddings will be returned in the `metadata` field under the special `_values` field.

**Note:** `return_metadata="none"` and `return_values=True` will return only ther `_values` field in `metadata`.

**Note:**
If you return metadata or values, the results will be limited to the top 20.

[https://developers.cloudflare.com/vectorize/platform/limits/](https://developers.cloudflare.com/vectorize/platform/limits/)

In [None]:
query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="Workers AI",
    return_values=True,
    return_metadata="all",
    k=100,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

If you'd like the similarity `scores` to be returned, you can use `similarity_search_with_score`


In [None]:
query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="Workers AI",
    k=100,
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

### Including D1 for "Raw Values"
All of the `add` and `search` methods on CloudflareVectorize support a `include_d1` parameter (default=True).

This is to configure whether you want to store/retrieve raw values.

If you do not want to use D1 for this, you can set this to `include=False`.  This will return documents with an empty `page_content` field.

In [None]:
query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="california",
    k=100,
    return_metadata="all",
    include_d1=False,
)
print(f"{len(query_documents)} results:\n{str(query_documents[0])[:500]}")

### Searching with Metadata Filtering

As mentioned before, Vectorize supports filtered search via filtered on indexes metadata fields.  Here is an example where we search for `Introduction` values within the indexed `section` metadata field.

More info on searching on Metadata fields is here: [https://developers.cloudflare.com/vectorize/reference/metadata-filtering/](https://developers.cloudflare.com/vectorize/reference/metadata-filtering/)


In [None]:
query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="California",
    k=100,
    md_filter={"section": "Introduction"},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")

You can do more sophisticated filtering as well

https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#valid-filter-examples

In [None]:
query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="California",
    k=100,
    md_filter={"section": {"$ne": "Introduction"}},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents[:3])}")

In [None]:
query_documents = cfVect.similarity_search_with_score(
    index_name=vectorize_index_name,
    query="DNS",
    k=100,
    md_filter={"section": {"$in": ["Products", "History"]}},
    return_metadata="all",
)
print(f"{len(query_documents)} results:\n - {str(query_documents)}")

### Search by Namespace
We can also search for vectors by `namespace`.  We just need to add it to the `namespaces` array when adding it to our vector database.

https://developers.cloudflare.com/vectorize/reference/metadata-filtering/#namespace-versus-metadata-filtering

In [None]:
namespace_name = f"test-namespace-{uuid.uuid4().hex[:8]}"

new_documents = [
    Document(
        page_content="This is a new namespace specific document!",
        metadata={"section": "Namespace Test1"},
    ),
    Document(
        page_content="This is another namespace specific document!",
        metadata={"section": "Namespace Test2"},
    ),
]

r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=new_documents,
    namespaces=[namespace_name] * len(new_documents),
    wait=True,
)

In [None]:
query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="California",
    namespace=namespace_name,
)

print(f"{len(query_documents)} results:\n - {str(query_documents)}")

### Search by IDs
We can also retrieve specific records for specific IDs.  To do so, we need to set the vectorize index name on the `index_name` Vectorize state param.

This will return both `_namespace` and `_values` as well as other `metadata`.


In [None]:
sample_ids = [x.id for x in query_documents]

In [None]:
cfVect.index_name = vectorize_index_name

In [None]:
query_documents = cfVect.get_by_ids(
    sample_ids,
)
print(str(query_documents[:3])[:500])

The namespace will be included in the `_namespace` field in `metadata` along with your other metadata (if you requested it in `return_metadata`).

**Note:** You cannot set the `_namespace` or `_values` fields in `metadata` as they are reserved.  They will be stripped out during the insert process.

### Upserts

Vectorize supports Upserts which you can perform by setting `upsert=True`.



In [None]:
query_documents[0].page_content = "Updated: " + query_documents[0].page_content
print(query_documents[0].page_content)

In [None]:
new_document_id = "12345678910"
new_document = Document(
    id=new_document_id,
    page_content="This is a new document!",
    metadata={"section": "Introduction"},
)

In [None]:
r = cfVect.add_documents(
    index_name=vectorize_index_name,
    documents=[new_document, query_documents[0]],
    upsert=True,
    wait=True,
)

In [None]:
query_documents_updated = cfVect.get_by_ids([new_document_id, query_documents[0].id])

In [None]:
print(str(query_documents_updated[0])[:500])
print(query_documents_updated[0].page_content)
print(query_documents_updated[1].page_content)

### Deleting Records
We can delete records by their ids as well


In [None]:
r = cfVect.delete(index_name=vectorize_index_name, ids=sample_ids, wait=True)
print(r)

And to confirm deletion

In [None]:
query_documents = cfVect.get_by_ids(sample_ids)
assert len(query_documents) == 0

### Creating from Documents
LangChain stipulates that all vectorstores must have a `from_documents` method to instantiate a new Vectorstore from documents.  This is a more streamlined method than the individual `create, add` steps shown above.

You can do that as shown here:

In [None]:
vectorize_index_name = "test-langchain-from-docs"

In [None]:
cfVect = CloudflareVectorize.from_documents(
    account_id=cf_acct_id,
    index_name=vectorize_index_name,
    documents=texts,
    embedding=embedder,
    d1_database_id=d1_database_id,
    d1_api_token=cf_d1_token,
    vectorize_api_token=cf_vectorize_token,
    wait=True,
)

In [None]:
# query for documents
query_documents = cfVect.similarity_search(
    index_name=vectorize_index_name,
    query="Edge Computing",
)

print(f"{len(query_documents)} results:\n{str(query_documents[0])[:300]}")

## Async Examples
This section will show some Async examples


### Creating Indexes

In [9]:
vectorize_index_name1 = "test-langchain1"
vectorize_index_name2 = "test-langchain2"
vectorize_index_name3 = "test-langchain3"

In [10]:
# depending on your notebook environment you might need to include these:
# import nest_asyncio
# nest_asyncio.apply()

async_requests = [
    cfVect.acreate_index(index_name=vectorize_index_name1),
    cfVect.acreate_index(index_name=vectorize_index_name2),
    cfVect.acreate_index(index_name=vectorize_index_name3),
]

await asyncio.gather(*async_requests);

### Creating Metadata Indexes

In [11]:
async_requests = [
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name1,
        wait=True,
    ),
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name2,
        wait=True,
    ),
    cfVect.acreate_metadata_index(
        property_name="section",
        index_type="string",
        index_name=vectorize_index_name3,
        wait=True,
    ),
]

await asyncio.gather(*async_requests);

### Adding Documents

In [16]:
async_requests = [
    cfVect.aadd_documents(index_name=vectorize_index_name1, documents=texts, wait=True),
    cfVect.aadd_documents(index_name=vectorize_index_name2, documents=texts, wait=True),
    cfVect.aadd_documents(index_name=vectorize_index_name3, documents=texts, wait=True),
]

await asyncio.gather(*async_requests);

### Querying/Search

In [17]:
async_requests = [
    cfVect.asimilarity_search(index_name=vectorize_index_name1, query="Workers AI"),
    cfVect.asimilarity_search(index_name=vectorize_index_name2, query="Edge Computing"),
    cfVect.asimilarity_search(index_name=vectorize_index_name3, query="SASE"),
]

async_results = await asyncio.gather(*async_requests);

In [18]:
print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")

20 results:
page_content='In 2023, Cloudflare launched Workers AI, a framework allowing for use of Nvidia GPU's within'
20 results:
page_content='utilizing edge computing, reverse proxies for web traffic, data center interconnects, and a content'
20 results:
page_content='== Products =='


### Returning Metadata/Values

In [19]:
async_requests = [
    cfVect.asimilarity_search(
        index_name=vectorize_index_name1,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name2,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name3,
        query="California",
        return_values=True,
        return_metadata="all",
    ),
]

async_results = await asyncio.gather(*async_requests);

In [20]:
print(f"{len(async_results[0])} results:\n{str(async_results[0][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")

20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.028919144, -0.019105384, -0.000850724, 0.012162158, 0.01853957, -0.028224546, -0.007677764, 0.0024373678, 0.041417565, -0.013160827, 0.03270
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.028919144, -0.019105384, -0.000850724, 0.012162158, 0.01853957, -0.028224546, -0.007677764, 0.0024373678, 0.041417565, -0.013160827, 0.03270
20 results:
page_content='and other services. Cloudflare's headquarters are in San Francisco, California. According to' metadata={'section': 'Introduction', '_values': [-0.028919144, -0.019105384, -0.000850724, 0.012162158, 0.01853957, -0.028224546, -0.007677764, 0.0024373678, 0.041417565, -0.013160827, 0.03270


### Searching with Metadata Filtering

In [21]:
async_requests = [
    cfVect.asimilarity_search(
        index_name=vectorize_index_name1,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name2,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
    cfVect.asimilarity_search(
        index_name=vectorize_index_name3,
        query="Cloudflare services",
        k=2,
        md_filter={"section": "Products"},
        return_metadata="all",
        # return_values=True
    ),
]

async_results = await asyncio.gather(*async_requests);

In [22]:
[doc.metadata["section"] == "Products" for doc in async_results[0]]

[True, True]

In [23]:
print(f"{len(async_results[0])} results:\n{str(async_results[0][-1])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[1][0])[:300]}")
print(f"{len(async_results[1])} results:\n{str(async_results[2][0])[:300]}")

2 results:
page_content='Cloudflare also provides analysis and reports on large-scale outages, including Verizon’s October' metadata={'section': 'Products'}
2 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}
2 results:
page_content='Cloudflare provides network and security products for consumers and businesses, utilizing edge' metadata={'section': 'Products'}


## Cleanup
Let's finish by deleting all of the indexes we created in this notebook.

In [24]:
arr_indexes = cfVect.list_indexes()
arr_indexes = [x for x in arr_indexes if "test-langchain" in x.get("name")]

In [25]:
arr_async_requests = [
    cfVect.adelete_index(index_name=x.get("name")) for x in arr_indexes
]
await asyncio.gather(*arr_async_requests);

## API Reference


https://developers.cloudflare.com/api/resources/vectorize/

https://developers.cloudflare.com/vectorize/