# CrateDB

This notebook shows how to use the CrateDB vector store functionality around
[`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn how to use it for similarity
search and other purposes.

It supports:
- Similarity Search with Euclidean Distance
- Maximal Marginal Relevance Search (MMR)

## What is CrateDB?

[CrateDB] is an open-source, distributed, and scalable SQL analytics database
for storing and analyzing massive amounts of data in near real-time, even with
complex queries. It is PostgreSQL-compatible, based on [Lucene], and inherits
the shared-nothing distribution layer of [Elasticsearch].

This example uses the [Python client driver for CrateDB].


[CrateDB]: https://github.com/crate/crate
[Elasticsearch]: https://github.com/elastic/elasticsearch
[`FLOAT_VECTOR`]: https://crate.io/docs/crate/reference/en/master/general/ddl/data-types.html#float-vector
[`KNN_MATCH`]: https://crate.io/docs/crate/reference/en/master/general/builtins/scalar-functions.html#scalar-knn-match
[Lucene]: https://github.com/apache/lucene
[Python client driver for CrateDB]: https://crate.io/docs/python/

## Getting Started

Install required Python packages.

In [None]:
#!pip install -r requirements.txt

You need to provide an OpenAI API key, using the environment variable
`OPENAI_API_KEY`, or by defining it within an `.env` file.

```shell
export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY
```

In [None]:
from pueblo.util.environ import getenvpass

getenvpass("OPENAI_API_KEY", prompt="OpenAI API key:")

You can also provide a connection string to your CrateDB database cluster,
using the environment variable `CRATEDB_CONNECTION_STRING`.

By default, the notebook will connect to a CrateDB server instance running on `localhost`.
You can start a sandbox instance on your workstation by running [CrateDB using Docker].
Alternatively, you can also connect to a cluster running on [CrateDB Cloud].

[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB using Docker]: https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker

In [6]:
import os

# Connect to a self-managed CrateDB instance.
CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://crate@localhost/?schema=notebook",
)

# Connect to CrateDB Cloud.
# CONNECTION_STRING = os.environ.get(
#     "CRATEDB_CONNECTION_STRING",
#     "crate://username:password@hostname/?ssl=true&schema=notebook",
# )

# Define the store collection to use for this notebook session.
COLLECTION_NAME = "state_of_the_union_test"

In [None]:
_ = """
# Alternatively, the connection string can be assembled from individual
# environment variables.
import os

CONNECTION_STRING = CrateDBVectorSearch.connection_string_from_db_params(
    driver=os.environ.get("CRATEDB_DRIVER", "crate"),
    host=os.environ.get("CRATEDB_HOST", "localhost"),
    port=int(os.environ.get("CRATEDB_PORT", "4200")),
    database=os.environ.get("CRATEDB_DATABASE", "langchain"),
    user=os.environ.get("CRATEDB_USER", "crate"),
    password=os.environ.get("CRATEDB_PASSWORD", ""),
)
"""

You will start by importing a few required `langchain` modules.

In [1]:
from langchain.docstore.document import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import CrateDBVectorSearch

Next, read input data, and tokenize it. There is no need to download it
over and over again, so store the response within a local filesystem cache.

In [2]:
from pueblo.nlp.resource import CachedWebResource

url = "https://github.com/langchain-ai/langchain/raw/v0.0.325/docs/docs/modules/state_of_the_union.txt"
docs = CachedWebResource(url).langchain_documents(chunk_size=1000, chunk_overlap=0)

## Similarity Search with Euclidean Distance (Default)

The module will create a table with the name of the collection. Make sure
the collection name is unique and that you have the permission to create
a table.

In [None]:
embeddings = OpenAIEmbeddings()

store = CrateDBVectorSearch.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = store.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

## Maximal Marginal Relevance Search (MMR)
Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents.

In [None]:
docs_with_score = store.max_marginal_relevance_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

## Working with the vector store

In the example above, you created a vector store from scratch. When
aiming to work with an existing vector store, you can initialize it directly.

In [None]:
store = CrateDBVectorSearch(
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings,
)

### Add documents

You can also add documents to an existing vector store.

In [None]:
store.add_documents([Document(page_content="foo")])

In [None]:
docs_with_score = store.similarity_search_with_score("foo")

In [None]:
docs_with_score[0]

In [None]:
docs_with_score[1]

### Overwriting a vector store

If you have an existing collection, you can overwrite it by using `from_documents`,
and setting `pre_delete_collection = True`.

In [None]:
store = CrateDBVectorSearch.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True,
)

In [None]:
docs_with_score = store.similarity_search_with_score("foo")

In [None]:
docs_with_score[0]

### Using a vector store as a retriever

In [None]:
retriever = store.as_retriever()

In [None]:
print(retriever)