This notebook shows how to use the CrateDB vector store functionality around FLOAT_VECTOR and KNN_MATCH. You will learn how to use it to create a retrieval augmented generation (RAG) pipeline.

## What is CrateDB?

CrateDB is an open-source, distributed, and scalable SQL analytics database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is wire-compatible to PostgreSQL, based on Lucene, and inherits the shared-nothing distribution layer of Elasticsearch.

This example uses the Python client driver for CrateDB.

## Getting Started
CrateDB supports storing vectors since version 5.5. You can leverage the fully managed service of CrateDB Cloud, or install CrateDB on your own, for example using Docker.

```shell
docker run --publish 4200:4200 --publish 5432:5432 --pull=always crate:latest -Cdiscovery.type=single-node
```

## Setup

Install required Python packages, and import Python modules.

In [None]:
#!pip install -r requirements.txt

In [None]:
import os

import openai
import pandas as pd
import sqlalchemy as sa

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from pueblo.util.environ import getenvpass

### Configure database settings

In [None]:
CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://crate@localhost/",
)

# For CrateDB Cloud, use:
# CONNECTION_STRING = os.environ.get(
#     "CRATEDB_CONNECTION_STRING",
#     "crate://username:password@hostname/?ssl=true",
# )

### Configure OpenAI

In [None]:
getenvpass("OPENAI_API_KEY", prompt="OpenAI API key:")

### Patches
Those can be removed again after they have been upstreamed.

In [None]:
# TODO: Bring this into the `crate-python` driver.
from cratedb_toolkit.sqlalchemy.patch import patch_inspector
patch_inspector()

## Create embeddings from dataset

In [None]:
loader = CSVLoader(file_path="./sample_data/twitter_support_microsoft.csv", encoding="utf-8", csv_args={'delimiter': ','})
data = loader.load()
pages_text = [doc.page_content for doc in data]
print(pages_text[0])

In [None]:
embeddings = OpenAIEmbeddings(deployment='my-embedding-model', chunk_size=1)
pages_embeddings = embeddings.embed_documents(pages_text)

## Write data to CrateDB

The next step creates a dataframe that contains the text of the documents and their embeddings. The embeddings will be stored in CrateDB using FLOAT_VECTOR type.

In [None]:
df = pd.DataFrame(list(zip(pages_text, pages_embeddings)), columns=['text', 'embedding'])

In [None]:
engine = sa.create_engine(CONNECTION_STRING, echo=False)

create_table = sa.text("CREATE TABLE IF NOT EXISTS text_data (text TEXT, embedding FLOAT_VECTOR(1536))")
with engine.connect() as con:
     con.execute(create_table)

The text and embeddings are written to CrateDB database using CrateDB vector storage support:

In [None]:
df.to_sql(name='text_data', con=engine, if_exists='append', index=False)
df.head(5)

## Ask question
Let's define our question and create an embedding using OpenAI embedding model:

In [None]:
my_question = "How to update shipping address on existing order in Microsoft Store?"
query_embedding = embeddings.embed_query(my_question)

## Find relevant context using similarity search

The `knn_match (search_vector, query_vector, k) `function in CrateDB performs an approximate k-nearest neighbors (KNN) search within a dataset. KNN search involves finding the k data points that are most similar to a given query data point. We find the most similar vectors to our query vector using knn search capability in CrateDB:

In [None]:
knn_query = sa.text("""SELECT text FROM text_data
            WHERE knn_match(embedding, {0}, 2)""".format(query_embedding))
documents=[]

with engine.connect() as con:
    results = con.execute(knn_query)
    for record in results:
        documents.append(record[0])

print(documents)


## Augment system prompt and query LLM

In [None]:
context = '---\n'.join(documents)

system_prompt = f"""
You are customer support expert and get questions about Microsoft products and services.
To answer question use the information from the context. Remove new line characters from the answer.
If you don't find the relevant information there, say "I don't know".

Context:
{context}"""

chat_completion = openai.chat.completions.create(model="gpt-4",
                                               messages=[{"role": "system", "content": system_prompt},
                                                         {"role": "user", "content": my_question}])


In [None]:
chat_completion.choices[0].message.content