This notebook shows how to use the CrateDB vector store functionality around FLOAT_VECTOR and KNN_MATCH. You will learn how to use it to create a retrieval augmented generation (RAG) pipeline.

#What is CrateDB?
CrateDB is an open-source, distributed, and scalable SQL analytics database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is wire-compatible to PostgreSQL, based on Lucene, and inherits the shared-nothing distribution layer of Elasticsearch.

This example uses the Python client driver for CrateDB.

#Getting Started
CrateDB supports storing vectors since version 5.5. You can leverage the fully managed service of CrateDB Cloud, or install CrateDB on your own, for example using Docker.

`docker run --publish 4200:4200 --publish 5432:5432 --pull=always crate:latest -Cdiscovery.type=single-node`

Install required Python packages, and import Python modules.

In [None]:
pip install langchain pypdf chromadb openai sentence_transformers sqlalchemy 'crate[sqlalchemy]' tiktoken

Collecting langchain
  Downloading langchain-0.0.352-py3-none-any.whl (794 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.4/794.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.17.3-py3-none-any.whl (277 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.9/277.9 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.21-py3-none-any.whl (508 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m508.6/508.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.6.0-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m13.4 MB/s[0m

In [None]:
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
import pandas as pd
import sqlalchemy as sa
from sqlalchemy import create_engine
from sqlalchemy import text
import crate
import openai
import os
import getpass

# Create embeddings from dataset


In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Open AI API Key:")
openai.api_key = os.environ["OPENAI_API_KEY"]

Open AI API Key:··········


In [None]:
loader = CSVLoader(file_path="./sample_data/twitter_support_microsoft.csv", encoding="utf-8", csv_args={'delimiter': ','})
data = loader.load()
pages_text = [doc.page_content for doc in data]
print(pages_text[0])

tweet_id: 2301
author_id: 116231
inbound: True
created_at: Tue Oct 31 20:22:23 +0000 2017
text: @MicrosoftHelps Please get back to me immediately this is of the upmost importance
response_tweet_id: 2299
in_response_to_tweet_id: 2306


In [None]:
embeddings = OpenAIEmbeddings(deployment='my-embedding-model', chunk_size=1)
pages_embeddings = embeddings.embed_documents(pages_text)

#Write data to CrateDB

The next step creates a dataframe that contains the text of the documents and their embeddings. The embeddings will be stored in CrateDB using FLOAT_VECTOR type.

In [None]:
df = pd.DataFrame(list(zip(pages_text, pages_embeddings)),columns =['text', 'embedding'])

In [None]:
host = getpass.getpass("Host:")
password = getpass.getpass("password:")
dbname="crate://admin:{0}@{1}:4200?ssl=true".format(password,host)
create_table = text("CREATE TABLE text_data (text TEXT, embedding FLOAT_VECTOR(1536))")
engine = create_engine(dbname, echo=False)

with engine.connect() as con:
     con.execute(create_table)

Host:··········
password:··········


The text and embeddings are written to CrateDB database using CrateDB vector storage support:

In [None]:
df.to_sql(name='text_data', con=engine, if_exists='append', index=False)
df.head(5)

Unnamed: 0,text,embedding
0,tweet_id: 2301\nauthor_id: 116231\ninbound: Tr...,"[-0.037185399571588756, -0.01364005917049614, ..."
1,tweet_id: 11879\nauthor_id: MicrosoftHelps\nin...,"[-0.015454164058839018, 0.0032340502581370413,..."
2,tweet_id: 11881\nauthor_id: MicrosoftHelps\nin...,"[-0.005936504790842904, 0.01942733669848253, 0..."
3,tweet_id: 11890\nauthor_id: 118332\ninbound: T...,"[-0.011779013479771422, 0.005725434705161641, ..."
4,tweet_id: 11912\nauthor_id: MicrosoftHelps\nin...,"[-0.022950152341946847, 0.004767860370434739, ..."


#Ask question
Let's define our question and create an embedding using OpenAI embedding model:

In [None]:
my_question = "How to update shipping address on existing order in Microsoft Store?"
query_embedding = embeddings.embed_query(my_question)

#Find relevant context using similarity search

The `knn_match (search_vector, query_vector, k) `function in CrateDB performs an approximate k-nearest neighbors (KNN) search within a dataset. KNN search involves finding the k data points that are most similar to a given query data point. We find the most similar vectors to our query vector using knn search capability in CrateDB:

In [None]:
knn_query = text("""SELECT text FROM text_data
            WHERE knn_match(embedding, {0}, 2)""".format(query_embedding))
documents=[]

with engine.connect() as con:
    results = con.execute(knn_query)
    for record in results:
        documents.append(record[0])

print(documents)


['tweet_id: 12858\nauthor_id: 118603\ninbound: True\ncreated_at: Mon Oct 30 18:33:00 +0000 2017\ntext: @MicrosoftHelps The store never gave me an error message.  It\'s STILL sitting there "thinking/working" after 2 hours.\nresponse_tweet_id: 12857,12859\nin_response_to_tweet_id: 12860', 'tweet_id: 12881\nauthor_id: 118606\ninbound: True\ncreated_at: Wed Nov 01 12:18:10 +0000 2017\ntext: @MicrosoftHelps okay. let me contact them\nresponse_tweet_id: \nin_response_to_tweet_id: 12879', "tweet_id: 12868\nauthor_id: MicrosoftHelps\ninbound: False\ncreated_at: Tue Oct 31 13:23:00 +0000 2017\ntext: @118604 1/2 We don't have direct email. You can post your query via Community Forum for assistance: https://t.co/jsa5yeYZ1T.\nresponse_tweet_id: \nin_response_to_tweet_id: 12867", "tweet_id: 11883\nauthor_id: MicrosoftHelps\ninbound: False\ncreated_at: Thu Oct 26 16:30:01 +0000 2017\ntext: @118331 Hi. That's strange. Let's make sure that all your driver was updated. Here's how: https://t.co/paTrSXK1

#Augment system prompt and query LLM

In [None]:
context = '---\n'.join(documents)

system_prompt = f"""
You are customer support expert and get questions about Microsoft products and services.
To answer question use the information from the context. Remove new line characters from the answer.
If you don't find the relevant information there, say "I don't know".

Context:
{context}"""

chat_completion = openai.chat.completions.create(model="gpt-4",
                                               messages=[{"role": "system", "content": system_prompt},
                                                         {"role": "user", "content": my_question}])


In [None]:
chat_completion.choices[0].message.content

'To update the shipping address on an existing order in the Microsoft Store, you will need to cancel your current order and place a new one so you can include your updated details.'