This notebook shows how to use the CrateDB vector store functionality around FLOAT_VECTOR and KNN_MATCH. You will learn how to use it to create a retrieval augmented generation (RAG) pipeline.

## What is CrateDB?
CrateDB is an open-source, distributed, and scalable SQL analytics database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is wire-compatible to PostgreSQL, based on Lucene, and inherits the shared-nothing distribution layer of Elasticsearch.

This example uses the Python client driver for CrateDB.

## Getting Started
CrateDB supports storing vectors since version 5.5. You can leverage the fully managed service of CrateDB Cloud, or install CrateDB on your own, for example using Docker.

`docker run --publish 4200:4200 --publish 5432:5432 --pull=always crate:latest -Cdiscovery.type=single-node`

Install required Python packages, and import Python modules.

In [None]:
#!pip install -r requirements.txt

# Note: If you are running in an environment like Google Colab, please use the absolute path of the requirements:
#!pip install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt"

In [2]:
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.llms import OpenAI
from langchain_openai import OpenAIEmbeddings
import pandas as pd
import sqlalchemy as sa
from sqlalchemy import create_engine
from sqlalchemy import text
import crate
import openai
import os
import requests
from pueblo.util.environ import getenvpass

# Create embeddings from dataset


In [3]:
getenvpass("OPENAI_API_KEY", prompt="OpenAI API key:")

OpenAI API key:········


In [4]:
url = 'https://media.githubusercontent.com/media/crate/cratedb-datasets/main/machine-learning/fulltext/twitter_support_microsoft.csv'
dataset = 'twitter_support.csv'

r = requests.get(url)
with open(dataset, 'wb') as f:
    f.write(r.content)

loader = CSVLoader(file_path=dataset, encoding="utf-8", csv_args={'delimiter': ','})
data = loader.load()
pages_text = [doc.page_content for doc in data]
print(pages_text[0])

tweet_id: 2301
author_id: 116231
inbound: True
created_at: Tue Oct 31 20:22:23 +0000 2017
text: @MicrosoftHelps Please get back to me immediately this is of the upmost importance
response_tweet_id: 2299
in_response_to_tweet_id: 2306


In [5]:
embeddings = OpenAIEmbeddings(deployment='my-embedding-model', chunk_size=200)
pages_embeddings = embeddings.embed_documents(pages_text)

## Write data to CrateDB

The next step creates a dataframe that contains the text of the documents and their embeddings. The embeddings will be stored in CrateDB using FLOAT_VECTOR type.

In [6]:
df = pd.DataFrame(list(zip(pages_text, pages_embeddings)),columns =['text', 'embedding'])

In [8]:
dbname="crate://localhost:4200"
create_table = text("CREATE TABLE text_data (text TEXT, embedding FLOAT_VECTOR(1536))")
engine = create_engine(dbname, echo=False)

with engine.connect() as con:
     con.execute(create_table)

The text and embeddings are written to CrateDB database using CrateDB vector storage support:

In [9]:
df.to_sql(name='text_data', con=engine, if_exists='append', index=False)
df.head(5)

Unnamed: 0,text,embedding
0,tweet_id: 2301\nauthor_id: 116231\ninbound: Tr...,"[-0.03669742581690742, -0.013565617003293586, ..."
1,tweet_id: 11879\nauthor_id: MicrosoftHelps\nin...,"[-0.015454164058839013, 0.0032340502581370404,..."
2,tweet_id: 11881\nauthor_id: MicrosoftHelps\nin...,"[-0.005936504790842901, 0.01942733669848252, 0..."
3,tweet_id: 11890\nauthor_id: 118332\ninbound: T...,"[-0.01177901347977142, 0.005725434705161641, -..."
4,tweet_id: 11912\nauthor_id: MicrosoftHelps\nin...,"[-0.022950152341946858, 0.004767860370434741, ..."


#Ask question
Let's define our question and create an embedding using OpenAI embedding model:

In [None]:
my_question = "How to update shipping address on existing order in Microsoft Store?"
query_embedding = embeddings.embed_query(my_question)

#Find relevant context using similarity search

The `knn_match (search_vector, query_vector, k) `function in CrateDB performs an approximate k-nearest neighbors (KNN) search within a dataset. KNN search involves finding the k data points that are most similar to a given query data point. We find the most similar vectors to our query vector using knn search capability in CrateDB:

In [None]:
knn_query = text("""SELECT text FROM text_data
            WHERE knn_match(embedding, {0}, 2)""".format(query_embedding))
documents=[]

with engine.connect() as con:
    results = con.execute(knn_query)
    for record in results:
        documents.append(record[0])

print(documents)


#Augment system prompt and query LLM

In [None]:
context = '---\n'.join(documents)

system_prompt = f"""
You are customer support expert and get questions about Microsoft products and services.
To answer question use the information from the context. Remove new line characters from the answer.
If you don't find the relevant information there, say "I don't know".

Context:
{context}"""

chat_completion = openai.chat.completions.create(model="gpt-4",
                                               messages=[{"role": "system", "content": system_prompt},
                                                         {"role": "user", "content": my_question}])


In [None]:
chat_completion.choices[0].message.content