Retrieval-Augmented Generation (RAG) combines a retrieval system, which fetches relevant documents, with a generative model, allowing it to incorporate external knowledge for more accurate and informed responses. This notebook shows how to use the CrateDB vector store functionality to create a retrieval augmented generation (RAG) pipeline.

## What is CrateDB?

CrateDB is an open-source, distributed, and scalable SQL analytics database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is wire-compatible to PostgreSQL, based on Lucene, and inherits the shared-nothing distribution layer of Elasticsearch.

This example uses the Python client driver for CrateDB and vector store support in LangChain.

## Getting Started
CrateDB supports storing vectors since version 5.5. You can leverage the fully managed service of CrateDB Cloud, or install CrateDB on your own, for example using Docker.

```shell
docker run --publish 4200:4200 --publish 5432:5432 --pull=always crate:latest -Cdiscovery.type=single-node
```

## Setup

Install required Python packages, and import Python modules.

In [1]:
!pip install -r requirements.txt

Collecting langchain@ git+https://github.com/crate-workbench/langchain.git@cratedb#subdirectory=libs/langchain (from langchain[cratedb,openai]@ git+https://github.com/crate-workbench/langchain.git@cratedb#subdirectory=libs/langchain->-r requirements.txt (line 18))
  Cloning https://github.com/crate-workbench/langchain.git (to revision cratedb) to /private/var/folders/3f/htk34xrs62d0jxkjddpz35qc0000gn/T/pip-install-evlqmlki/langchain_c160c6813236441ba600f33394746c79
  Running command git clone --filter=blob:none --quiet https://github.com/crate-workbench/langchain.git /private/var/folders/3f/htk34xrs62d0jxkjddpz35qc0000gn/T/pip-install-evlqmlki/langchain_c160c6813236441ba600f33394746c79
  Resolved https://github.com/crate-workbench/langchain.git to commit 5df2429aa2fec83b424cf21bc190f8bc9c36845b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting langchain-co

  Using cached urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Collecting multidict>=4.0 (from yarl==1.9.3->croud==1.10.0->cratedb-toolkit==0.0.3->-r requirements.txt (line 4))
  Using cached multidict-6.0.4-cp38-cp38-macosx_11_0_arm64.whl (29 kB)
Collecting Pygments<3,>=2.4 (from crash->-r requirements.txt (line 2))
  Using cached pygments-2.17.2-py3-none-any.whl.metadata (2.6 kB)
Collecting platformdirs<5 (from crash->-r requirements.txt (line 2))
  Using cached platformdirs-4.1.0-py3-none-any.whl.metadata (11 kB)
Collecting prompt-toolkit<4,>=3.0 (from crash->-r requirements.txt (line 2))
  Using cached prompt_toolkit-3.0.43-py3-none-any.whl.metadata (6.5 kB)
Collecting verlib2==0.2.0 (from crate[sqlalchemy]->-r requirements.txt (line 3))
  Using cached verlib2-0.2.0-py3-none-any.whl.metadata (5.0 kB)
Collecting geojson<4,>=2.5.0 (from crate[sqlalchemy]->-r requirements.txt (line 3))
  Using cached geojson-3.1.0-py3-none-any.whl.metadata (16 kB)
Collecting backports.zoneinfo<1 (fro

Collecting gcsfs (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached gcsfs-2023.12.2.post1-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached dask-2023.5.0-py3-none-any.whl.metadata (3.6 kB)
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached distributed-2023.5.0-py3-none-any.whl.metadata (3.4 kB)
Collecting s3fs (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached s3fs-2023.12.2-py3-none-any.whl.metadata (1.6 kB)
Collecting pygit2 (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "file

Collecting cffi>=1.16.0 (from pygit2->fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached cffi-1.16.0.tar.gz (512 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting aiobotocore<3.0.0,>=2.5.4 (from s3fs->fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached aiobotocore-2.11.0-py3-none-any.whl.metadata (21 kB)
Collecting cryptography>=2.0 (from smbprotocol->fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached cryptography-42.0.0-cp37-abi3-macosx_10_12_universal2.whl.metadata (5.3 kB)
Collecting pyspnego (from smbprotoc

Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib->gcsfs->fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2023.13; extra == "fileio"->pueblo[cli,fileio,nlp]>=0.0.6->-r requirements.txt (line 8))
  Using cached oauthlib-3.2.2-py3-none-any.whl (151 kB)
Using cached cratedb_toolkit-0.0.3-py3-none-any.whl (76 kB)
Using cached langchain_openai-0.0.3-py3-none-any.whl (28 kB)
Using cached croud-1.10.0-py2.py3-none-any.whl (107 kB)
Using cached marshmallow-3.20.1-py3-none-any.whl (49 kB)
Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Using cached shtab-1.6.4-py3-none-any.whl (13 kB)
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Using cached yarl-1.9.3-cp38-cp38-macosx_11_0_arm64.whl (81 kB)
Using cached crash-0.30.2-py2.py3-none-any.whl (36 kB)
Using cached crate-0.34.0-py2.py3-none-any.whl (117 kB)
Using cached pueblo-0.0.6-py3-none-any.whl (27 kB)
Using cached pydantic-2.5.3-py3-none-any.whl (381 kB)
Using cached pydantic_core-2.14.6-cp38-cp38-macosx_11_

  Building wheel for backports.zoneinfo (pyproject.toml) ... [?25ldone
[?25h  Created wheel for backports.zoneinfo: filename=backports.zoneinfo-0.2.1-cp38-cp38-macosx_10_14_arm64.whl size=49463 sha256=33d2689fb8e1dc6f40f256f99f1ce9878eb7e0be087ddee9f8bdeaf9301e6195
  Stored in directory: /Users/marijaselakovic/Library/Caches/pip/wheels/c7/de/cc/c405827ed64f81b56142f1e0075a970b2731b00d21983d54fb
  Building wheel for cffi (pyproject.toml) ... [?25ldone
[?25h  Created wheel for cffi: filename=cffi-1.16.0-cp38-cp38-macosx_10_14_arm64.whl size=263440 sha256=987f0cb4782782ed22f47a2339f58226a388af9e849d55efc8db78489b049056
  Stored in directory: /Users/marijaselakovic/Library/Caches/pip/wheels/f4/df/d7/20c740c0373c550cdca4fcf0eb9af36c769ad8553ea81c6a2f
Successfully built pyyaml langchain langchain-community backports.zoneinfo cffi
Installing collected packages: wcwidth, spinners, sortedcontainers, filetype, boltons, bitmath, appdirs, zipp, zict, wrapt, urllib3, typing-extensions, tqdm, to

In [5]:
import os
import re

import openai
import pandas as pd
import sqlalchemy as sa
import warnings

from langchain.document_loaders.csv_loader import CSVLoader
from langchain_openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.vectorstores import CrateDBVectorSearch

warnings.filterwarnings('ignore')

### Configure database settings

This notebook will connect to a CrateDB server instance running on localhost. You can start a sandbox instance on your workstation by running [CrateDB using Docker]. Alternatively, you can also connect to a cluster running on [CrateDB Cloud].

[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB using Docker]: https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker.

In [3]:
# Define the connection string to running CrateDB instance.
CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://crate@localhost/",
)

# Define the store collection to use for this notebook session.
COLLECTION_NAME = "customer_data"

### Configure OpenAI

In this example you need to have an API key from OpenAI. This is typically done by creating an account on OpenAI's website and accessing the API section, where you can generate a new key.

In [4]:
from pueblo.util.environ import getenvpass

getenvpass("OPENAI_API_KEY", prompt="OpenAI API key:")

OpenAI API key:········


### Patches
Those can be removed again after they have been upstreamed.

In [6]:
from cratedb_toolkit.sqlalchemy.patch import patch_inspector
patch_inspector()

## Create embeddings from dataset

We use `CSVLoader` class to load support tickets from Twitter. The next step initializes a vector search store in CrateDB using embeddings generated by an OpenAI model. This will create a table that stores the embeddings with the name of the collection. Make sure the collection name is unique and that you have the permission to create a table.

In [7]:
loader = CSVLoader(file_path="./sample_data/twitter_support_microsoft.csv", encoding="utf-8", csv_args={'delimiter': ','})
data = loader.load()

In [8]:
embeddings = OpenAIEmbeddings()

store = CrateDBVectorSearch.from_documents(
    embedding=embeddings,
    documents=data,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

## Ask question
Let's define our question:

In [9]:
my_question = "How to update shipping address on existing order in Microsoft Store?"

## Find relevant context using similarity search

The similarity search uses Eucledian distance to find similar vectors and compute the score:

In [10]:
docs_with_score = store.similarity_search_with_score(my_question)
documents=[]
pattern = r"text: (.+)\nresponse_tweet_id:"
for doc, score in docs_with_score:
    match = re.search(pattern, doc.page_content, re.DOTALL)
    if match:
        documents.append(match.group(1).strip())

@MicrosoftHelps Is there anyway to update the shipping address on an existing Microsoft Store order? I just recently moved.
@MicrosoftHelps Is there anyway to update the shipping address on an existing Microsoft Store order? I just recently moved.
@MicrosoftHelps Seems to be good.  Support responded by email saying that the order status won't change online, but the warehouse will ship to the new addr.
@118333 2/2 Store app or via Microsoft Store online?


## Augment system prompt and query LLM

In the final step we create an interactive chatbot scenario where GPT-4 serves as a customer support assistant, using a given set of documents as its knowledge base to answer questions about Microsoft products and services. If the answer to a question isn't in the provided documents, it's programmed to respond with "I don't know."

In [11]:
context = '---\n'.join(documents)

system_prompt = f"""
You are customer support expert and get questions about Microsoft products and services.
To answer question use the information from the context. Remove new line characters from the answer.
If you don't find the relevant information there, say "I don't know".

Context:
{context}"""

chat_completion = openai.chat.completions.create(model="gpt-3.5-turbo",
                                               messages=[{"role": "system", "content": system_prompt},
                                                         {"role": "user", "content": my_question}])


In [12]:
chat_completion.choices[0].message.content

'To update the shipping address on an existing order in the Microsoft Store, you can either contact Microsoft Support via email or update the shipping address in the Microsoft Store app or Microsoft Store online. The support team will inform the warehouse of the new address for shipping.'