Retrieval-Augmented Generation (RAG) combines a retrieval system, which fetches relevant documents, with a generative model, allowing it to incorporate external knowledge for more accurate and informed responses. This notebook shows how to use the CrateDB vector store functionality to create a retrieval augmented generation (RAG) pipeline.

## What is CrateDB?

CrateDB is an open-source, distributed, and scalable SQL analytics database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is wire-compatible to PostgreSQL, based on Lucene, and inherits the shared-nothing distribution layer of Elasticsearch.

This example uses the Python client driver for CrateDB and vector store support in LangChain.

## Getting Started
CrateDB supports storing vectors since version 5.5. You can leverage the fully managed service of CrateDB Cloud, or install CrateDB on your own, for example using Docker.

```shell
docker run --publish 4200:4200 --publish 5432:5432 --pull=always crate:latest -Cdiscovery.type=single-node
```

## Setup

Install required Python packages, and import Python modules.

In [1]:
!pip install -r requirements.txt

Collecting langchain@ git+https://github.com/crate-workbench/langchain.git@cratedb#subdirectory=libs/langchain (from langchain[cratedb,openai]@ git+https://github.com/crate-workbench/langchain.git@cratedb#subdirectory=libs/langchain->-r requirements.txt (line 18))
  Cloning https://github.com/crate-workbench/langchain.git (to revision cratedb) to /private/var/folders/3f/htk34xrs62d0jxkjddpz35qc0000gn/T/pip-install-ww1c0rmq/langchain_f25909563c1d4114a23c9398205c1afb
  Running command git clone --filter=blob:none --quiet https://github.com/crate-workbench/langchain.git /private/var/folders/3f/htk34xrs62d0jxkjddpz35qc0000gn/T/pip-install-ww1c0rmq/langchain_f25909563c1d4114a23c9398205c1afb
  Resolved https://github.com/crate-workbench/langchain.git to commit 5df2429aa2fec83b424cf21bc190f8bc9c36845b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting langchain-co









In [None]:
import os
import re

import openai
import pandas as pd
import sqlalchemy as sa
import warnings

from langchain.document_loaders.csv_loader import CSVLoader
from langchain_openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.vectorstores import CrateDBVectorSearch

warnings.filterwarnings('ignore')

### Configure database settings

This notebook will connect to a CrateDB server instance running on localhost. You can start a sandbox instance on your workstation by running [CrateDB using Docker]. Alternatively, you can also connect to a cluster running on [CrateDB Cloud].

[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB using Docker]: https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker.

In [None]:
# Define the connection string to running CrateDB instance.
CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://crate@localhost/",
)

# Define the store collection to use for this notebook session.
COLLECTION_NAME = "customer_data"

### Configure OpenAI

In this example you need to have an API key from OpenAI. This is typically done by creating an account on OpenAI's website and accessing the API section, where you can generate a new key.

In [None]:
from pueblo.util.environ import getenvpass

getenvpass("OPENAI_API_KEY", prompt="OpenAI API key:")

## Create embeddings from dataset

We use `CSVLoader` class to load support tickets from Twitter. The next step initializes a vector search store in CrateDB using embeddings generated by an OpenAI model. This will create a table that stores the embeddings with the name of the collection. Make sure the collection name is unique and that you have the permission to create a table.

In [None]:
loader = CSVLoader(file_path="./sample_data/twitter_support_microsoft.csv", encoding="utf-8", csv_args={'delimiter': ','})
data = loader.load()

In [None]:
embeddings = OpenAIEmbeddings()

store = CrateDBVectorSearch.from_documents(
    embedding=embeddings,
    documents=data,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

## Ask question
Let's define our question:

In [None]:
my_question = "How to update shipping address on existing order in Microsoft Store?"

## Find relevant context using similarity search

The similarity search uses Eucledian distance to find similar vectors and compute the score:

In [None]:
docs_with_score = store.similarity_search_with_score(my_question)
documents=[]
pattern = r"text: (.+)\nresponse_tweet_id:"
for doc, score in docs_with_score:
    match = re.search(pattern, doc.page_content, re.DOTALL)
    if match:
        documents.append(match.group(1).strip())

## Augment system prompt and query LLM

In the final step we create an interactive chatbot scenario where GPT-4 serves as a customer support assistant, using a given set of documents as its knowledge base to answer questions about Microsoft products and services. If the answer to a question isn't in the provided documents, it's programmed to respond with "I don't know."

In [None]:
context = '---\n'.join(documents)

system_prompt = f"""
You are customer support expert and get questions about Microsoft products and services.
To answer question use the information from the context. Remove new line characters from the answer.
If you don't find the relevant information there, say "I don't know".

Context:
{context}"""

chat_completion = openai.chat.completions.create(model="gpt-3.5-turbo",
                                               messages=[{"role": "system", "content": system_prompt},
                                                         {"role": "user", "content": my_question}])


In [None]:
chat_completion.choices[0].message.content