# Retrieval Augmented Generation (RAG) with CrateDB

This notebook shows how to use the CrateDB vector store functionality around
[`FLOAT_VECTOR`] and [`KNN_MATCH`]. You will learn how to use it to create a
retrieval augmented generation (RAG) pipeline.


## What is CrateDB?

[CrateDB] is an open-source, distributed, and scalable SQL analytics database
for storing and analyzing massive amounts of data in near real-time, even with
complex queries. It is wire-compatible to PostgreSQL, based on [Lucene], and
inherits the shared-nothing distribution layer of [Elasticsearch].

This example uses the [Python client driver for CrateDB].


[CrateDB]: https://github.com/crate/crate
[Elasticsearch]: https://github.com/elastic/elasticsearch
[`FLOAT_VECTOR`]: https://crate.io/docs/crate/reference/en/master/general/ddl/data-types.html#float-vector
[`KNN_MATCH`]: https://crate.io/docs/crate/reference/en/master/general/builtins/scalar-functions.html#scalar-knn-match
[Lucene]: https://github.com/apache/lucene
[Python client driver for CrateDB]: https://crate.io/docs/python/

## Getting Started

CrateDB supports storing vectors since version 5.5. You can leverage the fully managed service of
[CrateDB Cloud], or install CrateDB on your own, for example using Docker.

```shell
docker run --publish 4200:4200 --publish 5432:5432 --pull=always crate:latest -Cdiscovery.type=single-node
```

[CrateDB Cloud]: https://console.cratedb.cloud/

Install required Python packages, and import Python modules.

In [None]:
#!pip install -r requirements.txt

# Note: If you are running in an environment like Google Colab, please use the absolute path of the requirements:
#!pip install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt

In [1]:
import openai
import pandas as pd
import sqlalchemy as sa

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# TODO: Bring this into the `crate-python` driver.
from cratedb_toolkit.sqlalchemy.patch import patch_inspector
patch_inspector()

You need to provide an OpenAI API key, optionally using the environment variable `OPENAI_API_KEY`,
or by defining it within an `.env` file.

```shell
export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY
```

In [None]:
from pueblo.util.environ import getenvpass

getenvpass("OPENAI_API_KEY", prompt="OpenAI API key:")

You also need to provide a connection string to your CrateDB database cluster,
optionally using the environment variable `CRATEDB_CONNECTION_STRING`.

This example uses a CrateDB instance on your workstation, which you can start by
running [CrateDB using Docker]. Alternatively, you can also connect to a cluster
running on [CrateDB Cloud].

[CrateDB Cloud]: https://console.cratedb.cloud/
[CrateDB using Docker]: https://crate.io/docs/crate/tutorials/en/latest/basic/index.html#docker

In [9]:
import os

CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    "crate://crate@localhost/",
)

# For CrateDB Cloud, use:
# CONNECTION_STRING = os.environ.get(
#     "CRATEDB_CONNECTION_STRING",
#     "crate://username:password@hostname/?ssl=true&schema=langchain",
# )

In [None]:
_ = """
# Alternatively, the connection string can be assembled from individual
# environment variables.
import os

CONNECTION_STRING = CrateDBVectorSearch.connection_string_from_db_params(
    driver=os.environ.get("CRATEDB_DRIVER", "crate"),
    host=os.environ.get("CRATEDB_HOST", "localhost"),
    port=int(os.environ.get("CRATEDB_PORT", "4200")),
    database=os.environ.get("CRATEDB_DATABASE", "langchain"),
    user=os.environ.get("CRATEDB_USER", "crate"),
    password=os.environ.get("CRATEDB_PASSWORD", ""),
)
"""

## Step 1: Load PDF and split the data

Let's use the white-paper [Time-series data in manufacturing] as a foundation for the upcoming
explorations, to augment the LLM data. The paper provides a good overview about the database
technologies for storing and analyzing time-series data.

The data is split into chunks of 1,000 characters, with an overlap of 200 characters between
the chunks, which helps to give better results by containing the context of the information
between chunks:

[Time-series data in manufacturing]: https://cratedb.com/resources/white-papers/lp-wp-time-series-data-manufacturing

In [4]:
loader = PyPDFLoader("https://github.com/crate/cratedb-datasets/raw/main/machine-learning/fulltext/White%20paper%20-%20Time-series%20data%20in%20manufacturing.pdf")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
pages = loader.load_and_split(text_splitter)

## Step 2: Store embeddings

This section explains how to store text and embeddings into CrateDB using SQL and pandas,
without using the LangChain integration. This can be beneficial if you have special
requirements regarding security inside the database.

In [11]:
embeddings = OpenAIEmbeddings()
pages_text = [doc.page_content for doc in pages]
pages_embeddings = embeddings.embed_documents(pages_text)

# The next step creates a dataframe that contains the text of the documents and their embeddings. 
df = pd.DataFrame(list(zip(pages_text, pages_embeddings)), columns=['text', 'embedding'])

# The embeddings will be stored in CrateDB using the FLOAT_VECTOR type.
engine = sa.create_engine(CONNECTION_STRING, echo=False)
with engine.connect() as connection:

    # Create database table.
    connection.execute(sa.text("CREATE TABLE IF NOT EXISTS text_data (text TEXT, embedding FLOAT_VECTOR(1536));"))

    # Write text and embeddings to CrateDB database.
    df.to_sql(name="text_data", con=connection, if_exists="append", index=False)
    connection.execute(sa.text("REFRESH TABLE text_data;"))

df.head(5)

Unnamed: 0,text,embedding
0,/ WHITE PAPER \n \n \n \nTime-series data i...,"[0.0025093274553031246, -0.02196431773679507, ..."
1,https://crate.io | office@crate.io | +43 ...,"[-0.0049216739619637, -0.010619178696718409, 0..."
2,https://crate.io | office@crate.io | +43 ...,"[-0.022959793496217757, 0.0031425609019670345,..."
3,processed. The great advantages in data techno...,"[-0.008236182477236282, -0.011472808604225981,..."
4,due to the expansion of the IoT. Something is ...,"[0.007639478261629491, -0.022254722218927096, ..."


## Step 3: Retrieve

The `knn_match(search_vector, query_vector, k)` function in CrateDB performs an approximate k-nearest
neighbors (KNN) search within a dataset. KNN search involves finding the k data points that are most
similar to a given query data point per shard of the table.

Therefore, `ORDER BY _score` and `LIMIT` to 4, to achieve a total amount of four relevant documents
that will provide the context for the prompt.

Find the most similar vectors to the input query vector, using knn search capabilities in CrateDB:

In [20]:
# Define the question and create an embedding using the OpenAI embedding model.
my_question = "What is the difference between time series and NoSQL databases?"
query_embedding = embeddings.embed_query(my_question)

knn_query = sa.text("SELECT text FROM text_data WHERE knn_match(embedding, {0}, 4) ORDER BY _score DESC LIMIT 4".format(query_embedding))
documents=[]

with engine.connect() as con:
    results = con.execute(knn_query)
    for record in results:
        documents.append(record[0])
        
print(documents)

4


## Step 4: Generate

The goal is to distill the retrieved documents into an answer using an LLM/Chat model,
for example `gpt-3.5-turbo`.

Create a short system prompt to instruct the LLM how to answer the question, and send
similar documents alongside the user's question as additional context.

In [11]:
from openai import OpenAI

# Concatenate the found documents into the context that will be provided in the system prompt
context = '---\n'.join(doc for doc in documents)

# Give instructions and context in the system prompt
system_prompt = f"""
You are a time series expert and get questions from the user covering the area of time series databases and time series use cases. 
Please answer the users question in the language it was asked in. 
Please only use the following context to answer the question, if you don't find the relevant information there, say "I don't know".

Context: 
{context}"""

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

chat_completion = client.chat.completions.create(
    model="gpt-3.5-turbo", 
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": my_question}
    ]
)

print(chat_completion.choices[0].message.content)

Time series databases and NoSQL databases are two distinct types of databases with different characteristics and use cases.

Time series databases are designed specifically to handle time-stamped data efficiently. They excel at managing large volumes of data points with high ingestion and query rates. Time series databases optimize for storing and retrieving data based on time, allowing for fast and efficient retrieval of time-based data. They often provide specialized functions and features for time series analysis, such as downsampling, interpolation, and aggregation.

On the other hand, NoSQL databases are a broad category of databases that do not adhere to the traditional relational database management system (RDBMS) model. NoSQL databases are designed to handle unstructured or semi-structured data. They prioritize scalability, high availability, and flexible data models. NoSQL databases use different data models, such as key-value, document, column-family, and graph, to suit diffe