<a href="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/QA_with_cassio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Knowledge Base Search on Proprietary Data powered by Astra DB
This notebook guides you through setting up [RAGStack](https://www.datastax.com/products/ragstack) using [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html), [OpenAI](https://openai.com/about), and [CassIO](https://cassio.org/) to implement a generative Q&A over your own documentation.

## Astra Vector Search
Astra vector search enables developers to search a database by context or meaning rather than keywords or literal values. This is done by using “embeddings”. Embeddings are a type of representation used in machine learning where high-dimensional or complex data is mapped onto vectors in a lower-dimensional space. These vectors capture the semantic properties of the input data, meaning that similar data points have similar embeddings.
Reference: [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html)

## CassIO
CassIO is the ultimate solution for seamlessly integrating Apache Cassandra® with generative artificial intelligence and other machine learning workloads. This powerful Python library simplifies the complicated process of accessing the advanced features of the Cassandra database, including vector search capabilities. With CassIO, developers can fully concentrate on designing and perfecting their AI systems without any concerns regarding the complexities of integration with Cassandra.
Reference: [CassIO](https://cassio.org/)

## OpenAI
OpenAI provides various tools and resources to implement your own Document QA Search system. This includes pre-trained language models like GPT-4, which can understand and generate human-like text. Additionally, OpenAI offers guidelines and APIs to leverage their models for document search and question-answering tasks, enabling developers to build powerful and intelligent Document QA Search applications.
Reference: [OpenAI](https://openai.com/about)

## Demo Summary
ChatGPT excels at answering questions, but only on topics it knows from its training data. It offers you a nice dialog interface to ask questions and get answers.

But what do you do when you have your own documents? How can you leverage the GenAI and LLM models to get insights into those? We can use Retrieval Augmented Generation (RAG) -- think of a Q/A Bot that can answer specific questions over your documentation.

We can do this in two easy steps:
1. Analyzing and storing existing documentation.
2. Providing search capabilities for the model to retrieve your documentation.

This is solved by using LLM models. Ideally you embed the data as vectors and store them in a vector database and then use the LLM models on top of that database.

This notebook demonstrates a basic two-step RAG technique for enabling GPT to answer questions using a library of reference on your own documentation using Astra DB Vector Search.

# Prerequisites

* Follow [these steps](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html) to create a new vector search enabled database in Astra.
* Generate a new ["Database Administrator" token](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html).
  * For Open AI, you will need an [Open AI API Key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key). This will require an Open AI account with billing enabled.
  * For more details, see [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org.


When you run this notebook, it will ask you to provide each of these items at various steps.

In [4]:
# install required dependencies
! pip install -qU ragstack-ai pypdf

In [5]:
import os
from getpass import getpass

# Enter your settings for Astra DB and OpenAI:
os.environ["ASTRA_DB_ID"] = input("Enter your Astra DB ID: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

# Provide Sample Data
A sample document is provided from CassIO.

In [7]:
# retrieve the text of a short story that will be indexed in the vector store
# ruff: noqa: E501
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA = ["amontillado.txt"]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13022  100 13022    0     0  54555      0 --:--:-- --:--:-- --:--:-- 55649


# Read Files, Create Embeddings, Store in Vector DB

CassIO seamlessly integrates with RAGStack and LangChain, offering Cassandra-specific tools for many tasks. In our example we will use vector stores, indexers, embeddings, and queries.

We will use OpenAI for our LLM services. (See [cassio.org](https://cassio.org/start_here/#llm-access) for more details).

In [8]:
import os

import cassio

cassio.init(
    database_id=os.environ["ASTRA_DB_ID"],
    token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
)

In [None]:
# Import the needed libraries and declare the LLM model
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Cassandra

# Loop through each file and load it into our vector store
documents = []
for filename in SAMPLEDATA:
    path = os.path.join(os.getcwd(), filename)

    # Supported file types are pdf and txt
    if filename.endswith(".pdf"):
        loader = PyPDFLoader(path)
        new_docs = loader.load_and_split()
        print(f"Processed pdf file: {filename}")
    elif filename.endswith(".txt"):
        loader = TextLoader(path)
        new_docs = loader.load_and_split()
        print(f"Processed txt file: {filename}")
    else:
        print(f"Unsupported file type: {filename}")

    if len(new_docs) > 0:
        documents.extend(new_docs)

cass_vstore = Cassandra.from_documents(
    documents=documents,
    embedding=OpenAIEmbeddings(),
    session=None,
    table_name="qa_cassio",
    keyspace="default_keyspace",
)

# empty the list of file names -- we don't want to
# accidentally load the same files again
SAMPLEDATA = []

print("\nProcessing done.")

# Now Query the Data and execute some "searches" against it
First we will start with a similarity search using the Vectorstore's implementation

In [11]:
# construct your query
query = "Who is Luchesi?"

# find matching documentation using similarity search
matched_docs = cass_vstore.similarity_search(query=query, k=1)

# print out the relevant context that an LLM will use to produce an answer
for i, d in enumerate(matched_docs):
    print(f"\n## Document {i}\n")
    print(d.page_content)


## Document 0

The thousand injuries of Fortunato I had borne as I best could, but
when he ventured upon insult, I vowed revenge.  You, who so well know
the nature of my soul, will not suppose, however, that I gave utterance
to a threat.  _At length_ I would be avenged; this was a point definitely
settled--but the very definitiveness with which it was resolved,
precluded the idea of risk.  I must not only punish, but punish with
impunity.  A wrong is unredressed when retribution overtakes its
redresser.  It is equally unredressed when the avenger fails to make
himself felt as such to him who has done the wrong.

It must be understood that neither by word nor deed had I given
Fortunato cause to doubt my good will.  I continued, as was my wont, to
smile in his face, and he did not perceive that my smile _now_ was at
the thought of his immolation.

He had a weak point--this Fortunato--although in other regards he was a
man to be respected and even feared.  He prided himself on his
connoi

# Finally do a Q/A Search
To be able implement Q/A over documents we need to perform the following steps:

1. Create an Index on top of our vector store
2. Create a Retriever from that Index
3. Ask questions (prompts)!

A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever.

In [16]:
# Q/A LLM Search
from langchain.chat_models import ChatOpenAI
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

llm = ChatOpenAI(model="gpt-3.5-turbo-1106")
index = VectorStoreIndexWrapper(vectorstore=cass_vstore)

# Query the index for relevant vectors to our prompt
query = "Who is Luchesi?"
index.query(question=query, llm=llm)

'Luchesi is a character mentioned in the story "The Cask of Amontillado" by Edgar Allan Poe. He is a rival of the narrator and Fortunato in the field of wine connoisseurship. The narrator mentions Luchesi to challenge Fortunato\'s expertise and entice him to accompany to the vaults where he eventually meets his fate.'

In [18]:
# Alternatively, you can use a retrieval chain with a custom prompt
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate

query = """
You are Marv, a sarcastic but factual chatbot.
Context: {context}
Question: {question}
Your answer:
"""
query = ChatPromptTemplate.from_template(query)

qa = RetrievalQA.from_chain_type(
    llm=llm, retriever=cass_vstore.as_retriever(), chain_type_kwargs={"prompt": query}
)

result = qa.run("{question: Who is Luchesi?")
result

'Luchesi is a character mentioned in Edgar Allan Poe\'s short story "The Cask of Amontillado." He is a wine connoisseur who is referenced in the story by the narrator as a potential alternative to Fortunato for assessing the quality of a cask of Amontillado.'

## Cleanup

In [None]:
# WARNING: This will delete the collection and all documents in the collection
# cass_vstore.delete_collection()