<a href="https://colab.research.google.com/github/pg-sys/astra_vsearch_QA_for_documents/blob/main/astra_vsearch_QA_for_documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# astra-vsearch-QA-for-documents
This demo guides you through setting up Astra DB with Vector Search, Cassio and Open AI to implement an generative Q&A for your own Documentation

Jupyter notebook for generative Q&A for douments is powered by [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html) and [OpenAI](https://github.com/openai/) and Casssio [Opensource LLM integration with Cassandra and Astra DB](https://cassio.org/).

## Astra Vector Search
Astra vector search enables developers to search a database by context or meaning rather than keywords or literal values. This is done by using “embeddings”. Embeddings are a type of representation used in machine learning where high-dimensional or complex data is mapped onto vectors in a lower-dimensional space. These vectors capture the semantic properties of the input data, meaning that similar data points have similar embeddings.
Reference: [Astra Vector Search](https://docs.datastax.com/en/astra-serverless/docs/vector-search/overview.html)

## CassIO
CassIO is the ultimate solution for seamlessly integrating Apache Cassandra® with generative artificial intelligence and other machine learning workloads. This powerful Python library simplifies the complicated process of accessing the advanced features of the Cassandra database, including vector search capabilities. With CassIO, developers can fully concentrate on designing and perfecting their AI systems without any concerns regarding the complexities of integration with Cassandra.
Reference [Cassio](https://cassio.org/)

## OpenAI
OpenAI provides various tools and resources to implement your own Document QA Search system. This includes pre-trained language models like GPT-3.5, which can understand and generate human-like text. Additionally, OpenAI offers guidelines and APIs to leverage their models for document search and question-answering tasks, enabling developers to build powerful and intelligent Document QA Search applications.
Reference: [OpenAI](https://github.com/openai/)

## Demo Summary
ChatGPT excels at answering questions, but only on topics it remembers from its training data. It offers you a nice dialog interface to ask questions and get answers.

But what do you do when you have your onw documents? How can you leverage the GenAI and LLM models to get insights in those?

Think of an Q/A Bot that you want to provide to your customers for asking questions against the documentation of your products.

For beeing able to do so, you have to implement your own ChatGPT-like solution.
The implementation requires
1. Analysing your existing documents and store the information
2. Providing search capabilities for your questions to get answers

This is solve by using a LLM models. Ideally you embedd the data as vectors and store them in a vector database and then use the LLM models on top of that database.

This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference on your onw documentations based on Astra DB vector search.




# Getting Started with this notebook

These are prerequisites you need to to before running this notebook
- Create a new vector search enabled database in Astra.
- Create a keyspace
- Create a token with permissions to create tables
- Download your secure-connect-bundle.zip file
- Create an OpenAI account and download an API Key

- When you run this notebook, it will ask you for providing the secure-connect-bundle.zip, some text file and client ids, passwords as well as API Key

# Setup

This jupyter notebook was build on Colab. You need to install the following libraries.

In [1]:
# install required dependencies
! pip install \
    "git+https://github.com/hemidactylus/langchain@cassio#egg=langchain" \
    "cassandra-driver>=3.28.0" \
    "cassio>=0.0.4" \
    "openai==0.27.7" \
    "tiktoken==0.4.0"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Cloning https://github.com/hemidactylus/langchain (to revision cassio) to /tmp/pip-install-gqks0_tu/langchain_99d4c1dbd72a413fa2ad06f3756b36d5
  Running command git clone --filter=blob:none --quiet https://github.com/hemidactylus/langchain /tmp/pip-install-gqks0_tu/langchain_99d4c1dbd72a413fa2ad06f3756b36d5
  Running command git checkout -b cassio --track origin/cassio
  Switched to a new branch 'cassio'
  Branch 'cassio' set up to track remote branch 'cassio' from 'origin'.
  Resolved https://github.com/hemidactylus/langchain to commit 470af8a886a7c5a95884b2ecf72f37d68910b6b1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Imports

In [2]:
# Imports for our environment and accessing Astra DB
import os

import getpass
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from google.colab import files

# Astra DB configuration, connection bundle and token secrets

In [18]:
#upload secure connect bundle
print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    SECURE_CONNECT_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )
#Alternatively upload to the environment and reference it here
#SECURE_CONNECT_BUNDLE_PATH = '/content/secure-connect-documentation.zip'


Please upload your Secure Connect Bundle


Saving secure-connect-documentation.zip to secure-connect-documentation (2).zip


In [16]:
ASTRA_DB_TOKEN_BASED_USERNAME = getpass.getpass('What Astra DB token username do you want to use? ')
#ASTRA_DB_TOKEN_BASED_USERNAME = '<<ENTER>>'

What Astra DB token username do you want to use? ··········


In [17]:
ASTRA_DB_TOKEN_BASED_PASSWORD = getpass.getpass('What Astra DB token password do you want to use? ')
#ASTRA_DB_TOKEN_BASED_PASSWORD = '<<ENTER>>'

What Astra DB token password do you want to use? ··········


In [6]:
ASTRA_DB_KEYSPACE = input(f'Which Astra DB keypsace do you want to use? ')
#ASTRA_DB_KEYSPACE = 'mykeyspace'

Which Astra DB keypsace do you want to use? mykeyspace


# Provide Sample Data
If you want to provide some docoments, you can upload them here.
As a sample document you can also download some text here:

In [7]:
# retrieve the text of a short story that will be indexed in the vector store
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA_PATH="amontillado.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 13022  100 13022    0     0  96459      0 --:--:-- --:--:-- --:--:-- 95750100 13022  100 13022    0     0  96459      0 --:--:-- --:--:-- --:--:-- 95750


In [None]:
# Alternatively you can provide your own file - please consider to customize the queries at the end of the notebook to match your content.
#provide some sample files
print('Please upload your own sample file:')
uploaded = files.upload()
if uploaded:
    sampleDataFileTitle = list(uploaded.keys())[0]
    SAMPLEDATA_PATH = os.path.join(os.getcwd(), sampleDataFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Sample Data. Please re-run the cell.'
    )

# Connect to Astra DB

In [19]:
# make sure that you can connect to Astra DB - if you see errors, then have a look at the environment you configured earlier

cloud_config = {
   'secure_connect_bundle': SECURE_CONNECT_BUNDLE_PATH
}
auth_provider = PlainTextAuthProvider(ASTRA_DB_TOKEN_BASED_USERNAME, ASTRA_DB_TOKEN_BASED_PASSWORD)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(140227095072336) cfe0851c-36a8-420c-9b15-ac1200c177c7-us-east1.db.astra.datastax.com:29042:e953f596-a31c-4699-ae7b-76a154c01d9c> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


# LLM Provider Setup
CassIO seamlessly integrates with LangChain, offering Cassandra-specific tools for many tasks. In our example we will use vector stores, indexers, embeddings and queries.

And we will use OpenAI for our LLM services. (See Pre-requisites on [cassio.org](https://cassio.org/start_here/#llm-access) for more details).

In [20]:
# Set your secret(s) for LLM access:
llmProvider = 'OpenAI'  # Be aware that you can also use 'GCP_VertexAI' with Cassio (as of date of authoring this notebook)

In [23]:
#we will use GPT embeddings, so please provide your OpenAI AKP Key
apiSecret = getpass.getpass('Your secret for LLM provider OpenAI: ')
#apiSecret = "<<ENTER>>"
os.environ['OPENAI_API_KEY'] = apiSecret

Your secret for LLM provider OpenAI: ··········


In [24]:
#Import the needed libraries and declare the LLM model
import langchain
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader


from langchain.vectorstores.cassandra import Cassandra


# creation of the LLM resources
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = OpenAI(temperature=0)
myEmbedding = OpenAIEmbeddings()
print('LLM+embeddings from OpenAI will be used.')

LLM+embeddings from OpenAI will be used.


#Vector Store on Astra DB - automatically created table and configuration
Cassio will automatically create the needed tables and SAI in Astra DB for you. No worries about that configuration.

In [27]:
#define the table name to be used to store our embeddings
ASTRA_DB_TABLE_NAME = 'vdocuments'
cassVStore = Cassandra(
    session=session,
    keyspace=ASTRA_DB_KEYSPACE,
    table_name=ASTRA_DB_TABLE_NAME,
    embedding=myEmbedding,
)

# just in case this demo runs multiple times
#cassVStore.clear()

In [28]:
# creation of the DB connection

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Cassandra,
    embedding=myEmbedding,
    text_splitter=CharacterTextSplitter(
        chunk_size=100,
        chunk_overlap=0,
    ),
    vectorstore_kwargs={
        'session': session,
        'keyspace': ASTRA_DB_KEYSPACE,
        'table_name': ASTRA_DB_TABLE_NAME,
    },
)

# Load the Documents you want to process and create the embeddings to be stored in the Vector Database on Astra DB

In [29]:
#loader = TextLoader('amontillado.txt', encoding='utf8')
loader = TextLoader(SAMPLEDATA_PATH, encoding='utf8')

In [30]:
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
docs = loader.load()
subdocs = index_creator.text_splitter.split_documents(docs)
#
print(f'subdocument {0} ...', end=' ')
vs = index_creator.vectorstore_cls.from_documents(
    subdocs[:1],
    index_creator.embedding,
    **index_creator.vectorstore_kwargs
)
print('done.')
for sdi, sd in enumerate(subdocs[1:]):
    print(f'subdocument {sdi+1} ... out of {sd}' , end=' ')
    vs.add_texts(texts=[sd.page_content], metadata=[sd.metadata])
    print('done.')
#
index = VectorStoreIndexWrapper(vectorstore=vs)



subdocument 0 ... done.
subdocument 1 ... out of page_content='It must be understood that neither by word nor deed had I given\nFortunato cause to doubt my good will.  I continued, as was my wont, to\nsmile in his face, and he did not perceive that my smile _now_ was at\nthe thought of his immolation.' metadata={'source': 'amontillado.txt'} done.
subdocument 2 ... out of page_content='He had a weak point--this Fortunato--although in other regards he was a\nman to be respected and even feared.  He prided himself on his\nconnoisseurship in wine.  Few Italians have the true virtuoso spirit.\nFor the most part their enthusiasm is adopted to suit the time and\nopportunity--to practise imposture upon the British and Austrian\n_millionaires_.  In painting and gemmary, Fortunato, like his countrymen,\nwas a quack--but in the matter of old wines he was sincere.  In this\nrespect I did not differ from him materially: I was skillful in the\nItalian vintages myself, and bought largely whenever I c

# Query the Vector Store to see what has been added to it and what happended with our documentation

In [35]:
cqlSelect = f'SELECT * FROM {ASTRA_DB_KEYSPACE}.{ASTRA_DB_TABLE_NAME} LIMIT 3;'  # (Not a production-optimized query ...)
rows = session.execute(cqlSelect)
for row_i, row in enumerate(rows):
    print(f'\nRow {row_i}:')
    print(f'    document_id:      {row.document_id}')
    print(f'    embedding_vector: {str(row.embedding_vector)[:64]} ...')
    print(f'    document:         {row.document[:64]} ...')
    print(f'    metadata_blob:    {row.metadata_blob}')

print('\n...')


Row 0:
    document_id:      39d2b183da854107e6cf38c3da3d6408
    embedding_vector: [-0.0016813528491184115, -0.011779578402638435, 0.01529727876186 ...
    document:         "My friend, no.  It is not the engagement, but the severe cold w ...
    metadata_blob:    {}

Row 1:
    document_id:      3788fd3f23b30e797cc3d539e51b5d8a
    embedding_vector: [0.005347211379557848, 8.401897503063083e-05, 0.0241224560886621 ...
    document:         "Nitre?" he asked, at length.

"Nitre," I replied.  "How long ha ...
    metadata_blob:    {}

Row 2:
    document_id:      bb331c75e8d659cc73867897f1200ca6
    embedding_vector: [-0.005720699205994606, -0.013201086781919003, 0.004768961109220 ...
    document:         "Let us go, nevertheless.  The cold is merely nothing. Amontilla ...
    metadata_blob:    {}

...


# ASK - Ask Question and get Awnsers based on your documentation

In [36]:
# Search within the document contexnt for some text related information.
query = "Who is Luchesi?"
index.query(query, llm=llm)

' Luchesi is a person who is knowledgeable about wine.'

In [37]:
# Search within the document contexnt for some text un-related information.
query = "Who is John Doe?"
index.query(query, llm=llm)

" I don't know."

Congratulations - you just created and GenAI based Q/A Bot for you own documentation!