<a href="https://colab.research.google.com/github/YoshiyukiKono/gen_ai-sandbox/blob/main/DL_DSE7_RAGStack_MissionControl_astradb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with DataStax Enterprise 7 deplyed with Mission Control & RAGStack

This page provides a quick start for using [DataStax Enterprise 7](https://www.datastax.com/blog/introducing-vector-search-for-self-managed-datastax-enterprise) as a Vector Store.

Additionally, we're introducing [RAGStack](https://www.datastax.com/products/ragstack), an out of the box solution simplifying Retrieval Augmented Generation (RAG) in AI apps. RAGStack includes the best open-source libraries for implementing RAG, giving developers a comprehensive Gen AI Stack leveraging LangChain, CassIO, and more.

***In addition to access to the database, an OpenAI API Key is required to run the full example.***

## Setup & General Dependencies
### GKE

```
Save the cluster definition to a file and use `kubectl apply -f <yourfile>` to provision the cluster.

apiVersion: v1
kind: Service
metadata:
  name: test-loadbalancer
  namespace: demo-33z2o6o4
  labels:
    cassandra.datastax.com/cluster: test
    cassandra.datastax.com/datacenter: dc-1
spec:
  type: LoadBalancer
  ports:
  - port: 9042
    protocol: TCP
  selector:
    cassandra.datastax.com/cluster: test
    cassandra.datastax.com/datacenter: dc-1
    
```
Save the `LoadBalancer` definition to a file and use `kubectl apply -f <yourfile>` to create the service.

### Connection
Once the `LoadBalancer` service has been created, copy the External IP address (*only the IP address* not `https://`) and enter it in the prompt of the `Code` cell below. Run the rest of the notebook as normal.


### IMPORTANT

The namespace must be filled with the project name in the Mission Control ui  https://localhost:8080/ui/projects/demo-33z2o6o4/clusters/test


The load balancer can be accessed via the GCP Console navigation pane > Kubernetes engine > Networking > Gateways, services and Ingress > Services. The IP adress of the load balancer will be displayed there.



In [None]:
import os
from getpass import getpass

cluster_external_ip = getpass("GKE External IP Address = ")

GKE External IP Address = ··········


In [None]:
print(cluster_external_ip)

34.133.121.172


In [1]:
#Dependency Install
%pip install datasets pypdf ragstack-ai ipywidgets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/510.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ragstack-ai
  Downloading ragstack_ai-0.10.0-py3-none-any.whl (4.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downlo

In [2]:
from datasets import load_dataset

import langchain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter

##### Paste your OpenAI API key into the prompt

In [6]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

OPENAI_API_KEY = ··········


In [7]:
embe = OpenAIEmbeddings()

  warn_deprecated(


## DataStax Enterprise 7 Session
`RAGStack` includes LangChain modules for both vector similarity search and vector database operations. Here we will `IMPORT` the `Cassandra` vector store library, which includes DataStax Enterprise functionality.

In [8]:
from langchain.vectorstores import Cassandra

### Retrieve DSE Cluster superuser password and input into the prompt


In [None]:
cass_pass = getpass("GKE Superuser Password = ")

GKE Superuser Password = ··········


In [None]:
print(cass_pass)

datastax


In [None]:
from cassandra.cluster import Cluster, PlainTextAuthProvider
from cassandra.policies import AddressTranslator


class ContactPointAddressTranslator(AddressTranslator):

    def get_contact_point(self):
        return self.contact_point

    def set_contact_point(self, contact_point):
        # strip ports from both source and destination as the cassandra python
        # client doesn't appear to support ports translation
        self.contact_point = contact_point

    def translate(self, addr):
        return self.contact_point

if __name__ == '__main__':
    username = "test-superuser"
    password = "datastax"
    translator = ContactPointAddressTranslator()
    translator.set_contact_point("34.133.121.172")

    auth_provider = PlainTextAuthProvider(
                                username=username,
                                password=password
                                )

    # if the port parameter value is removed from below, we are unable
    # to establish a connection

    cluster = Cluster([translator.get_contact_point()],
                      address_translator = translator,
                      auth_provider = auth_provider,
                      port = 9042
                      )

    session = cluster.connect()





In [None]:
# Create a keyspace in the DSE 7 cluster
session.execute("CREATE KEYSPACE IF NOT EXISTS default_keyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };")

<cassandra.cluster.ResultSet at 0x7da189a4b430>

## Use Astra DB temporarily

In [11]:
SECURE_CONNECT_BUNDLE_PATH = 'secure-connect-new-vector.zip'

In [10]:
!wget -O secure-connect-new-vector.zip "https://datastax-cluster-config-prod.s3.us-east-2.amazonaws.com/815e548a-4757-40ac-a986-9f8e2b647534-1/secure-connect-new-vector.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA2AIQRQ76XML7FLD6%2F20240408%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240408T070608Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=0b28c65a7cd7635eb4ac639ce883448dfa7b50f2aa6e635d8da6a3a0d1bfef30"

--2024-04-08 07:06:16--  https://datastax-cluster-config-prod.s3.us-east-2.amazonaws.com/815e548a-4757-40ac-a986-9f8e2b647534-1/secure-connect-new-vector.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA2AIQRQ76XML7FLD6%2F20240408%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240408T070608Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=0b28c65a7cd7635eb4ac639ce883448dfa7b50f2aa6e635d8da6a3a0d1bfef30
Resolving datastax-cluster-config-prod.s3.us-east-2.amazonaws.com (datastax-cluster-config-prod.s3.us-east-2.amazonaws.com)... 52.219.229.34, 3.5.133.16, 3.5.130.187, ...
Connecting to datastax-cluster-config-prod.s3.us-east-2.amazonaws.com (datastax-cluster-config-prod.s3.us-east-2.amazonaws.com)|52.219.229.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12262 (12K) [application/zip]
Saving to: ‘secure-connect-new-vector.zip’


2024-04-08 07:06:16 (134 MB/s) - ‘secure-connect-new-vector.zip’ saved [12262/12262]



In [12]:
import getpass

YOUR_TOKEN = getpass.getpass()

··········


In [13]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

cloud_config= {
  'secure_connect_bundle': SECURE_CONNECT_BUNDLE_PATH
}
#auth_provider = PlainTextAuthProvider(ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET)
auth_provider = PlainTextAuthProvider("token", YOUR_TOKEN)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

row = session.execute("select release_version from system.local").one()
if row:
  print(row[0])
else:
  print("An error occurred.")

ERROR:cassandra.connection:Closing connection <LibevConnection(137757430015424) 815e548a-4757-40ac-a986-9f8e2b647534-us-east1.db.astra.datastax.com:29042:eaccbf41-af39-4d8f-9f62-51c675165db1> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


4.0.11-051eacf61d32


In [14]:
import cassio

cassio.init(session=session, keyspace="default_keyspace")

In [15]:
#Create the LangChain vector store object
vstore = Cassandra(
    embedding=embe, table_name="cassandra_vector_demo", session=None, keyspace=None
)

### Load A Dataset
Convert each entry in the source dataset into a `Document`, then write them into the vector store:

In [16]:
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]

docs = []
for entry in philo_dataset:
    metadata = {"author": entry["author"]}
    doc = Document(page_content=entry["quote"], metadata=metadata)
    docs.append(doc)

inserted_ids = vstore.add_documents(docs)
print(f"\nInserted {len(inserted_ids)} documents.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/574 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Inserted 450 documents.


In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.

_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._

In [17]:
texts = ["I think, therefore I am.", "To the things themselves!"]
metadatas = [{"author": "descartes"}, {"author": "husserl"}]
ids = ["desc_01", "huss_xy"]

inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)
print(f"\nInserted {len(inserted_ids_2)} documents.")


Inserted 2 documents.


_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_
_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_
_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._
### Run simple searches
This section demonstrates metadata filtering and getting the similarity scores back:

In [18]:
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")



* We are what we are because we have been what we have been. [{'author': 'freud'}]
* We become what we contemplate. [{'author': 'plato'}]
* In the blessings as well as in the ills of life, less depends upon what befalls us than upon the way in which it is met. [{'author': 'schopenhauer'}]


In [19]:
results_filtered = vstore.similarity_search(
    "Our life is what we make of it",
    k=3,
    filter={"author": "plato"},
)
for res in results_filtered:
    print(f"* {res.page_content} [{res.metadata}]")



* We become what we contemplate. [{'author': 'plato'}]
* Enjoy life. There's plenty of time to be dead. Be kind, for everyone you meet is fighting a harder battle. [{'author': 'plato'}]
* The measure of a man is what he does with power. [{'author': 'plato'}]


In [20]:
results = vstore.similarity_search_with_score("Our life is what we make of it", k=3)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")



* [SIM=0.934000] We are what we are because we have been what we have been. [{'author': 'freud'}]
* [SIM=0.931940] We become what we contemplate. [{'author': 'plato'}]
* [SIM=0.928524] In the blessings as well as in the ills of life, less depends upon what befalls us than upon the way in which it is met. [{'author': 'schopenhauer'}]


### MMR (Maximal-marginal-relevance) search

In [21]:
results = vstore.max_marginal_relevance_search(
    "Our life is what we make of it",
    k=3,
    filter={"author": "aristotle"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")



* The quality of life is determined by its activities. [{'author': 'aristotle'}]
* Love is composed of a single soul inhabiting two bodies. [{'author': 'aristotle'}]
* We must be neither cowardly nor rash but courageous. [{'author': 'aristotle'}]


### Deleting stored documents

In [22]:
delete_1 = vstore.delete(inserted_ids[:3])
print(f"all_succeed={delete_1}")  # True, all documents deleted

all_succeed=True


In [23]:
delete_2 = vstore.delete(inserted_ids[2:5])
print(f"some_succeeds={delete_2}")  # True, though some IDs were gone already

some_succeeds=True


### Running A Minimal RAG Chain
The next cells will implement a simple RAG pipeline:
- download a sample PDF file and load it onto the store;
- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;
- run the question-answering chain.

In [24]:
!curl -L \
"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true" \
"-o what-is-philosophy.pdf"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 55220  100 55220    0     0   142k      0 --:--:-- --:--:-- --:--:--  142k


In [25]:
#Load the PDF file
#pdf_loader = PyPDFLoader("what-is-philosophy.pdf")

import os

!mv '/content/ what-is-philosophy.pdf' '/content/what-is-philosophy.pdf'

# Assuming 'what-is-philosophy.pdf' is in the current working directory
file_path = os.path.join(os.getcwd(),"what-is-philosophy.pdf")
print(file_path)

!ls "/content"  # Or adjust the path to where you expect the file to be.


pdf_loader = PyPDFLoader(file_path)

#Create document chunks & embeddings
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)

print(f"Documents from PDF: {len(docs_from_pdf)}.")
inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)
print(f"Inserted {len(inserted_ids_from_pdf)} documents.")

/content/what-is-philosophy.pdf
sample_data  secure-connect-new-vector.zip  what-is-philosophy.pdf
Documents from PDF: 38.
Inserted 38 documents.


In [26]:
#Create the prompt and chain
retriever = vstore.as_retriever(search_kwargs={"k": 3})

philo_template = """
You are a philosopher that draws inspiration from great thinkers of the past
to craft well-thought answers to user questions. Use the provided context as the basis
for your answers and do not make up new reasoning paths - just mix-and-match what you are given.
Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.

CONTEXT:
{context}

QUESTION: {question}

YOUR ANSWER:"""

philo_prompt = ChatPromptTemplate.from_template(philo_template)

llm = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | philo_prompt
    | llm
    | StrOutputParser()
)

  warn_deprecated(


I faced the following error
```
LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
```

In [28]:
!pip install -U langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.1.1-py3-none-any.whl (32 kB)
Collecting langchain-core<0.2.0,>=0.1.33 (from langchain-openai)
  Downloading langchain_core-0.1.40-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.8/276.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: langchain-core, langchain-openai
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 0.1.31
    Uninstalling langchain-core-0.1.31:
      Successfully uninstalled langchain-core-0.1.31
  Attempting uninstall: langchain-openai
    Found existing installation: langchain-openai 0.0.8
    Uninstalling langchain-openai-0.0.8:
      Successfully uninstalled langchain-openai-0.0.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ragstack-ai 0.10.0 requires langchain-

In [29]:
from langchain_openai import ChatOpenAI

In [30]:
#Create the prompt and chain
retriever = vstore.as_retriever(search_kwargs={"k": 3})

philo_template = """
You are a philosopher that draws inspiration from great thinkers of the past
to craft well-thought answers to user questions. Use the provided context as the basis
for your answers and do not make up new reasoning paths - just mix-and-match what you are given.
Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.

CONTEXT:
{context}

QUESTION: {question}

YOUR ANSWER:"""

philo_prompt = ChatPromptTemplate.from_template(philo_template)

llm = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | philo_prompt
    | llm
    | StrOutputParser()
)

In [31]:
#Run the whole chain and answer question
chain.invoke("How does Russel elaborate on Peirce's idea of the security blanket?")



"Russell elaborates on Peirce's idea of the security blanket by describing how individuals without a philosophical mindset are confined by prejudices derived from common sense and habitual beliefs, leading to a sense of imprisonment in their own limited understanding of the world."

### Cleanup
If you want to completely delete the collection from your DSE7 instance, run this.

_(You will lose the data you stored in it.)_

In [32]:
vstore.delete_collection()

### Learn more

For more information, extended quickstarts and additional usage examples, please visit the [CassIO documentation](https://cassio.org/frameworks/langchain/about/) for more on using the LangChain `Cassandra` vector store.