# Integrate LangChain with Astra DB Serverless

For more information, visit the DataStax [Astra DB docs page](https://docs.datastax.com/en/astra/astra-db-vector/integrations/langchain.html).

In [1]:
! pip install --quiet "langchain==0.1.7" "langchain-astradb>=0.0.1" \
    "langchain-openai==0.0.6" "datasets==2.17.1" "pypdf==4.0.2" \
    "python-dotenv==1.0.1"

## Secrets

Example values:
- API Endpoint: `"https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com"`
- Token: `"AstraCS:6gBhNmsk135..."` (it must have a role of at least "Database Administrator")
- OpenAI API key: `sk-4fQ3F...`

In [2]:
import os
from getpass import getpass
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("ASTRA_DB_APPLICATION_TOKEN = ")
os.environ["ASTRA_DB_API_ENDPOINT"] = input("ASTRA_DB_API_ENDPOINT = ")
os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")

ASTRA_DB_APPLICATION_TOKEN =  ········
ASTRA_DB_API_ENDPOINT =  https://3ccff27f-315f-400d-a5c6-1cd424252ed9-us-east1.apps.astra.datastax.com
OPENAI_API_KEY =  ········


## Dependencies

In [3]:
from langchain_astradb import AstraDBVectorStore
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

from datasets import load_dataset

In [4]:
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

## Create the embeddings model and vector store with its collection

In [5]:
embedding = OpenAIEmbeddings()
vstore = AstraDBVectorStore(
    embedding=embedding,
    collection_name="test",
    token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
    api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
)

## Load a small dataset

In [6]:
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
print("An example entry:")
print(philo_dataset[16])

An example entry:
{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}


## Transform into LangChain "Documents"

In [7]:
docs = []
for entry in philo_dataset:
    metadata = {"author": entry["author"]}
    if entry["tags"]:
        # Add metadata tags to the metadata dictionary
        for tag in entry["tags"].split(";"):
            metadata[tag] = "y"
    # Add a LangChain document with the quote and metadata tags
    doc = Document(page_content=entry["quote"], metadata=metadata)
    docs.append(doc)

## Compute vector embedding and store entries

In [8]:
inserted_ids = vstore.add_documents(docs)
print(f"\nInserted {len(inserted_ids)} documents.")


Inserted 450 documents.


## Run a similarity search (to verify the integration)

In [9]:
results = vstore.similarity_search("Our life is what we make of it", k=3)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* We are what we are because we have been what we have been. [{'author': 'freud', 'history': 'y'}]
* We become what we contemplate. [{'author': 'plato', 'knowledge': 'y', 'ethics': 'y'}]
* In the blessings as well as in the ills of life, less depends upon what befalls us than upon the way in which it is met. [{'author': 'schopenhauer', 'knowledge': 'y', 'ethics': 'y'}]


## Further usage patterns

### Use `add_texts`

Storing entries in the vector store through `add_texts` has the advantage that you can specify the IDs, so that you don't risk duplicating the entries if you run the insertion multiple times.

In [10]:
texts = [
    "I think, therefore I am.",
    "To the things themselves!",
]
metadatas = [
    {"author": "descartes", "knowledge": "y"},
    {"author": "husserl", "knowledge": "y"},
]
ids = [
    "desc_01",
    "huss_xy",
]
inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)
print(f"\nInserted {len(inserted_ids_2)} documents.")


Inserted 2 documents.


### Return similarity scores from a search

In [11]:
results = vstore.similarity_search_with_score("Our life is what we make of it", k=3)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=0.933999] We are what we are because we have been what we have been. [{'author': 'freud', 'history': 'y'}]
* [SIM=0.931921] We become what we contemplate. [{'author': 'plato', 'knowledge': 'y', 'ethics': 'y'}]
* [SIM=0.928508] In the blessings as well as in the ills of life, less depends upon what befalls us than upon the way in which it is met. [{'author': 'schopenhauer', 'knowledge': 'y', 'ethics': 'y'}]


### Similarity search with metadata filtering

In [12]:
results = vstore.similarity_search(
    "Our life is what we make of it",
    k=3,
    filter={"author": "aristotle"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* The quality of life is determined by its activities. [{'author': 'aristotle'}]
* You are what you repeatedly do [{'author': 'aristotle'}]
* You are what you do repeatedly. [{'author': 'aristotle'}]


### MMR (maximal marginal relevance) similarity search

In [13]:
results = vstore.max_marginal_relevance_search(
    "Our life is what we make of it",
    k=3,
    filter={"author": "aristotle"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* The quality of life is determined by its activities. [{'author': 'aristotle'}]
* We must be neither cowardly nor rash but courageous. [{'author': 'aristotle', 'ethics': 'y', 'knowledge': 'y'}]
* Love is composed of a single soul inhabiting two bodies. [{'author': 'aristotle', 'love': 'y'}]


### Deleting documents from the store

#### Delete by document ID

In [14]:
delete_1 = vstore.delete(inserted_ids[:3])
print(f"delete result = {delete_1}")

delete result = True


In [15]:
delete_2 = vstore.delete(inserted_ids[2:5])
print(f"delete result = {delete_2}")

delete result = True


### Retrieve and then delete

Sometimes you do not have the IDs, ... but you might want to run a search and then delete the results for some reason:

In [16]:
ids_to_delete = []
for res_doc, res_score, res_id in vstore.similarity_search_with_score_id(
    "Philosophy has no goals",
    k=2,
):
    print(f"* [SIM={res_score:3f}] {res_doc.page_content} [{res_doc.metadata}]")
    ids_to_delete.append(res_id)

print(f"Deleting IDs = {ids_to_delete} ...")
success = vstore.delete(ids_to_delete)
print(f"Deletion succeeded = {success}")

* [SIM=0.920144] For what purpose humanity is there should not even concern us: why you are here, that you should ask yourself: and if you have no ready answer, then set for yourself goals, high and noble goals, and perish in pursuit of them! [{'author': 'nietzsche', 'ethics': 'y', 'knowledge': 'y'}]
* [SIM=0.920104] Philosophy can make people sick. [{'author': 'aristotle', 'politics': 'y'}]
Deleting IDs = ['6f4a9130dc37448789b62a00d0d9112d', 'ee02055cff454df3a93ffedcc574da1f'] ...
Deletion succeeded = True


Now try again the same search:

In [17]:
for res_doc, res_score, res_id in vstore.similarity_search_with_score_id(
    "Philosophy has no goals",
    k=2,
):
    print(f"* [SIM={res_score:3f}] {res_doc.page_content} [{res_doc.metadata}]")

* [SIM=0.918456] Philosophy is by its nature something esoteric, neither made for the mob nor capable of being prepared for the mob. [{'author': 'hegel'}]
* [SIM=0.916039] The business of philosophy is not to give rules, but to analyze the private judgments of common reason. [{'author': 'kant'}]


### Delete the **whole** stored data

> _Warning: use with caution. Data loss!_

In [18]:
vstore.clear()

## A full mini-RAG example

The store is now empty. Let us re-populate it, this time by loading from a (locally available) PDF file.

_(The file is an abridged version of a public document found at [this link](https://commons.bellevuecollege.edu/wp-content/uploads/sites/125/2017/04/Intro-to-Phil-full-text.pdf))_

The whole ingestion of the document, from reading the input PDF to sensibly splitting its text to computing and storing the sentence embeddings, is handled within LangChain by the code in the two cells below:

In [19]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

**(Colab-only) Get the source PDF file**

> You don't need to run the following cell unless you are on a Google Colab notebook:

In [20]:
# Run this cell if on a Google Colab:
!mkdir -p sources
!curl -L \
    "https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true" \
    -o "sources/what-is-philosophy.pdf"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 55220  100 55220    0     0  84434      0 --:--:-- --:--:-- --:--:-- 84434


### Load the PDF file in the vector store:

In [21]:
pdf_loader = PyPDFLoader("sources/what-is-philosophy.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)

print(f"Documents from PDF: {len(docs_from_pdf)}.")
inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)
print(f"Inserted {len(inserted_ids_from_pdf)} documents.")

Documents from PDF: 38.
Inserted 38 documents.


We use the LCEL (LangChain Expression Language), ready to be served e.g. through `langchain serve` among other delivery methods:

In [22]:
from langchain_openai import ChatOpenAI

from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

In [23]:
retriever = vstore.as_retriever(search_kwargs={'k': 3})

philo_template = """
You are a philosopher that draws inspiration from great thinkers of the past
to craft well-thought answers to user questions. Use the provided context as the basis
for your answers and do not make up new reasoning paths - just mix-and-match what you are given.
Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.

CONTEXT:
{context}

QUESTION: {question}

YOUR ANSWER:"""

philo_prompt = ChatPromptTemplate.from_template(philo_template)

llm = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()} 
    | philo_prompt 
    | llm 
    | StrOutputParser()
)

In [24]:
chain.invoke("How does Russel elaborate on Peirce's idea of the security blanket?")

"Russell elaborates on Peirce's idea of the security blanket by highlighting how individuals without a philosophical mindset are confined by societal prejudices and habitual beliefs, leading them to cling to comforting ideologies for a false sense of security. This clinging ultimately results in additional worries and anxieties, creating a paradoxical predicament."

## Cleanup

Let us completely delete the collection, thereby freeing the associated resources on Astra DB:

> _Warning: use with caution. Data loss!_

In [25]:
vstore.delete_collection()