# Preliminaries

In [2]:
%pip install markdownify python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Load the Astra Documentation into Knowledge Store

First, we'll crawl the DataStax documentation. LangChain includes a `SiteMapLoader` but it loads all of the pages into memory simultaneously, which makes it impossible to index larger sites from small environments (such as CoLab). So, we'll scrape the sitemap ourselves and iterate over the URLs, allowing us to process documents in batches and flush them to Astra DB. 

## Scrape the URLs from the Site Maps
First, we use Beautiful Soup to parse the XML content of each sitemap and get the list of URLs.
We also add a few extra URLs for external sites that are also useful to include in the index.

In [3]:
# Use sitemaps to crawl the content
SITEMAPS = [
    "https://docs.datastax.com/en/sitemap-astra-db-vector.xml",
    "https://docs.datastax.com/en/sitemap-cql.xml",
    "https://docs.datastax.com/en/sitemap-dev-app-drivers.xml",
    "https://docs.datastax.com/en/sitemap-glossary.xml",
    "https://docs.datastax.com/en/sitemap-astra-db-serverless.xml"
]

# Additional URLs to crawl for content.
EXTRA_URLS = [
    "https://github.com/jbellis/jvector"
]

SITE_PREFIX = "astra"

from bs4 import BeautifulSoup
import requests

def load_pages(sitemap_url):
    r = requests.get(sitemap_url,
                     headers={
                         # Astra docs only return a sitemap with a user agent set.
                         "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0",
                     })
    xml = r.text

    soup = BeautifulSoup(xml, features="xml")
    url_tags = soup.find_all("url")
    for url in url_tags:
        yield(url.find("loc").text)


# For maintenance purposes, we could check only the new articles since a given time.
URLS = [
    url
    for sitemap_url in SITEMAPS
    for url in load_pages(sitemap_url)
] + EXTRA_URLS
len(URLS)

1368

## Load the content from each URL
Next, we create the code to load each page. This performs the following steps:

1. Parses the HTML with BeautifulSoup
2. Locates the "content" of the HTML using an appropriate selector based on the URL
3. Find the link (`<a href="...">`) tags in the content and collect the absolute URLs (for creating edges).

Adding the URLs of these references to the metadata allows the knowledge store to create edges between the document.

In [4]:
from langchain_community.document_loaders import AsyncHtmlLoader
from bs4 import BeautifulSoup
from langchain_core.documents import Document
from typing import AsyncIterator, Iterable, Set
from markdownify import MarkdownConverter
from urllib.parse import urlparse, urljoin, urldefrag

def parse_url(link,
              page_url,
              drop_fragment: bool = True):
  href = link.get('href')
  if href is None:
    return None
  url = urlparse(href)
  if url.scheme not in ["http", "https", ""]:
    return None

  # Join the HREF with the page_url to convert relative paths to absolute.
  url = urljoin(page_url, href)

  # Fragments would be useful if we chunked a page based on section.
  # Then, each chunk would have a different URL based on the fragment.
  # Since we aren't doing that yet, they just "break" links. So, drop
  # the fragment.
  if drop_fragment:
    return urldefrag(url).url
  else:
     return url

def parse_hrefs(soup: BeautifulSoup, url: str) -> Set[str]:
  links = soup.find_all('a')
  links = {parse_url(link, page_url=url) for link in links}

  # Remove entries for any 'a' tag that failed to parse (didn't have href,
  # or invalid domain, etc.)
  links.discard(None)

  # Remove self links.
  links.discard(url)

  return links

def locate_content(soup: BeautifulSoup, url: str) -> BeautifulSoup:
    content = None
    if url.startswith("https://docs.datastax.com/en/"):
        content = soup.select_one("article.doc")
    if url.startswith("https://github.com"):
        content = soup.select_one("article.entry-content")
    assert content is not None, f"Unable to locate content for {url}"
    return content

markdown_converter = MarkdownConverter(autolinks_false="false", heading_style="ATX")

def process_document(html: Document) -> Document:
    url = html.metadata["source"]
    soup = BeautifulSoup(html.page_content, "html.parser")
    content = locate_content(soup, url)

    content_md = markdown_converter.convert_soup(content)
    hrefs = parse_hrefs(content, url)
    return Document(
        page_content = content_md,
        metadata = {
            # Assign the unique ID for the `Document` in the graph.
            "content_id": url,
            # This document references all documents with matching urls.
            "hrefs": hrefs,
            # These are the URLs the document is "defined" at.
            "urls": [url]
        }
    )


async def load_pages(urls: Iterable[str]) -> AsyncIterator[Document]:
    loader = AsyncHtmlLoader(
        urls,
        requests_per_second=4,
        # Astra docs require a user agent
        header_template = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0"
        }
    )
    async for html in loader.alazy_load():
        yield process_document(html)

## Initialize Environment
Before we initialize the Knowledge Store and write the documents we need to set some environment variables.
In colab, this will prompt you for input. When running locally, this will load from `.env`.

In [5]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
    # (Option 1) - Set the environment variables from getpass.
    print("In colab. Using getpass/input for environment variables.")
    import getpass
    import os

    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key: ")
    os.environ["ASTRA_DB_DATABASE_ID"] = input("Enter Astra DB Database ID: ")
    os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass.getpass("Enter Astra DB Application Token: ")

    keyspace = input("Enter Astra DB Keyspace (Empty for default): ")
    if keyspace:
        os.environ["ASTRA_DB_KEYSPACE"] = keyspace
    else:
        os.environ.pop("ASTRA_DB_KEYSPACE", None)
else:
    print("Not in colab. Loading '.env' (see 'env.template' for example)")
    import dotenv
    dotenv.load_dotenv()

Not in colab. Loading '.env' (see 'env.template' for example)


## Initialize Cassio and Knowledge Store
With the environment variables set we initialize the Cassio library for talking to Cassandra / Astra DB.
We also create the `KnowledgeStore`.

In [6]:
import cassio
from langchain_openai import OpenAIEmbeddings
from ragstack_knowledge_store import CassandraKnowledgeStore
from ragstack_knowledge_store.directed_edge_extractor import DirectedEdgeExtractor

cassio.init(auto=True)
embeddings = OpenAIEmbeddings()
SITE_PREFIX="astra_docs"
knowledge_store = CassandraKnowledgeStore(
    embeddings,
    edge_extractors = [
        DirectedEdgeExtractor.for_hrefs_to_urls(),
    ],
    node_table=f"{SITE_PREFIX}_nodes",
    edge_table=f"{SITE_PREFIX}_edges")

In [7]:
from cassio.config import check_resolve_session, check_resolve_keyspace
session = check_resolve_session()
keyspace = check_resolve_keyspace()
session.execute(f"DROP TABLE IF EXISTS {keyspace}.{SITE_PREFIX}_nodes")
session.execute(f"DROP TABLE IF EXISTS {keyspace}.{SITE_PREFIX}_edges")

<cassandra.cluster.ResultSet at 0x17fb2da90>

## Load the Documents
Finally, we fetch pages and write them to the knowledge store in batches of 50.

In [7]:
not_found = 0
found = 0

docs = []
async for doc in load_pages(URLS):
    if doc.page_content.startswith("\n# Page Not Found"):
        not_found += 1
        continue

    docs.append(doc)
    found += 1

    if len(docs) >= 50:
        knowledge_store.add_documents(docs)
        docs.clear()

if docs:
    knowledge_store.add_documents(docs)
print(f"{not_found} (of {not_found + found}) URLs were not found")

Fetching pages:  22%|##1       | 300/1368 [00:51<06:37,  2.68it/s]Error fetching https://docs.datastax.com/en/cql/dse/reference/cql-commands/alter-materialized-view.html with attempt 1/3: Cannot connect to host docs.datastax.com:443 ssl:default [nodename nor servname provided, or not known]. Retrying...
Fetching pages: 100%|##########| 1368/1368 [04:21<00:00,  5.23it/s]


96 (of 1368) URLs were not found


# Create the RAG Chain
Here, I create two versions of the retriever and chain -- one which uses a depth of 0 (so is equivalent to vector search) and another which uses a depth of 1.

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o")

# Depth 0 doesn't traverses edges and is equivalent to vector similarity only.
vector_retriever = knowledge_store.as_retriever(search_kwargs={"depth": 0})

# Depth 1 does vector similarity and then traverses 1 level of edges.
graph_retriever = knowledge_store.as_retriever(search_kwargs={"depth": 1})

template = """You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    formatted = "\n\n".join(f"From {doc.metadata['content_id']}: {doc.page_content}" for doc in docs)
    return formatted

vector_rag_chain = (
    {"context": vector_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

graph_retriever = (
    {"context": graph_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [23]:
from IPython.display import display, Markdown

# Helper method to render markdown in responses to a chain.
def run_and_render(chain, question):
    result = chain.invoke(question)
    display(Markdown(result))

## Check Retrieval Results

In [22]:
# Set the question and see what documents each technique retrieves.
QUESTION="What vector indexing algorithms does Astra use?"
for doc in vector_retriever.invoke(QUESTION):
  print(f"Vector: {doc.metadata['content_id']}")

for doc in graph_retriever.invoke(QUESTION):
  print(f"Graph:  {doc.metadata['content_id']}")

Vector Only: https://docs.datastax.com/en/cql/astra/getting-started/vector-search-quickstart.html
Vector Only: https://docs.datastax.com/en/astra-db-serverless/get-started/concepts.html


Traversing:  https://docs.datastax.com/en/cql/astra/getting-started/vector-search-quickstart.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/get-started/concepts.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/integrations/semantic-kernel.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/tutorials/recommendations.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/administration/customer-keys-overview.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/administration/manage-database-ip-access-list.html
Traversing:  https://docs.datastax.com/en/cql/astra/developing/indexing/sai/sai-overview.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/api-reference/devops-api.html
Traversing:  https://docs.datastax.com/en/astra-db-serverless/databases/database-overview.html
Traversing:  https://docs.da

## Run the question with depth 0 and 1 

In [20]:
run_and_render(vector_rag_chain, QUESTION)


Astra DB uses multiple indexing techniques to speed up searches in vector databases. The two primary indexing algorithms are:

1. **JVector**: 
   - **Description**: JVector is a vector search engine used by the Serverless (Vector) database to construct a graph index.
   - **Features**: 
     - Adds new documents to the graph immediately, enabling efficient searches right away.
     - Can compress vectors with quantization to save space and improve performance.
   - **More Information**: You can find more details on [JVector on GitHub](https://github.com/jbellis/jvector).

2. **Storage-Attached Index (SAI)**:
   - **Description**: SAI is an indexing technique that efficiently finds rows satisfying query predicates.
   - **Features**: 
     - Provides numeric-, text-, and vector-based indexes to support different kinds of searches.
     - Allows customization of indexes based on specific requirements, such as a particular similarity function or text transformation.
     - Loads a superset of all possible results from storage based on the predicates provided, evaluates the search criteria, and sorts the results by vector similarity.
     - Returns the top `limit` results to the user.
   - **More Information**: More details are available in the [Storage-Attached Indexing (SAI) Overview](https://docs.datastax.com/en/cql/astra/developing/indexing/sai/sai-overview.html).

These indexing techniques work together to enhance the performance and efficiency of vector searches in Astra DB.

In [21]:
run_and_render(graph_rag_chain, QUESTION)

Astra DB Serverless uses the JVector vector search engine to construct a graph index. JVector is a graph-based index that builds on the DiskANN design with composable extensions. It supports various indexing techniques, such as:

1. **Product Quantization (PQ)**: Optionally with anisotropic weighting.
2. **Binary Quantization (BQ)**: Compresses vectors for space efficiency.
3. **Fused ADC**: Uses PQ codebooks transposed and written inline with the graph adjacency list.
4. **LVQ (Lattice Vector Quantization)**: Compresses vectors for more accurate second-pass searches.

These techniques allow Astra DB to perform efficient vector searches, supporting operations like similarity searches for use cases requiring efficient similarity search by leveraging the high-dimensional vector spaces produced by embedding algorithms or neural networks.

With vector only we retrieved chunks from the Astra documentation explaining that it used JVector.
Since it didn't follow the link to [JVector on GitHub](https://github.com/jbellis/jvector) it didn't actually answer the question.

The graph retrieval started with the same set of chunks, but it followed the edge to the documents we loaded from GitHub.
This allowed the LLM to read in more depth how JVector is implemented, which allowed it to answer the question more clearly and with more detail.