# Introduction

This notebook demonstrates the use of a _Hybrid Graph Store_.
This combines the benefits of a traditional vector store (locating nodes by vector similarity) with the benefits of a graph graph (connecting relevant but not necessarily similar information).

It demonstrates loading a PDF, chunking it and writing it to the Graph Store using the standard LangChain patterns.
The only addition is the extraction of "keywords" using [keybert](https://maartengr.github.io/KeyBERT/index.html).
This demonstrates how chunks may be linked.

Other ways that chunks could be linked:

- Using TF-IDF to compute keywords from chunks, rather than keybert.
- Using links (`<a href="...">`) in the content and associated URLs to connect explicit links. This would even work with anchors within a page!
- Connecting images and tables on a page to the other content on the page.

In [None]:
# (Optional) When developing locally, this reloads the module code
# when changes are made, making it easier to iterate.
%load_ext autoreload
%autoreload 2

## Environment

In [None]:
# (Required in Colab) Install the graph store library from the repository.
# This will also install the dependencies.
%pip install ragstack-ai-knowledge-store

Pick one of the following.
1. If you're just running the notebook, it's probably best to run the cell using `getpass` to set the necessary
   environment variables.
1. If you're developing, it's likely easiest to create a `.env` file and store the necessary credentials.

In [None]:
# (Option 1) - Set the environment variables from getpass.
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key: ")
os.environ["ASTRA_DB_DATABASE_ID"] = input("Enter Astra DB Database ID: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass.getpass(
    "Enter Astra DB Application Token: "
)

keyspace = input("Enter Astra DB Keyspace (Empty for default): ")
if keyspace:
    os.environ["ASTRA_DB_KEYSPACE"] = keyspace
else:
    os.environ.pop("ASTRA_DB_KEYSPACE", None)

In [None]:
# (Option 2) - Load the `.env` file.
# See `env.template` for an example of what you should have there.
%pip install python-dotenv
import dotenv

dotenv.load_dotenv()

In [None]:
%pip install langchain_openai

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

## Initialize Astra DB Graph Store

In [None]:
# Initialize cassandra connection from environment variables).
import cassio

cassio.init(auto=True)

In [None]:
# Create graph store.
from ragstack_langchain.graph_store import CassandraGraphStore

graph_store = CassandraGraphStore(embeddings)

# Ingest Documents
In this section we ingest documents to the hybrid graph store.
We'll use `keybert` for extracting keywords which will automatically link between chunks with common keywords.

In [None]:
%pip install pypdf langchain-text-splitters keybert langchain-community

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    length_function=len,
    is_separator_regex=False,
)

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split(text_splitter)
pages

In [None]:
from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(
    [doc.page_content for doc in pages], stop_words="english"
)

for doc, kws in zip(pages, keywords):
    # Consider only taking keywords within a certain distance?
    doc.metadata["keywords"] = [kw for (kw, _) in kws]
pages[0]

In [None]:
graph_store.add_documents(pages)

# Retrieval
In this section, we'll set up a retrieval chain using the graph store.

We can configure how many chunks are retrieved by the vector search as well as how deep to traverse the keyword edges.
If we traverse to depth 0, the hybrid graph store is equivalent to a vector store.
Using a depth of 1 or 2 we are able to retrieve related, but dissimilar chunks.

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

In [None]:
# Retrieve and generate using the relevant snippets of the blog.
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever0 = graph_store.as_retriever(depth=0)
retriever1 = graph_store.as_retriever(depth=1)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain0 = (
    {"context": retriever0 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain1 = (
    {"context": retriever1 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain0.invoke("How does LayoutParser work?")

In [None]:
rag_chain1.invoke("How does LayoutParser work?")