# Introduction

This notebook demonstrates the use of a _Hybrid Knowledge Store_.
This combines the benefits of a traditional vector store (locating nodes by vector similarity) with the benefits of a knowledge graph (connecting relevant but not necessarily similar information).

It demonstrates loading a PDF, chunking it and writing it to the Knowledge Store using the standard LangChain patterns.
The only addition is the extraction of "keywords" using [keybert](https://maartengr.github.io/KeyBERT/index.html).
This demonstrates how chunks may be linked.

Other ways that chunks could be linked:

- Using TF-IDF to compute keywords from chunks, rather than keybert.
- Using links (`<a href="...">`) in the content and associated URLs to connect explicit links. This would even work with anchors within a page!
- Connecting images and tables on a page to the other content on the page.

In [1]:
# (Optional) When developing locally, this reloads the module code when changes are made,
# making it easier to iterate.
%load_ext autoreload
%autoreload 2

## Environment

In [2]:
# (Required in Colab) Install the knowledge graph library from the repository.
# This will also install the dependencies.
%pip install https://github.com/datastax-labs/knowledge-store/archive/main.zip

Collecting https://github.com/datastax-labs/knowledge-store/archive/main.zip
  Downloading https://github.com/datastax-labs/knowledge-store/archive/main.zip (4.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting langchain-core<0.2.0,>=0.1.50 (from knowledge-store==0.1.0)
  Using cached langchain_core-0.1.52-py3-none-any.whl.metadata (5.9 kB)
Using cached langchain_core-0.1.52-py3-none-any.whl (302 kB)
Building wheels for collected packages: knowledge-store
  Building wheel for knowledge-store (pyproject.toml) ... [?25ldone
[?25h  Created wheel for knowledge-store: filename=knowledge_store-0.1.0-py3-none-any.whl size=7045 sha256=0a9f5a520be62850280fa141254311c81145b0a23970d39673599f486116ee7f
  Stored in directory: /

Pick one of the following.
1. If you're just running the notebook, it's probably best to run the cell using `getpass` to set the necessary
   environment variables.
1. If you're developing, it's likely easiest to create a `.env` file and store the necessary credentials.

In [None]:
# (Option 1) - Set the environment variables from getpass.
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key: ")
os.environ["ASTRA_DB_DATABASE_ID"] = input("Enter Astra DB Database ID: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass.getpass("Enter Astra DB Application Token: ")

keyspace = input("Enter Astra DB Keyspace (Empty for default): ")
if keyspace:
    os.environ["ASTRA_DB_KEYSPACE"] = keyspace
else:
    os.environ.pop("ASTRA_DB_KEYSPACE", None)

In [3]:
# (Option 2) - Load the `.env` file.
# See `env.template` for an example of what you should have there.
%pip install python-dotenv
import dotenv
dotenv.load_dotenv()


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


True

In [4]:
%pip install langchain_openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

## Initialize Astra DB Knowledge Store

In [6]:
# Initialize cassandra connection from environment variables).
import cassio
cassio.init(auto=True)

In [7]:
# Create graph store.
from knowledge_store import KnowledgeStore
knowledge_store = KnowledgeStore(embeddings)

# Ingest Documents
In this section we ingest documents to the hybrid knowledge store.
We'll use `keybert` for extracting keywords which will automatically link between chunks with common keywords.

In [8]:
%pip install pypdf langchain-text-splitters keybert langchain-community

Collecting langchain-core<0.3.0,>=0.2.0 (from langchain-text-splitters)
  Using cached langchain_core-0.2.0-py3-none-any.whl.metadata (5.9 kB)
Collecting sentence-transformers>=0.3.8 (from keybert)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Collecting scikit-learn>=0.22.2 (from keybert)
  Downloading scikit_learn-1.5.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn>=0.22.2->keybert)
  Using cached scipy-1.13.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn>=0.22.2->keybert)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn>=0.22.2->keybert)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers<5.0.0,>=4.34.0 (from sentence-transformers>=0.3.8->keybert)
  Downloading transformers-4.41.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [9]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    length_function=len,
    is_separator_regex=False,
)

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split(text_splitter)
pages

[autoreload of langchain_core.runnables.passthrough failed: Traceback (most recent call last):
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 500, in superreload
    update_generic(old_obj, new_obj)
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 397, in update_generic
    update(a, b)
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 365, in update_class
    update_instances(old, new)
  File "/Users/be

[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in

In [10]:
from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords([doc.page_content for doc in pages],
                                     stop_words='english')

for (doc, kws) in zip(pages, keywords):
    # Consider only taking keywords within a certain distance?
    doc.metadata["keywords"] = [kw for (kw, _) in kws]
pages[0]

  from .autonotebook import tqdm as notebook_tqdm


Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in 

In [11]:
knowledge_store.add_documents(pages)

# Retrieval
In this section, we'll set up a retrieval chain using the knowledge store.

We can configure how many chunks are retrieved by the vector search as well as how deep to traverse the keyword edges.
If we traverse to depth 0, the hybrid knowledge store is equivalent to a vector store.
Using a depth of 1 or 2 we are able to retrieve related, but dissimilar chunks.

In [12]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

In [14]:
# Retrieve and generate using the relevant snippets of the blog.
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

retriever0 = knowledge_store.as_retriever(depth=0)
retriever1 = knowledge_store.as_retriever(depth=1)

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain0 = (
    {"context": retriever0 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain1 = (
    {"context": retriever1 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

ValidationError: 1 validation error for RunnableBinding
bound
  instance of Runnable expected (type=type_error.arbitrary_type; expected_arbitrary_type=Runnable)

In [None]:
rag_chain0.invoke("How does LayoutParser work?")

In [None]:
rag_chain1.invoke("How does LayoutParser work?")