# Introduction

This notebook demonstrates the use of a _Hybrid Knowledge Store_.
This combines the benefits of a traditional vector store (locating nodes by vector similarity) with the benefits of a knowledge graph (connecting relevant but not necessarily similar information).

It demonstrates loading a PDF, chunking it and writing it to the Knowledge Store using the standard LangChain patterns.
The only addition is the extraction of "keywords" using [keybert](https://maartengr.github.io/KeyBERT/index.html).
This demonstrates how chunks may be linked.

Other ways that chunks could be linked:

- Using TF-IDF to compute keywords from chunks, rather than keybert.
- Using links (`<a href="...">`) in the content and associated URLs to connect explicit links. This would even work with anchors within a page!
- Connecting images and tables on a page to the other content on the page.

In [1]:
# (Optional) When developing locally, this reloads the module code when changes are made,
# making it easier to iterate.
%load_ext autoreload
%autoreload 2

## Environment

In [2]:
# (Required in Colab) Install the knowledge graph library from the repository.
# This will also install the dependencies.
%pip install https://github.com/datastax-labs/knowledge-store/archive/main.zip
%pip install langchain langchainhub langchain_openai==0.1.7 cassio python-dotenv pypdf langchain-text-splitters keybert 

Collecting https://github.com/datastax-labs/knowledge-store/archive/main.zip
  Downloading https://github.com/datastax-labs/knowledge-store/archive/main.zip
[2K     [32m-[0m [32m4.6 MB[0m [31m1.0 MB/s[0m [33m0:00:04[0m0m0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting langchain-core<0.2.0,>=0.1.50
  Downloading langchain_core-0.1.52-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting cassio<0.2.0,>=0.1.7
  Downloading cassio-0.1.7-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.9/44.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting instructorembedding<2.0.0,>=1.0.1
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)
Collecting sentence

Pick one of the following.
1. If you're just running the notebook, it's probably best to run the cell using `getpass` to set the necessary
   environment variables.
1. If you're developing, it's likely easiest to create a `.env` file and store the necessary credentials.

In [None]:
# (Option 1) - Set the environment variables from getpass.
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key: ")
os.environ["ASTRA_DB_DATABASE_ID"] = input("Enter Astra DB Database ID: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass.getpass("Enter Astra DB Application Token: ")

keyspace = input("Enter Astra DB Keyspace (Empty for default): ")
if keyspace:
    os.environ["ASTRA_DB_KEYSPACE"] = keyspace
else:
    os.environ.pop("ASTRA_DB_KEYSPACE", None)

In [2]:
# (Option 2) - Load the `.env` file.
# See `env.template` for an example of what you should have there.
import dotenv
dotenv.load_dotenv()


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


True

In [4]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

## Initialize Astra DB Knowledge Store

In [5]:
# Initialize cassandra connection from environment variables).
import cassio
cassio.init(auto=True)

In [6]:
# Create graph store.
from knowledge_store import KnowledgeStore
knowledge_store = KnowledgeStore(embeddings)

# Ingest Documents
In this section we ingest documents to the hybrid knowledge store.
We'll use `keybert` for extracting keywords which will automatically link between chunks with common keywords.

In [8]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    length_function=len,
    is_separator_regex=False,
)

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split(text_splitter)
pages

[autoreload of langchain_core.runnables.passthrough failed: Traceback (most recent call last):
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 500, in superreload
    update_generic(old_obj, new_obj)
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 397, in update_generic
    update(a, b)
  File "/Users/benjamin.chambers/Library/Caches/pypoetry/virtualenvs/knowledge-store-L8e7UibK-py3.11/lib/python3.11/site-packages/IPython/extensions/autoreload.py", line 365, in update_class
    update_instances(old, new)
  File "/Users/be

[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in

In [9]:
from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords([doc.page_content for doc in pages],
                                     stop_words='english')

for (doc, kws) in zip(pages, keywords):
    # Consider only taking keywords within a certain distance?
    doc.metadata["keywords"] = [kw for (kw, _) in kws]
pages[0]

  from .autonotebook import tqdm as notebook_tqdm


Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in 

In [10]:
knowledge_store.add_documents(pages)

# Retrieval
In this section, we'll set up a retrieval chain using the knowledge store.

We can configure how many chunks are retrieved by the vector search as well as how deep to traverse the keyword edges.
If we traverse to depth 0, the hybrid knowledge store is equivalent to a vector store.
Using a depth of 1 or 2 we are able to retrieve related, but dissimilar chunks.

In [9]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

In [15]:
# Retrieve and generate using the relevant snippets of the blog.
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

retriever0 = knowledge_store.as_retriever(depth=0)
retriever1 = knowledge_store.as_retriever(depth=1)

print(f"Retrieval: {retriever0.invoke('How does LayoutParser work?')}")

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    print(f"Docs: {docs}")
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain0 = (
    {"context": retriever0 | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

partial_chain = (
    {"context": retriever0 | format_docs, "question": RunnablePassthrough()}
    | prompt
)

print(f"Partial: {partial_chain.invoke('How does LayoutParser work?')}")

rag_chain1 = (
    {"context": retriever1 | format_docs, "question": RunnablePassthrough()}
    | prompt
    |
    | llm
    | StrOutputParser()
)

Retrieval: [Document(page_content='the regions are automatically annotated with high conﬁdence predictions from\nthe layout detection model. This allows a layout dataset to be created more\neﬃciently with only around 60% of the labeling budget.\nAfter the training dataset is curated, LayoutParser supports diﬀerent modes\nfor training the layout models. Fine-tuning can be used for training models on a\nsmall newly-labeled dataset by initializing the model with existing pre-trained\nweights. Training from scratch can be helpful when the source dataset and\ntarget are signiﬁcantly diﬀerent and a large training set is available. However, as\nsuggested in Studer et al.’s work[ 33], loading pre-trained weights on large-scale\ndatasets like ImageNet [ 5], even from totally diﬀerent domains, can still boost\nmodel performance. Through the integrated API provided by LayoutParser ,\nusers can easily compare model performances on the benchmark datasets.', metadata={'content_id': 'c3a393b18b358160

In [16]:
from langchain.globals import set_debug

set_debug(True)

rag_chain0.invoke("How does LayoutParser work?")

[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question>] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnablePassthrough] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[36;1m[1;3m[chain/end][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnablePassthrough] [0ms] Exiting Chain run with output:
[0m{
  "output": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:

ValueError: Got unsupported message type: ('messages', [HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: How does LayoutParser work? \nContext: the regions are automatically annotated with high conﬁdence predictions from\nthe layout detection model. This allows a layout dataset to be created more\neﬃciently with only around 60% of the labeling budget.\nAfter the training dataset is curated, LayoutParser supports diﬀerent modes\nfor training the layout models. Fine-tuning can be used for training models on a\nsmall newly-labeled dataset by initializing the model with existing pre-trained\nweights. Training from scratch can be helpful when the source dataset and\ntarget are signiﬁcantly diﬀerent and a large training set is available. However, as\nsuggested in Studer et al.’s work[ 33], loading pre-trained weights on large-scale\ndatasets like ImageNet [ 5], even from totally diﬀerent domains, can still boost\nmodel performance. Through the integrated API provided by LayoutParser ,\nusers can easily compare model performances on the benchmark datasets.\n\nLayoutParser easy to learn and use.\nAllenNLP [ 8] and transformers [ 34] have provided the community with complete\nDL-based support for developing and deploying models for general computer\nvision and natural language processing problems. LayoutParser , on the other\nhand, specializes speciﬁcally in DIA tasks. LayoutParser is also equipped with a\ncommunity platform inspired by established model hubs such as Torch Hub [23]\nandTensorFlow Hub [1]. It enables the sharing of pretrained models as well as\nfull document processing pipelines that are unique to DIA tasks.\nThere have been a variety of document data collections to facilitate the\ndevelopment of DL models. Some examples include PRImA [ 3](magazine layouts),\nPubLayNet [ 38](academic paper layouts), Table Bank [ 18](tables in academic\npapers), Newspaper Navigator Dataset [ 16,17](newspaper ﬁgure layouts) and\nHJDataset [31](historical Japanese document layouts). A spectrum of models\n\nthe regions are automatically annotated with high conﬁdence predictions from\nthe layout detection model. This allows a layout dataset to be created more\neﬃciently with only around 60% of the labeling budget.\nAfter the training dataset is curated, LayoutParser supports diﬀerent modes\nfor training the layout models. Fine-tuning can be used for training models on a\nsmall newly-labeled dataset by initializing the model with existing pre-trained\nweights. Training from scratch can be helpful when the source dataset and\ntarget are signiﬁcantly diﬀerent and a large training set is available. However, as\nsuggested in Studer et al.’s work[ 33], loading pre-trained weights on large-scale\ndatasets like ImageNet [ 5], even from totally diﬀerent domains, can still boost\nmodel performance. Through the integrated API provided by LayoutParser ,\nusers can easily compare model performances on the benchmark datasets.\n\nLayoutParser easy to learn and use.\nAllenNLP [ 8] and transformers [ 34] have provided the community with complete\nDL-based support for developing and deploying models for general computer\nvision and natural language processing problems. LayoutParser , on the other\nhand, specializes speciﬁcally in DIA tasks. LayoutParser is also equipped with a\ncommunity platform inspired by established model hubs such as Torch Hub [23]\nandTensorFlow Hub [1]. It enables the sharing of pretrained models as well as\nfull document processing pipelines that are unique to DIA tasks.\nThere have been a variety of document data collections to facilitate the\ndevelopment of DL models. Some examples include PRImA [ 3](magazine layouts),\nPubLayNet [ 38](academic paper layouts), Table Bank [ 18](tables in academic\npapers), Newspaper Navigator Dataset [ 16,17](newspaper ﬁgure layouts) and\nHJDataset [31](historical Japanese document layouts). A spectrum of models \nAnswer:")])

In [30]:
rag_chain1.invoke("How does LayoutParser work?")

[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question>] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnablePassthrough] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[36;1m[1;3m[chain/end][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnablePassthrough] [0ms] Exiting Chain run with output:
[0m{
  "output": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableSequence > chain:RunnableParallel<context,question> > chain:RunnableSequence] Entering Chain run with input:
[0m{
  "input": "How does LayoutParser work?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:

AttributeError: 'tuple' object has no attribute 'input_variables'