<a href="https://colab.research.google.com/github/YoshiyukiKono/gen_ai-sandbox/blob/main/Graph_RAG_example_Playground_trial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Graph RAG -- Connecting Knowledge for the LLM

* Brian Godsey, DataStax -- brian.godsey@datastax.com

---------------

NOTE: this notebook was built locally and uploaded here. The dataset used here may not be in this notebook's environment and will need to be re-downloaded.

---------------

Graph RAG is particularly powerful for tasks where context and relationships within data are crucial. It's well-suited for applications where related pieces of information appear in very different documents or contexts, and thus may not be close semantic neighbors in a vector DB.

It is also useful in natural language processing tasks that require deep understanding and contextualization, such as question answering, where the context provided by the graph data can significantly enhance the quality and relevance of the generated responses.

### Motivation

* To explore the possibilities for graph RAG on example datasets
* To experiment with building and using knowledge graphs (KGs) for graph RAG
* To compare results from graph RAG against plain RAG for various query types
* To discover how and where graph RAG can be useful beyond plain RAG

### Key Points

* There are many ways to customize and tailor graph RAG to your datasets
* The KG significantly impacts graph RAG performance
* Building a KG dynamically can be tricky
* Graph RAG can make connections and answer queries where plain RAG fails

### Further Notes

* The default implementations in LlamaIndex have some issues
* Specifically: keyword extraction/matching, triplet extraction, case sensitivity, unidirectional traversal, default prompts
* Adding a little more sophistication to a few of these could make significant improvements


## Setup Enviroments and APIs

### Package installations

In [1]:
%pip install llama-index
%pip install llama-index-vector-stores-astra

%pip install networkx


Collecting llama-index
  Downloading llama_index-0.10.29-py3-none-any.whl (6.9 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.2-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.11-py3-none-any.whl (26 kB)
Collecting llama-index-core<0.11.0,>=0.10.29 (from llama-index)
  Downloading llama_index_core-0.10.29-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.7-py3-none-any.whl (6.0 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.5-py3-none-any.whl (6.7 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading 

### Package imports and config parameters

In [2]:
# basic imports
import os
import sys
import logging
import networkx as nx
from IPython.display import display, Markdown


logging.basicConfig(
    stream=sys.stdout, level=logging.INFO
)  # logging.DEBUG for more verbose output


# parameters for the knowledge graph and storage
space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], [
    "relationship"
]  # default, could be omit if create from an empty kg
tags = ["entity"]  # default, could be omit if create from an empty kg


# HELPER FUNCTIONS

# define prompt viewing function
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}<br>" f"**Text:** <br>"
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown("<br><br>"))


### Load Datasets

If the data isn't in this environment, you can find it here:

https://github.com/run-llama/llama-datasets/tree/main/llama_datasets/origin_of_covid19


In [3]:
# create directories and download file
!mkdir origin_of_covid19
!mkdir origin_of_covid19/source_files

!curl -L -o origin_of_covid19/source_files/OriginsOfCovid19.pdf https://github.com/run-llama/llama-datasets/raw/main/llama_datasets/origin_of_covid19/source_files/OriginOfCovid19.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2549k  100 2549k    0     0  2591k      0 --:--:-- --:--:-- --:--:-- 2591k


In [4]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset

PATH_TO_DATA = './'

dataset_path = 'origin_of_covid19/source_files'

#rag_dataset = LabelledRagDataset.from_json("./origin_of_covid_data/data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir=PATH_TO_DATA + dataset_path).load_data()

In [6]:
documents

[Document(id_='748e97dc-5d06-4147-8079-f77aaa7ce905', embedding=None, metadata={'page_label': '1', 'file_name': 'OriginOfCovid19.pdf', 'file_path': '/content/origin_of_covid_data/source_files/OriginOfCovid19.pdf', 'file_type': 'application/pdf', 'file_size': 2610336, 'creation_date': '2024-04-16', 'last_modified_date': '2024-04-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Am. J. Trop. Med. Hyg. , 103(3), 2020, pp. 955 –959\ndoi:10.4269/ajtmh.20-0849Copyright © 2020 by The American Society of Tropical Medicine and Hygiene\nPerspective Piece\nThe Origin of COVID-19 and Why It Matters\nDavid M. Morens,1,2* Joel G. Breman,3Charles H. Calisher,4Peter C. Doherty,5Beatrice H. Hahn,6,7Gerald T. Keusch,8,9,10\nLaura D. Kramer,11,12James W. LeDuc,13Thomas

#### Download data (broken externally 2024-03-22)

In [5]:
from llama_index.core.llama_dataset import download_llama_dataset

# This download step broke on 2024-03-22, giving a JSONDecodeError
# it is still unclear what caused this


# datasets from here: https://llamahub.ai/?tab=llama_datasets
rag_dataset, documents = download_llama_dataset(
    # "PatronusAIFinanceBenchDataset", "./patronus_finance_data" # this dataset takes a long time to download (>10min)
    # "PaulGrahamEssayDataset", "./paul_graham_data"  # small dataset
    "OriginOfCovid19Dataset", "./origin_of_covid_data"
    #"DocugamiKgRagSec10Q", "./docugami_finance_10Q_data"  # https://github.com/docugami/KG-RAG-datasets
)

### Prepare LLM: OpenAI

In [7]:
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass(
    "\nPlease enter your OpenAI API Key (e.g. 'sk-...'):"
)


# define LLM
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(
    temperature=0,
    #model="gpt-3.5-turbo",
    model="gpt-4",
)
Settings.chunk_size = 512


Please enter your OpenAI API Key (e.g. 'sk-...'):··········


### Prepare Graph DB

#### SimpleGraphStore

Instead of a full-featured graph DB, for simplicity we'll use an in memory graph store object.

In [8]:
from llama_index.core.graph_stores import SimpleGraphStore
from llama_index.core import StorageContext

graph_store = SimpleGraphStore()
storage_context = StorageContext.from_defaults(graph_store=graph_store)

## Build the Knowledge Graph

### Create a custom triplet extraction prompt (optional)

In [9]:
from llama_index.core.prompts.base import PromptTemplate
from llama_index.core.prompts.prompt_type import PromptType


kg_prompt_template_string = (
    "Some text is provided below. "
    "Given the text, extract up to "
    "{max_knowledge_triplets} "
    "knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.\n"

    "* Keep the subject and predicate simple, and not too long."
    "* Focus on people, places, animals, organisms, diseases, organizations.\n"
    "* Return triplet text in all lowercase.\n"
    "* Also include triplets in the reverse relationship direction.\n"

    "\n\n---------------------\n"
    "Examples:"
    "\n\nText: Sarbecoviruses are a group of viruses"
    "\nTriplets:\n"
    "(sarbecovisuses, are a group of, viruses)"
    "(viruses, include, sarbecoviruses)"

    "\n\nText: Sarbecoviruses are a group of viruses that naturally infect bats and pangolins."
    "\nTriplets:\n"
    "(sarbecovisuses, are a group of, viruses)"
    "(viruses, include, sarbecoviruses)"
    "(sarbecoviruses, naturally infect, bats)"
    "(bats, are naturally infected by, sarbecoviruses)"
    "(sarbecoviruses, naturally infect, pangolins)"
    "(pangolins, are naturally infected by, sarbecoviruses)"

    "---------------------\n"
    "\nText: {text}\n"
    "\nTriplets:\n"
)

kg_triplet_template = PromptTemplate(
    kg_prompt_template_string,
    prompt_type=PromptType.KNOWLEDGE_TRIPLET_EXTRACT
)

### Build the KG

This usually takes between 2 and 10 minutes, depending on LLM used and KG parameters.

In [10]:
from llama_index.core import KnowledgeGraphIndex


kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    kg_triple_extract_template=kg_triplet_template,
    storage_context=storage_context,
    max_triplets_per_chunk=20,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

### Explore the KG

In [11]:
graph = kg_index.get_networkx_graph()

print('Number of nodes: ', len(list(graph.nodes)))

Number of nodes:  88


In [None]:
# write the graph to a file
# nx.write_graphml(graph.subgraph, "kg.graphml")

In [12]:
#list(graph.nodes)[0:20]
sorted(graph.degree, key=lambda x: x[1], reverse=True) #[0:20]

[('Sars-cov-2', 22),
 ('Sarbecoviruses', 9),
 ('Bats', 7),
 ('Scientists', 7),
 ('David m. morens', 4),
 ('Laura d. kramer', 3),
 ('Covid-19 pandemic', 3),
 ('American society of tropical medicine and hygiene', 2),
 ('Charles h. calisher', 2),
 ('Peter c. doherty', 2),
 ('Beatrice h. hahn', 2),
 ('Department of medicine', 2),
 ('Gerald t. keusch', 2),
 ('Sarbecovirus', 2),
 ('Similar coronavirus outbreaks', 2),
 ('Page_label', 1),
 ('1', 1),
 ('File_path', 1),
 ('/content/origin_of_covid_data/source_files/originofcovid19.pdf', 1),
 ('Am. j. trop. med. hyg.', 1),
 ('103(3', 1),
 ('American committee on arthropod-borne viruses', 1),
 ('National institute of allergy and infectious diseases', 1),
 ('Dm270q@nih.gov', 1),
 ('Joel g. breman', 1),
 ('Arthropod-borne and infectious diseases laboratory', 1),
 ('Colorado state university', 1),
 ('Department of microbiology and immunology', 1),
 ('University of melbourne at the doherty institute', 1),
 ('Perelman school of medicine', 1),
 ('Boston

In [13]:
list(graph.edges(data=True))

[('Page_label', '1', {'label': 'Is', 'title': 'Is'}),
 ('File_path',
  '/content/origin_of_covid_data/source_files/originofcovid19.pdf',
  {'label': 'Is', 'title': 'Is'}),
 ('Am. j. trop. med. hyg.', '103(3', {'label': 'Is', 'title': 'Is'}),
 ('David m. morens',
  'American committee on arthropod-borne viruses',
  {'label': 'Includes', 'title': 'Includes'}),
 ('David m. morens',
  'National institute of allergy and infectious diseases',
  {'label': 'Includes', 'title': 'Includes'}),
 ('David m. morens',
  'Dm270q@nih.gov',
  {'label': 'Has email', 'title': 'Has email'}),
 ('David m. morens',
  'American society of tropical medicine and hygiene',
  {'label': 'Includes', 'title': 'Includes'}),
 ('American society of tropical medicine and hygiene',
  'Joel g. breman',
  {'label': 'Includes', 'title': 'Includes'}),
 ('Charles h. calisher',
  'Arthropod-borne and infectious diseases laboratory',
  {'label': 'Includes', 'title': 'Includes'}),
 ('Charles h. calisher',
  'Colorado state univer

## Graph RAG

### Build Retriever and Query Engine

In [14]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever, KGTableRetriever


GRAPH_TRAVERSAL_DEPTH = 3
graph_rag_retriever = KGTableRetriever(
    kg_index,
    storage_context=storage_context,
    verbose=True,
    retriever_mode='embedding', #'keyword', 'embedding'
    graph_traversal_depth=GRAPH_TRAVERSAL_DEPTH,
    use_global_node_triplets=True,
    include_text=True,
)
# explicitly set this, because sometimes it isn't set correctly
graph_rag_retriever.graph_store_query_depth = GRAPH_TRAVERSAL_DEPTH

graph_query_engine = RetrieverQueryEngine.from_args(
    graph_rag_retriever,
    retriever_mode='embedding', #"keyword",
    response_mode='tree_summarize', # "simple_summarize" "compact", "tree_summarize",
)


In [None]:
# check some properties of the Retriever
#dir(graph_rag_retriever)
#graph_rag_retriever._include_text

#### Alternate Constructions of KG Retrievers and Query Engines

In [None]:
# graph_rag_retriever = KnowledgeGraphRAGRetriever(
#     storage_context=storage_context,
#     verbose=True,
#     retriever_mode='keyword',   # 'embedding', is not implemented in LlamaIndex yet
#     graph_traversal_depth=3,
#     include_text=True,
# )

# graph_query_engine = kg_index.as_query_engine(
#     include_text=False,
#     retriever_mode="keyword",
#     response_mode="tree_summarize",
# )

### Querying with Graph RAG

In [15]:
response = graph_query_engine.query(
    'What is the origin of the COVID virus?'
    #'What do we know about the Sars-cov-2 virus?',
    #'Tell me about Sarbecoviruses',
)
display(Markdown(f"<b>{response}</b>"))

[1;3;32mExtracted keywords: ['origin', 'COVID', 'virus']
[0m[1;3;34mKG context:
The following are knowledge sequence in max depth 3 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`
('Theories', 'About', 'Hypothetical man-made origin of sars-cov-2')
('Bat-origin coronavirus', 'Emerged in', 'China')
[0m

<b>The COVID-19 virus, also known as SARS-CoV-2, is believed to have originated from bats. It emerged as a natural event associated with either direct transmission of a bat coronavirus to humans or indirect transmission to humans via an intermediate host such as a Malaysian pangolin or another yet-to-be-identified mammal. Theories about a hypothetical man-made origin of SARS-CoV-2 have been thoroughly discredited by multiple coronavirus experts.</b>

#### Digging into the Graph RAG Response

##### Examine the Response object

In [16]:
#dir(response)
#response.get_formatted_sources()
# response.metadata
# response.response

# print('\n', response.source_nodes[1].text)
for source_node in response.source_nodes:
    print('\n', source_node.text)



 19,21,22Many scientists have proposed aggressive
monitoring of known hotspots to try to predict and prevent viral
emergence that might impact human health, including early
Unfortunately, outside of some members of the scienti ﬁc
community, there has been little interest and no sense of
from the very same SARS-like bat virus group that had been
warned about by multiple voices for over a decade —emerged
and proceeded to cause the COVID-19 pandemic that nowsweeps the globe.
SARS-CoV-2 emerged essentially as predicted: a natural
event associated with either direct transmission of a batcoronavirus to humans or indirect transmission to humans via
an intermediate host such as a Malaysian pangolin ( Manis
javanica ) or another, yet-to-be-identi ﬁed mammal.
28–31
It should be clari ﬁed that theories about a hypothetical man-
made origin of SARS-CoV-2 have been thoroughly discredited
by multiple coronavirus experts.21,28,29SARS-CoV-2 contains
neither the genetic ﬁngerprints of any of the rever

##### Inspect Retrieval-Only Results

In [19]:
resp = graph_rag_retriever.retrieve(
    'What do we know about the Sars-cov-2 virus?',
    #'Tell me about sarbecoviruses',
    #'What do we know about Animal-to-human host-switching in Sars-cov-2?',
 )

resp

[1;3;32mExtracted keywords: ['Sars-cov-2 virus.', 'Sars', 'cov', '2', 'know', 'virus']
[0m[1;3;34mKG context:
The following are knowledge sequence in max depth 3 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`
('Sars-cov-2', 'Is', 'Unlike any previously identified coronavirus')
('Sars-cov-2', 'Is closely related to', 'Numerous bat and pangolin coronaviruses')
[0m

[NodeWithScore(node=TextNode(id_='06aa8c5a-303e-448e-a6f9-683fad6c9e87', embedding=None, metadata={'page_label': '2', 'file_name': 'OriginOfCovid19.pdf', 'file_path': '/content/origin_of_covid_data/source_files/OriginOfCovid19.pdf', 'file_type': 'application/pdf', 'file_size': 2610336, 'creation_date': '2024-04-16', 'last_modified_date': '2024-04-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='b01a4e65-f6db-46e6-91a8-193400f0799e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '2', 'file_name': 'OriginOfCovid19.pdf', 'file_path': '/content/origin_of_covid_data/source_files/OriginOfCovid19.pdf', 'file_type': 'application/pdf', 'file_size': 2610336, 'creation_date': '2024-04-16', 'last_m

In [20]:
print(resp[0].text)

And most recently, at least as earlyas late November 2019, SARS-CoV-2 was recognized andbecame the third fatal bat virus –associated human diseaseHCoV
−NL63HCoV
−229ESADS
−CoVSARS
−CoV
−1Bat
−CoV
−RaTG13 SARS
−CoV
−2GD
−Pangolin
−CoVGX
−Pangolin
−CoV
HCoV−HKU1HCoV−OC43
FCoVMERS
−CoV
0.187100
97
100Beta-CoV
Delta-CoV
Gam
m
a-CoV
Alpha-CoV
FIGURE 1. Phylogenetic relationships of selected coronaviruses of
medical and veterinary importan ce. Human SARS-CoV and SARS-CoV-2
are closely related to numerous bat and pangolin coronaviruses in a viralgenetic grouping called sarbecoviruses, which contains many otherviruses very closely related to SARS-CoV and SARS-CoV-2. These viru-s e sb e l o n gt ot h eo r d e r Nidovirales , family Coronaviridae , subfamily
Coronavirinae and the four genera Alphacoronavirus ,Betacoronavirus ,
Gammacoronavirus ,a n d Deltacoronavirus . The betacoronaviruses are
comprised of two subgenera, Sarbecovirus andMerbecovirus .T h ef o r m e r
include SARS-CoV and SARS-C

##### Prompt Inspection and Engineering

In [21]:
list(graph_query_engine.get_prompts().keys())

['response_synthesizer:summary_template']

In [22]:
display_prompt_dict(graph_query_engine.get_prompts())

**Prompt Key**: response_synthesizer:summary_template<br>**Text:** <br>

Context information from multiple sources is below.
---------------------
{context_str}
---------------------
Given the information from multiple sources and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


<br><br>

In [23]:
prompt_key = 'response_synthesizer:summary_template'

print(graph_query_engine.get_prompts()[prompt_key].get_template())

Context information from multiple sources is below.
---------------------
{context_str}
---------------------
Given the information from multiple sources and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


## Plain RAG

### Setup and Load Vector Store

In [24]:
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)


# build basic RAG system
vector_index = VectorStoreIndex.from_documents(documents=documents)
vector_query_engine = vector_index.as_query_engine()


### Build Retriever and Query Engine

In [25]:
from llama_index.core import get_response_synthesizer
from llama_index.core.indices.vector_store.retrievers import VectorIndexRetriever
from llama_index.core.query_engine.retriever_query_engine import (
    RetrieverQueryEngine,
)

# build retriever
vector_retriever = VectorIndexRetriever(
    index=vector_index,
    similarity_top_k=10,
    vector_store_query_mode="default",
    alpha=None,
    doc_ids=None,
)

# build query engine
response_synthesizer=get_response_synthesizer(
    response_mode='simple_summarize', # "simple_summarize" "compact", "tree_summarize",
)

vector_query_engine = RetrieverQueryEngine(
    retriever=vector_retriever,
    response_synthesizer=response_synthesizer
)


### Querying with Plain RAG

In [26]:
response = vector_query_engine.query(
    'What is the origin of the COVID virus?'
    #'What do we know about the Sars-cov-2 virus?',
    #'Tell me about Sarbecoviruses',
)

display(Markdown(f"<b>{response}</b>"))

<b>The COVID-19 virus, also known as SARS-CoV-2, evolved directly or indirectly from a β-coronavirus in the sarbecovirus (SARS-like virus) group that naturally infect bats and pangolins in Asia and Southeast Asia. The specific mechanism of its emergence in humans remains unknown.</b>

#### Digging into the Plain RAG Response

##### Inspect Retrieval Results

In [27]:
resp = vector_retriever.retrieve(
    'What did Apple report as its net cash from operating activities in the Q3 2022 10-Q?'
)

resp

[NodeWithScore(node=TextNode(id_='e4a07f76-522d-4948-8c94-489a743b58b0', embedding=None, metadata={'page_label': '1', 'file_name': 'OriginOfCovid19.pdf', 'file_path': '/content/origin_of_covid_data/source_files/OriginOfCovid19.pdf', 'file_type': 'application/pdf', 'file_size': 2610336, 'creation_date': '2024-04-16', 'last_modified_date': '2024-04-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='748e97dc-5d06-4147-8079-f77aaa7ce905', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': 'OriginOfCovid19.pdf', 'file_path': '/content/origin_of_covid_data/source_files/OriginOfCovid19.pdf', 'file_type': 'application/pdf', 'file_size': 2610336, 'creation_date': '2024-04-16', 'last_m

In [28]:
len(resp)

10

In [None]:
# dir(resp[0])

In [29]:
print(resp[0].to_dict()['node']['text'])

Am. J. Trop. Med. Hyg. , 103(3), 2020, pp.


##### Prompt Inspection and Engineering

In [30]:
display_prompt_dict(vector_query_engine.get_prompts())

**Prompt Key**: response_synthesizer:text_qa_template<br>**Text:** <br>

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 


<br><br>

## Comparing Graph RAG to Plain RAG

In [31]:
rag_query = [
    'What is the origin of the COVID virus?'    # graph RAG better

    # -- both plain RAG and graph RAG do well for many queries
    #'What do we know about the Sars-cov-2 virus?',
    #'Tell me about Sarbecoviruses',
    #'What do we know about Animal-to-human host-switching in Sars-cov-2?',
    #'Which national institutes and committees were mentioned in the text?'

    # -- plain RAG does better for some queries with one concise topic
    #'What is the connection between bats and pangolins in COVID research?'
    #'Which universities were involved in Sars-cov-2 research?'
    #'Who is Jeffery Taubenberger?'

    #'Tell me about some members of the Department of medicine.'
    #'How are Beatrice h. hahn and Gerald t. keusch related to each other?'

    # -- graph RAG does better with connecting loosely related topics
    #'Who are some authors named in this text?'
    #'How are Beatrice Hahn and Gerald Keusch related to each other?'  # graph RAG better
    #'How are David Morens and Joel Breman related to each other?'  # plain RAG slightly better
    #'How are Joel Breman and Charles Calisher related?'  # graph RAG better

    # -- MISC
    #'Which of these people worked together: David M. Morens, Joel G. Breman, Charles H. Calisher, Peter C. Doherty, Beatrice H. Hahn, Gerald T. Keusch, Laura D. Kramer, James W. LeDuc, Thomas P. Monath, and Jeffery K. Taubenberger?'
    #'Which university Departments of Medicine are mentioned in the text?'
    #'Tell me about some members of university departments of medicine.'

][0]

print('Query:\n', rag_query)


Query:
 What is the origin of the COVID virus?


### Graph RAG

In [32]:
# graph RAG
response = graph_query_engine.query(rag_query)

display(Markdown(f"<b>{response}</b>"))

[1;3;32mExtracted keywords: ['origin', 'COVID', 'virus', 'COVID virus']
[0m[1;3;34mKG context:
The following are knowledge sequence in max depth 3 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...`
('Theories', 'About', 'Hypothetical man-made origin of sars-cov-2')
('Bat-origin coronavirus', 'Emerged in', 'China')
[0m

<b>The COVID-19 virus, also known as SARS-CoV-2, is believed to have originated from bats. It emerged as a natural event associated with either direct transmission of a bat coronavirus to humans or indirect transmission to humans via an intermediate host such as a Malaysian pangolin or another, yet-to-be-identified mammal. Theories about a hypothetical man-made origin of SARS-CoV-2 have been thoroughly discredited by multiple coronavirus experts.</b>

### Plain RAG

In [33]:
# plain RAG
response = vector_query_engine.query(rag_query)

display(Markdown(f"<b>{response}</b>"))

<b>The COVID-19 virus, also known as SARS-CoV-2, evolved directly or indirectly from a β-coronavirus in the sarbecovirus (SARS-like virus) group that naturally infect bats and pangolins in Asia and Southeast Asia. The specific mechanism of its emergence in humans remains unknown.</b>