# KG Index and VectorStore Index on Hotpot QA with LlamaIndex


This notebook demonstrates the following:

    1) Building a custom `KnowledgeGraphIndex` for extracted triples.
    2) How to build a customized `BaseRetriever` to retrieve KG triples and text documents jointly.
    3) Automated evaluation of KG RAG based QA using LLMs for evaluation on hotpot qa.
    
This notebook compares a `VectorIndexRetriever`, `KGTableRetriever`, and joint Retriever for RAG on the `hotpot_qa` dataset. The triples can be extracted following `hotpot_qa_extraction.ipynb` or by running `hotpot_qa_kgs.py`

In [20]:
import os, sys

open_ai_key = '...'
os.environ['OPENAI_API_KEY'] = open_ai_key

sys.path = ['/Users/walder2/kg_uq/'] + sys.path
path_to_data = '/Users/walder2/kg_uq/hotpot_qa_data'

from hotpot_qa_data.hotpot_data_load import load_hotpot_kgs

import numpy as np 

from llama_index import (
    SimpleDirectoryReader,
    ServiceContext,
    KnowledgeGraphIndex,
    VectorStoreIndex,
    get_response_synthesizer,
    QueryBundle,
    Response
)

from llama_index.llms import OpenAI
from llama_index.graph_stores import SimpleGraphStore
from llama_index.schema import NodeWithScore
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.node_parser import SentenceSplitter

# Retrievers
from llama_index.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    KGTableRetriever,
)

#Evaluators
from llama_index.evaluation import CorrectnessEvaluator, BaseEvaluator
from typing import Dict, List, Tuple

import time
import asyncio 
import nest_asyncio

nest_asyncio.apply()

In [21]:
kg, query_answer = load_hotpot_kgs(path_to_data=path_to_data, query_answer=True)

### Look at the data

`doc_id` is the id for the group (question group id), `sub_idx` is the id of the subgraph extracted for context entry `j` for a particular question. Some of the text is messy, we can clean that up at a later time. 

Also note that the `file_path` is included. This helps track where the triples came from. It is important to know if you want to track down top-matching subgraphs. 

In [22]:
kg

Unnamed: 0,head,head_type,relation,tail,tail_type,file_path,doc_id,sub_idx
0,Radio City,place,isFirstPrivateFMStationIn,India,place,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,0,0
1,Radio City,place,wasStartedOn,3 July 2001,event,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,0,0
2,Radio City,place,broadcastsOn,91.1 megahertz,measurement,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,0,0
3,Radio City,place,broadcastsFrom,Mumbai,place,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,0,0
4,Radio City,place,broadcastsFrom,Bengaluru,place,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,0,0
...,...,...,...,...,...,...,...,...
1996,Jeff Harris,person,hasOccupation,attorney,person,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,19,9
1997,Jeff Harris,person,hasOccupation,politician,person,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,19,9
1998,Jeff Harris,person,represents,23rd District of Missouri,place,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,19,9
1999,Jeff Harris,person,ranFor,attorney general,person,/Users/walder2/kg_uq/hotpot_qa_data/txt_files/...,19,9


# Building a KnowledgeGraphIndex from extracted triples

Pick an LLM for use, default is gpt-3.5-turbo. `service_context` will help with chunking documents and determining with LLM to call. The path for `documents` should point to `'./hotpot_qa_data/txt_files'`. 

In [23]:
llm = OpenAI(temperature=0)
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)
documents = SimpleDirectoryReader(path_to_data + '/txt_files').load_data()

Define the and empty `KnowledgeGraphIndex`. We will fill this store up with our extracted triples and a reference to the `Node` with contains the document the triples were extracted from.

In [24]:
kg_index = KnowledgeGraphIndex(
    [],
    service_context=service_context,
)

node_parser = SentenceSplitter()
nodes = node_parser.get_nodes_from_documents(documents)

file_to_node = {node.metadata['file_path']: k for k, node in enumerate(nodes)}

We fill up the `KnowledgeGraphIndex` object by passing in triples corresponding to the `Node` they were extracted from. 

In [25]:
for doc_id in kg['doc_id'].unique():
    idx = kg['doc_id'] == doc_id

    for sub_id in kg[idx]['sub_idx'].unique():
        tmp = kg[np.bitwise_and(idx, kg['sub_idx'] == sub_id)]
        for h, r, t, f in zip(tmp['head'], tmp['relation'], tmp['tail'], tmp['file_path']):
            kg_index.upsert_triplet_and_node((h, r, t), nodes[file_to_node[f]])
    
        

# Define JointReriever for KG triple and text indexing.

In [26]:
class JointRetriever(BaseRetriever):
    """Custom retriever that performs both Vector search and Knowledge Graph search"""

    def __init__(
        self,
        vector_retriever: VectorIndexRetriever,
        kg_retriever: KGTableRetriever,
        mode: str = "OR",
    ) -> None:
        """Init params."""

        self._vector_retriever = vector_retriever
        self._kg_retriever = kg_retriever
        if mode not in ("AND", "OR"):
            raise ValueError("Invalid mode.")
        self._mode = mode
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve nodes given query."""

        vector_nodes = self._vector_retriever.retrieve(query_bundle)
        kg_nodes = self._kg_retriever.retrieve(query_bundle)

        vector_ids = {n.node.node_id for n in vector_nodes}
        kg_ids = {n.node.node_id for n in kg_nodes}

        combined_dict = {n.node.node_id: n for n in vector_nodes}
        combined_dict.update({n.node.node_id: n for n in kg_nodes})

        if self._mode == "AND":
            retrieve_ids = vector_ids.intersection(kg_ids)
        else:
            retrieve_ids = vector_ids.union(kg_ids)

        retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
        return retrieve_nodes

Create a `VectorStoreIndex` object for RAG with just the text. 

In [27]:
vector_index = VectorStoreIndex.from_documents(documents)

Now we instantiate a retriever for the KQ, text, which is passed in the `CustomRetriever` object for joint retrieval.

In [55]:
qa_template = PromptTemplate(
    "Context information is"
    " below.\n---------------------\n{context_str}\n---------------------\nUsing"
    " only the context information, answer"
    " the question: {query_str}."
    "Be as concise as possible."

)

In [56]:
# create custom retriever
vector_retriever = VectorIndexRetriever(index=vector_index)
kg_retriever = KGTableRetriever(
    index=kg_index, retriever_mode="keyword", include_text=False
)
joint_retriever = JointRetriever(vector_retriever, kg_retriever)

# create response synthesizer
response_synthesizer = get_response_synthesizer(
    service_context=service_context,
    response_mode="tree_summarize",
    text_qa_template=qa_template
)

Create query engines for all three cases: text, KG, and joint.

In [57]:
joint_query_engine = RetrieverQueryEngine(
    retriever=joint_retriever,
    response_synthesizer=response_synthesizer,
)

vector_query_engine = vector_index.as_query_engine(text_qa_template=qa_template)

# only use triples from the KG
kg_keyword_query_engine = kg_index.as_query_engine(
    include_text=False,
    retriever_mode="keyword",
    response_mode="tree_summarize",
    text_qa_template=qa_template
)

# Evaluation of responses with RAG

Define a `CorrectnessEvaluator` that checks correctness of response to the query (with answer supplied). There are other tools for out in the wild evaluation of responses. E.g. 

`ResponseSourceEvaluator` - uses an LLM to decide if the response is similar enough to the sources -- a good measure for hallunication detection.

`QueryResponseEvaluator` - uses an LLM to decide if a response is similar enough to the original query -- a good measure for checking if the query was answered.


I've defined a function which uses `CorrectnessEvaluator` to check if the response contains an answer suitable for the query, given the correct answer. We can actually write custom evaluators that fit specified guidelines. This will come in handy later when we want to extend the QA to self defined embeddings ect. 

In [103]:
from llama_index.evaluation import EvaluationResult

In [104]:
async def run_query(query_engine: RetrieverQueryEngine, x: Dict[str, str]):
    return x, await query_engine.aquery(x['query'])

def run_queries(query_engine: RetrieverQueryEngine, queries: List[Dict[str, str]]):
    responses = []
    for batch_size in range(0, len(queries), 5):
        batch_queries = queries[batch_size:batch_size+5]
        
        tasks = [run_query(query_engine, x) for x in batch_queries]
        responses.extend(asyncio.run(asyncio.gather(*tasks)))
    return responses

In [105]:
kg_res = run_queries(kg_keyword_query_engine, query_answer)
print('KG responses complete...')
time.sleep(1)
txt_res = run_queries(vector_query_engine, query_answer)
print('Text responses complete...')
time.sleep(1)
joint_res = run_queries(joint_query_engine, query_answer)
print('Joint responses complete...')

KG responses complete...
Text responses complete..
Joint responses complete..


In [290]:
CORRECTNESS_SYS_TMPL = """
You are an expert evaluation system for a question answering chatbot.

You are given the following information:
- a user query, 
- a reference answer, and
- a generated answer.

Your job is to judge the relevance and correctness of the generated answer.
Output a score that represents a holistic evaluation and a concise explanation reason for the score.


Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.
- If the generated answer is not relevant to the user query, \
you should give a score of 1.
- If the generated answer is relevant but contains mistakes, \
you should give a score between 2 and 3.
- If the generated answer is relevant and fully correct, \
you should give a score between 4 and 5.
- The length of the answer should not affect the score at all.

--> Begining of example
# query 
"Which magazine was started first Arthur's Magazine or First for Women?"

# reference_answer
"Arthur's Magazine"

# generated_answer
"Arthur's Magazine"

# Output
_score_=5.0
_reason_=The response matched the reference answer perfectly.

--> End of example

The output should be returned following the example above only, using only two lines.

"""

CORRECTNESS_USER_TMPL = """
## User Query
{query}

## Reference Answer
{reference_answer}

## Generated Answer
{generated_answer}
"""

In [299]:
from llama_index.llms.openai import AsyncOpenAI

class HotpotEvaluator:
    def __init__(self, user_template: str, system_template: str, threshold: float = 4.0):
        self.client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
        self.threshold = threshold
        self.user_template = user_template
        self.system_template = system_template
        
    async def run_eval(self, x: Tuple[Dict[str, str], Response]) -> str:
        completion = await self.client.chat.completions.create(
            model='gpt-3.5-turbo',
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": self.system_template
                },
                {
                    "role": "user",
                    "content": self.user_template.format(query=x[0]['query'], reference_answer=x[0]['answer'], generated_answer=x[1].response)
                }
            ]
        )
        
        return completion.choices[0].message.content
        
        
    def get_evals(self, qa_responses):
        
        results = []
        for batch_size in range(0, len(qa_responses), 5):
            
            batch_x = qa_responses[batch_size:batch_size+5]
            
            tasks = [self.run_eval(x) for x in batch_x]
            rv = asyncio.run(asyncio.gather(*tasks))
            
            for x in rv:
                y = x.split('\n')
                for u in y:
                    if '_score_=' in u:
                        score = float(u.split('_score_=')[1])
                    elif '_reason_=' in u:
                        reason = u.split('_reason_=')[1]

                results.append({"passing": score >= 4.0, "score": score, "reason": reason})

        return results

In [300]:
evaluator = HotpotEvaluator(user_template=CORRECTNESS_USER_TMPL, system_template=CORRECTNESS_SYS_TMPL)

In [None]:
kg_eval = evaluator.get_evals(kg_res)
print('KG evaluations complete...')
time.sleep(1)
txt_res = evaluator.get_evals(txt_res)
print('Text evaluations complete...')
time.sleep(1)
joint_res = evaluator.get_evals(joint_res)
print('Joint evaluations complete...')

Run the retrievers on the queries and check correctness

Take a look at the results. The LLM will tell us if the retrieval is correct and some feedback on why the response was deemed correct or incorrect. 

In [310]:
kg_tot, txt_tot, joint_tot = 0, 0, 0
for i, x in enumerate(query_answer):
    print('-------------\nQuery: %s\nAnswer %s\n' % (repr(x['query']), repr(x['answer']) ))
    
    print('KG (%s): %s\nFeedback: %s\n' % (repr(kg_eval[i]['passing']), repr(kg_res[i][1].response), repr(kg_eval[i]['reason'])))
    
    print('Text (%s): %s\nFeedback: %s\n' % (repr(txt_eval[i]['passing']), repr(txt_res[i][1].response), repr(txt_eval[i]['reason'])))
    
    print('Joint (%s): %s\nFeedback: %s\n---------------\n' % (repr(joint_eval[i]['passing']), repr(joint_res[i][1].response), repr(joint_eval[i]['reason'])))
    
    kg_tot += kg_eval[i]['passing']
    txt_tot += txt_eval[i]['passing']
    joint_tot += joint_eval[i]['passing']

-------------
Query: "Which magazine was started first Arthur's Magazine or First for Women?"
Answer "Arthur's Magazine"

KG (True): "Arthur's Magazine was started first."
Feedback: 'The response is relevant and fully correct, matching the reference answer perfectly.'

Text (True): "Arthur's Magazine was started before First for Women."
Feedback: "The generated answer is both relevant and correct, providing the accurate information that Arthur's Magazine was started before First for Women."

Joint (True): "Arthur's Magazine was started before First for Women."
Feedback: "The generated answer is both relevant and correct, providing the accurate information that Arthur's Magazine was started before First for Women."
---------------

-------------
Query: 'The Oberoi family is part of a hotel company that has a head office in what city?'
Answer 'Delhi'

KG (True): 'Delhi'
Feedback: 'The response matched the reference answer perfectly.'

Text (True): 'The Oberoi family is part of a hotel co

In [311]:
print(f"KG Correct: {kg_tot}, Text Correct: {txt_tot}, Joint Correct: {joint_tot}, Total: {len(query_answer)}")

KG Correct: 10, Text Correct: 13, Joint Correct: 14, Total: 20


#### Things to do...

Looking at the output above, it looks like some correct answers are being marked incorrectly based on the answer being "too verbose". We can play with this by changing the prompt for the evaluator. Here are some thoughts on things to try out: 

    1) Clean up the triples a bit, some of the text was messy and try this again.
    2) Look at correcting the prompt for the evaluators.
    3) Check if top documents (subgraphs) correspond to the `hotpot_qa` dataset suggestion for top context. 
    4) Use manually defined embeddings for the KGs (need to fit hetero gnn and pass as embedding method)
