#📓 RAG Framework Evaluation and Upgrade


<img src="https://i.imgur.com/fTICSCN.png">

In this exercise, you will have the opportunity to evaluate the various RAG components you have built earlier. In the previous Day 1 session, we only performed evaluation using the LLM. However, this time, we will utilize various evaluation methods commonly used in the NLP field.


### I. Evaluate RAG  
### II. Upgrade KG Query Stage with MCP
  
Okay. Now we know what we have to do for this final section.  
However, we need to know additional evaluation metric for CRAG dataset:

`Exact Accuracy`  
`Accuracy`  
`Hallucination`   
`Missing`  



## 0. New evaluation metrics for CRAG dataset

In most cases, datasets designed for specific tasks are presented along with **evaluation metrics** that can be used for performance evaluation. Similarly, the CRAG dataset provides evaluation metrics that should be used when measuring the performance of LLMs on this dataset.  

Therefore, before proceeding with the evaluation, let’s first check which evaluation methods the creators of the CRAG dataset intended to use. Specifically, we will examine the four answer classification criteria they proposed, understand how these are evaluated, and clarify what each criterion means.  

<br/>

We follow the steps below:  

#### 1. What is the new evaluation metrics for CRAG dataset?
#### 2. How to evalute RAG following new evaluation metrics?


### 1. What is the new evaluation metrics for CRAG dataset?
  
The creators of the CRAG dataset evaluated RAG based on the following four elements:

<img src="https://i.imgur.com/0hxmPdi.png">

Simply put, they classified responses that were identical to their predefined answers as the most ideal case. Responses with similar meanings but containing minor errors were classified as the next most ideal case.

The important point is that, under the CRAG dataset’s evaluation criteria, everything else is not simply classified as “incorrect.” Instead, the creators expect the LLM to admit when it does not know the answer. Incorrect answers are those containing errors, and the more these answers occur, the worse the model’s performance is considered. However, answers classified as Missing (indicating no answer) do not negatively or positively impact the model’s performance.

By understanding these four classification criteria, you will gain valuable insights when analyzing evaluation results on the CRAG dataset.


### 2. How to evalute RAG following new evaluation metrics?
  
Since the evaluation criteria mentioned above cannot be measured automatically, the evaluation must be conducted using an LLM. This can be done using methods similar to Trulens. However, as the evaluation results can vary depending on the prompt used, we will use the default prompt provided by the creators of the CRAG dataset.

The evaluation prompt is as follows. Based on its content, we need to provide the LLM with the `question`, `model prediction`, and `ground truth answers`. The LLM will then generate a response by performing the evaluation according to the instructions.

```Python
INSTRUCTIONS = """
# Task:
You are given a Question, a model Prediction, and a list of Ground Truth answers, judge whether the model Prediction matches any answer from the list of Ground Truth answers. Follow the instructions step by step to make a judgement.
1. If the model prediction matches any provided answers from the Ground Truth Answer list, "Accuracy" should be "True"; otherwise, "Accuracy" should be "False".
2. If the model prediction says that it couldn't answer the question or it doesn't have enough information, "Accuracy" should always be "False".
3. If the Ground Truth is "invalid question", "Accuracy" is "True" only if the model prediction is exactly "invalid question".
# Output:
Respond with only a single JSON string with an "Accuracy" field which is "True" or "False".
"""
```
To help us understand this with a simple example, please install and import the following library:

```Python
!pip install openai==1.55.3 --quiet
!pip install llama-index==0.12.2 --quiet
!pip install llama-index-embeddings-huggingface==0.4.0 --quiet
!pip install packaging==23.2 langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet
!pip install blingfire beautifulsoup4 sentence-transformers ray --quiet
!pip install scikit-learn --quiet
!pip install tqdm tiktoken --quiet
!pip uninstall numpy -y
!pip install numpy==2.0.2
!pip uninstall pandas scipy transformers -y
!pip install pandas scipy transformers --quiet
```

In [None]:
### YOUR CODE HERE ###

INSTRUCTIONS = """
# Task:
You are given a Question, a model Prediction, and a list of Ground Truth answers, judge whether the model Prediction matches any answer from the list of Ground Truth answers. Follow the instructions step by step to make a judgement.
1. If the model prediction matches any provided answers from the Ground Truth Answer list, "Accuracy" should be "True"; otherwise, "Accuracy" should be "False".
2. If the model prediction says that it couldn't answer the question or it doesn't have enough information, "Accuracy" should always be "False".
3. If the Ground Truth is "invalid question", "Accuracy" is "True" only if the model prediction is exactly "invalid question".
# Output:
Respond with only a single JSON string with an "Accuracy" field which is "True" or "False".
"""

In [None]:
### YOUR CODE HERE ###

!pip install openai==1.55.3 --quiet
!pip install llama-index==0.12.2 --quiet
!pip install llama-index-embeddings-huggingface==0.4.0 --quiet
!pip install packaging==23.2 langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet
!pip install blingfire beautifulsoup4 sentence-transformers ray --quiet
!pip install scikit-learn --quiet
!pip install tqdm tiktoken --quiet
!pip uninstall numpy -y
!pip install numpy==2.0.2
!pip uninstall pandas scipy transformers -y
!pip install pandas scipy transformers --quiet

You can drag-and-drop the import code file into the workspace. This will allow you to import the necessary functions from that file for this practice. However, please ensure that the file is in the same folder as the currently running code for the import to succeed.  


```Python
import os
os.environ["OPENAI_API_KEY"] = "sk-..." #Insert your openai api key

import openai
import json
import random
import bz2
from tqdm import tqdm
from import_function import LlamaIndexRetriever, Reader, KGQueryEngine
```


In [None]:
### YOUR CODE HERE ###

import os
os.environ["OPENAI_API_KEY"] = "sk-..." #Insert your openai api key

import openai
import json
import random
import bz2
from tqdm import tqdm
from import_function import LlamaIndexRetriever, Reader, KGQueryEngine

Now, to proceed with the evaluation, let’s mount your Google Drive as before to make the dataset accessible in Colab. Run the code below. Depending on your computer environment, this may take a little time.  

```Pyhon
from google.colab import drive

drive.mount('/content/drive')
```
```Pyhon
file_path = '/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl.bz2'

dataset = []

with bz2.open(file_path, 'rt') as file:
    for line in file:
        try:
            data = json.loads(line.strip())
            dataset.append(data)
            if len(dataset) > 500:
              break
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")
```

In [None]:
### YOUR CODE HERE ###

from google.colab import drive

drive.mount('/content/drive')

In [None]:
### YOUR CODE HERE ###

file_path = '/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl.bz2'

dataset = []

with bz2.open(file_path, 'rt') as file:
    for line in file:
        try:
            data = json.loads(line.strip())
            dataset.append(data)
            if len(dataset) > 500:
              break
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")

Please note that, depending on your computer environment, this process might take some time.  

<img src="https://i.gifer.com/B6Qs.gif" width="150">

Thank you for your understanding.


Next, to obtain the model prediction by asking the LLM a question, we will use the following simple code to generate a response:

```Python
def generate_answer(user_prompt, system_prompt = "You are a helpful assistant."):
    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt,
        },
    ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",  
        messages=messages,
    )

    return response.choices[0].message.content
```

In [None]:
### YOUR CODE HERE ###

def generate_answer(user_prompt, system_prompt = "You are a helpful assistant."):
    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": user_prompt,
        },
    ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
    )

    return response.choices[0].message.content

Now, let’s proceed with the evaluation. The evaluation will be conducted using randomly selected data points.

```Python
random_data = random.choice(dataset)

test_question = random_data['query']
test_answer = random_data['answer']
test_answer_candidate = random_data['alt_ans']
all_answers = [test_answer] + test_answer_candidate

model_prediction = generate_answer(user_prompt=test_question)
```
```Python
context_template = f"""Question: {test_question}
List of Ground Truth answers: {all_answers}
Model Prediction: {model_prediction}
"""

print(context_template)

evaluation_result = generate_answer(user_prompt=context_template, system_prompt=INSTRUCTIONS)

print(evaluation_result)
```

In [None]:
### YOUR CODE HERE ###

random_data = random.choice(dataset)

test_question = random_data['query']
test_answer = random_data['answer']
test_answer_candidate = random_data['alt_ans']
all_answers = [test_answer] + test_answer_candidate

model_prediction = generate_answer(user_prompt=test_question)

In [None]:
### YOUR CODE HERE ###

context_template = f"""Question: {test_question}
List of Ground Truth answers: {all_answers}
Model Prediction: {model_prediction}
"""

print(context_template)

evaluation_result = generate_answer(user_prompt=context_template, system_prompt=INSTRUCTIONS)

print(evaluation_result)

Once the LLM evaluation results are generated, we need a function to extract the relevant results. Using the parser below, we can extract the classification results effectively:

```Python
def parse_response(response):
    try:
        response = response.lower()
        model_resp = json.loads(response)
        answer = -1
        if "accuracy" in model_resp and (
            (
              model_resp["accuracy"] is True
            )
            or
            (
                isinstance(model_resp["accuracy"], str)
                and model_resp["accuracy"].lower() == "true"
            )
        ):
            answer = 1
        else:
            raise ValueError(f"Could not parse answer from response: {model_resp}")

        return answer
    except:
        return -1

test_text = """{
    "Accuracy": "True"
}"""

print(parse_response(test_text))
```

In [None]:
### YOUR CODE HERE ###

def parse_response(response):
    try:
        response = response.lower()
        model_resp = json.loads(response)
        answer = -1
        if "accuracy" in model_resp and (
            (
              model_resp["accuracy"] is True
            )
            or
            (
                isinstance(model_resp["accuracy"], str)
                and model_resp["accuracy"].lower() == "true"
            )
        ):
            answer = 1
        else:
            raise ValueError(f"Could not parse answer from response: {model_resp}")

        return answer
    except:
        return -1

test_text = """{
    "Accuracy": "True"
}"""

print(parse_response(test_text))

Here is the function that processes the LLM-generated response, converts it into JSON format, and checks for the accuracy attribute. If accuracy is true, it returns 1; otherwise, it returns -1.

This function can be used to perform the evaluation.

```Python
def CRAG_evaluation(question, ground_truth, prediction):
  context_template = f"""Question: {question}
  List of Ground Truth answers: {ground_truth}
  Model Prediction: {prediction}
  """
  
  evaluation_result = generate_answer(user_prompt=context_template, system_prompt=INSTRUCTIONS)

  eval_res = parse_response(evaluation_result)

  return eval_res
```

In [None]:
### YOUR CODE HERE ###

def CRAG_evaluation(question, ground_truth, prediction):
  context_template = f"""Question: {question}
  List of Ground Truth answers: {ground_truth}
  Model Prediction: {prediction}
  """

  evaluation_result = generate_answer(user_prompt=context_template, system_prompt=INSTRUCTIONS)

  eval_res = parse_response(evaluation_result)

  return eval_res

## I. Evaluate RAG

So far, we have evaluated the performance of the retriever to determine which retriever can be more effective. While using the recall score as the evaluation metric makes it difficult to measure semantic similarity, we have seen that it is convenient for large-scale automatic evaluation.

In this section, we aim to evaluate the RAG system using multiple approaches. After selecting the best retriever based on the method described above, we now need to integrate it with the Reader to build a complete RAG system and verify its overall performance.
We follow the steps below:  

#### 1. Define RAG and Evaluation Metric
#### 2. Evaluate through CRAG Evaluation Method

### 1. Define RAG
Now, we will define RAG class.


```Python
external_kg_server = "http://x.x.x.x:port"   #need to change

class RAG:
    def __init__(self, server=None):
        self.retriever = LlamaIndexRetriever()
        self.kg_query_engine = KGQueryEngine(server=server)
        self.reader = Reader()

    def retrieve(self, query, search_results, topk):
        retrieved_results = self.retriever.retrieve(query, search_results, topk)

        kg_results = self.kg_query_engine.query(query)

        combined_results = [kg_results]
        combined_results.extend(retrieved_results)

        return combined_results

    def generate_response(self, query, retrieved_results):
        answer = self.reader.generate_response(query, retrieved_results)
        return answer

    def inference(self, query, search_results, topk):
        retrieved_results = self.retrieve(query, search_results, topk)
        answer = self.generate_response(query, retrieved_results)
        return {
            "retrieved_results": retrieved_results,
            "answer": answer
        }

rag = RAG(server=external_kg_server)
```

In [None]:
### YOUR CODE HERE ###

external_kg_server = "http://x.x.x.x:port"   #need to change

class RAG:
    def __init__(self, server=None):
        self.retriever = LlamaIndexRetriever()
        self.kg_query_engine = KGQueryEngine(server=server)
        self.reader = Reader()

    def retrieve(self, query, search_results, topk):
        retrieved_results = self.retriever.retrieve(query, search_results, topk)

        kg_results = self.kg_query_engine.query(query)

        combined_results = [kg_results]
        combined_results.extend(retrieved_results)

        return combined_results

    def generate_response(self, query, retrieved_results):
        answer = self.reader.generate_response(query, retrieved_results)
        return answer

    def inference(self, query, search_results, topk):
        retrieved_results = self.retrieve(query, search_results, topk)
        answer = self.generate_response(query, retrieved_results)
        return {
            "retrieved_results": retrieved_results,
            "answer": answer
        }

rag = RAG(server=external_kg_server)

###2. Evaluate through CRAG Evaluation Method

This time, we will evaluate the results using the evaluation metrics proposed in the CRAG dataset. As explained at the very beginning of this session, the model’s predictions are categorized into four classes for evaluation purposes.

####1.   **Perfect**: Correctly answers the question and contains no hallucination
####2.   **Acceptable**: Provide a useful answer to the question but may contain minor errors
####3.   **Missing**: The response is "I don't know", "I'm sorry I can't find ...".
####4.   **Incorrect**: The response provides wrong or irrelevant infromation to answer the question.

The model predictions categorized above are then linearized in the following manner to evaluate the final performance of the RAG system.

<img src="https://i.imgur.com/TDQ5eI4.png">

Here, let’s proceed to evaluate the validation set using the function we just defined.

We designed our Graph RAG to handle only questions within the finance domain, so we will also select test dataset questions that belong to the finance domain.

```Python
finance_test_dataset_ids = []

for data in dataset:
    if data['domain'] == 'finance':
        finance_test_dataset_ids.append(data['interaction_id'])

    if len(finance_test_dataset_ids) >= 10:
        break
```

```Python
n_miss, n_correct, n_correct_exact = 0, 0, 0

for data in tqdm(dataset):
  if data['interaction_id'] not in finance_test_dataset_ids:
    continue

  question = data['query']
  ground_truth_lowercase = str(data['answer']).strip().lower()
  web_search_results = data['search_results']

  prediction_lowercase = rag.inference(question, web_search_results, 5)['answer'].lower()

  if prediction_lowercase == ground_truth_lowercase:
      n_correct_exact += 1
      continue
  elif "i don't know" in prediction_lowercase:
      n_miss += 1
      continue

  acceptable = CRAG_evaluation(question, ground_truth_lowercase, prediction_lowercase)

  if acceptable == 1:
    n_correct += 1

n_hallucinate = (len(finance_test_dataset_ids) - n_correct_exact - n_correct - n_miss)

CRAG_score = n_correct_exact + 0.5*n_correct - n_hallucinate

print("\n\n")
print("Number of correct answers:", n_correct)
print("Number of exact correct answers:", n_correct_exact)
print("Number of missed answers:", n_miss)
print("Number of hallucinated answers:", n_hallucinate)
print("CRAG score:", CRAG_score)
```

In [None]:
### YOUR CODE HERE ###

finance_test_dataset_ids = []

for data in dataset:
    if data['domain'] == 'finance':
        finance_test_dataset_ids.append(data['interaction_id'])

    if len(finance_test_dataset_ids) >= 10:
        break

In [None]:
### YOUR CODE HERE ###

n_miss, n_correct, n_correct_exact = 0, 0, 0

for data in tqdm(dataset):
  if data['interaction_id'] not in finance_test_dataset_ids:
    continue

  question = data['query']
  ground_truth_lowercase = str(data['answer']).strip().lower()
  web_search_results = data['search_results']

  prediction_lowercase = rag.inference(question, web_search_results, 5)['answer'].lower()

  if prediction_lowercase == ground_truth_lowercase:
      n_correct_exact += 1
      continue
  elif "i don't know" in prediction_lowercase:
      n_miss += 1
      continue

  acceptable = CRAG_evaluation(question, ground_truth_lowercase, prediction_lowercase)

  if acceptable == 1:
    n_correct += 1

n_hallucinate = (len(finance_test_dataset_ids) - n_correct_exact - n_correct - n_miss)

CRAG_score = n_correct_exact + 0.5*n_correct - n_hallucinate

print("\n\n")
print("Number of correct answers:", n_correct)
print("Number of exact correct answers:", n_correct_exact)
print("Number of missed answers:", n_miss)
print("Number of hallucinated answers:", n_hallucinate)
print("CRAG score:", CRAG_score)

## II. Upgrade KG Query Stage with MCP

So far, we have integrated RAG with a knowledge graph, enabling the LLM to retrieve structured knowledge through KGQueryEngine. In this approach, the retrieval was based on a decision tree to invoke API calls, and queries were constructed using the LLM.

However, is our KGQueryEngine truly reliable in practical settings? Can the retrieved results be used effectively by the LLM? More importantly, how does this method compare to the emerging MCP-based retrieval approach?

To answer these questions, we will upgrade our RAG system and evaluate both methods through the following steps:

####1. Error Case Analysis
####2. Implement MCP-based Tool Calls

This structured comparison will help us determine whether the shift toward MCP tools—now gaining popularity in LLM applications—is practically justified.

###1. Error Case Analysis

Previously, we built Graph RAG and confirmed that incorporating a knowledge graph, rather than relying solely on RAG, can improve performance for certain examples.

However, it remains uncertain whether our `KGQueryEngine` functions correctly for all question-answer pairs. We have yet to analyze all questions within the finance domain.

Beyond questions that require information such as EPS, there can be various other questions within the finance domain. For example, consider the following questions:

```python
with bz2.open(file_path, 'rt') as file:
  for line in file:
    data = json.loads(line.strip())

    if data['interaction_id'] == "7a77679e-d88b-4acf-9532-94e32233950b":
      question = data['query']
      gold_answer = data['answer']
      search_results = data['search_results']
      query_time = data['query_time']
      break
      
print("Question: ", question)
print("Gold Answer: ", gold_answer)
print("Query Time: ", query_time)
```

In [None]:
### YOUR CODE HERE ###

with bz2.open(file_path, 'rt') as file:
  for line in file:
    data = json.loads(line.strip())

    if data['interaction_id'] == "7a77679e-d88b-4acf-9532-94e32233950b":
      question = data['query']
      gold_answer = data['answer']
      search_results = data['search_results']
      query_time = data['query_time']
      break

print("Question: ", question)
print("Gold Answer: ", gold_answer)
print("Query Time: ", query_time)

Let’s examine the results generated by our KGQueryEngine for this question.

To do so, we need to recall how we perform retrievals on the knowledge graph. We use an LLM to generate queries that serve as inputs for APIs connected to the knowledge graph.

However, the LLM models provided by OpenAI inherently exhibit randomness, meaning that the same query is not always generated consistently. As a result, our KGQueryEngine does not always return identical results.

Therefore, it is important to repeat the same process multiple times to identify trends in the generated outputs.

Let’s explore the irregularity of GPT and our query generation examples through the following case study.


```Python
kg_query_engine = KGQueryEngine(server=external_kg_server)

generated_queries = []

for i in range(3):
    generated_query = kg_query_engine.generate_query(question)[0]
    print(generated_query)
    generated_queries.append(generated_query)
```

In [None]:
### YOUR CODE HERE ###

kg_query_engine = KGQueryEngine(server=external_kg_server)

generated_queries = []

for i in range(3):
    generated_query = kg_query_engine.generate_query(question)[0]
    print(generated_query)
    generated_queries.append(generated_query)

Select queries that explicitly specify company names, metrics, and other relevant details, and examine the results of knowledge graph retrieval

```python
i=0 #change this index

kg_results = kg_query_engine.get_finance_kg_results(generated_queries[i])

json_strings = kg_results.split("<DOC>\n")

json_strings = [s.replace("'", '"') for s in json_strings]

parsed_json = [json.loads(js) for js in json_strings]

for idx, data in enumerate(parsed_json):
    print(json.dumps(data, indent=4))

len(kg_results)
```

In [None]:
### YOUR CODE HERE ###

i=0 #change this index

kg_results = kg_query_engine.get_finance_kg_results(generated_queries[i])

json_strings = kg_results.split("<DOC>\n")

json_strings = [s.replace("'", '"') for s in json_strings]

parsed_json = [json.loads(js) for js in json_strings]

for idx, data in enumerate(parsed_json):
    print(json.dumps(data, indent=4))

len(kg_results)

Upon reviewing the search results, we observed that an extremely long string was retrieved.  

Such lengthy search results can lead to the following issues:  

1. **Excessive unnecessary information may cause hallucinations.**  
2. **Even if the necessary information is retrieved, the LLM may fail to recognize it properly.**  
3. **When combined with search results from `search_results`, the total context length may exceed the LLM’s limit, leading to inference errors.**  

Let’s analyze how our RAG responds to this issue. Here, we will focus on handling search results from the knowledge graph, excluding the search process from `search_results`.  

Therefore, let’s declare a new RAG class as follows and use it for evaluation.

```python
class RAGwithoutSR:
    def __init__(self, server=None):
        self.retriever = LlamaIndexRetriever()
        self.kg_query_engine = KGQueryEngine(server=server)
        self.reader = Reader()

    def retrieve(self, query, search_results, topk):
        kg_results = self.kg_query_engine.query(query)

        return [kg_results]

    def generate_response(self, query, retrieved_results):
        answer = self.reader.generate_response(query, retrieved_results)
        return answer

    def inference(self, query, search_results, topk):
        retrieved_results = self.retrieve(query, search_results, topk)
        answer = self.generate_response(query, retrieved_results)
        return {
            "retrieved_results": retrieved_results,
            "answer": answer
        }

rag_kg = RAGwithoutSR(server=external_kg_server)
```
```python
for i in range(3):
    rag_output = rag_kg.inference(question, search_results, 5)
    print(rag_output['answer'])
```

In [None]:
### YOUR CODE HERE ###

class RAGwithoutSR:
    def __init__(self, server=None):
        self.retriever = LlamaIndexRetriever()
        self.kg_query_engine = KGQueryEngine(server=server)
        self.reader = Reader()

    def retrieve(self, query, search_results, topk):
        kg_results = self.kg_query_engine.query(query)

        return [kg_results]

    def generate_response(self, query, retrieved_results):
        answer = self.reader.generate_response(query, retrieved_results)
        return answer

    def inference(self, query, search_results, topk):
        retrieved_results = self.retrieve(query, search_results, topk)
        answer = self.generate_response(query, retrieved_results)
        return {
            "retrieved_results": retrieved_results,
            "answer": answer
        }

rag_kg = RAGwithoutSR(server=external_kg_server)

In [None]:
### YOUR CODE HERE ###

for i in range(3):
    rag_output = rag_kg.inference(question, search_results, 5)
    print(rag_output['answer'])

###2. Implement MCP-based Tool Calls

In this step, we replace the existing KGQueryEngine logic with MCP-based tool calls.

**Model Context Protocol (MCP)** is a standardized interface that allows large language models to interact with external tools or data services via a structured client-server protocol. Each function on the server is registered as a tool, and the LLM can invoke these tools by sending structured requests through the MCP client.

<img src="https://i.imgur.com/P1g0TPh.png">

Therefore, if an external data source can be accessed by the LLM through MCP, this interaction can be viewed as a form of RAG. This is especially relevant in our scenario, where the knowledge graph is only partially accessible via APIs. In such cases, instead of relying on a pre-defined decision tree, it might be more effective to let the LLM decide which tools to use dynamically.

In other words, the LLM itself selects the appropriate tool to retrieve specific information from the knowledge graph.

To enable this, it is crucial to define which tools are available to the LLM. This is handled on the MCP server side, where each tool is registered along with a description that helps the model understand what it can do.

<img src="https://i.imgur.com/EiQfJxQ.png" width=500>

The MCP server consists of core tool definitions and server-side settings that together define the server’s behavior.  
However, because the MCP server requires an event loop to run, it cannot be executed within this Colab notebook environment. Thus, we exclude server execution from this exercise, but you can refer to the attached code for implementation details.

We can leverage the same tools we used previously, such as llamaindex, to construct the MCP client. Refer to the code below to see how the MCP client can be implemented.

Now, let’s proceed to implement the MCP client. This client is responsible for retrieving tool information from the server and forwarding tool invocation requests on behalf of the LLM.

First, let's install and import what to need.

```python
! pip install llama-index-tools-mcp --quiet
! pip install llama-index-llms-openai --quiet
```
```python
from llama_index.tools.mcp import BasicMCPClient, McpToolSpec
from llama_index.llms.openai import OpenAI
from llama_index.core.agent.workflow import FunctionAgent, ToolCallResult, ToolCall
from llama_index.core.workflow import Context

import dotenv
```

In [None]:
### YOUR CODE HERE ###

! pip install llama-index-tools-mcp==0.2.5 --quiet
! pip install llama-index-llms-openai==0.4.4 --quiet

In [None]:
### YOUR CODE HERE ###

from llama_index.tools.mcp import BasicMCPClient, McpToolSpec
from llama_index.core.agent.workflow import FunctionAgent, ToolCallResult, ToolCall
from llama_index.core.workflow import Context

import dotenv

Second, apart from this, the current reader is not considering query_time. Therefore, let's use the revised version to consider query_time.


```python
from openai import OpenAI

oai_client = OpenAI()

class Reader:
  def __init__(self):

    self.system_prompt = """
    You are provided with a question, the time at which the question is asked, and various references.
    Your task is to answer the question succinctly, using the fewest words possible.
    If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
    There is no need to explain the reasoning behind your answers.
    """

  def generate_response(self, question: str, query_time: str, top_k_chunks: list) -> str:
    """
    Generate answer from context.
    """
    llm_input = self.prompt_generator(question, query_time, top_k_chunks)
    completion = oai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=
    llm_input
    ).choices[0].message.content
    return completion

  def prompt_generator(self, query, query_time, top_k_chunks):
    user_message = ""
    references = ""

    if len(top_k_chunks) > 0:
        references += "# References \n"
        # Format the top sentences as references in the model's prompt template.
        for chunk_id, chunk in enumerate(top_k_chunks):
            references += f"- {chunk.strip()}\n"


    user_message += f"{references}\n------\n\n"
    user_message
    user_message += f"Using only the references listed above, answer the following question: \n"
    user_message += f"Question: {query}\n"
    user_message += f"Query Time: {query_time}\n"

    llm_input = [
    {"role": "system", "content": self.system_prompt},
    {"role": "user", "content": user_message},
    ]

    return llm_input
```

In [None]:
### YOUR CODE HERE ###

from openai import OpenAI

oai_client = OpenAI()

class Reader:
  def __init__(self):

    self.system_prompt = """
    You are provided with a question, the time at which the question is asked, and various references.
    Your task is to answer the question succinctly, using the fewest words possible.
    If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
    There is no need to explain the reasoning behind your answers.
    """

  def generate_response(self, question: str, query_time: str, top_k_chunks: list) -> str:
    """
    Generate answer from context.
    """
    llm_input = self.prompt_generator(question, query_time, top_k_chunks)
    completion = oai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=
    llm_input
    ).choices[0].message.content
    return completion

  def prompt_generator(self, query, query_time, top_k_chunks):
    user_message = ""
    references = ""

    if len(top_k_chunks) > 0:
        references += "# References \n"
        # Format the top sentences as references in the model's prompt template.
        for chunk_id, chunk in enumerate(top_k_chunks):
            references += f"- {chunk.strip()}\n"


    user_message += f"{references}\n------\n\n"
    user_message
    user_message += f"Using only the references listed above, answer the following question: \n"
    user_message += f"Question: {query}\n"
    user_message += f"Query Time: {query_time}\n"

    llm_input = [
    {"role": "system", "content": self.system_prompt},
    {"role": "user", "content": user_message},
    ]

    return llm_input

### MCP Server Overview

The **MCP (Model-Context-Program) Server** serves as the execution and data layer that interfaces directly with the LLM. It exposes customized functionality and structured contextual information through three main elements: **Resources**, **Tools**, and **Prompts**. This enables LLMs to make informed, context-aware decisions and perform real-world operations safely and efficiently.

---

### MCP Context Components
---

### 1. Resource

A **Resource** provides structured, external data that the LLM can reference during reasoning or decision-making. It represents external objects—such as datasets, files, graph nodes, or static configuration tables—that are read-only from the model's perspective.

**Key Properties:**
- `name`: Human-readable name shown to users or in debugging.
- `description`: A short summary explaining the resource's purpose.
- `mime_type`: The MIME type of the content (e.g., `text/plain`, `application/json`). This describes the content format.
- `uri`: A globally unique identifier for the resource.
- `content`: The actual data or a pointer to the data (e.g., a JSON object, string, or file reference).

**Usage Notes:**
- Resources are **not invoked** like functions.
- They are **fetched by URI**, and remain **static** so that the LLM can repeatedly refer to them without reloading.
- Resources improve LLM accuracy by anchoring it to reliable external context.

---

### 2. Tool

A **Tool** is a callable function registered on the MCP server that the LLM can use to interact with the outside world. These tools bridge natural language instructions and real-world effects by turning a model’s intent into executable actions.

**Typical Use Cases:**
- Executing database queries
- Fetching live data from APIs
- Performing calculations or summarizations
- Searching documents or files

**How It Works:**
1. The user provides a natural language input (e.g., “Please calculate 3 + 7”).
2. The MCP client supplies the LLM with a list of available tools and their schema.
3. The LLM chooses the appropriate tool and extracts input arguments (e.g., `x=3`, `y=7`).
4. The tool invocation is generated and passed to the MCP server.
5. The tool is executed on the server, and the result is returned.
6. The LLM incorporates the result into its context to generate a final user response.

**Implementation:**
Tools are registered using the `@mcp.tool()` decorator in Python, and must conform to the function signature and type schema required by the MCP framework.

---

### 3. Prompt

A **Prompt** in MCP serves as a reusable, templated instruction that guides the LLM’s reasoning in a structured way. Prompts help the model:

- Interpret ambiguous user queries
- Structure downstream tool invocations
- Extract relevant content from complex input

**Prompt Usage Scenarios:**
- Generate structured summaries from unstructured context
- Classify user intent
- Disambiguate time references like “this year” or “last quarter”

Prompts act as templates or procedural guides that the LLM fills in dynamically based on user input and context. While not executable like tools, they play a key role in shaping how the LLM prepares inputs and interprets outputs.

---

### Summary

| Component | Purpose |
|----------|---------|
| `Resource` | Static external data that LLM can reference |
| `Tool`     | Callable function to perform real-world actions |
| `Prompt`   | Structured, reusable guide for reasoning or template generation |

Together, these elements form the foundation of the **MCP Server**, enabling LLMs to operate safely, reliably, and contextually in real applications.

### MCP Client Overview

In this section, we focus on the **MCP Client**, which serves as the interface between the LLM runtime (e.g., LlamaIndex) and an external **MCP Server**.

### What is the MCP Client?

The MCP Client is responsible for:

- Connecting to the MCP Server over HTTP or SSE.
- Fetching available **tools**, **resources**, and **prompts** exposed by the server.
- Translating natural language requests into structured tool invocations.
- Managing context and integrating responses from the server back into the LLM workflow.

It essentially acts as a **bridge** that allows the language model to interact with external APIs and structured context in a modular and dynamic way.

```python
external_mcp_server = "http://x.x.x.x:port/sse" #change to correct uri

mcp_client = BasicMCPClient(external_mcp_server)
mcp_tool = McpToolSpec(client=mcp_client)

tools = await mcp_tool.to_tool_list_async()
print("\n=== Available Tools ===\n")
for tool in tools:
    print(f"🔧 Name       : {tool.metadata.name}")
    print(f"   Description: {tool.metadata.description}\n")
```
```python
resources = await mcp_tool.fetch_resources()

print("\n=== Available Resources ===\n")
for resource in resources:
    print(f"📦 URI        : {resource.uri}")
    print(f"   Name       : {resource.name}")
    print(f"   Description: {resource.description}")
    print(f"   MIME Type  : {resource.mimeType}\n")
```

In [None]:
### YOUR CODE HERE ###

external_mcp_server = "http://x.x.x.x:port/sse" #change to correct uri

mcp_client = BasicMCPClient(external_mcp_server)
mcp_tool = McpToolSpec(client=mcp_client)

tools = await mcp_tool.to_tool_list_async()
print("\n=== Available Tools ===\n")
for tool in tools:
    print(f"🔧 Name       : {tool.metadata.name}")
    print(f"   Description: {tool.metadata.description}\n")

In [None]:
resources = await mcp_tool.fetch_resources()

print("\n=== Available Resources ===\n")
for resource in resources:
    print(f"📦 URI        : {resource.uri}")
    print(f"   Name       : {resource.name}")
    print(f"   Description: {resource.description}")
    print(f"   MIME Type  : {resource.mimeType}\n")

#### 1. KG Query Engine Initialization with MCP Client

In this step, we initialize an **LLM agent** that can interact with tools registered on the MCP Server.

- `BasicMCPClient` connects to the MCP Server at the specified URL.
- `McpToolSpec` wraps available tools for the agent to use.
- `FunctionAgent` is created with:
  - The `GPT` model
  - A system prompt guiding tool-based reasoning
  - A dynamic list of MCP tools

The agent is now ready to handle user queries by calling tools exposed via the MCP interface.

```python
SYSTEM_PROMPT = """\
You are an AI assistant for Tool Calling.

Before you help a user, you need to work with tools to interact with Our Knowledge Graph
"""
```

```python
from llama_index.llms.openai import OpenAI

class KGQueryEngineWithMCP:
    def __init__(self, mcp_tool_spec: McpToolSpec, model: str, llm = None):
        self.llm = llm or OpenAI(model=model, temperature=0)
        self.mcp_tool_spec = mcp_tool_spec
        self.agent: Optional[FunctionAgent] = None
        self.agent_context: Optional[Context] = None

    async def init_agent(self):
        tools = await self.mcp_tool_spec.to_tool_list_async()
        self.agent = FunctionAgent(
            name="Agent",
            description="An agent that can work with Our Knowledge Graph api.",
            tools=tools,
            llm=self.llm,
            system_prompt=SYSTEM_PROMPT,
        )
        self.agent_context = Context(self.agent)

    async def query(self, question: str, verbose: bool = False) -> str:
        if self.agent is None or self.agent_context is None:
            raise RuntimeError("Agent not initialized. Call `await init_agent()` first.")

        handler = self.agent.run(question, ctx=self.agent_context)
        async for event in handler.stream_events():
            if verbose and type(event) == ToolCall:
                print(f"Calling tool {event.tool_name} with kwargs {event.tool_kwargs}")
            elif verbose and type(event) == ToolCallResult:
                print(f"Tool {event.tool_name} returned {event.tool_output}")

        response = await handler
        return str(response)

kg_engine = KGQueryEngineWithMCP(mcp_tool, model='gpt-3.5-turbo')
await kg_engine.init_agent()
```


In [None]:
### YOUR CODE HERE ###

SYSTEM_PROMPT = """\
You are an AI assistant for Tool Calling.

Before you help a user, you need to work with tools to interact with Our Knowledge Graph
"""

In [None]:
### YOUR CODE HERE ###

from llama_index.llms.openai import OpenAI

class KGQueryEngineWithMCP:
    def __init__(self, mcp_tool_spec: McpToolSpec, model: str, llm = None):
        self.llm = llm or OpenAI(model=model, temperature=0)
        self.mcp_tool_spec = mcp_tool_spec
        self.agent: Optional[FunctionAgent] = None
        self.agent_context: Optional[Context] = None

    async def init_agent(self):
        tools = await self.mcp_tool_spec.to_tool_list_async()
        self.agent = FunctionAgent(
            name="Agent",
            description="An agent that can work with Our Knowledge Graph api.",
            tools=tools,
            llm=self.llm,
            system_prompt=SYSTEM_PROMPT,
        )
        self.agent_context = Context(self.agent)

    async def query(self, question: str, verbose: bool = False) -> str:
        if self.agent is None or self.agent_context is None:
            raise RuntimeError("Agent not initialized. Call `await init_agent()` first.")

        handler = self.agent.run(question, ctx=self.agent_context)
        async for event in handler.stream_events():
            if verbose and type(event) == ToolCall:
                print(f"Calling tool {event.tool_name} with kwargs {event.tool_kwargs}")
            elif verbose and type(event) == ToolCallResult:
                print(f"Tool {event.tool_name} returned {event.tool_output}")

        response = await handler
        return str(response)

kg_engine = KGQueryEngineWithMCP(mcp_tool, model='gpt-3.5-turbo')
await kg_engine.init_agent()

This function executes an LLM agent `FunctionAgent` using a given user message and shared workflow context `Context`.    
It streams intermediate tool call events in real time and returns the final response from the agent.

1. Starts the agent with the input message and context.
2. Streams events while the agent is running.
   - Logs tool invocations and results if `verbose=True`.
3. Awaits the final result and returns it as a string.

This function helps monitor the reasoning and tool execution steps taken by the agent in a transparent, asynchronous manner.

Through the code below, let's see how the MCP-based RAG is solving the 'error case seen above

```python
question = """Query: which company distribute more dividends this year, muj or  tcbio?
Query time: 2024/02/23"""

response = await handle_user_message(question, agent, agent_context, verbose=True)
print("Agent: ", response)
```

In [None]:
### YOUR CODE HERE ###

print("Question: ", question)

response = await kg_engine.query(
    question,
    verbose=True,
)
print(response)

Let's examine the result of the code above.  
Did the MCP agent generate a correct answer? Probably not.

The reason becomes clear when we look at the question that was passed to the LLM.  
The LLM did not receive any information about the **query time**, which is crucial for answering this type of question.  
Since the timing context is highly important here, let’s include the query time and try the question again.

```python
question_with_time = "Query: " + question + "\nQuery time: " + query_time
print("Question: ", question_with_time)

response = await kg_engine.query(
    question_with_time,
    verbose=True,
)
print(response)
```

In [None]:
### YOUR CODE HERE ###

question_with_time = "Query: " + question + "\nQuery time: " + query_time
print("Question: ", question_with_time)

response = await kg_engine.query(
    question_with_time,
    verbose=True,
)
print(response)

Was the generated result good enough?

Some users may find it insufficient, while others may encounter outright errors.  
In most cases, these issues arise because the LLM used in MCP exceeded its **context limit** due to the large amount of input.

There are several possible solutions to this problem, but one practical approach is to switch to a model with a **larger context window**.

This time, let's proceed by using the `gpt-4.1-mini` model, which supports a larger input context and is better suited for handling longer queries.

```python
kg_engine = KGQueryEngineWithMCP(mcp_tool, model='gpt-4.1-mini')
await kg_engine.init_agent()
```
```python
print("Question: ": question_with_time)

response = await kg_engine.query(
    question_with_time,
    verbose=True,
)
print(response)
```

In [None]:
### YOUR CODE HERE ###

kg_engine = KGQueryEngineWithMCP(mcp_tool, model='gpt-4.1-mini')
await kg_engine.init_agent()

In [None]:
### YOUR CODE HERE ###

print("Question: ", question_with_time)

response = await kg_engine.query(
    question_with_time,
    verbose=True,
)
print(response)

### 2. Structuring a RAG Class for MCP-based Interaction

To modularize the code above and enable scalable, query-driven interaction with the MCP Server, we can refactor it into a unified `RAG` class.

### Design Motivation

- MCP-based clients require **rich, well-structured input** to maximize tool usage accuracy.
- The CRAG dataset provides us with structured components such as:
  - Query (question)
  - Context (retrieved passages)
  - Metadata (timestamps, entities, etc.)

By encapsulating this into a class, we can:
1. **Generate precise questions** from structured CRAG input
2. **Run those questions through the MCP-connected agent**
3. **Return clean, tool-integrated answers**

---

### Key Components

| Method | Purpose |
|--------|---------|
| `__init__` | Initialize MCP client, tools, and LLM agent |
| `retrieve(query, query_time)` | Convert CRAG sample into a natural-language query with max context and retrieve from knowledge graph |
| `generate_response(query, query_time, )` | Perform LLM reference based on searched results to obtain LLM response |
| `inference(query, query_time)` | Combine the above two methods into a method that can perform QATask in RAG |

```python
class RAGwithMCP:
    def __init__(self, server=None):
        self.mcp_application = KGQueryEngineWithMCP(server=server)
        self.reader = Reader()

    async def retrieve(self, query: str, query_time: str, search_results: list, topk: int):
        await self.mcp_application.init_agent()

        full_query = f"""Query: {query}
Query time: {query_time}"""

        mcp_result = await self.mcp_application.query(full_query, verbose=False)
        return mcp_result

    def generate_response(self, query: str, query_time: str, retrieved_results: str):
        answer = self.reader.generate_response(query, query_time, retrieved_results)
        return answer

    async def inference(self, query: str, search_results: list, query_time: str, topk: int):
        retrieved_results = await self.retrieve(query, query_time, search_results, topk)
        answer = self.generate_response(query, query_time, retrieved_results)
        return {
            "retrieved_results": retrieved_results,
            "answer": answer
        }

rag_mcp = RAGwithMCP(server=external_mcp_server)
```

In [None]:
### YOUR CODE HERE ###

from llama_index.llms.openai import OpenAI

class RAGwithMCP:
    def __init__(self, mcp_tool):
        self.mcp_application = KGQueryEngineWithMCP(mcp_tool, model='gpt-4.1-mini')
        self.reader = Reader()

    async def retrieve(self, query: str, query_time: str, search_results: list, topk: int):
        await self.mcp_application.init_agent()

        full_query = f"""Query: {query}
Query time: {query_time}"""

        mcp_result = await self.mcp_application.query(full_query, verbose=False)
        return mcp_result

    def generate_response(self, query: str, query_time: str, retrieved_results: str):
        answer = self.reader.generate_response(query, query_time, retrieved_results)
        return answer

    async def inference(self, query: str, search_results: list, query_time: str, topk: int):
        retrieved_results = await self.retrieve(query, query_time, search_results, topk)
        answer = self.generate_response(query, query_time, retrieved_results)
        return {
            "retrieved_results": retrieved_results,
            "answer": answer
        }

rag_mcp = RAGwithMCP(mcp_tool)

In [None]:
### YOUR CODE HERE ###

result = await rag_mcp.inference(
    query=question,
    search_results=[],
    query_time=query_time,
    topk=5
)

print("Retrieved Results:\n", result["retrieved_results"])
print("------------------")
print("Final Answer:\n", result["answer"])

In [None]:
### YOUR CODE HERE ###

for i in range(3):
    rag_output = await rag_mcp.inference(
        query=question,
        search_results=[],
        query_time=query_time,
        topk=5
    )
    print(rag_output['answer'].lower())

In this notebook, we explored how MCP (Model Context Protocol) can be used to build a flexible and modular tool-augmented RAG pipeline.

While MCP allows LLMs to dynamically select and invoke tools using a standardized interface, real-world usage has revealed several practical limitations:

- **Context Overflow**: Tool outputs are inserted directly into the model's input context. If the result is too long, it may exceed the model’s token limit and lead to failure.
- **Limited Error Recovery**: When tool execution fails or exceeds limits, LLMs often cannot recover or retry unless explicitly guided to do so.
- **Debugging Difficulty**: Since tool selection and reasoning are tightly coupled inside the model, it is difficult to trace what went wrong without detailed logs or event streaming.
- **Latency and Reliability**: Each tool call requires a round-trip to the server. In multi-step workflows, this can introduce significant delay and failure points.
- **Loss of Developer Control**: Tool behavior is driven by the LLM’s interpretation of the prompt and available tools, making behavior harder to predict or constrain.

Understanding these trade-offs is key to using MCP effectively in real-world LLM applications.