#📓 TASK #1: WEB-BASED RETRIEVAL SUMMARIZATION

In this section, we will tackle the first task of the KDD Cup: Web-based Retrieval Summarization. Since the KDD Cup CRAG benchmark fundamentally focuses on RAG (Retrieval-Augmented Generation), our approach will also be based on the RAG framework.

Before building the RAG system, let’s first clarify what problem we need to solve. At first, participants receive 5 web pages per question, potentially containing relevant information. And the objective is to measure the systems' capability to identify and condense this information into accurate answers.

<br/>

<img src="https://i.imgur.com/jlNdBmD.png">

By looking at the diagram above, you will get an idea of what problem we need to solve. Additionally, since you have already reviewed the input data in previous sessions, you are well aware of the types of data you will be working with.

As you may recall, the CRAG dataset contains many challenging questions, and as we observed earlier, it is difficult for the LLM alone to solve these problems effectively. Therefore, we will explore how these types of problems can be addressed using the RAG framework.

Specifically, This practice class will be comprised of four sections.  
  
### I. Implementing a Retriever
### II. Implementing a Reader
### III. Implementing a RAG
### IV. Error case analysis

## I. Implementing a Retriever

Before building the RAG system, the first essential component we need is the **Retriever**. As you are already familiar, the retriever is a crucial element for building an effective RAG. If the retriever successfully retrieves a sufficient amount of relevant information and passes it to the LLM, the probability of the LLM generating the correct answer will significantly increase.

In the previous session, we only experimented with the default retriever provided by `LlamaIndex` and made minor adjustments, such as modifying the chunk size.   
This time, however, we will define the retriever in a more low-level manner and explore its use step by step.

This section is divided into the following four stages.

1. Preparing Python Packages
2. Implementing a Chunk Extractor
3. Implementing a Retriever
4. Implementing a Retriever with LlamaIndex





### 1. Preparing Python Packages

As always, we will start by installing and importing the necessary libraries for use.  

At this point, we will also set the values for the global variables that will be needed later. The significance of these values will be explained in detail in the following steps.

```Python
!pip install openai==1.55.3 --quiet
!pip install llama-index --quiet
!pip install llama-index-readers-wikipedia wikipedia --quiet
!pip install llama-index-llms-openai --quiet
!pip install llama-index-embeddings-huggingface --quiet
!pip install packaging==23.2 openai --quiet
!pip install langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet

!pip install blingfire beautifulsoup4 sentence-transformers ray --quiet
!pip install textwrap3 --quiet
!pip install scikit-learn --quiet
!pip uninstall numpy -y
!pip install numpy==1.26.4 --quiet
```
```Python
import numpy as np
import ray
import bz2
import json
import torch
from blingfire import text_to_sentences_and_offsets
from collections import defaultdict
from typing import Any, Dict, List
from bs4 import BeautifulSoup
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..." #copy your api key

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, get_response_synthesizer
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter

import textwrap
```

```Python
# Define the number of context sentences to consider for generating an answer.
NUM_CONTEXT_SENTENCES = 20
# Set the maximum length for each context sentence (in characters).
MAX_CONTEXT_SENTENCE_LENGTH = 1000
# Set the maximum context references length (in characters).
MAX_CONTEXT_REFERENCES_LENGTH = 4000
# Sentence Transformer Parameters
SENTENTENCE_TRANSFORMER_BATCH_SIZE = 128 # TUNE THIS VARIABLE depending on the size of your embedding model and GPU mem available
```


In [None]:
### YOUR CODE HERE ###

!pip install openai==1.55.3 --quiet
!pip install llama-index --quiet
!pip install llama-index-readers-wikipedia wikipedia --quiet
!pip install llama-index-llms-openai --quiet
!pip install llama-index-embeddings-huggingface --quiet
!pip install packaging==23.2 openai --quiet
!pip install langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet

!pip install blingfire beautifulsoup4 sentence-transformers ray --quiet
!pip install textwrap3 --quiet
!pip install scikit-learn --quiet
!pip uninstall numpy -y
!pip install numpy==1.26.4 --quiet

In [None]:
### YOUR CODE HERE ###

import numpy as np
import ray
import bz2
import json
from blingfire import text_to_sentences_and_offsets
from collections import defaultdict
from typing import Any, Dict, List
from bs4 import BeautifulSoup
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..." #copy your api key

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, get_response_synthesizer
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter

import textwrap

In [None]:
### YOUR CODE HERE ###

# Define the number of context sentences to consider for generating an answer.
NUM_CONTEXT_SENTENCES = 20
# Set the maximum length for each context sentence (in characters).
MAX_CONTEXT_SENTENCE_LENGTH = 1000
# Set the maximum context references length (in characters).
MAX_CONTEXT_REFERENCES_LENGTH = 4000
# Sentence Transformer Parameters
SENTENTENCE_TRANSFORMER_BATCH_SIZE = 128 # TUNE THIS VARIABLE depending on the size of your embedding model and GPU mem available

### 2. Implementing a Chunk Extractor

This time, we will define and use a `Chunk Extractor`. As you observed during the first practice session, the Chunk Extractor is a function needed to split the search results into appropriately sized pieces for use.

Since search results are essentially `HTML` files, we will first define the `parse_htmls` function to remove HTML tags. Then, we will define the `extract_chunks` function, which splits the text extracted from the HTML into chunks. To avoid losing information at the chunk boundaries, the text will be split at the sentence level.

```Python
def parse_htmls(search_results):
    all_documents = []
    
    # Process each HTML text from the search results to extract text content.
    for html_text in search_results:

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_text["page_result"], features="lxml")
        text = soup.get_text(" ", strip=True)  # Use space as a separator, strip whitespaces
        all_documents.append(text)
    
    return all_documents

def extract_chunks(all_documents):
    # Initialize a list to hold all extracted sentences from the search results.
    all_chunks = []

    for document in all_documents:

        if not document:
            # If no document is extracted, add an empty string as a placeholder.
            all_chunks.append("")
        else:

            # Extract offsets of sentences from the document
            _, offsets = text_to_sentences_and_offsets(document)

            # Initialize a list to store sentences
            chunks = []

            # Iterate through the list of offsets and extract sentences
            for start, end in offsets:
                # Extract the sentence and limit its length
                chunk = document[start:end][:MAX_CONTEXT_SENTENCE_LENGTH]
                all_chunks.append(chunk)

    return all_chunks
```

In [None]:
### YOUR CODE HERE ###

def parse_htmls(search_results):
    all_documents = []

    # Process each HTML text from the search results to extract text content.
    for html_text in search_results:

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_text["page_result"], features="lxml")
        text = soup.get_text(" ", strip=True)  # Use space as a separator, strip whitespaces
        all_documents.append(text)

    return all_documents

def extract_chunks(all_documents):
    # Initialize a list to hold all extracted sentences from the search results.
    all_chunks = []

    for document in all_documents:

        if not document:
            # If no document is extracted, add an empty string as a placeholder.
            all_chunks.append("")
        else:

            # Extract offsets of sentences from the document
            _, offsets = text_to_sentences_and_offsets(document)

            # Initialize a list to store sentences
            chunks = []

            # Iterate through the list of offsets and extract sentences
            for start, end in offsets:
                # Extract the sentence and limit its length
                chunk = document[start:end][:MAX_CONTEXT_SENTENCE_LENGTH]
                all_chunks.append(chunk)

    return all_chunks

Now, let’s use the two functions to process the data by loading the search results and splitting them into chunks. As mentioned earlier, you will need to mount your Google Drive to access the dataset.

Run the code below to test the example:

```
from google.colab import drive

drive.mount('/content/drive')
```
```Python
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"
with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        
        # Get documents
        all_documents = parse_htmls(item["search_results"])
        
        # Get chunks
        all_chunks = extract_chunks(all_documents)
        
        print("=========== Document ===========")
        print("# of Document Characters: ", len(all_documents[0]))
        print()
        print(all_documents[0])
        print()
        print("=========== Chunk ===========")
        print("# of Chunk Characters: ", len(all_chunks[0]))
        print()
        print(all_chunks[0])
        print()
        break
```

In [None]:
### YOUR CODE HERE ###

from google.colab import drive

drive.mount('/content/drive')

In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"
with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)

        # Get documents
        all_documents = parse_htmls(item["search_results"])

        # Get chunks
        all_chunks = extract_chunks(all_documents)

        print("=========== Document ===========")
        print("# of Document Characters: ", len(all_documents[0]))
        print()
        print(all_documents[0])
        print()
        print("=========== Chunk ===========")
        print("# of Chunk Characters: ", len(all_chunks[0]))
        print()
        print(all_chunks[0])
        print()
        break

As a result of the test, we observed that the length of the text was reduced by nearly **100 times** after splitting it into chunks compared to using the full search results. This suggests that chunks can effectively extract only the relevant parts from the `search_results` and pass them to the LLM efficiently.

Of course, we cannot guarantee that the retriever will always retrieve relevant results. However, if we use entire documents as the retrieval unit, it will take a long time to compute embeddings, and information loss may occur during that process.

Additionally, the length of the chunk characters may not exactly match the value of `MAX_CONTEXT_SENTENCE_LENGTH`. This is because we used the text_to_sentences_and_offsets function to ensure that chunks are formed without splitting sentences.

### 3. Implementing a Retriever

This time, we will implement a **Retriever** using the Chunk Extractor we defined earlier, without relying on AI frameworks like LlamaIndex.  

For this implementation, the following components are required:

<br/>

1.	**Chunk extractor**: Used to split the input search_results into chunks.
2.	**Embedding model**: Used to generate embeddings for the chunks and the query.
3.	**Similarity metric**: Measures the similarity between embeddings. We will use cosine similarity here.

Using these components, let’s implement the `BaseRetriever` with the following code:

```Python
class BaseRetriever:
    def __init__(self,):
        self.client = openai.OpenAI(api_key = os.environ["OPENAI_API_KEY"])

    def embed_text(self, texts):
        """Generate embeddings using OpenAI's embedding model."""
        if isinstance(texts, str):
            texts = [texts]

        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )

        # Extract embeddings correctly from the response object
        embeddings = [np.array(item.embedding) for item in response.data]  # Adjust based on actual attributes
        return np.array(embeddings)

    def retrieve(self, query, search_results, topk):
        # Get documents
        all_documents = parse_htmls(search_results)

        # Get chunks
        all_chunks = extract_chunks(all_documents)

        # Generate embeddings for all chunks and the query.
        all_embeddings = self.embed_text(all_chunks)
        query_embedding = self.embed_text(query)[0]  # Single query embedding

        # Calculate cosine similarity between query and sentence embeddings, and select the top sentences.
        cosine_scores = np.dot(all_embeddings, query_embedding) / (
            np.linalg.norm(all_embeddings, axis=1) * np.linalg.norm(query_embedding)
        )
        top_k_indices = (-cosine_scores).argsort()[:topk]
        top_k_chunks = np.array(all_chunks)[top_k_indices]

        return top_k_chunks
```



In [None]:
### YOUR CODE HERE ###

class BaseRetriever:
    def __init__(self,):
        self.client = openai.OpenAI(api_key = os.environ["OPENAI_API_KEY"])

    def embed_text(self, texts):
        """Generate embeddings using OpenAI's embedding model."""
        if isinstance(texts, str):
            texts = [texts]

        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )

        # Extract embeddings correctly from the response object
        embeddings = [np.array(item.embedding) for item in response.data]  # Adjust based on actual attributes
        return np.array(embeddings)

    def retrieve(self, query, search_results, topk):
        # Get documents
        all_documents = parse_htmls(search_results)

        # Get chunks
        all_chunks = extract_chunks(all_documents)

        # Generate embeddings for all chunks and the query.
        all_embeddings = self.embed_text(all_chunks)
        query_embedding = self.embed_text(query)[0]  # Single query embedding

        # Calculate cosine similarity between query and sentence embeddings, and select the top sentences.
        cosine_scores = np.dot(all_embeddings, query_embedding) / (
            np.linalg.norm(all_embeddings, axis=1) * np.linalg.norm(query_embedding)
        )
        top_k_indices = (-cosine_scores).argsort()[:topk]
        top_k_chunks = np.array(all_chunks)[top_k_indices]

        return top_k_chunks

The retriever determines the final chunks to return through three main steps.  

1.	**First Step**: The retriever takes the query, the search_results, and a variable topk (which determines how many chunks to return) as inputs. It then extracts chunks from the `search_results`.
2.	**Second Step**: The extracted chunks are converted into embeddings using an embedding model. Since the chunks are in a list format, the embedding results will also be returned as a list. At the same time, the query is also converted into an embedding.
3. **Third Step**: **Cosine similarity** between the query’s embedding and the chunks’ embeddings is calculated to determine which chunks have the highest similarity to the query.

Through this process, our `BaseRetriever` retrieves and returns the `topk` chunks with the highest similarity.  

Here is the code to verify the process:

```
retriever = BaseRetriever()
topk = 5
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        retrieved_results = retriever.retrieve(item['query'], item['search_results'], topk)
        break

print("retrieved results:")
print()
for rank, retrieved_result in enumerate(retrieved_results):
    print(f"rank {rank+1}: {retrieved_result}")
    print()
```



In [None]:
### YOUR CODE HERE ###

retriever = BaseRetriever()
topk = 5
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        retrieved_results = retriever.retrieve(item['query'], item['search_results'], topk)
        break

print("retrieved results:")
print()
for rank, retrieved_result in enumerate(retrieved_results):
    print(f"rank {rank+1}: {retrieved_result}")
    print()

### 3. Implementing a Retriever with Llama Index

You may recall that in Day 1 practice, we defined a retriever using `LlamaIndex`.

In this exercise, we will again define a retriever using LlamaIndex. To create a retriever with `LlamaIndex`, we must first build an index. To build the index, we need to decide which data to use – in this case, we will use the `search_results`.

Follow the code below to declare the retriever:

```Python
from llama_index.core.schema import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

class LlamaIndexRetriever:
  def __init__(self):
      self.parser = SentenceSplitter(chunk_size=512, chunk_overlap=0)

  def retrieve(self, query, search_results, topk):
      documents = []

      for document in parse_htmls(search_results):
        if not document:
            # If no text is extracted, add an empty string as a placeholder.
            documents.append(Document(text=""))
        else:
            documents.append(Document(text=document))

      # Split documents into chunks & Create vector index
      base_index = VectorStoreIndex.from_documents(documents = documents, transformations=[self.parser])

      # Execute query
      base_retriever = base_index.as_retriever(similarity_top_k=topk)

      retrieved_nodes = base_retriever.retrieve(query)

      retrieved_results = [retrieved_node.node.get_content().strip() for retrieved_node in retrieved_nodes]

      return retrieved_results
```


In [None]:
### YOUR CODE HERE ###

from llama_index.core.schema import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

class LlamaIndexRetriever:
  def __init__(self):
      self.parser = SentenceSplitter(chunk_size=512, chunk_overlap=0)

  def retrieve(self, query, search_results, topk):
      documents = []

      for document in parse_htmls(search_results):
        if not document:
            # If no text is extracted, add an empty string as a placeholder.
            documents.append(Document(text=""))
        else:
            documents.append(Document(text=document))

      # Split documents into chunks & Create vector index
      base_index = VectorStoreIndex.from_documents(documents = documents, transformations=[self.parser])

      # Execute query
      base_retriever = base_index.as_retriever(similarity_top_k=topk)

      retrieved_nodes = base_retriever.retrieve(query)

      retrieved_results = [retrieved_node.node.get_content().strip() for retrieved_node in retrieved_nodes]

      return retrieved_results

By leveraging an external AI framework like LlamaIndex, we can see that the code has become significantly more concise and streamlined.

Now, let’s practice using the same approach with an example to verify how it works in action!

```
retriever = LlamaIndexRetriever()
topk = 5
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        retrieved_results = retriever.retrieve(item['query'], item['search_results'], topk)
        break

print("retrieved results:")
print()
for rank, retrieved_result in enumerate(retrieved_results):
    print(f"rank {rank}: {retrieved_result}")
    print()
```

In [None]:
### YOUR CODE HERE ###

retriever = LlamaIndexRetriever()
topk = 5
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        retrieved_results = retriever.retrieve(item['query'], item['search_results'], topk)
        break

print("retrieved results:")
print()
for rank, retrieved_result in enumerate(retrieved_results):
    print(f"rank {rank}: {retrieved_result}")
    print()

## II. Implementing a Reader

In this section, we will design the **Reader**.

What are the most important considerations when creating a Reader? The most crucial factor is likely the choice of LLM. Factors such as model size, performance on reasoning benchmarks, cost, and other considerations are typically part of the configuration.

However, since we have limited options for the LLMs we can use in this practice session, this will not be a consideration for us here.

So, what’s the next most important factor? **Prompt design**. It is well known that well-designed prompts lead to better results from the LLM.

Moreover, setting an appropriate prompt becomes even more critical for the CRAG dataset. In this task, the LLM must be able to answer “I don’t know” if it encounters something it is unsure about or cannot answer confidently. To achieve this, the prompt must be specifically designed to guide the LLM to behave in this manner.

Therefore, this exercise will be conducted in the following three main stages:

1. Design a Prompt Template
2. Implement a Prompt Generator
3. Implement a Reader

### 1. Design a Prompt Template

To design an effective prompt template, we need to carefully consider certain factors.

1.	The response must be generated based on the given question and references.
2.	In the CRAG benchmark, answers should not be too long or verbose. During evaluation, only the first 75 tokens are used for scoring, so the response needs to be concise.
3.	The LLM must be able to recognize questions it cannot answer and respond with “I don’t know”.

Taking these factors into account, we can draft the following `system_prompt`:

```Python

system_prompt = """
You are provided with a question and various references.
Your task is to answer the question succinctly, using the fewest words possible.
If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
There is no need to explain the reasoning behind your answers.
"""
```

In [None]:
### YOUR CODE HERE ###

system_prompt = """
You are provided with a question and various references.
Your task is to answer the question succinctly, using the fewest words possible.
If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
There is no need to explain the reasoning behind your answers.
"""

### 2. Implement a Prompt Generator

Above, we created a system prompt. Now, we need to build a prompt generator that takes the question and reference, combines them into one, and formats it so it can be passed to the LLM.

Below is an example of a `prompt_generator` that takes a question and reference, combines them for delivery to the LLM:

```Python
def prompt_generator(query, top_k_chunks, system_prompt):
    user_message = ""
    references = ""

    if len(top_k_chunks) > 0:
        references += "# References \n"
        # Format the top sentences as references in the model's prompt template.
        for chunk_id, chunk in enumerate(top_k_chunks):
            references += f"- {chunk.strip()}\n"

    references = references[:MAX_CONTEXT_REFERENCES_LENGTH]
    # Limit the length of references to fit the model's input size.

    user_message += f"{references}\n------\n\n"
    user_message += f"Using only the references listed above, answer the following question: \n"
    user_message += f"Question: {query}\n"

    llm_input = [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": user_message},
    ]

    return llm_input

```

In [None]:
### YOUR CODE HERE ###

def prompt_generator(query, top_k_chunks, system_prompt):
    user_message = ""
    references = ""

    if len(top_k_chunks) > 0:
        references += "# References \n"
        # Format the top sentences as references in the model's prompt template.
        for chunk_id, chunk in enumerate(top_k_chunks):
            references += f"- {chunk.strip()}\n"

    references = references[:MAX_CONTEXT_REFERENCES_LENGTH]
    # Limit the length of references to fit the model's input size.

    user_message += f"{references}\n------\n\n"
    user_message += f"Using only the references listed above, answer the following question: \n"
    user_message += f"Question: {query}\n"

    llm_input = [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": user_message},
    ]

    return llm_input

### 3. Implement a Reader

Now that we have created a function to generate the necessary prompts for the Reader, we will proceed to define the Reader itself and set up the components needed for RAG creation.

Follow the code below to implement it.

```Python
from openai import OpenAI

oai_client = OpenAI()

class Reader:
  def __init__(self):

    self.system_prompt = """
    You are provided with a question and various references.
    Your task is to answer the question succinctly, using the fewest words possible.
    If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
    There is no need to explain the reasoning behind your answers.
    """

  def generate_response(self, query: str, top_k_chunks: list) -> str:
      """
      Generate answer from context.
      """
      llm_input = self.prompt_generator(query, top_k_chunks)
      completion = oai_client.chat.completions.create(
      model="gpt-3.5-turbo",
      temperature=0,
      messages=
      llm_input
      ).choices[0].message.content
      return completion

  def prompt_generator(self, query, top_k_chunks):
      user_message = ""
      references = ""

      if len(top_k_chunks) > 0:
          references += "# References \n"
          # Format the top sentences as references in the model's prompt template.
          for chunk_id, chunk in enumerate(top_k_chunks):
              references += f"- {chunk.strip()}\n"
      
      references = references[:MAX_CONTEXT_REFERENCES_LENGTH]
      # Limit the length of references to fit the model's input size.

      user_message += f"{references}\n------\n\n"
      user_message
      user_message += f"Using only the references listed above, answer the following question: \n"
      user_message += f"Question: {query}\n"

      llm_input = [
        {"role": "system", "content": self.system_prompt},
        {"role": "user", "content": user_message},
      ]

      return llm_input
```


In [None]:
### YOUR CODE HERE ###

from openai import OpenAI

oai_client = OpenAI()

class Reader:
  def __init__(self):

    self.system_prompt = """
    You are provided with a question and various references.
    Your task is to answer the question succinctly, using the fewest words possible.
    If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
    There is no need to explain the reasoning behind your answers.
    """

  def generate_response(self, query: str, top_k_chunks: list) -> str:
      """
      Generate answer from context.
      """
      llm_input = self.prompt_generator(query, top_k_chunks)
      completion = oai_client.chat.completions.create(
      model="gpt-3.5-turbo",
      temperature=0,
      messages=
      llm_input
      ).choices[0].message.content
      return completion

  def prompt_generator(self, query, top_k_chunks):
      user_message = ""
      references = ""

      if len(top_k_chunks) > 0:
          references += "# References \n"
          # Format the top sentences as references in the model's prompt template.
          for chunk_id, chunk in enumerate(top_k_chunks):
              references += f"- {chunk.strip()}\n"

      references = references[:MAX_CONTEXT_REFERENCES_LENGTH]
      # Limit the length of references to fit the model's input size.

      user_message += f"{references}\n------\n\n"
      user_message
      user_message += f"Using only the references listed above, answer the following question: \n"
      user_message += f"Question: {query}\n"

      llm_input = [
        {"role": "system", "content": self.system_prompt},
        {"role": "user", "content": user_message},
      ]

      return llm_input

Now, let’s check the results through an actual example.

```
reader = Reader()
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print(f"ground truth: {item['answer']}")
        print()
        answer = reader.generate_response(item['query'], [])
        break

print(f"answer: {answer}")
```



In [None]:
### YOUR CODE HERE ###

reader = Reader()
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print(f"ground truth: {item['answer']}")
        print()
        answer = reader.generate_response(item['query'], [])
        break

print(f"answer: {answer}")

## III. Implementing a RAG

At this point, we have defined both the Reader and the Retriever, and we have verified their inputs and outputs.

Now, let’s combine these two components into a functional RAG system that we can use.

```
class RAG:
    def __init__(self):
        self.retriever = LlamaIndexRetriever()
        self.reader = Reader()
  
    def inference(self, query, search_results, topk):
        # 1. retrieve relevant chunks
        retrieved_results = self.retriever.retrieve(query, search_results, topk)

        # 2. answer the question based on the retrieved chunks
        answer = self.reader.generate_response(query, retrieved_results)

        return answer, retrieved_results

```


In [None]:
### YOUR CODE HERE ###

class RAG:
    def __init__(self):
        self.retriever = LlamaIndexRetriever()
        self.reader = Reader()

    def inference(self, query, search_results, topk):
        # 1. retrieve relevant chunks
        retrieved_results = self.retriever.retrieve(query, search_results, topk)

        # 2. answer the question based on the retrieved chunks
        answer = self.reader.generate_response(query, retrieved_results)

        return answer, retrieved_results

Let’s now verify whether the RAG system we defined works as intended or not.

Using the code below, we will test the system on a total of 10 data points. You can check each result yourself and evaluate whether the RAG performs well enough.

```
rag = RAG()
topk = 5
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat > 9:
          break
        
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer = rag.inference(item['query'], item['search_results'], topk)[0]
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        repeat += 1
```





In [None]:
### YOUR CODE HERE ###

rag = RAG()
topk = 5
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat > 9:
          break

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer = rag.inference(item['query'], item['search_results'], topk)[0]
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        repeat += 1

## IV. Error case analysis

Some of you may be satisfied with the experimental results above, while others may not. However, few would believe that the RAG system produced the correct answer for all questions.

Therefore, before formally evaluating the RAG system, we will check which questions it answered incorrectly and try to understand why those results occurred. To do this, we need to classify the data into two categories:

1.	Questions the RAG answered correctly.
2.	Questions the RAG answered incorrectly.

Ultimately, before moving on to Task 2 in the next session, we will execute the RAG implemented for Task 1 and analyze which queries the system struggles to answer correctly.

To begin, let’s check how well the Reader alone performs on the following questions, without using search results.
<br/>  
Question: **In 2004, which animated film was recognized with the best animated feature film oscar?**.   
Answer: **Finding Nemo**
<br/>


```
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat != 5:
          repeat += 1
          continue
        
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer = reader.generate_response(item['query'], [])
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        repeat += 1
        break
```



In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat != 5:
          repeat += 1
          continue

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer = reader.generate_response(item['query'], [])
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        repeat += 1
        break

Although the correct answer is **“Finding Nemo”**, the model generated the incorrect answer, **“The Incredibles”**.

Next, let’s check the generated result when search results are utilized.


```
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

rag = RAG()
topk = 5

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat != 5:
          repeat += 1
          continue
        
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer, retrieved_results = rag.inference(item['query'], item['search_results'], topk)
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print("retrieved results:")
        for rank, retrieved_result in enumerate(retrieved_results):
            print(f"{rank}: {retrieved_result}")
        print()
        repeat += 1
        break
```



In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

rag = RAG()
topk = 5

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat != 5:
          repeat += 1
          continue

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer, retrieved_results = rag.inference(item['query'], item['search_results'], topk)
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print("retrieved results:")
        for rank, retrieved_result in enumerate(retrieved_results):
            print(f"{rank}: {retrieved_result}")
        print()
        repeat += 1
        break

The following queries focus on retrieving information related to finance.

Such information is typically stored in structured data formats, such as tables or knowledge graphs. However, unstructured data sources, like web search results, often overlook the structural information inherent in tables or knowledge graphs, making it challenging to extract specific information efficiently.

For instance, financial data such as Microsoft's ex-dividend date, P/E ratio, or earnings per share is usually presented in numeric, date, or tabular formats. In contrast, text-based data lacks the structured representation found in tables, making it harder to leverage such information.

Let us explore whether RAG (Retrieval-Augmented Generation) can effectively answer the following queries using only web search results.

<br/>  
Question: **What is the ex-dividend date of microsoft in the 1st qtr of 2024**.   
Answer: **The ex-dividend date of microsoft in the 1st qtr of 2024 is feb 14, 2024**
<br/>

<br/>  
Question: **I'm looking for the p/e ratio of dks. would you happen to know what it is?**.   
Answer: **13.75**
<br/>

<br/>  
Question: **What's auph's earnings per share?**.   
Answer: **0.4**
<br/>



```Python
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

rag = RAG()
topk = 5

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat not in [14, 53, 64]:
          repeat += 1
          continue

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer, retrieved_results = rag.inference(item['query'], item['search_results'], topk)
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print("retrieved results:")
        for rank, retrieved_result in enumerate(retrieved_results):
            print(f"{rank}: {retrieved_result}")
        print()
        repeat += 1
```



In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

rag = RAG()
topk = 5

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat not in [14, 53, 64]:
          repeat += 1
          continue

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer, retrieved_results = rag.inference(item['query'], item['search_results'], topk)
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print("retrieved results:")
        for rank, retrieved_result in enumerate(retrieved_results):
            print(f"{rank}: {retrieved_result}")
        print()
        repeat += 1