#📓 TASK #2: KNOWLEDGE GRAPH AND WEB AUGMENTATION

This task introduces mock APIs to access information from underlying mock Knowledge Graphs (KGs), with structured data possibly related to the questions. Participants use mock APIs, inputting parameters derived from the questions, to retrieve relevant data for answer formulation. The evaluation focuses on the systems' ability to query structured data and integrate information from various sources into comprehensive answers.

<img src="https://i.ibb.co/cXqvBZq/2024-12-07-3-09-46.png">

###Steps in RAG with Mock API
1. The model receives an input query to which a response is required.
2. Retriever retrieves relevant chunks from the recieved web pages that are pertinent to the input query.
3. Mock API retrieves relevant information from the Mock KG that are pertinent to the input query.
4. The large language model then generates a response, informed by both the original query and the retrieved information.

This practice class will be comprised of five sections.  
  
### I. Implementing a Mock KG Query Engine
### II. Implementing a Reader
### III. Implementing a LLM + Mock KG
### IV. Implementing a LLM + Web Search Results +Mock KG

## I. Implementing a Mock KG Query Engine

As you observed in the previous session, we can send specific queries to the Mock API connected to the Knowledge Graph (KG). The results obtained from the KG through the Mock API can then be utilized in the LLM’s inference stage.

Ultimately, what we will implement in this session is a connection between the existing RAG and the KG. Specifically, we will build the Mock KG query engine that serves as this connection.

More specifically, we will create a Mock KG query engine that generates a query from an input question **belonging to the finance domain**, sends it to the KG, and retrieves the relevant information.

The process will be carried out in the following five steps.

1. Preparing Python Packages
2. Preparing Mock APIs
3. Implementing a Query Generator
4. Implementing a Query Executor
5. Implementing a Mock KG Query Engine

### 1. Preparing Python Packages

As always, we will install and import the necessary python packages for use.  

The important point is that the **external IP address** of the KG we will use must be set correctly. If the configuration is incorrect, errors may occur in subsequent code execution.

```Python
!pip install llama-index --quiet
!pip install llama-index-readers-wikipedia wikipedia --quiet
!pip install llama-index-llms-openai --quiet
!pip install llama-index-embeddings-huggingface --quiet
!pip install packaging==23.2 trulens trulens-providers-openai openai --quiet
!pip install langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet

!pip install blingfire beautifulsoup4 sentence-transformers ray --quiet
!pip install textwrap3 --quiet
!pip install scikit-learn --quiet
!pip uninstall numpy -y
!pip install numpy==1.26.4 --quiet
!pip uninstall pandas scipy transformers -y
!pip install pandas scipy transformers --quiet
```
```Python
from typing import List
import requests
import numpy as np
import bz2
import json
import torch
from blingfire import text_to_sentences_and_offsets
from collections import defaultdict
from typing import Any, Dict, List
from bs4 import BeautifulSoup
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..." #copy your api key
os.environ["CRAG_SERVER"] = "http://34.64.232.38:8000"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, get_response_synthesizer
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter

import textwrap

import nltk
nltk.download('punkt')
```

```Python
# Define the number of context sentences to consider for generating an answer.
NUM_CONTEXT_SENTENCES = 20
# Set the maximum length for each context sentence (in characters).
MAX_CONTEXT_SENTENCE_LENGTH = 1000
# Set the maximum context references length (in characters).
MAX_CONTEXT_REFERENCES_LENGTH = 4000
# Sentence Transformer Parameters
SENTENTENCE_TRANSFORMER_BATCH_SIZE = 128 # TUNE THIS VARIABLE depending on the size of your embedding model and GPU mem available
```


In [None]:
### YOUR CODE HERE ###

!pip install llama-index --quiet
!pip install llama-index-readers-wikipedia wikipedia --quiet
!pip install llama-index-llms-openai --quiet
!pip install llama-index-embeddings-huggingface --quiet
!pip install packaging==23.2 trulens trulens-providers-openai openai --quiet
!pip install langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet

!pip install blingfire beautifulsoup4 sentence-transformers ray --quiet
!pip install textwrap3 --quiet
!pip install scikit-learn --quiet
!pip uninstall numpy -y
!pip install numpy==1.26.4 --quiet
!pip uninstall pandas scipy transformers -y
!pip install pandas scipy transformers --quiet

In [None]:
### YOUR CODE HERE ###

from typing import List
import requests
import numpy as np
import bz2
import json
from blingfire import text_to_sentences_and_offsets
from collections import defaultdict
from typing import Any, Dict, List
from bs4 import BeautifulSoup
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..." #copy your api key
os.environ["CRAG_SERVER"] = "http://34.64.71.239:8000"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, get_response_synthesizer
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter

import textwrap

import nltk
nltk.download('punkt')

In [None]:
### YOUR CODE HERE ###

# Define the number of context sentences to consider for generating an answer.
NUM_CONTEXT_SENTENCES = 20
# Set the maximum length for each context sentence (in characters).
MAX_CONTEXT_SENTENCE_LENGTH = 1000
# Set the maximum context references length (in characters).
MAX_CONTEXT_REFERENCES_LENGTH = 4000
# Sentence Transformer Parameters
SENTENTENCE_TRANSFORMER_BATCH_SIZE = 128 # TUNE THIS VARIABLE depending on the size of your embedding model and GPU mem available

### 2. Preparing Mock APIs

We have previously seen an example of an API that sends a query to the Mock KG and retrieves related information. Here, we will explain in more detail what types of APIs may exist.

The example below demonstrates a Mock API used in the KDD Cup. Each method is designed to connect to the KG (represented as self.server) and send the required request using the requests library. Simultaneously, it receives the results and finally returns the data in **JSON format**.

Here, the `requests` library is a Python package that simplifies making HTTP requests. Using methods like `GET `or `POST` from the requests library allows us to send HTTP requests, but it requires some understanding of computer networks, so we will skip the detailed explanation. If you want to learn more, I recommend researching it independently.

In any case, this is the connection bridge we have been discussing with the Mock KG. Please review the code below and check which methods are available.

```Python

class CRAG(object):
    def __init__(self, server = None):
        self.server = os.environ.get('CRAG_SERVER', "http://34.64.232.38:8000")

    def finance_get_company_name(self, query:str):
        url = self.server + '/finance/get_company_name'
        headers={'accept': "application/json"}
        data = {'query': query}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_ticker_by_name(self, query:str):
        url = self.server + '/finance/get_ticker_by_name'
        headers={'accept': "application/json"}
        data = {'query': query}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_price_history(self, ticker_name:str):
        url = self.server + '/finance/get_price_history'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_detailed_price_history(self, ticker_name:str):
        url = self.server + '/finance/get_detailed_price_history'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_dividends_history(self, ticker_name:str):
        url = self.server + '/finance/get_dividends_history'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)
    
    def finance_get_market_capitalization(self, ticker_name:str):
        url = self.server + '/finance/get_market_capitalization'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_eps(self, ticker_name:str):
        url = self.server + '/finance/get_eps'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_pe_ratio(self, ticker_name:str):
        url = self.server + '/finance/get_pe_ratio'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_info(self, ticker_name:str):
        url = self.server + '/finance/get_info'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)
```



In [None]:
### YOUR CODE HERE ###

class CRAG(object):
    def __init__(self, server = None):
        self.server = os.environ.get('CRAG_SERVER', "http://34.64.232.38:8000")

    def finance_get_company_name(self, query:str):
        url = self.server + '/finance/get_company_name'
        headers={'accept': "application/json"}
        data = {'query': query}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_ticker_by_name(self, query:str):
        url = self.server + '/finance/get_ticker_by_name'
        headers={'accept': "application/json"}
        data = {'query': query}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_price_history(self, ticker_name:str):
        url = self.server + '/finance/get_price_history'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_detailed_price_history(self, ticker_name:str):
        url = self.server + '/finance/get_detailed_price_history'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_dividends_history(self, ticker_name:str):
        url = self.server + '/finance/get_dividends_history'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_market_capitalization(self, ticker_name:str):
        url = self.server + '/finance/get_market_capitalization'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_eps(self, ticker_name:str):
        url = self.server + '/finance/get_eps'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_pe_ratio(self, ticker_name:str):
        url = self.server + '/finance/get_pe_ratio'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

    def finance_get_info(self, ticker_name:str):
        url = self.server + '/finance/get_info'
        headers={'accept': "application/json"}
        data = {'query': ticker_name}
        result = requests.post(url, json=data, headers=headers)
        return json.loads(result.text)

The code below is a simple example that allows you to test one of the APIs mentioned above.  

You can review the available methods and experiment freely as you wish.

```
def pretty_json_print(data):
    json_string = json.dumps(data, indent=4)
    lines = json_string.splitlines()
    formatted_lines = "\n\n".join(lines)
    print(formatted_lines)

api = CRAG()

metric = "price"

result = api.finance_get_company_name("microsoft")
pretty_json_print(result)
ticker_name = api.finance_get_ticker_by_name(result["result"][0])
pretty_json_print(ticker_name)

if metric == 'price':
    response = api.finance_get_price_history(ticker_name['result'])
elif metric == 'dividend':
    response = api.finance_get_dividends_history(ticker_name['result'])
elif metric == 'p/e ratio':
    response = api.finance_get_pe_ratio(ticker_name['result'])
elif metric == 'eps':
    response = api.finance_get_eps(ticker_name['result'])
elif metric == 'marketcap' :
    response = api.finance_get_market_capitalization(ticker_name['result'])
else:
    response = api.finance_get_info(ticker_name['result'])

pretty_json_print(response)
```


In [None]:
### YOUR CODE HERE ###

def pretty_json_print(data):
    json_string = json.dumps(data, indent=4)
    lines = json_string.splitlines()
    formatted_lines = "\n\n".join(lines)
    print(formatted_lines)

api = CRAG()

metric = "price"

result = api.finance_get_company_name("microsoft")
pretty_json_print(result)
ticker_name = api.finance_get_ticker_by_name(result["result"][0])
pretty_json_print(ticker_name)

if metric == 'price':
    response = api.finance_get_price_history(ticker_name['result'])
elif metric == 'dividend':
    response = api.finance_get_dividends_history(ticker_name['result'])
elif metric == 'p/e ratio':
    response = api.finance_get_pe_ratio(ticker_name['result'])
elif metric == 'eps':
    response = api.finance_get_eps(ticker_name['result'])
elif metric == 'marketcap' :
    response = api.finance_get_market_capitalization(ticker_name['result'])
else:
    response = api.finance_get_info(ticker_name['result'])

pretty_json_print(response)

### 3. Implementing a Query Generator

Creating the API, as shown above, is a good step. However, the important point is that the API requires a specific input format to function correctly.

Unfortunately, the CRAG dataset questions do not explicitly indicate which words or terms can be used as inputs for the API. To use the API effectively, we need to extract or generate the required terms from the question and provide them as inputs to the API.

There are various methods for this task. For example, in the field of **Named Entity Recognition (NER)**, techniques are studied to extract important words (entities) from a given sentence. Using a model developed in this field could be an excellent approach.

However, this would require loading additional models. Fortunately, we already have an LLM. Therefore, we can instruct the LLM to directly extract elements that can serve as inputs for the API.

Below is the prompt provided to the LLM for this purpose. It includes a breakdown of which entities are usable as API inputs for each domain. Reviewing this breakdown would also be a valuable exercise.

#### Designing prompts

```
entity_extract_template = """
You are given a Query and Query Time. Do the following:

1) Determine the domain the query is about. The domain should be one of the following: "finance", "sports", "music", "movie", "encyclopedia". If none of the domain applies, use "other". Use "domain" as the key in the result json.

2) Extract structured information from the query. Include different keys into the result json depending on the domains, and put them DIRECTLY in the result json. Here are the rules:

For `finance` queries, these are possible keys:
- `market_identifier`: stock identifiers including individual company names, stock symbols.
- `metric`: financial metrics that the query is asking about. This must be one of the following: `price`, `dividend`, `P/E ratio`, `EPS`, `marketCap`, and `other`.
- `datetime`: time frame that query asks about. When datetime is not explicitly mentioned, use `Query Time` as default.


Return the results in a FLAT json.

*NEVER include ANY EXPLANATION or NOTE in the output, ONLY OUTPUT JSON*  
"""
```
```
def prompt_generator(query):
    user_message = ""
    user_message += f"Query: {query}\n"
        
    llm_input = [
      {"role": "system", "content": entity_extract_template},
      {"role": "user", "content": user_message},
    ]

    return llm_input

```


In [None]:
### YOUR CODE HERE ###

entity_extract_template = """
You are given a Query and Query Time. Do the following:

1) Determine the domain the query is about. The domain should be one of the following: "finance", "sports", "music", "movie", "encyclopedia". If none of the domain applies, use "other". Use "domain" as the key in the result json.

2) Extract structured information from the query. Include different keys into the result json depending on the domains, and put them DIRECTLY in the result json. Here are the rules:

For `finance` queries, these are possible keys:
- `market_identifier`: stock identifiers including individual company names, stock symbols.
- `metric`: financial metrics that the query is asking about. This must be one of the following: `price`, `dividend`, `P/E ratio`, `EPS`, `marketCap`, and `other`.
- `datetime`: time frame that query asks about. When datetime is not explicitly mentioned, use `Query Time` as default.


Return the results in a FLAT json.

*NEVER include ANY EXPLANATION or NOTE in the output, ONLY OUTPUT JSON*
"""

In [None]:
### YOUR CODE HERE ###

def prompt_generator(query):
    user_message = ""
    user_message += f"Query: {query}\n"

    llm_input = [
      {"role": "system", "content": entity_extract_template},
      {"role": "user", "content": user_message},
    ]

    return llm_input

#### Generating Queries

Now, let’s actually deliver this prompt to the LLM and generate the query that will be sent to the API.

```python
import json
from openai import OpenAI
from json import JSONDecoder

oai_client = OpenAI()

def generate_query(query):
    llm_input = prompt_generator(query)
    completion = oai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=
    llm_input
    ).choices[0].message.content

    try:
        completion = json.loads(completion)
    except:
        completion = extract_json_objects(completion)
    
    if "domain" in completion.keys():
        domain = completion["domain"]
        is_finance = domain == "finance"
    else:
        is_finance = False

    return completion, is_finance

def extract_json_objects(text, decoder=JSONDecoder()):
    """Find JSON objects in text, and yield the decoded JSON data
    """
    pos = 0
    results = []
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            results.append(result)
            pos = match + index
        except ValueError:
            pos = match + 1
    return results
```


In [None]:
### YOUR CODE HERE ###

import json
from openai import OpenAI
from json import JSONDecoder

oai_client = OpenAI()

def generate_query(query):
    llm_input = prompt_generator(query)
    completion = oai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=
    llm_input
    ).choices[0].message.content

    try:
        completion = json.loads(completion)
    except:
        completion = extract_json_objects(completion)

    if "domain" in completion.keys():
        domain = completion["domain"]
        is_finance = domain == "finance"
    else:
        is_finance = False

    return completion, is_finance

def extract_json_objects(text, decoder=JSONDecoder()):
    """Find JSON objects in text, and yield the decoded JSON data
    """
    pos = 0
    results = []
    while True:
        match = text.find("{", pos)
        if match == -1:
            break
        try:
            result, index = decoder.raw_decode(text[match:])
            results.append(result)
            pos = match + index
        except ValueError:
            pos = match + 1
    return results

Let’s verify whether the above code works correctly in practice.

```
from google.colab import drive
drive.mount('/content/drive')
```

```
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        generated_query = generate_query(item['query'])
        break

print(f"generated_query: {generated_query}")
```

In [None]:
### YOUR CODE HERE ###

from google.colab import drive
drive.mount('/content/drive')

In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        generated_query = generate_query(item['query'])
        break

print(f"generated_query: {generated_query}")

### 4. Implementing a Query Executor

We are now at the most challenging step. Even if we generate a query to send to the API, we cannot directly pass it as is.

Therefore, it is necessary to transform the generated results into a format where only the required elements are sent to the API.

Additionally, simply being able to call the API does not solve the problem. Many of the data points in the CRAG dataset require multiple pieces of information. As a result, a single API call may not provide sufficient information to generate an answer.

To address this, we need to predefine a **decision tree** outlining how we will use and analyze the API. With this decision tree set, the necessary code can then be written.

For now, let’s declare some **utility functions** that may be needed later. These functions will transform sentences into a fixed format using methods like pattern matching or other techniques.

```python
import datetime
from datetime import timedelta
from dateutil import parser
import pytz
import re

def normalize_key(key):
    return re.sub(r'[^a-zA-Z0-9]', '', key).lower()

def get_metric_from_response(response, metric):
    normalized_metric = normalize_key(metric)
    if response != None:
        for key, value in response.items():
            if normalize_key(key) == normalized_metric:
                return value
    return None


def convert_to_standard_format(date_string):
    try:
        dt = parser.parse(date_string)
        
        est = pytz.timezone('US/Eastern')
        
        if dt.tzinfo is None:
            dt = est.localize(dt)
        else:
            dt = dt.astimezone(est)
        dt = dt.replace(hour=0, minute=0, second=0, microsecond=0)
        
        formatted_date = dt.strftime('%Y-%m-%d %H:%M:%S %Z')
        return formatted_date
    except (ValueError, OverflowError) as e:
        return date_string

def add_one_day(date_string):
    try:
        dt = parser.parse(date_string)
        
        est = pytz.timezone('US/Eastern')

        if dt.tzinfo is None:
            dt = est.localize(dt)
        else:
            dt = dt.astimezone(est)
        
        dt_plus_one = dt + timedelta(days=1)
        dt = dt.replace(hour=0, minute=0, second=0, microsecond=0)
        formatted_date = dt_plus_one.strftime('%Y-%m-%d %H:%M:%S %Z')
        return formatted_date
    except (ValueError, OverflowError) as e:
        return f"Invalid date string: {e}"

def subtract_one_day(date_string):
    try:
        dt = parser.parse(date_string)
        
        est = pytz.timezone('US/Eastern')
        
        if dt.tzinfo is None:
            dt = est.localize(dt)
        else:
            dt = dt.astimezone(est)
        
        dt_minus_one = dt - timedelta(days=1)
        dt = dt.replace(hour=0, minute=0, second=0, microsecond=0)
        formatted_date = dt_minus_one.strftime('%Y-%m-%d %H:%M:%S %Z')
        return formatted_date
    except (ValueError, OverflowError) as e:
        return f"Invalid date string: {e}"
```


In [None]:
### YOUR CODE HERE ###

import datetime
from datetime import timedelta
from dateutil import parser
import pytz
import re

def normalize_key(key):
    return re.sub(r'[^a-zA-Z0-9]', '', key).lower()

def get_metric_from_response(response, metric):
    normalized_metric = normalize_key(metric)
    if response != None:
        for key, value in response.items():
            if normalize_key(key) == normalized_metric:
                return value
    return None


def convert_to_standard_format(date_string):
    try:
        dt = parser.parse(date_string)

        est = pytz.timezone('US/Eastern')

        if dt.tzinfo is None:
            dt = est.localize(dt)
        else:
            dt = dt.astimezone(est)
        dt = dt.replace(hour=0, minute=0, second=0, microsecond=0)

        formatted_date = dt.strftime('%Y-%m-%d %H:%M:%S %Z')
        return formatted_date
    except (ValueError, OverflowError) as e:
        return date_string

def add_one_day(date_string):
    try:
        dt = parser.parse(date_string)

        est = pytz.timezone('US/Eastern')

        if dt.tzinfo is None:
            dt = est.localize(dt)
        else:
            dt = dt.astimezone(est)

        dt_plus_one = dt + timedelta(days=1)
        dt = dt.replace(hour=0, minute=0, second=0, microsecond=0)
        formatted_date = dt_plus_one.strftime('%Y-%m-%d %H:%M:%S %Z')
        return formatted_date
    except (ValueError, OverflowError) as e:
        return f"Invalid date string: {e}"

def subtract_one_day(date_string):
    try:
        dt = parser.parse(date_string)

        est = pytz.timezone('US/Eastern')

        if dt.tzinfo is None:
            dt = est.localize(dt)
        else:
            dt = dt.astimezone(est)

        dt_minus_one = dt - timedelta(days=1)
        dt = dt.replace(hour=0, minute=0, second=0, microsecond=0)
        formatted_date = dt_minus_one.strftime('%Y-%m-%d %H:%M:%S %Z')
        return formatted_date
    except (ValueError, OverflowError) as e:
        return f"Invalid date string: {e}"


Now, let’s write a function that follows the **decision tree** we established to interact with the API and process its results. A decision tree is a decision support structure that uses a tree-like model of decisions and their possible consequences, including outcomes, costs, and utility. We'll be using our pre-built API in this manner.

<img src="https://i.imgur.com/F7lfJQq.png">

Our search process will proceed as follows:  

1.	Extract relevant entities from the question that are required to use the API.
2.	Based on the extracted entities, determine the necessary parameters to call finance-related APIs.
3.	Use the identified parameters to call the relevant API and retrieve the results.
4.	Filter the results to extract only the relevant information.

By following these steps, we will ultimately obtain the kg_results, which contain the relevant information extracted from the Knowledge Graph (KG).

```python
import copy

def get_finance_kg_results(generated_query):
    formatted_time_list = []
    if 'datetime' in generated_query:
        datetime_list = generated_query['datetime'].split(' - ')
        for datetime in datetime_list:
            formatted_time_list.append(convert_to_standard_format(datetime.strip()))


    kg_results = []
    res = ""
    if "market_identifier" in generated_query.keys() and generated_query["market_identifier"] is not None:
        if isinstance(generated_query["market_identifier"], str):
            company_names = generated_query["market_identifier"].split(",")
        else:
            company_names = generated_query["market_identifier"]

        for company_name in company_names:
            try:
                res = api.finance_get_company_name(company_name)["result"]

                if res == []:
                    ticker_name = company_name.upper()
                else:
                    ticker_name = api.finance_get_ticker_by_name(res[0])["result"]

                if generated_query['metric'].lower().strip() == 'price':
                    response = api.finance_get_price_history(ticker_name)['result']
                elif generated_query['metric'].lower().strip() == 'dividend':
                    response = api.finance_get_dividends_history(ticker_name)['result']
                elif generated_query['metric'].lower().strip() == 'p/e ratio':
                    response = api.finance_get_pe_ratio(ticker_name)['result']
                elif generated_query['metric'].lower().strip() == 'eps':
                    response = api.finance_get_eps(ticker_name)["result"]
                elif generated_query['metric'].lower().strip() == 'marketcap' :
                    response = api.finance_get_market_capitalization(ticker_name)['result']
                else:
                    response = api.finance_get_info(ticker_name)['result']
                    metric_value = get_metric_from_response(response, generated_query['metric'])
                    if metric_value is not None:
                        response = metric_value

                try:
                    for formatted_time in formatted_time_list:
                        if formatted_time in response:
                            filtered_response = copy.deepcopy(response[formatted_time])
                        elif add_one_day(formatted_time) in response:
                            filtered_response = copy.deepcopy(response[add_one_day(formatted_time)])
                        elif subtract_one_day(formatted_time) in response:
                            filtered_response = copy.deepcopy(response[subtract_one_day(formatted_time)])
                        else:
                            filtered_response = copy.deepcopy(response)
                        kg_results.append({company_name + " " + generated_query["metric"]: filtered_response, 'time': formatted_time})
                except:
                    kg_results.append({company_name + " " + generated_query["metric"]: response})

            except Exception as e:
                print("Fail to parse the generated query")
                pass

    kg_results = "<DOC>\n".join([str(res) for res in kg_results]) if len(kg_results) > 0 else ""
    return  kg_results
```


In [None]:
### YOUR CODE HERE ###

import copy

def get_finance_kg_results(generated_query):
    formatted_time_list = []
    if 'datetime' in generated_query:
        datetime_list = generated_query['datetime'].split(' - ')
        for datetime in datetime_list:
            formatted_time_list.append(convert_to_standard_format(datetime.strip()))


    kg_results = []
    res = ""
    if "market_identifier" in generated_query.keys() and generated_query["market_identifier"] is not None:
        if isinstance(generated_query["market_identifier"], str):
            company_names = generated_query["market_identifier"].split(",")
        else:
            company_names = generated_query["market_identifier"]

        for company_name in company_names:
            try:
                res = api.finance_get_company_name(company_name)["result"]

                if res == []:
                    ticker_name = company_name.upper()
                else:
                    ticker_name = api.finance_get_ticker_by_name(res[0])["result"]

                if generated_query['metric'].lower().strip() == 'price':
                    response = api.finance_get_price_history(ticker_name)['result']
                elif generated_query['metric'].lower().strip() == 'dividend':
                    response = api.finance_get_dividends_history(ticker_name)['result']
                elif generated_query['metric'].lower().strip() == 'p/e ratio':
                    response = api.finance_get_pe_ratio(ticker_name)['result']
                elif generated_query['metric'].lower().strip() == 'eps':
                    response = api.finance_get_eps(ticker_name)["result"]
                elif generated_query['metric'].lower().strip() == 'marketcap' :
                    response = api.finance_get_market_capitalization(ticker_name)['result']
                else:
                    response = api.finance_get_info(ticker_name)['result']
                    metric_value = get_metric_from_response(response, generated_query['metric'])
                    if metric_value is not None:
                        response = metric_value

                try:
                    for formatted_time in formatted_time_list:
                        if formatted_time in response:
                            filtered_response = copy.deepcopy(response[formatted_time])
                        elif add_one_day(formatted_time) in response:
                            filtered_response = copy.deepcopy(response[add_one_day(formatted_time)])
                        elif subtract_one_day(formatted_time) in response:
                            filtered_response = copy.deepcopy(response[subtract_one_day(formatted_time)])
                        else:
                            filtered_response = copy.deepcopy(response)
                        kg_results.append({company_name + " " + generated_query["metric"]: filtered_response, 'time': formatted_time})
                except:
                    kg_results.append({company_name + " " + generated_query["metric"]: response})

            except Exception as e:
                print("Fail to parse the generated query")
                pass

    kg_results = "<DOC>\n".join([str(res) for res in kg_results]) if len(kg_results) > 0 else ""
    return  kg_results


Let’s now verify whether we can obtain the kg_results using real data by executing the pipeline we just declared.  

This will confirm that our process, from entity extraction to API interaction, works correctly.  

```
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")

        if item["domain"] == "finance":
          generated_query, is_finance = generate_query(item['query'])
          if is_finance:
            print("generated_query: ", generated_query)
            kg_results = get_finance_kg_results(generated_query)
            if kg_results not in ["", None]:
                break

print(f"kg_results: {kg_results}")
```

In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")

        if item["domain"] == "finance":
          generated_query, is_finance = generate_query(item['query'])
          if is_finance:
            print("generated_query: ", generated_query)
            kg_results = get_finance_kg_results(generated_query)
            if kg_results not in ["", None]:
                break

print(f"kg_results: {kg_results}")

###5. Implementing a Mock KG Query Engine

Let’s now define the `KGQueryEngine` that brings together all the components we have implemented.

When a query is provided, the `KGQueryEngine` will interact with the Knowledge Graph (KG) to retrieve the relevant information.


```
class KGQueryEngine:
    def query(self, query):
        generated_query, is_finance = self.generate_query(query)

        if is_finance:
            kg_results = self.get_finance_kg_results(generated_query)
        else:
            kg_results = ""

        return kg_results

    def generate_query(self, query):
        llm_input = prompt_generator(query)
        completion = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=
        llm_input
        ).choices[0].message.content

        try:
            completion = json.loads(completion)
        except:
            completion = extract_json_objects(completion)

        if "domain" in completion.keys():
            domain = completion["domain"]
            is_finance = domain == "finance"
        else:
            is_finance = False

        return completion, is_finance

    def get_finance_kg_results(self, generated_query):
        formatted_time_list = []
        if 'datetime' in generated_query:
            datetime_list = generated_query['datetime'].split(' - ')
            for datetime in datetime_list:
                formatted_time_list.append(convert_to_standard_format(datetime.strip()))


        kg_results = []
        res = ""
        if "market_identifier" in generated_query.keys() and generated_query["market_identifier"] is not None:
            if isinstance(generated_query["market_identifier"], str):
                company_names = generated_query["market_identifier"].split(",")
            else:
                company_names = generated_query["market_identifier"]

            for company_name in company_names:
                try:
                    res = api.finance_get_company_name(company_name)["result"]

                    if res == []:
                        ticker_name = company_name.upper()
                    else:
                        ticker_name = api.finance_get_ticker_by_name(res[0])["result"]

                    if generated_query['metric'].lower().strip() == 'price':
                        response = api.finance_get_price_history(ticker_name)['result']
                    elif generated_query['metric'].lower().strip() == 'dividend':
                        response = api.finance_get_dividends_history(ticker_name)['result']
                    elif generated_query['metric'].lower().strip() == 'p/e ratio':
                        response = api.finance_get_pe_ratio(ticker_name)['result']
                    elif generated_query['metric'].lower().strip() == 'eps':
                        response = api.finance_get_eps(ticker_name)["result"]
                    elif generated_query['metric'].lower().strip() == 'marketcap' :
                        response = api.finance_get_market_capitalization(ticker_name)['result']
                    else:
                        response = api.finance_get_info(ticker_name)['result']
                        metric_value = get_metric_from_response(response, generated_query['metric'])
                        if metric_value is not None:
                            response = metric_value

                    try:
                        for formatted_time in formatted_time_list:
                            if formatted_time in response:
                                filtered_response = copy.deepcopy(response[formatted_time])
                            elif add_one_day(formatted_time) in response:
                                filtered_response = copy.deepcopy(response[add_one_day(formatted_time)])
                            elif subtract_one_day(formatted_time) in response:
                                filtered_response = copy.deepcopy(response[subtract_one_day(formatted_time)])
                            else:
                                filtered_response = copy.deepcopy(response)
                            kg_results.append({company_name + " " + generated_query["metric"]: filtered_response, 'time': formatted_time})
                    except:
                        kg_results.append({company_name + " " + generated_query["metric"]: response})

                except Exception as e:
                    print("Fail to parse the generated query")
                    pass

        kg_results = "<DOC>\n".join([str(res) for res in kg_results]) if len(kg_results) > 0 else ""
        return  kg_results

    def prompt_generator(self, query):
        user_message = ""
        user_message += f"Query: {query}\n"

        llm_input = [
          {"role": "system", "content": entity_extract_template},
          {"role": "user", "content": user_message},
        ]

        return llm_input
```



In [None]:
### YOUR CODE HERE ###

class KGQueryEngine:
    def query(self, query):
        generated_query, is_finance = self.generate_query(query)

        if is_finance:
            kg_results = self.get_finance_kg_results(generated_query)
        else:
            kg_results = ""

        return kg_results

    def generate_query(self, query):
        llm_input = prompt_generator(query)
        completion = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=
        llm_input
        ).choices[0].message.content

        try:
            completion = json.loads(completion)
        except:
            completion = extract_json_objects(completion)

        if "domain" in completion.keys():
            domain = completion["domain"]
            is_finance = domain == "finance"
        else:
            is_finance = False

        return completion, is_finance

    def get_finance_kg_results(self, generated_query):
        formatted_time_list = []
        if 'datetime' in generated_query:
            datetime_list = generated_query['datetime'].split(' - ')
            for datetime in datetime_list:
                formatted_time_list.append(convert_to_standard_format(datetime.strip()))


        kg_results = []
        res = ""
        if "market_identifier" in generated_query.keys() and generated_query["market_identifier"] is not None:
            if isinstance(generated_query["market_identifier"], str):
                company_names = generated_query["market_identifier"].split(",")
            else:
                company_names = generated_query["market_identifier"]

            for company_name in company_names:
                try:
                    res = api.finance_get_company_name(company_name)["result"]

                    if res == []:
                        ticker_name = company_name.upper()
                    else:
                        ticker_name = api.finance_get_ticker_by_name(res[0])["result"]

                    if generated_query['metric'].lower().strip() == 'price':
                        response = api.finance_get_price_history(ticker_name)['result']
                    elif generated_query['metric'].lower().strip() == 'dividend':
                        response = api.finance_get_dividends_history(ticker_name)['result']
                    elif generated_query['metric'].lower().strip() == 'p/e ratio':
                        response = api.finance_get_pe_ratio(ticker_name)['result']
                    elif generated_query['metric'].lower().strip() == 'eps':
                        response = api.finance_get_eps(ticker_name)["result"]
                    elif generated_query['metric'].lower().strip() == 'marketcap' :
                        response = api.finance_get_market_capitalization(ticker_name)['result']
                    else:
                        response = api.finance_get_info(ticker_name)['result']
                        metric_value = get_metric_from_response(response, generated_query['metric'])
                        if metric_value is not None:
                            response = metric_value

                    try:
                        for formatted_time in formatted_time_list:
                            if formatted_time in response:
                                filtered_response = copy.deepcopy(response[formatted_time])
                            elif add_one_day(formatted_time) in response:
                                filtered_response = copy.deepcopy(response[add_one_day(formatted_time)])
                            elif subtract_one_day(formatted_time) in response:
                                filtered_response = copy.deepcopy(response[subtract_one_day(formatted_time)])
                            else:
                                filtered_response = copy.deepcopy(response)
                            kg_results.append({company_name + " " + generated_query["metric"]: filtered_response, 'time': formatted_time})
                    except:
                        kg_results.append({company_name + " " + generated_query["metric"]: response})

                except Exception as e:
                    print("Fail to parse the generated query")
                    pass

        kg_results = "<DOC>\n".join([str(res) for res in kg_results]) if len(kg_results) > 0 else ""
        return  kg_results

    def prompt_generator(self, query):
        user_message = ""
        user_message += f"Query: {query}\n"

        llm_input = [
          {"role": "system", "content": entity_extract_template},
          {"role": "user", "content": user_message},
        ]

        return llm_input

## II. Implementing a Reader

Before defining RAG with KG, let’s finalize the `Reader`. This `Reader` will be the same as the one you used in Task 1.

Here’s a reminder of the structure:

```Python
from openai import OpenAI

oai_client = OpenAI()

class Reader:
  def __init__(self):

    self.system_prompt = """
    You are provided with a question and various references.
    Your task is to answer the question succinctly, using the fewest words possible.
    If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
    There is no need to explain the reasoning behind your answers.
    """

  def generate_response(self, question: str, top_k_chunks: list) -> str:
      """
      Generate answer from context.
      """
      llm_input = self.prompt_generator(question, top_k_chunks)
      completion = oai_client.chat.completions.create(
      model="gpt-3.5-turbo",
      temperature=0,
      messages=
      llm_input
      ).choices[0].message.content
      return completion

  def prompt_generator(self, query, top_k_chunks):
      user_message = ""
      references = ""

      if len(top_k_chunks) > 0:
          references += "# References \n"
          # Format the top sentences as references in the model's prompt template.
          for chunk_id, chunk in enumerate(top_k_chunks):
              references += f"- {chunk.strip()}\n"

      references = references[:MAX_CONTEXT_REFERENCES_LENGTH]
      # Limit the length of references to fit the model's input size.

      user_message += f"{references}\n------\n\n"
      user_message
      user_message += f"Using only the references listed above, answer the following question: \n"
      user_message += f"Question: {query}\n"

      llm_input = [
        {"role": "system", "content": self.system_prompt},
        {"role": "user", "content": user_message},
      ]

      return llm_input
```



In [None]:
### YOUR CODE HERE ###

from openai import OpenAI

oai_client = OpenAI()

class Reader:
  def __init__(self):

    self.system_prompt = """
    You are provided with a question and various references.
    Your task is to answer the question succinctly, using the fewest words possible.
    If the references do not contain the necessary information to answer the question, respond with 'I don't know'.
    There is no need to explain the reasoning behind your answers.
    """

  def generate_response(self, question: str, top_k_chunks: list) -> str:
      """
      Generate answer from context.
      """
      llm_input = self.prompt_generator(question, top_k_chunks)
      completion = oai_client.chat.completions.create(
      model="gpt-3.5-turbo",
      temperature=0,
      messages=
      llm_input
      ).choices[0].message.content
      return completion

  def prompt_generator(self, query, top_k_chunks):
      user_message = ""
      references = ""

      if len(top_k_chunks) > 0:
          references += "# References \n"
          # Format the top sentences as references in the model's prompt template.
          for chunk_id, chunk in enumerate(top_k_chunks):
              references += f"- {chunk.strip()}\n"

      references = references[:MAX_CONTEXT_REFERENCES_LENGTH]
      # Limit the length of references to fit the model's input size.

      user_message += f"{references}\n------\n\n"
      user_message
      user_message += f"Using only the references listed above, answer the following question: \n"
      user_message += f"Question: {query}\n"

      llm_input = [
        {"role": "system", "content": self.system_prompt},
        {"role": "user", "content": user_message},
      ]

      return llm_input

Let’s verify once again that the model is functioning correctly.

Run the following test to ensure all components are working as expected.

```
reader = Reader()
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer = reader.generate_response(item['query'], [])
        break

print(f"answer: {answer}")
```



In [None]:
### YOUR CODE HERE ###

reader = Reader()
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer = reader.generate_response(item['query'], [])
        break

print(f"answer: {answer}")

## III. Implementing an LLM + Mock KG

Finally, let’s define our RAG system, which is the ultimate goal of this session. Fortunately, we have already completed the complex steps above. To define the RAG system, we simply need to combine the previously defined components into a single class.  

```
class RAGWithKG:
    def __init__(self):
        self.kg_query_engine = KGQueryEngine()
        self.reader = Reader()

    def inference(self, query):
        # 1. retrieve relevant kg results
        kg_results = self.kg_query_engine.query(query)

        # 2. answer the question based on the retrieved chunks
        answer = self.reader.generate_response(query, [kg_results])

        return answer, kg_results

```




In [None]:
### YOUR CODE HERE ###

class RAGWithKG:
    def __init__(self):
        self.kg_query_engine = KGQueryEngine()
        self.reader = Reader()

    def inference(self, query):
        # 1. retrieve relevant kg results
        kg_results = self.kg_query_engine.query(query)

        # 2. answer the question based on the retrieved chunks
        answer = self.reader.generate_response(query, [kg_results])

        return answer, kg_results

Let’s now verify whether the implemented RAG system operates as intended. This system retrieves relevant information from the Knowledge Graph (KG) and utilizes it to generate a final answer.

Additionally, we will check whether the system can correctly handle the following queries, which the RAG system from Task 1 previously failed to answer.

<br/>  
Question: **What is the ex-dividend date of microsoft in the 1st qtr of 2024**.   
Answer: **The ex-dividend date of microsoft in the 1st qtr of 2024 is feb 14, 2024**
<br/>

<br/>  
Question: **I'm looking for the p/e ratio of dks. would you happen to know what it is?**.   
Answer: **13.75**
<br/>

<br/>  
Question: **What's auph's earnings per share?**.   
Answer: **0.4**
<br/>

```
rag = RAGWithKG()
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"
repeat = 0

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        if repeat not in [14, 53, 64]:
            repeat += 1
            continue
        
        print(f"query: {item['query']}")
        print()
        answer, kg_results = rag.inference(item['query'])
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print(f"kg results: {kg_results}")

        repeat += 1
```


In [None]:
### YOUR CODE HERE ###

rag = RAGWithKG()
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"
repeat = 0

with open(dataset_path, "rt") as file:
    for line in file:
        item = json.loads(line)
        if repeat not in [14, 53, 64]:
            repeat += 1
            continue

        print(f"query: {item['query']}")
        print()
        answer, kg_results = rag.inference(item['query'])
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print(f"kg results: {kg_results}")

        repeat += 1

## IV. Implementing an LLM + Web Search Results + Mock KG



```
from llama_index.core.schema import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

def parse_htmls(search_results):
    all_documents = []

    # Process each HTML text from the search results to extract text content.
    for html_text in search_results:

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_text["page_result"], features="lxml")
        text = soup.get_text(" ", strip=True)  # Use space as a separator, strip whitespaces
        all_documents.append(text)

    return all_documents

class LlamaIndexRetriever:
  def __init__(self):
      self.parser = SentenceSplitter(chunk_size=512, chunk_overlap=0)

  def retrieve(self, query, search_results, topk):
      documents = []

      for document in parse_htmls(search_results):
        if not document:
            # If no text is extracted, add an empty string as a placeholder.
            documents.append(Document(text=""))
        else:
            documents.append(Document(text=document))

      # Split documents into chunks & Create vector index
      base_index = VectorStoreIndex.from_documents(documents = documents, transformations=[self.parser])

      # Execute query
      base_retriever = base_index.as_retriever(similarity_top_k=topk)

      retrieved_nodes = base_retriever.retrieve(query)

      retrieved_results = [retrieved_node.node.get_content().strip() for retrieved_node in retrieved_nodes]

      return retrieved_results
```



```
class RAGWithSRKG:
    def __init__(self):
        self.retriever = LlamaIndexRetriever()
        self.kg_query_engine = KGQueryEngine()
        self.reader = Reader()

    def inference(self, query, search_results, topk):
        # 1. retrieve relevant chunks
        retrieved_results = self.retriever.retrieve(query, search_results, topk)

        # 2. retrieve relevant kg results
        kg_results = self.kg_query_engine.query(query)

        combined_results = [kg_results]
        combined_results.extend(retrieved_results)

        # 3. answer the question based on the retrieved chunks
        answer = self.reader.generate_response(query, combined_results)

        return answer, combined_results
```



In [None]:
### YOUR CODE HERE ###

from llama_index.core.schema import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

def parse_htmls(search_results):
    all_documents = []

    # Process each HTML text from the search results to extract text content.
    for html_text in search_results:

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_text["page_result"], features="lxml")
        text = soup.get_text(" ", strip=True)  # Use space as a separator, strip whitespaces
        all_documents.append(text)

    return all_documents

class LlamaIndexRetriever:
  def __init__(self):
      self.parser = SentenceSplitter(chunk_size=512, chunk_overlap=0)

  def retrieve(self, query, search_results, topk):
      documents = []

      for document in parse_htmls(search_results):
        if not document:
            # If no text is extracted, add an empty string as a placeholder.
            documents.append(Document(text=""))
        else:
            documents.append(Document(text=document))

      # Split documents into chunks & Create vector index
      base_index = VectorStoreIndex.from_documents(documents = documents, transformations=[self.parser])

      # Execute query
      base_retriever = base_index.as_retriever(similarity_top_k=topk)

      retrieved_nodes = base_retriever.retrieve(query)

      retrieved_results = [retrieved_node.node.get_content().strip() for retrieved_node in retrieved_nodes]

      return retrieved_results

In [None]:
### YOUR CODE HERE ###

class RAGWithSRKG:
    def __init__(self):
        self.retriever = LlamaIndexRetriever()
        self.kg_query_engine = KGQueryEngine()
        self.reader = Reader()

    def inference(self, query, search_results, topk):
        # 1. retrieve relevant chunks
        retrieved_results = self.retriever.retrieve(query, search_results, topk)

        # 2. retrieve relevant kg results
        kg_results = self.kg_query_engine.query(query)

        combined_results = [kg_results]
        combined_results.extend(retrieved_results)

        # 3. answer the question based on the retrieved chunks
        answer = self.reader.generate_response(query, combined_results)

        return answer, combined_results

Let us evaluate whether it can successfully answer all the following queries.

<br/>  
Question: **In 2004, which animated film was recognized with the best animated feature film oscar?**.   
Answer: **Finding Nemo**
<br/>

<br/>  
Question: **What is the ex-dividend date of microsoft in the 1st qtr of 2024**.   
Answer: **The ex-dividend date of microsoft in the 1st qtr of 2024 is feb 14, 2024**
<br/>

<br/>  
Question: **I'm looking for the p/e ratio of dks. would you happen to know what it is?**.   
Answer: **13.75**
<br/>

<br/>  
Question: **What's auph's earnings per share?**.   
Answer: **0.4**
<br/>



```
dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

rag = RAGWithSRKG()
topk = 5

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat not in [5, 14, 53, 64]:
            repeat += 1
            continue

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer, retrieved_results = rag.inference(item['query'], item['search_results'], topk)
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print("retrieved results:")
        for rank, retrieved_result in enumerate(retrieved_results):
            print(f"{rank}: {retrieved_result}")
        print()
        repeat += 1
```



In [None]:
### YOUR CODE HERE ###

dataset_path = "/content/drive/MyDrive/CRAG dataset/crag_task_1_dev_v4_release.jsonl"

rag = RAGWithSRKG()
topk = 5

repeat = 0
with open(dataset_path, "rt") as file:
    for line in file:
        if repeat not in [5, 14, 53, 64]:
            repeat += 1
            continue

        item = json.loads(line)
        print(f"query: {item['query']}")
        print()
        answer, retrieved_results = rag.inference(item['query'], item['search_results'], topk)
        print(f"predicted answer: {answer}")
        print(f"ground truth answer: {item['answer']}")
        print()
        print("retrieved results:")
        for rank, retrieved_result in enumerate(retrieved_results):
            print(f"{rank}: {retrieved_result}")
        print()
        repeat += 1