#📓 RAG (Retrieval-Augmented Generation) Practice

This is today's main section! In this class, we delve into RAG, the Retrieval-Augmented Generation (RAG) technique. RAG is the process of optimizing the output of a large language model, so it references an external knowledge base outside of its training data sources before generating a response.

Retrieval-Augmented Generation combines the power of language models with the specificity of retrieved documents to generate more accurate and relevant responses. This technique retrieves relevant documents or passages based on the input query and uses this retrieved information to inform the generation process of the language model.  

Let's look at the picture below and remind the structure of the RAG.
<img src="https://i.imgur.com/bR4xaBd.png">
###Steps in RAG
1. The model receives an input query to which a response is required.
2. It retrieves relevant documents or passages from a database that are pertinent to the input query.
3. The language model then generates a response, informed by both the original query and the retrieved documents.


![](https://huggingface.co/blog/assets/12_ray_rag/rag_gif.gif)



This practice class will be comprised of three sections.  
  
### I. Building a Prototype RAG.  
### II. Finding the best RAG App Configuration.
  
Okay. Now we know what we have to do.  
Then, is it okay to dive into practice?

![](https://media.tenor.com/r2l6ol9HRqIAAAAj/question-mark-question.gif)

## 0. Before Dive into Practice!

We can easily infer that models often give out strange answers to complex questions. However, this is only natural because people are the same.

So exactly what kind of questions does LLM generate the wrong answers in? If we find out what kind of questions LLM is vulnerable to, we will have a better understanding of how to supplement LLM.

<br/>

We follow the steps below:  

#### 1. Preparing for OpenAI API
#### 2. Make a Method for Asking Question
#### 3. Ask to Query Engine


### 1. Preparing for OpenAI API
  
We will use OpenAI LLM for this acitivity. So, we need to install openai api library, import it, and set openai api key.

Copy the api key you have to the 'sk-...' location. Note that all api keys start with 'sk-'.


```Python
! pip install llama-index==0.12.2 --quiet
! pip install llama-index-readers-wikipedia==0.3.0 wikipedia==1.4.0 --quiet
! pip install llama-index-llms-openai==0.3.1 --quiet
! pip install openai==1.55.3 --quiet
! pip install trulens==1.2.0 trulens-providers-openai==1.2.0 --quiet
! pip install packaging==23.2 langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet
! pip uninstall numpy -y
! pip install numpy==2.0.2
```
```Python
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..." #copy your api key

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, get_response_synthesizer
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter

import textwrap

import nltk
nltk.download('punkt')
```

In [None]:
### YOUR CODE HERE ###

! pip install llama-index==0.12.2 --quiet
! pip install llama-index-readers-wikipedia==0.3.0 wikipedia==1.4.0 --quiet
! pip install llama-index-llms-openai==0.3.1 --quiet
! pip install openai==1.55.3 --quiet
! pip install trulens==1.2.0 trulens-providers-openai==1.2.0 --quiet
! pip install packaging==23.2 langchain nltk>=3.8.1 streamlit==1.35.0 watchdog kubernetes==26.1.0 --quiet
! pip uninstall numpy -y
! pip install numpy==2.0.2

In [None]:
### YOUR CODE HERE ###

import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..." #copy your api key

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, get_response_synthesizer
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.core.node_parser import SentenceSplitter

import textwrap

import nltk
nltk.download('punkt')

### 2. Make a Method for Asking Question
  
Using the OpenAI API, we will create a function that receives a question and returns the response of the OpenAI LLM.

```Python
def generate_answer(question):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": question,
        },
    ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",  
        messages=messages,
    )
    
    return response.choices[0].message.content
```


In [None]:
### YOUR CODE HERE ###

def generate_answer(question):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": question,
        },
    ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
    )

    return response.choices[0].message.content

### 3. Ask to Query Engine

Let's pose some questions to the OpenAI LLM. We will ask the following types of questions:

#### 1. Question about **old information** vs Question about **recent information**
#### 2. Question that **require only simple information** vs Question that **require complex reasoning**
#### 3. Question that are **too vague** vs Question that **clearly specify the desired task**


<br/>

#### 1. Question about **old information** vs Question about **recent information**

Question about old information: 'When did Blackpink make their debut?
'  
Answer: 2016

Question about recent information: 'What is the lowest pressure at which diamonds have been created in the world?'  
Answer: 1 atmospheric pressure

<img src="https://i.imgur.com/ZaHVekw.png" width="600">

<br/>

#### 2. Question that **require only simple information** vs Question that **require complex reasoning**

Question that require only simple information: 'What was the first animal to orbit the Earth?'  
Answer: Laika

Question that require complex reasoning: 'In what country was the current World\'s Strongest Man born?'  

Reasoning step:  
1. Who is the current World\'s Strongest Man?
2. Which country was the current World\'s Strongest Man born in?

Answer: Canada  

<br/>

#### 3. Question that are **too vague** vs Question that **clearly specify the desired task**

Question that are too vague: 'I want to cook something.'  

Question that clearly specify the desired task: 'Explain the steps to make apple pie step by step'

Answer: Compare two responses by yourself  


```Python
question_old =  'When did Blackpink make their debut? '
question_recent = 'What is the lowest pressure at which diamonds have been created in the world?'

question_simple = 'What was the first animal to orbit the Earth?'
question_complex = 'In what country was the current World\'s Strongest Man born?'

question_clear = 'Explain the steps to make something step by step.'
question_vague = 'I want to cook something.'
```

In [None]:
### YOUR CODE HERE ###

question_old =  'When did Blackpink make their debut? '
question_recent = 'What is the lowest pressure at which diamonds have been created in the world?'

question_simple = 'What was the first animal to orbit the Earth?'
question_complex = 'In what country was the current World\'s Strongest Man born?'

question_clear = 'Explain the steps to make something step by step.'
question_vague = 'I want to cook something.'

### 4. Visualizaing Result

Let's check the results by adding a question to the previously made `generate_answer` function.

Also, we will use `pretty_print` method to automatically change the line at a certain length in case the answer is too long.

```Python
import textwrap

def pretty_print(answer):
    wrapped_text = textwrap.fill(answer, width=80)
    return wrapped_text
```

```Python
answer_old = generate_answer(question_old)
print("Answer to question with old information: \n", pretty_print(answer_old))
print("------------------------------------------------------------")
print("Real Answer: 2016")
```
```Python
answer_recent = generate_answer(question_recent)
print("Answer to question with recent information: \n", pretty_print(answer_recent))
print("------------------------------------------------------------")
print("Real Answer: 1 atmospheric pressure")
```
```Python
answer_simple = generate_answer(question_simple)
print("Answer to a simple question: \n", pretty_print(answer_simple))
print("------------------------------------------------------------")
print("Real Answer: Laika")
```
```Python
answer_complex = generate_answer(question_complex)
print("Answer to a complex question: \n", pretty_print(answer_complex))
print("------------------------------------------------------------")
print("Real Answer: Canada")
```
```Python
answer_clear = generate_answer(question_clear)
print("Response to a clear question: \n", pretty_print(answer_clear))
print("------------------------------------------------------------")
print("\nThink about this results. Does the result fit the purpose of the question?")
```
```Python
answer_vague = generate_answer(question_vague)
print("Response to a vague question: \n", pretty_print(answer_vague))
print("------------------------------------------------------------")
print("\nThink about this results. Does the result fit the purpose of the question?")
```

In [None]:
### YOUR CODE HERE ###

import textwrap

def pretty_print(answer):
    wrapped_text = textwrap.fill(answer, width=80)
    return wrapped_text

In [None]:
### YOUR CODE HERE ###

answer_old = generate_answer(question_old)
print("Answer to question with old information: \n", pretty_print(answer_old))
print("------------------------------------------------------------")
print("Real Answer: 2016")

In [None]:
### YOUR CODE HERE ###

answer_recent = generate_answer(question_recent)
print("Answer to question with recent information: \n", pretty_print(answer_recent))
print("------------------------------------------------------------")
print("Real Answer: 1 atmospheric pressure")

In [None]:
### YOUR CODE HERE ###

answer_simple = generate_answer(question_simple)
print("Answer to a simple question: \n", pretty_print(answer_simple))
print("------------------------------------------------------------")
print("Real Answer: Laika")

In [None]:
### YOUR CODE HERE ###

answer_complex = generate_answer(question_complex)
print("Answer to a complex question: \n", pretty_print(answer_complex))
print("------------------------------------------------------------")
print("Real Answer: Canada")

In [None]:
### YOUR CODE HERE ###

answer_clear = generate_answer(question_clear)
print("Response to a clear question: \n", pretty_print(answer_clear))
print("------------------------------------------------------------")
print("\nThink about this results. Does the result fit the purpose of the question?")

In [None]:
### YOUR CODE HERE ###

answer_vague = generate_answer(question_vague)
print("Response to a vague question: \n", pretty_print(answer_vague))
print("------------------------------------------------------------")
print("\nThink about this results. Does the result fit the purpose of the question?")

All right, we've now seen what kind of question LLM is vulnerable to and the importance of a clear prompt.

Finally, before diving into the practice, let's ask LLM complex questions related to the city. We will look at LLM's answers and compare them with those of LLM reinforced by RAG.  

<br/>

Question: **Which Korean city has a sisterhood relationship with Belize City?**.   
Answer: **Yeosu**

<br/>

<img src="https://media.tenor.com/TWMxi0kGDTgAAAAi/hmm.gif" width="150" />  


Hmm... I think it is quite complex question. Could you answer this question?  
If you can, feel free to think of other questions related to the city.  

<br/>

However, make sure there is a Wikipedia page for that city, and if so, save the title of that Wikipedia page in the variable below.  

```Python
city_question = 'Which Korean city has a sisterhood relationship with Belize City?.'

kr_city = generate_answer(city_question)
print("Answer: ", pretty_print(kr_city))

city_related_to_question = 'Belize City'
```


In [None]:
### YOUR CODE HERE ###

city_question = 'Which Korean city has a sisterhood relationship with Belize City?.'

kr_city = generate_answer(city_question)
print("Answer: ", pretty_print(kr_city))

city_related_to_question = 'Belize City'

If you remembered the answer above, let's start making RAG!

## I. Building a Prototype RAG.  

In this section, we will build the Retriever and Database for use in RAG using the LlamaIndex and Milvus that we learned in the previous session.

What makes it different from previous practice is that this time, we go through the process of directly selecting the information and the external knowledge source and inserting it into the database. In this practice, we will make the following assumptions.   
<br/>

#### Input Prompt: **City-related question**  
#### External Knowledge Source: **Wikipedia**   
<br/>

We follow the steps below:  

#### 1. Data Load
#### 2. Check Query Engine
#### 3. Connect Retriever and Generator

### 1. Data Load

Load 100 different cities' wikipedia page data.  

We will use LlamaIndex data connector `WikipediaReader` which that receives a tile list of the wikipedia page and returns corresponding wikipedia page by replacing it with a `Document` object.  

```Python
city_name_path = '/your/path/to/city.txt' #change this path

city_names = []

with open(city_name_path, 'r', encoding='utf-8') as file:
    lines = file.readlines()
    for line in lines:
        city = line.split(':')[0][:-1]
        city_names.append(city)

print(city_names) #100 different cities

if city_related_to_question not in city_names:
  city_names.append(city_related_to_question)
```

```Python
reader = WikipediaReader()
documents = reader.load_data(city_names, auto_suggest=False)
```

Plus, in this section, we will add the city name to the list corresponding to the city-related questions we initially asked LLM.

In [None]:
! pwd

/content


In [None]:
### YOUR CODE HERE ###

city_name_path = '/your/path/to/city.txt' #change this path

city_names = []

with open(city_name_path, 'r', encoding='utf-8') as file:
    lines = file.readlines()
    for line in lines:
        city = line.split(':')[0][:-1]
        city_names.append(city)

print(city_names) #100 different cities

if city_related_to_question not in city_names:
  city_names.append(city_related_to_question)

In [None]:
### YOUR CODE HERE ###

reader = WikipediaReader()
documents = reader.load_data(city_names, auto_suggest=False)

<img src="https://i.gifer.com/B6Qs.gif" width="150">

This process will take some time.

If there was no problem with progressing this far, let's insert the Wikipedia page information transformed into the `Document` object.

```Python
index = VectorStoreIndex.from_documents(documents)
```

In [None]:
### YOUR CODE HERE ###

index = VectorStoreIndex.from_documents(documents)

### 2. Check Query Engine

Make query engine from index, and check whether Wikipedia page information has been properly inserted using city-related queries.

We will using query related to Berlin which is element of `city_names`, and the answer of the query can be inferred from the information on the Wikipedia page.  

<br/>

####**Using query: 'What's the arts and culture scene in Berlin?'**

<img src = "https://i.imgur.com/RQkoLL1.png" width = 800>

```Python
query_engine = index.as_query_engine()
response = query_engine.query("What's the arts and culture scene in Berlin?")
print(textwrap.fill(str(response), 100))
```

In [None]:
### YOUR CODE HERE ###

query_engine = index.as_query_engine()
response = query_engine.query("What's the arts and culture scene in Berlin?")
print(textwrap.fill(str(response), 100))

#### Let's see what happens to the response of query engine when get query asking for knowledge that does not exist in the database.

#### **Citiese and question-answer pairs that are not in 'city.txt'**

**Xinjiang**  
*Question:* What is the magnitude of the earthquake that occurred in Xinjiang, China in 2024?  
*Answer:* 7.1

**Jaipur**  
*Question:* Where in Jaipur is the proposed location for the third largest cricket stadium in the world, which can accommodate about 75,000 people?  
*Answer:* Chonp Village.


**Evaluate for yourself whether the response is accurate of inaccurate**

```Python
unknown_question_xinjiang = "What is the magnitude of the earthquake that occurred in Xinjiang, China in 2024?"
unknown_question_jaipur = "Where in Jaipur is the proposed location for the third largest cricket stadium in the world, which can accommodate about 75,000 people?"

response_xinjiang = query_engine.query(unknown_question_xinjiang)
response_jaipur = query_engine.query(unknown_question_jaipur)

print(pretty_print(response_xinjiang.response))
print("\n-------------------------\n")
print(pretty_print(response_jaipur.response))
```

In [None]:
### YOUR CODE HERE ###

unknown_question_xinjiang = "What is the magnitude of the earthquake that occurred in Xinjiang, China in 2024?"
unknown_question_jaipur = "Where in Jaipur is the proposed location for the third largest cricket stadium in the world, which can accommodate about 75,000 people?"

response_xinjiang = query_engine.query(unknown_question_xinjiang)
response_jaipur = query_engine.query(unknown_question_jaipur)

print(pretty_print(response_xinjiang.response))
print("\n-------------------------\n")
print(pretty_print(response_jaipur.response))

### 3. Connect Retriever and Generator

Build a RAG from scratch.

Since RAG consists of two steps: retrieve and generation, it is a good choice to generate it as a class with two methods: `retrieve` and `generate_response`.

In addition, when we add a method that puts the output from `retrieve` method as the input of the `generate_response` method, the RAG class can be completed.

First, import libraries

```Python
from openai import OpenAI
from trulens.apps.custom import instrument

oai_client = OpenAI()
```

In [None]:
### YOUR CODE HERE ###

from openai import OpenAI
from trulens.apps.custom import instrument

oai_client = OpenAI()

Next, make a RAG from scratch. In this section, we use `instrument` decorator for later evaluation.

```Python
class RAG_from_scratch:
    @instrument
    def retrieve(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = query_engine.query(query)
        return results

    @instrument
    def generate_response(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=
        [
            {"role": "user",
            "content":
            f"We have provided context information below. \n"
            f"---------------------\n"
            f"{context_str}"
            f"\n---------------------\n"
            f"Given this information, please answer the question: {query}"
            }
        ]
        ).choices[0].message.content
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve(query)
        completion = self.generate_response(query, context_str)
        return completion

rag = RAG_from_scratch()
```

In [None]:
### YOUR CODE HERE ###

class RAG_from_scratch:
    @instrument
    def retrieve(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = query_engine.query(query)
        return results

    @instrument
    def generate_response(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=
        [
            {"role": "user",
            "content":
            f"We have provided context information below. \n"
            f"---------------------\n"
            f"{context_str}"
            f"\n---------------------\n"
            f"Given this information, please answer the question: {query}"
            }
        ]
        ).choices[0].message.content
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve(query)
        completion = self.generate_response(query, context_str)
        return completion

rag = RAG_from_scratch()

Now, it's time to remember the question we asked to LLM before making RAG.  
The default question was this.  

<br/>

Question: **Which Korean city has a sisterhood relationship with Belize City?**  
Answer: **Yeosu**

<br/>

Let's use the RAG class we created to get this question to LLM. What will happen?  
Now that we've made the RAG class and the method in it, we can also see the results of query engine.

```Python
city_question = 'Which Korean city has a sisterhood relationship with Belize City?'

answer = rag.query(city_question)

print(pretty_print(answer))
```


In [None]:
### YOUR CODE HERE ###

city_question = 'Which Korean city has a sisterhood relationship with Belize City?'

answer = rag.query(city_question)

print(pretty_print(answer))

Moreover, the query engine we use is directly generated from the index, and since retriever made of the `index.as _retriever()` method used in this query engine, we can also check the retrieved passages.

```Python
retriever = index.as_retriever()

ret = retriever.retrieve(city_question)
for i in range(len(ret)):
  print(f"Retrieved Paasge {i+1}\n", ret[i].text, '\n\n\n')
```

In [None]:
### YOUR CODE HERE ###

retriever = index.as_retriever()

ret = retriever.retrieve(city_question)
for i in range(len(ret)):
  print(f"Retrieved Paasge {i+1}\n", ret[i].text, '\n\n\n')

### II. Finding the best RAG App Configuration.

Let's change the configuration of the RAG and observe how its performance varies.

The performance of RAG is measured by `Groundedness`, `Answer Relevance`, and `Context Relevance`. However, `latency` and `total cost` are also important factors. Therefore, we shounld choose the evaluation metrics carefully and evaluate RAG's performance from various perspectives.

<br/>  

In this section, we'll modify the following factors to examine changes in RAG's performance:

#### 1. Chunk Size of Retriever: As the chunk size varies, retriever may or may not retrieve sufficient information. Let’s adjust the chunk size to see how the results change.
#### 2. Query Engine: Refine the prompt used in the Query Engine's answer generation to enhance the quality of its summaries.
#### 3. Prompt Design for RAG: Add clear instructions to the task so that the LLM generates the answer that the user is looking for.



First, let’s go through the process of varying the chunk size of the retriever.

When performing a search, the retriever finds and returns the most relevant information from the existing `nodes`. These nodes are determined by the `chunk size` and `chunk overlap` settings of the `SentenceSplitter`, which can be configured when creating the `index`.

Therefore, for questions where the information needed for the answer spans multiple chunks, the retriever may or may not return sufficient information depending on the chunk size.

This time, we will use the following two settings:


**<h4>Retriever_long - chunk size: 1024, chunk overlap: 200</h4>**

**<h4>Retriever_short - chunk size: 200, chunk overlap: 50</h4>**

```Python
text_splitter_short = SentenceSplitter(chunk_size=200, chunk_overlap=50)

index_short = VectorStoreIndex.from_documents(documents=documents, transformations=[text_splitter_short])

text_splitter_long = SentenceSplitter(chunk_size=1024, chunk_overlap=200)

index_long = VectorStoreIndex.from_documents(documents=documents, transformations=[text_splitter_long])
```

In [None]:
### YOUR CODE HERE ###

text_splitter_short = SentenceSplitter(chunk_size=200, chunk_overlap=50)

index_short = VectorStoreIndex.from_documents(documents=documents, transformations=[text_splitter_short])

text_splitter_long = SentenceSplitter(chunk_size=1024, chunk_overlap=200)

index_long = VectorStoreIndex.from_documents(documents=documents, transformations=[text_splitter_long])

Additionally, to clearly compare the retrieved results, each retriever will be set to return only one node

```Python
retriever_short = index_short.as_retriever(similarity_top_k=1)
retriever_long = index_long.as_retriever(similarity_top_k=1)
```

In [None]:
### YOUR CODE HERE ###

retriever_short = index_short.as_retriever(similarity_top_k=1)
retriever_long = index_long.as_retriever(similarity_top_k=1)

Now, let’s check the impact of `chunk size` by using the following question. This question is about Antananarivo, the capital of Madagascar.

```Python
question = "In Antananarivo, what were the foundational materials of traditional architecture, and what efforts have been made to protect and restore the city’s architectural and cultural heritage?"
```

To answer this question correctly, we need information on what the foundational materials of traditional architecture in Antananarivo are, as well as the efforts made to protect and restore this architectural heritage.

Let’s see how the retriever responds to this question with the code below.

```Python
ret_passages_short = retriever_short.retrieve(question)
for i in range(len(ret_passages_short)):
  print(pretty_print(ret_passages_short[i].text))
  print("\n\n\n")
```
```Python
ret_passages_long = retriever_long.retrieve(question)
for i in range(len(ret_passages_long)):
  print(pretty_print(ret_passages_long[i].text))
  print("\n\n\n")
```

In [None]:
### YOUR CODE HERE ###

question = "In Antananarivo, what were the foundational materials of traditional architecture, and what efforts have been made to protect and restore the city’s architectural and cultural heritage?"

In [None]:
### YOUR CODE HERE ###

ret_passages_short = retriever_short.retrieve(question)
for i in range(len(ret_passages_short)):
  print(pretty_print(ret_passages_short[i].text))
  print("\n\n\n")

In [None]:
### YOUR CODE HERE ###

ret_passages_long = retriever_long.retrieve(question)
for i in range(len(ret_passages_long)):
  print(pretty_print(ret_passages_long[i].text))
  print("\n\n\n")

Comparing the retrieved passages, you can see that when the `chunk size` is large, sufficient information is included, whereas with a small chunk size, the context is cut off.

By adjusting the `chunk size` in this way, the accuracy of the query engine can change. However, increasing the chunk size may also include unnecessary information, and detailed information might be overlooked by the generator.

Therefore, the optimal `chunk size` varies depending on the case, and this is an important configuration that can affect performance in RAG.

Choose the retriever and corresponding index you want to use for this practice. Later, you can modify this part and redeclare the query engine to change the type of retriever. These results will also impact the creation of the Query engine in the future

```Python
# retriever = index_short.as_retriever(similarity_top_k=2)
# retriever = index_long.as_retriever(similarity_top_k=2)

# index = index_short #If you want to use this, remove #
# index = index_long #If you want to use this, remove #
```

In [None]:
### YOUR CODE HERE ###

retriever = index_short.as_retriever(similarity_top_k=2)
# retriever = index_long.as_retriever(similarity_top_k=2)

index = index_short #If you want to use this, remove #
# index = index_long #If you want to use this, remove #

<h4>Secondly, we will redefine the query engine and response synthesizer.  
There are two options available:  </h4>

First, you can use the query engine to **generate a summary** instead of an answer. The downside of this approach is that important information may be lost during the summarization process.

The second option is to **modify the instructions of the current query engine** you have been using. Currently, the query engine is set to generate an answer directly, but you can change the prompt used for this as you see fit. However, if an error occurs in the query engine, it may propagate to the generator, causing error propagation.

As seen in previous experiments, crafting a good prompt significantly impacts the performance of an LLM. At this point, you will decide which version of the query engine to use and what prompt to employ.  

<br/>

**Prompt**

```Pytyhon
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core import get_response_synthesizer
from llama_index.core.response_synthesizers import BaseSynthesizer
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

simple_qa_prompt = PromptTemplate(
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

short_sum_prompt = PromptTemplate(
"""Write a summary of the following. Try to use only the information provided.
Try to include as many key details as possible.
---------------------\n
{context_str}
---------------------\n
SUMMARY:"""
)
```
**Custom query engine class**
```Pytyhon
class OurCustomQueryEngine(CustomQueryEngine):

    retriever: BaseRetriever
    response_synthesizer: BaseSynthesizer
    llm: OpenAI
    qa_prompt: PromptTemplate = simple_qa_prompt

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)

        context_str = "\n\n".join([n.node.get_content() for n in nodes])
        response = self.llm.complete(
            self.qa_prompt.format(context_str=context_str, query_str=query_str)
        )

        return str(response)

llm = OpenAI(model="gpt-3.5-turbo")
```

In [None]:
### YOUR CODE HERE ###

from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core import get_response_synthesizer
from llama_index.core.response_synthesizers import BaseSynthesizer
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

simple_qa_prompt = PromptTemplate(
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

short_sum_prompt = PromptTemplate(
"""Write a summary of the following. Try to use only the information provided.
Try to include as many key details as possible.
---------------------\n
{context_str}
---------------------\n
SUMMARY:"""
)

In [None]:
### YOUR CODE HERE ###

class OurCustomQueryEngine(CustomQueryEngine):

    retriever: BaseRetriever
    response_synthesizer: BaseSynthesizer
    llm: OpenAI
    qa_prompt: PromptTemplate = simple_qa_prompt

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)

        context_str = "\n\n".join([n.node.get_content() for n in nodes])
        response = self.llm.complete(
            self.qa_prompt.format(context_str=context_str, query_str=query_str)
        )

        return str(response)

llm = OpenAI(model="gpt-3.5-turbo")

You can directly see how the two versions of the query engine generate results by checking their outputs with the code below.

```Python
retriever = index.as_retriever()
synthesizer = get_response_synthesizer(response_mode="compact")

query_engine_answer = OurCustomQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer,
    llm=llm,
    qa_prompt=simple_qa_prompt,
)

res_answer = query_engine_answer.query("What's the arts and culture scene in Berlin?")
```
```Python
query_engine_sum = OurCustomQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer,
    llm=llm,
    qa_prompt=short_sum_prompt,
)

res_summary = query_engine_sum.query("What's the arts and culture scene in Berlin?")
```
```Python
print(res_summary)
print(res_answer)
```

In [None]:
### YOUR CODE HERE ###

retriever = index.as_retriever()
synthesizer = get_response_synthesizer(response_mode="compact")

query_engine_answer = OurCustomQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer,
    llm=llm,
    qa_prompt=simple_qa_prompt,
)

res_answer = query_engine_answer.query("What's the arts and culture scene in Berlin?")

In [None]:
### YOUR CODE HERE ###

query_engine_sum = OurCustomQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer,
    llm=llm,
    qa_prompt=short_sum_prompt,
)

res_summary = query_engine_sum.query("What's the arts and culture scene in Berlin?")

In [None]:
### YOUR CODE HERE ###

print(res_summary)
print(res_answer)

Now, before redifining RAG, you need to first decide which query engine to use

```Python
#query_engine = query_engine_sum #If you want to use this, remove #
#query_engine = query_engine_answer #If you want to use this, remove #
```

In [None]:
### YOUR CODE HERE ###

query_engine = query_engine_sum #If you want to use this, remove #
# query_engine = query_engine_answer #If you want to use this, remove #

Now, let's create a new RAG using the query engine and retriever we've built above.  

In this process, we'll provide clearer instructions in the input prompt that goes to the RAG's generator.


```Python
class Refine_RAG:
    @instrument
    def retrieve(self, query: str) -> list:
        ret = retriever.retrieve(query)
        results = query_engine.query(query)
        return ret, results

    @instrument
    def generate_response(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        messages = [
            {
                "role": "system",
                "content": f"You are a helpful assistant. Answer as concisely as possible.",
            },
            {
                "role": "user",
                "content":
                    f"""
                    ###Instruction
Please answer the following question based on the provided context. Your answer should be short and concise.
Basically, you have to answer the question based on the provided context. But you can use your parametrized knowledge when the provided context was wrong or unrelated to the quesion.
When you generate the answer, you should explain the reason why you deduced such a answer from the context.

### Question
{query}

### Provided Context
{context_str}

### Answer


### Reason
                    """
            }
        ]

        response = oai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            temperature=0,
            messages=messages,
        )

        return response.choices[0].message.content

    @instrument
    def query(self, query: str) -> str:
        ret, context_str = self.retrieve(query)
        # for i in range(len(ret)):
        #   print("Retrieved Context: \n", ret[i].text) # use only when you want to see intermeidate result
        # print("\n\nIntermediate Summary: \n",  context_str) # use only when you want to see intermeidate result
        completion = self.generate_response(query, context_str)
        return completion

refine_rag = Refine_RAG()
```

In [None]:
### YOUR CODE HERE ###

class Refine_RAG:
    @instrument
    def retrieve(self, query: str) -> list:
        ret = retriever.retrieve(query)
        results = query_engine.query(query)
        return ret, results

    @instrument
    def generate_response(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        messages = [
            {
                "role": "system",
                "content": f"You are a helpful assistant. Answer as concisely as possible.",
            },
            {
                "role": "user",
                "content":
                    f"""
                    ###Instruction
Please answer the following question based on the provided context. Your answer should be short and concise.
Basically, you have to answer the question based on the provided context. But you can use your parametrized knowledge when the provided context was wrong or unrelated to the quesion.
When you generate the answer, you should explain the reason why you deduced such a answer from the context.

### Question
{query}

### Provided Context
{context_str}

### Answer


### Reason
                    """
            }
        ]

        response = oai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            temperature=0,
            messages=messages,
        )

        return response.choices[0].message.content

    @instrument
    def query(self, query: str) -> str:
        ret, context_str = self.retrieve(query)
        # for i in range(len(ret)):
        #   print("Retrieved Context: \n", ret[i].text) # use only when you want to see intermeidate result
        # print("\n\nIntermediate Summary: \n",  context_str) # use only when you want to see intermeidate result
        completion = self.generate_response(query, context_str)
        return completion

refine_rag = Refine_RAG()

Within the RAG class, you can inspect all values and content that are created and passed along.

```Python
sample_question = "City council of Suwon addressed illegal dumping of household waste in what way?"

answer = refine_rag.query(sample_question)

print(f"\n\n{answer}")
```


In [None]:
### YOUR CODE HERE ###

sample_question = "City council of Suwon addressed illegal dumping of household waste in what way?"

answer = refine_rag.query(sample_question)

print(f"\n\n{answer}")

I hope there are some improvement of RAG performance, even a little :)  

Of course, we can enhance the performance of the RAG not only through the three elements mentioned above but also in many other ways.

For instance, we can expect changes in performance by altering the following factors.

**Temperature**: This parameter controls the randomness of the output from a language model. A higher temperature increases the randomness, resulting in more varied and unpredictable text. Conversely, a lower temperature makes the model's responses more deterministic and predictable, often sticking closer to the most likely outcomes.  

**Top-p**: This parameter helps in controlling the diversity of the model's responses by focusing on the most probable tokens. Top-p sets a threshold to consider the top tokens up to the point where their cumulative probability exceeds a specified value p. This way, the model filters out less likely tokens and focuses only on a subset of the most probable tokens, which helps in generating coherent and contextually relevant outputs.

```Python
        response = oai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            temperature=0, #change this value
            top_p=0, #change this value
            messages=messages,
        )
```

**Different Embedding Function**: Using more advanced embedding models could capture the meanings of tokens more effectively, potentially allowing for more accurate retrieval.

Increasing the dimensionality of embedding vectors can enable them to capture more information, thus improving their ability to find semantically similar passages. However, this might also lead to overfitting and make it difficult to measure similarities due to diluted data distribution across dimensions, i.e., curse of dimensionality.

Conversely, reducing the dimensions of vectors can increase computational efficiency and prevent overfitting. Yet, overly small dimensions may not capture enough information, leading to degraded performance.

```Python
from llama_index.core import Settings

# Option 1. global default
Settings.embed_model = OpenAIEmbedding()

# Option 2. local model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5" #change this to the local path or huggingface model name
)

#per-index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
```

Therefore, by using one of the above codes, you can adjust the embedding function and the dimension of the embedding vector to improve the performance of the RAG
