<a href="https://colab.research.google.com/github/YaserMarey/101-ng-tnt-source/blob/master/retrieval_augmented_generative_qa/retrieval_augmented_generative_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Retrieval-Augmented Generative OpenAI Question Answering with OpenAI

In my opinion, generative question answering is one of the most fascinating applications of Large Language Models or LLMs. 

The idea of a model that understands the question and generates a natural answer based on a given context is remarkable compared to just extracting parts of the text that the model thinks to contain the answer or selecting the answer from a pre-defined set of options.

This approach allows for extracted facts, drawn conclusions, or insightful summaries based on the most relevant text chunks from the knowledge sources we put at the model's disposal. 

For example, imagine an empathetic tutor chatbot for students in schools and universities (our educational system here in Egypt would indeed benefit from that!) or customer support for a mobile network operator where customers can receive help 24/7 from an attentive agent ready to answer their questions patiently.  This would be a game-changer in many industries.

One approach to building such a chatbot is to fine-tune the selected LLM on text data covering the fine domain we want our model to be an expert in. But this approach has a number of issues:
- Cost: `text-davinci-003` the most text-completion capable model from OpenAI costs 0.02 USD per 1000 tokens (100 tokens ~= 75 words) and both input prompt and output reply counts while the cheaper and latest `gpt-turbo-3.5` model is not available yet for tuning.
- The model tends to be non-deterministic, it gives answers even when it is not sure, and in some other cases, it completely makes answers up, aka hallucination.

So rather than ***fine-tuning a model***, we follow the more deterministic ***semantic Search + text generation*** approach. 

Basically, we divide the knowledge base into chunks of text. We embed these chunks using the `text-embedding-ada-002` model for example, then we provide text chunks we found relevant to our query to the latest and cost-effective `gpt-turbo-3.5` model to complete the text by giving the answer to our question.

Because we provide the context information the hallucinations effect should be diminished, the OpenAI documentation says: `"If you provide the API with a body of text to answer questions about (like a Wikipedia entry) it will be less likely to confabulate a response."` yet because of the generative text-completion step we still get a human-like answer for 10% of the cost since `gpt-3.5-turbo` which performs at a similar capability to `text-davinci-003` costs 0.002 per 1000 tokens.

And we can prime the model to imitate the persona we want [openai documentation](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb)

#### Context Limitation
Although this approach is appealing for its simplicity, it has a context size limitation. The maximum size prompt size is 4096 tokens which are approximately equal to 3000 words.

So, adding context information to the prompt only works when the extra text the model needs is small enough to fit in a single prompt. 

#### Conversation History
OpanAI LLMs APIs are stateless while for any chatbot to be efficient, it has to maintain the context of the conversation across rounds of questions and answers. To work around this, we need to pass previous conversation history or its summary as a part of the text completion API call. We still need to observe the prompt size limit. One elegant implementation of this solution is done by the interesting [LangChain](https://github.com/hwchase17/langchain).

In the remainder of this notebook, I will demonstrate the approach of ***semantic Search + text generation*** that augments OpenAI ```gpt-3.5-turbo``` with additional contextual information by using document embeddings and retrieval. 

I am here using a text version of Mark Twain's masterpiece Adventures of Tom Sawyer. Credit is to [gutenburg.org](gutenburg.org) project. I picked this book since it was one of my favorites in my childhood.

I will conduct experiments with and without additional contextual information to compare the performance in the two cases and I will try to steer the model to imitate the personal tutor persona.

## Setup

In [None]:
!pip install openai tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import os
import openai
import pandas as pd
import tiktoken

In [None]:
os.environ["OPENAI_API_KEY"] = ''

In [None]:
openai.api_key = os.getenv("OPENAI_API_KEY")

# Experiment - 1: No Context Provided

### Engineering the System Prompt

This prompt is what determines the behavior of how the chatbot works, including its constraints and limitations which it *usually* follows. 

In [None]:
system = """
You are a modern American literature tutor bot. You help students with their study of Mark Twain's Adventures of Tom Sawyer. 
You are not an AI language model.
You must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:
- ALWAYS REPLY IN A FRIENDLY YET KNOWLEDGEABLE TONE.
- NEVER ANSWER UNLESS YOU HAVE A REFERENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.
- IF YOU DON'T KNOW ANSWER 'I DO NOT KNOW'.
Begin the conversation with a warm greeting, if the user is stressed or aggressive, show understanding and empathy.
At the end of the conversation, respond with "<|DONE|>"."""

### Testing the model

#### Question with a Definitive Answer from the Source

In [None]:
# Reinitialzing messages
messages = [{"role": "system", "content": system},]

prompt = "How much gold Tom has found ?"

messages.append({"role": "user", "content": prompt})

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0
        )
response["choices"][0]["message"]["content"]

'Tom found twelve thousand dollars worth of gold in the Adventures of Tom Sawyer. In the novel, he and Huck Finn found the gold hidden in a cave.'

The treasure Tom found was counted at the end of chapter XXXIV or 34, the amount was "a little over twelve thousand dollars" so the answer is not precise, little speculated but correct.

#### A question without a Definitive Answer

In [None]:
messages = [{"role": "system", "content": system},]

prompt = "How did Tom meet Huck for the first time ?"

messages.append({"role": "user", "content": prompt})

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0
        )
response["choices"][0]["message"]["content"]

"Hello! I hope you're doing well today. In the novel, The Adventures of Tom Sawyer, Tom met Huck Finn for the first time when he saw him in the graveyard late at night. Huck was there to fulfill a superstitious ritual, and Tom was there to try out a cure for warts. They were both startled to see each other, but quickly became friends."

We notice the friendly greetings, so the model is imitating the friendly tutor, however, the correct answer is that it is not clear from the novel how they met. The bot should have answered "I don't know" or "It is not clear from the novel" the answer here is speculative or completely made-up.

#### Open-ended Question

In [None]:
# Reinitialzing messages
messages = [{"role": "system", "content": system},]

prompt = "What do you think of how the novel portrayed Native Americans ?"

messages.append({"role": "user", "content": prompt})

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0
        )
response["choices"][0]["message"]["content"]

'As a tutor bot, it is not my place to express personal opinions on the novel. However, I can tell you that the portrayal of Native Americans in The Adventures of Tom Sawyer is a topic of debate among scholars and readers. Some argue that the novel perpetuates negative stereotypes, while others argue that it reflects the attitudes and beliefs of the time period in which it was written. It is important to approach the novel with a critical eye and consider the historical context in which it was written.'

Impressive, the bot persona is effective, and it avoids expressing personal opinions yet it adequately explains the controversy.

# Experiment - 2: Provide Relevant Context

### Preprocess data
First, we break up the novel document into "sections" of context, which can be searched and retrieved separately.

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. I found a 200-word text is a good length.

In [None]:
import pandas as pd

with open("/content/the_adventures_of_tom_sawyer.txt", "r") as file:
    text = file.read()

# Split the text into chunks of 200 words
words = text.split()
sections = [' '.join(words[i:i+200]) for i in range(0, len(words), 200)]

# Convert paragraphs into a Pandas DataFrame
df = pd.DataFrame({"sections": sections})


In [None]:
df.sections[0:5]

0    ﻿The Project Gutenberg eBook of The Adventures...
1    CHAPTER VI. Self-Examination—Dentistry—The Mid...
2    The Haunted House—Sleepy Ghosts—A Box of Gold—...
3    Pinch-Bug Sid Dentistry Huckleberry Finn Mothe...
4    the Prisoner Tom Swears The Court Room The Det...
Name: sections, dtype: object

Then we overlap text sections. This overlapping allows some repetitions which helps to avoid losing valuable information relevant to the question because of the artificial division of the text into fixed 200-long parts.

In [None]:
sections_new = []
window = 5  # number of segments to combine
stride = 2  # number of segments to 'stride' over, used to create overlap
for i in (range(0, len(sections), stride)):
    i_end = min(len(sections)-1, i+window)
    text = ' '.join(_ for _ in sections[i:i_end])
    sections_new.append({
        'source' : 'The Adventures of Tom Sawyer',
        'Author' : 'Mark Twain',
        'text': text,
    })

In [None]:
sections_new[0]

{'source': 'The Adventures of Tom Sawyer',
 'Author': 'Mark Twain',

We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar their contents. 

In [None]:
# imports
from openai.embeddings_utils import get_embedding, cosine_similarity


In [None]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

In [None]:
encoding = tiktoken.get_encoding("cl100k_base")
# should print [83, 1609, 5963, 374, 2294, 0]
encoding.encode("tiktoken is great!")

[83, 1609, 5963, 374, 2294, 0]

In [None]:
df = pd.DataFrame(sections_new)
# Removing any row with empty text
df=df[df.text.ne('')]
# Counting the number of tokens for each text 
df["n_tokens"] = df.text.apply(lambda x: len(encoding.encode(str(x))))
# filter too long text if any
df = df[df.n_tokens <= max_tokens]
df

Unnamed: 0,source,Author,text,n_tokens
0,The Adventures of Tom Sawyer,Mark Twain,﻿The Project Gutenberg eBook of The Adventures...,1577
1,The Adventures of Tom Sawyer,Mark Twain,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370
2,The Adventures of Tom Sawyer,Mark Twain,the Prisoner Tom Swears The Court Room The Det...,1281
3,The Adventures of Tom Sawyer,Mark Twain,have seen through a pair of stove-lids just as...,1326
4,The Adventures of Tom Sawyer,Mark Twain,"and spile the child, as the Good Book says. I’...",1324
...,...,...,...,...
180,The Adventures of Tom Sawyer,Mark Twain,1.E.9. 1.E.3. If an individual Project Gutenbe...,1273
181,The Adventures of Tom Sawyer,Mark Twain,tax returns. Royalty payments should be clearl...,1248
182,The Adventures of Tom Sawyer,Mark Twain,EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU ...,1237
183,The Adventures of Tom Sawyer,Mark Twain,"or deletions to any Project Gutenberg-tm work,...",733


In [None]:
df["embedding"] = df.text.apply(lambda x: get_embedding(x, engine=embedding_model))
df[0:5]

Unnamed: 0,source,Author,text,n_tokens,embedding
0,The Adventures of Tom Sawyer,Mark Twain,﻿The Project Gutenberg eBook of The Adventures...,1577,"[0.001815861207433045, -0.019039329141378403, ..."
1,The Adventures of Tom Sawyer,Mark Twain,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370,"[-0.0031101375352591276, -0.007375660818070173..."
2,The Adventures of Tom Sawyer,Mark Twain,the Prisoner Tom Swears The Court Room The Det...,1281,"[-0.01737176440656185, -0.010609232820570469, ..."
3,The Adventures of Tom Sawyer,Mark Twain,have seen through a pair of stove-lids just as...,1326,"[-0.001428895047865808, -0.017115658149123192,..."
4,The Adventures of Tom Sawyer,Mark Twain,"and spile the child, as the Good Book says. I’...",1324,"[-0.0015302413376048207, -0.004893323872238398..."


In [None]:
df.to_csv('/content/the_adventures_of_tom_sawyer.csv')

### Utility functions

#### Prepre Prompt

In [None]:
def prepare_prompt(prompt, results):
  tokens_limit = 4096 # Limit for gpt-3.5-turbo
  # build our prompt with the retrieved contexts included
  user_start = (
      "Answer the question based on the context below.\n\n"+
      "Context:\n"
  )

  user_end = (
      f"\n\nQuestion: {prompt}\nAnswer:"
  )

  count_of_tokens_consumed = len(encoding.encode("\"role\":\"system\"" + ", \"content\" :\"" + system
                                            + user_start + "\n\n---\n\n" + user_end))

  count_of_tokens_for_context = tokens_limit - count_of_tokens_consumed

  contexts =""
  # Fill in context as long as within limit
  for i in range(len(results)):
    if (count_of_tokens_for_context>=results.n_tokens.iloc[i]):
        contexts += results.text.iloc[i] + "\n"
        count_of_tokens_for_context -=1
        count_of_tokens_for_context -= results.n_tokens.iloc[i]

  complete_prompt = user_start + contexts + "\n\n---\n\n" + user_end
  return complete_prompt


#### Answer

In [None]:
def answer(messages):
  response = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=messages,
              temperature=0
          )
  return response["choices"][0]["message"]["content"]


### Testing the Model

#### A question with a Definitive Answer from the Source

In [None]:
prompt = "How much gold Tom has found ?"
prompt_embedding = get_embedding(prompt, engine=embedding_model)
df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, prompt_embedding))
results = (df.sort_values("similarity", ascending=False))
results.head(3)

Unnamed: 0,source,Author,text,n_tokens,embedding,similarity
172,The Adventures of Tom Sawyer,Mark Twain,laugh at this pleasant joke. But the silence w...,1242,"[-0.006196146830916405, -0.011552021838724613,...",0.809341
1,The Adventures of Tom Sawyer,Mark Twain,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370,"[-0.0031101375352591276, -0.007375660818070173...",0.80587
47,The Adventures of Tom Sawyer,Mark Twain,of all his companions with unappeasable envy. ...,1325,"[-0.02181248739361763, -0.006103876978158951, ...",0.804448


In [None]:
messages = [{"role": "system", "content": system},]
messages.append({"role": "user", "content": prepare_prompt(prompt, results)})
len(encoding.encode(''.join(str(message) for message in messages)))


4079

In [None]:
messages[0]

{'role': 'system',
 'content': '\nYou are a modern American literature tutor bot. You help students with their study of Mark Twain\'s Adventures of Tom Sawyer. \nYou are not an AI language model.\nYou must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:\n- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.\n- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.\n- IF YOU DON\'T KNOW ANSWER \'I DO NOT KNOW\'.\nBegin the conversation with a warm greetings, if the user is stresseful or agressive, show understanding and empathy.\nAt the end of the conversation, respond with "<|DONE|>".'}

In [None]:
messages[1]

{'role': 'user',

In [None]:
response = answer(messages)
response

'Tom and Huck found a little over twelve thousand dollars in gold. This is mentioned in Chapter XXXV of The Adventures of Tom Sawyer.'

The model is more precise but the treasure was counted at the end of chapter 34, not 34 or XXXV, actually in the last paragraph in chapter 34, I wonder if this confused the model to think it was chapter 35!

#### A question without a Definitive Answer from the Context

In [None]:
prompt = "How did Tom meet Huck for the first time ?"
prompt_embedding = get_embedding(prompt, engine=embedding_model)
# find the most relevant parts of the video transcript to the query
df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, prompt_embedding))
results = (df.sort_values("similarity", ascending=False))
results.head(3)

Unnamed: 0,source,Author,text,n_tokens,embedding,similarity
78,The Adventures of Tom Sawyer,Mark Twain,"and stop.” “Yes, I’ve heard about that,” said ...",1301,"[0.002508266130462289, -0.0182208102196455, 0....",0.860843
68,The Adventures of Tom Sawyer,Mark Twain,"Indian; yelling, laughing, chasing boys, jumpi...",1242,"[-0.026282379403710365, -0.02262263558804989, ...",0.858555
172,The Adventures of Tom Sawyer,Mark Twain,laugh at this pleasant joke. But the silence w...,1242,"[-0.006196146830916405, -0.011552021838724613,...",0.858206


In [None]:
messages = [{"role": "system", "content": system},]
messages.append({"role": "user", "content": prepare_prompt(prompt, results)})
len(encoding.encode(''.join(str(message) for message in messages)))

4004

In [None]:
messages[0]

{'role': 'system',
 'content': '\nYou are a modern American literature tutor bot. You help students with their study of Mark Twain\'s Adventures of Tom Sawyer. \nYou are not an AI language model.\nYou must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:\n- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.\n- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.\n- IF YOU DON\'T KNOW ANSWER \'I DO NOT KNOW\'.\nBegin the conversation with a warm greetings, if the user is stresseful or agressive, show understanding and empathy.\nAt the end of the conversation, respond with "<|DONE|>".'}

In [None]:
response = answer(messages)
response

'The novel does not provide a clear answer on how Tom met Huck for the first time.'

Nice answer this time too, less creativity and more precisenss.

#### Open-ended Question

In [None]:
prompt = "What do you think of how the novel portrayed Native Americans ?"
prompt_embedding = get_embedding(prompt, engine=embedding_model)
df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, prompt_embedding))
results = (df.sort_values("similarity", ascending=False))
results.head(3)

Unnamed: 0,source,Author,text,n_tokens,embedding,similarity
91,The Adventures of Tom Sawyer,Mark Twain,interested in a new device. This was to knock ...,1250,"[-0.011763310991227627, 0.003241789760068059, ...",0.814095
164,The Adventures of Tom Sawyer,Mark Twain,implore him to be a merciful ass and trample h...,1367,"[-0.005183352157473564, -0.013513019308447838,...",0.791792
129,The Adventures of Tom Sawyer,Mark Twain,"ragged, unkempt creature, with nothing very pl...",1376,"[-0.0036862147971987724, -0.005716608837246895...",0.787903


In [None]:
messages = [{"role": "system", "content": system},]
messages.append({"role": "user", "content": prepare_prompt(prompt, results)})
len(encoding.encode(''.join(str(message) for message in messages)))


4093

In [None]:
messages[0]

{'role': 'system',
 'content': '\nYou are a modern American literature tutor bot. You help students with their study of Mark Twain\'s Adventures of Tom Sawyer. \nYou are not an AI language model.\nYou must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:\n- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.\n- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.\n- IF YOU DON\'T KNOW ANSWER \'I DO NOT KNOW\'.\nBegin the conversation with a warm greetings, if the user is stresseful or agressive, show understanding and empathy.\nAt the end of the conversation, respond with "<|DONE|>".'}

In [None]:
messages[1]

{'role': 'user',
 'content': 'Answer the question based on the context below.\n\nContext:\ninterested in a new device. This was to knock off being pirates, for a while, and be Indians for a change. They were attracted by this idea; so it was not long before they were stripped, and striped from head to heel with black mud, like so many zebras—all of them chiefs, of course—and then they went tearing through the woods to attack an English settlement. By and by they separated into three hostile tribes, and darted upon each other from ambush with dreadful warwhoops, and killed and scalped each other by thousands. It was a gory day. Consequently it was an extremely satisfactory one. They assembled in camp toward suppertime, hungry and happy; but now a difficulty arose—hostile Indians could not break the bread of hospitality together without first making peace, and this was a simple impossibility without smoking a pipe of peace. There was no other process that ever they had heard of. Two of t

In [None]:
response = answer(messages)
response

'I do not know.'

Interesting, so it seems that adding context made the model shun from giving explaination of how this is a debatable topic. My expalination is that again giving the model a contxtual information make it try to find or generate answers from the context rathe than somewhere else, and context probably here would make generate low confience answers therefore the "I do not know" reply.