## Retrieval-Augmented Generative OpenAI Question Answering with OpenAI

In my opinion, generative question answering is one of the most interesting applications of Large Language Models or LLMs. 

Generative question answering is a type of natural language processing (NLP) task where a model generates a natural language answer to a question based on a given context or passage.

In this approach, the model generates the answer from scratch rather than selecting an answer from a pre-defined set of options. 

Ideally what I want to build is a chatbot that can answer my questions by extracting facts, drawing conclusions, or providing an insightful summary based on the most relevant text chunks extracted from the knowledge sources I put at its disposal. 

I can imagine this empathic bot to be a personal tutor for students in schools and universities, our educational system here in Egypt would benefit from such a bot, or a customer support agent, for example, if we are mobile phone operators and we want to allow our customers to send their inquiries 24/7 and find attentive support ready to answer them patiently.

### Two Approaches with OpenAI
Now, one approach to building such a bot is to fine-tune the selected LLM on text data covering the fine domain we want our model to be an expert in. But this approach has a number of issues:
- Cost: Davinci model costs 0.02 USD per 1000 tokens (100 tokens ~= 75 words)
- Cheaper and latest ```gpt-turbo-3.5``` model is not available yet for tuning
- The model tends to be non-deterministic giving answers even when it is not sure or completely making answers up, aka hallucination.

So rather than fine-tuning a model, we follow the more deterministic Semantic Search in + Text Generation approach. We divide the knoweldge base into chunks of text. We embed these chunks using ```text-embedding-ada-002``` model, then we provide text chunks we found relevant to our query to the latest and cost-effective ```gpt-turbo-3.5``` model to complete the text.

**Because** we provided context information we stop the hallucinations, and the LLMs provide factual accurate answers. The OpenAI documentation says: "If you provide the API with a body of text to answer questions about (like a Wikipedia entry) it will be less likely to confabulate a response." yet because of the generative text completion step we still get a human-like answer. 

**Also** with the answer we can give the source text our bot used to generate its reply, this would help the users to trust the system and 
confirm the reliability of the information presented to them. 

**And** we prime the model that to say "I don't know" for low-confidence answers.

**Finally** as we said above this is cost-effective because we use ```gpt-3.5-turbo``` which performs at a similar capability to ```text-davinci-003``` but at 10% its price per token.

## Limitations and Enhacements

### Context Limitation
Although this approach is appealing for its simplicity, it has a context size limitation. The maximum size prompt size is 4000 tokens which are approximately equal to 3000 words.

So, adding context information to the prompt only works when the dataset of extra text that the model may need to know is small enough to fit in a single prompt. What do we do when we need the model to choose relevant contextual information from within a large body of information? and how context size limitation affects the quality of the answers this is still a question I need to look into.

### Conversation History
Another area that I may look into is prompt engineering to maintain the context across rounds of questions and answers by passing previous conversation history summaries in as part of the text completion API call. One elegant solution for this problem is what is done by the trending framework [LangChain](https://github.com/hwchase17/langchain).

### Bot Persona
Additionally, the API support "persona", which allows you to specify certain traits or characteristics of a fictional persona to add more context to the conversation. From [openai documentation](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb)

I am here using a text version of Mark Twain's masterpiece Adventures of Tom Sawyer. Credit is to [gutenburg.org](gutenburg.org) project. I picked this book since it was one of my favorites in my childhood.

In the remainder of this notebook, I will demonstrate a method for augmenting OpenAI ```gpt-3.5-turbo``` with a large body of additional contextual information by using document embeddings and retrieval. 

This method answers queries in two steps: 

**first** it retrieves the information relevant to the query, **then** it writes an answer tailored to the question based on the retrieved information. The first step uses the **Embeddings** API, and the second step uses the **Completions** API.

## Setup

In [2]:
!pip install openai tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
import os
# os.environ["OPENAI_API_KEY"] = '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'

In [7]:
import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken

# Experiment - 1: No Context Provided

### The System Prompt

This prompt is what determines the entire behavior how the chatbot works, including its constraints and limitations which it *usually* follows. Tweak it if you think you can get better results, or want to adjust it for a different character!

In [15]:
openai.api_key = os.getenv("OPENAI_API_KEY")

system = """
You are a modern American literature tutor bot to help students with their study of Mark Twain's Adventures of Tom Sawyer. 
You are not an AI language model.
You must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:
- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.
- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.
- IF YOU DON'T KNOW ANSWER 'I DO NOT KNOW'.
Begin the conversation with a warm greetings, you may follow up with a quize question the readers knowledge of Mark Twain Tom Sawyer novel
Refuse to talk about either race or gender. If asked about either race or gender, reply politly that you are designed to teach Mark Twain works only.
If the user is stresseful or agressive, show understanding and empathy.
At the end of the conversation, respond with "<|DONE|>"."""

messages = [{"role": "system", "content": system},]



'Hello! Tom met Huck Finn in the Adventures of Tom Sawyer when he was playing alone in the woods and stumbled upon Huck, who was also playing there. Huck was initially afraid of Tom, but they soon became friends and started to have many adventures together.'

### Test the model

In [17]:
prompt = "How did Tom meet Huck for the first time ?"

messages.append({"role": "user", "content": prompt})

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0
        )
response["choices"][0]["message"]["content"]

'Hello! In the novel "The Adventures of Tom Sawyer," Tom met Huck Finn for the first time when he saw him in the street and threw a rock at him. Huck then chased Tom until Tom convinced him to stop by offering him a small amount of money. From that point on, they became friends and had many adventures together.'

# Experiment - 2: Provide Relevant Context

### Preprocess data
First we break up the document library into "sections" of context, which can be searched and retrieved separately.

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. I found that approximately a 200 word section of text is a good length, but we should experiment for every particular use case.

In [19]:
import pandas as pd

# Open the file for reading
with open("/content/the_adventures_of_tom_sawyer.txt", "r") as file:

    # Read the entire file into a string
    text = file.read()

# Split the text into chunks of 200 words
words = text.split()
paragraphs = [' '.join(words[i:i+200]) for i in range(0, len(words), 200)]

# Convert paragraphs into a Pandas DataFrame
df = pd.DataFrame({"paragraphs": paragraphs})


In [20]:
df.paragraphs[0:5]

0    ﻿The Project Gutenberg eBook of The Adventures...
1    CHAPTER VI. Self-Examination—Dentistry—The Mid...
2    The Haunted House—Sleepy Ghosts—A Box of Gold—...
3    Pinch-Bug Sid Dentistry Huckleberry Finn Mothe...
4    the Prisoner Tom Swears The Court Room The Det...
Name: paragraphs, dtype: object

Then we overlap text chunks. This overlapping allow some repetions which helps to avoid losing valuable information relevant to the question because of the artficial dvisition of the text into fixed 200 long parts.

In [21]:
paragraphs_new = []
window = 5  # number of segments to combine
stride = 2  # number of segments to 'stride' over, used to create overlap
for i in (range(0, len(paragraphs), stride)):
    i_end = min(len(paragraphs)-1, i+window)
    text = ' '.join(_ for _ in paragraphs[i:i_end])
    paragraphs_new.append({
        'text': text,
    })

In [23]:
paragraphs_new[0]



We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. 


In [24]:
# imports
import pandas as pd
import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity


In [25]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

In [26]:
encoding = tiktoken.get_encoding("cl100k_base")
# should print [83, 1609, 5963, 374, 2294, 0]
encoding.encode("tiktoken is great!")

[83, 1609, 5963, 374, 2294, 0]

In [27]:
df = pd.DataFrame(paragraphs_new)
# noteiced a row with empty text, removing that
df=df[df.text.ne('')]
# encode , I might not need this step
df["n_tokens"] = df.text.apply(lambda x: len(encoding.encode(str(x))))
# filter too long text if any
df = df[df.n_tokens <= max_tokens]
df

Unnamed: 0,text,n_tokens
0,﻿The Project Gutenberg eBook of The Adventures...,1577
1,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370
2,the Prisoner Tom Swears The Court Room The Det...,1281
3,have seen through a pair of stove-lids just as...,1326
4,"and spile the child, as the Good Book says. I’...",1324
...,...,...
180,1.E.9. 1.E.3. If an individual Project Gutenbe...,1273
181,tax returns. Royalty payments should be clearl...,1248
182,EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU ...,1237
183,"or deletions to any Project Gutenberg-tm work,...",733


In [28]:
df["embedding"] = df.text.apply(lambda x: get_embedding(x, engine=embedding_model))
df[0:5]

Unnamed: 0,text,n_tokens,embedding
0,﻿The Project Gutenberg eBook of The Adventures...,1577,"[0.001815861207433045, -0.019039329141378403, ..."
1,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370,"[-0.0031101375352591276, -0.007375660818070173..."
2,the Prisoner Tom Swears The Court Room The Det...,1281,"[-0.01737176440656185, -0.010609232820570469, ..."
3,have seen through a pair of stove-lids just as...,1326,"[-0.001428895047865808, -0.017115658149123192,..."
4,"and spile the child, as the Good Book says. I’...",1324,"[-0.0015302413376048207, -0.004893323872238398..."


In [29]:
df["embedding"] = df.embedding.apply(np.array)
df[0:5]

Unnamed: 0,text,n_tokens,embedding
0,﻿The Project Gutenberg eBook of The Adventures...,1577,"[0.001815861207433045, -0.019039329141378403, ..."
1,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370,"[-0.0031101375352591276, -0.007375660818070173..."
2,the Prisoner Tom Swears The Court Room The Det...,1281,"[-0.01737176440656185, -0.010609232820570469, ..."
3,have seen through a pair of stove-lids just as...,1326,"[-0.001428895047865808, -0.017115658149123192,..."
4,"and spile the child, as the Good Book says. I’...",1324,"[-0.0015302413376048207, -0.004893323872238398..."


In [30]:
df.to_csv('/content/the_adventures_of_tom_sawyer.csv')

### Embed The Query

In [31]:
prompt = "How did Tom meet Huck for the first time ?"
prompt_embedding = get_embedding(prompt, engine=embedding_model)
prompt_embedding

[0.005511189345270395,
 -0.025657134130597115,
 -0.003149250987917185,
 -0.017638452351093292,
 -0.013364468701183796,
 0.04104612022638321,
 0.012663165107369423,
 0.010823896154761314,
 -0.007773886434733868,
 -0.024876436218619347,
 0.01972913183271885,
 0.018392683938145638,
 -0.0022792373783886433,
 -0.008468573912978172,
 0.0003963441704399884,
 0.006023935042321682,
 0.01642109453678131,
 0.008250243961811066,
 0.004326912108808756,
 -0.03493287041783333,
 -0.010909905657172203,
 0.00327661051414907,
 0.01659311354160309,
 -0.0003824917657766491,
 -0.019755596294999123,
 0.014965558424592018,
 0.02408250793814659,
 -0.01593150570988655,
 0.026345204561948776,
 -0.013258611783385277,
 0.000781110196840018,
 -0.0011429267469793558,
 -0.01992761343717575,
 -0.0036619966849684715,
 -0.019940845668315887,
 -0.003387429751455784,
 -0.007535708136856556,
 0.005335863213986158,
 0.020800935104489326,
 -0.014886165969073772,
 -0.008012065663933754,
 -0.008713369257748127,
 -0.00920295808

In [89]:
# find the most relevant parts of the video transcript to the query
df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, prompt_embedding))
results = (df.sort_values("similarity", ascending=False)).head(3)
# results = results.set_index(['id'])

In [90]:
display(results)

Unnamed: 0,text,n_tokens,embedding,similarity
78,"and stop.” “Yes, I’ve heard about that,” said ...",1301,"[0.002508266130462289, -0.0182208102196455, 0....",0.860773
68,"Indian; yelling, laughing, chasing boys, jumpi...",1242,"[-0.026282379403710365, -0.02262263558804989, ...",0.858479
172,laugh at this pleasant joke. But the silence w...,1242,"[-0.006196146830916405, -0.011552021838724613,...",0.858168


In [85]:
results["text"].iloc[0]

'and stop.” “Yes, I’ve heard about that,” said Joe. “I wonder what makes the bread do that.” “Oh, it ain’t the bread, so much,” said Tom; “I reckon it’s mostly what they _say_ over it before they start it out.” “But they don’t say anything over it,” said Huck. “I’ve seen ’em and they don’t.” “Well, that’s funny,” said Tom. “But maybe they say it to themselves. Of _course_ they do. Anybody might know that.” The other boys agreed that there was reason in what Tom said, because an ignorant lump of bread, uninstructed by an incantation, could not be expected to act very intelligently when set upon an errand of such gravity. “By jings, I wish I was over there, now,” said Joe. “I do too” said Huck “I’d give heaps to know who it is.” The boys still listened and watched. Presently a revealing thought flashed through Tom’s mind, and he exclaimed: “Boys, I know who’s drownded—it’s us!” They felt like heroes in an instant. Here was a gorgeous triumph; they were missed; they were mourned; hearts w

In [96]:
# example token count from the OpenAI API
limit = 2500

def retrieve(results):

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )

    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )

    prompt = prompt_start + "\n\n---\n\n".join(results["text"]) + prompt_end
    # contexts = [[text] for text in results['text'] ]
    # print (contexts)
    # append contexts until hitting limit
    
    # for i in range(1, len(results)):
        # print (results["text"].iloc[:i])
        # print (len("\n\n---\n\n".join(results["text"].iloc[:i])))
        # if len("\n\n---\n\n".join(results["text"].iloc[:i])) >= limit:
        #     prompt = (
        #         prompt_start +
        #         "\n\n---\n\n".join(results["text"].iloc[:i-1]) +
        #         prompt_end
        #     )
        #     print("here")
        #     break
        # elif i == len(results)-1:
        #     prompt = (
        #         prompt_start +
        #         "\n\n---\n\n".join(results["text"].iloc[i]) +
        #         prompt_end
        #     )
        #     print("no here")
    return prompt

In [97]:
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(results, prompt)
query_with_contexts

'Answer the question based on the context below.\n\nContext:\nand stop.” “Yes, I’ve heard about that,” said Joe. “I wonder what makes the bread do that.” “Oh, it ain’t the bread, so much,” said Tom; “I reckon it’s mostly what they _say_ over it before they start it out.” “But they don’t say anything over it,” said Huck. “I’ve seen ’em and they don’t.” “Well, that’s funny,” said Tom. “But maybe they say it to themselves. Of _course_ they do. Anybody might know that.” The other boys agreed that there was reason in what Tom said, because an ignorant lump of bread, uninstructed by an incantation, could not be expected to act very intelligently when set upon an errand of such gravity. “By jings, I wish I was over there, now,” said Joe. “I do too” said Huck “I’d give heaps to know who it is.” The boys still listened and watched. Presently a revealing thought flashed through Tom’s mind, and he exclaimed: “Boys, I know who’s drownded—it’s us!” They felt like heroes in an instant. Here was a go

In [98]:
messages = [{"role": "system", "content": system},]

query = "How did Tom meet Huck for the first time ?"
prompt_start = ("Answer the question based on the context below.\n\n"+ "Context:\n")
prompt_end = (f"\n\nQuestion: {query}\nAnswer:")
prompt = prompt_start + "\n\n---\n\n".join(results["text"]) + prompt_end

messages.append({"role": "user", "content": prompt})

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0
        )
response["choices"][0]["message"]["content"]

'The novel does not provide a clear answer to this question.'