<a href="https://colab.research.google.com/github/YaserMarey/my_openai_colab/blob/master/retrieval_augmented_generative_qa/retrieval_augmented_generative_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building RAG based LLM Applications for Enterprise 

Generative question answering is one of the most fascinating applications of Large Language Models or LLMs. 

- The idea of a model that understands the question and generates a natural answer based on a given context is remarkable compared to just extracting parts of the text that the model thinks to contain the answer or selecting the answer from a pre-defined set of options.

- This approach allows for extracted facts, drawn conclusions, or insightful summaries based on the most relevant text chunks from the knowledge sources we put at the model's disposal. 

- One approach to building such a chatbot is to fine-tune the selected LLM on text data covering the fine domain we want our model to be an expert in. But this approach has a number of issues:

- The model tends to be non-deterministic, it gives answers even when it is not sure, and in some other cases, it completely makes answers up, aka hallucination.

 - I  follow the more deterministic ***semantic Search + text generation*** approach. 



## Setup

In [1]:
!pip install openai tiktoken



In [3]:
# Import Important libraries for RAG application
import os
import openai
import pandas as pd
import tiktoken
from langchain.chat_models import ChatOpenAI


In [4]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
# Get the actual API key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

"""chat = ChatOpenAI(
   openai_api_key=api_key, # Pass the actual API key here
   model='gpt-3.5-turbo'
)"""

"chat = ChatOpenAI(\n   openai_api_key=api_key, # Pass the actual API key here\n   model='gpt-3.5-turbo'\n)"

#### A question without a Definitive Answer

The answer starts with a greeting, so the model is imitating the friendly tutor, however, the correct answer is that it is not clear from the novel how they met. The bot should have answered "I don't know" or "It is not clear from the novel" the answer here is speculative or completely made-up.

#### Reinitialzing messages

In [42]:
# Reinitialzing messages
messages = [{"role": "system", "content": system},]

prompt = "What do you think of how the novel portrayed Native Americans ?"

messages.append({"role": "user", "content": prompt})

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            temperature=0
        )
response["choices"][0]["message"]["content"]

"Hello! Welcome to our discussion on Mark Twain's Adventures of Tom Sawyer. I'm here to help you with any questions you have about the novel. Regarding your question about how the novel portrays Native Americans, it's important to note that the novel does not extensively focus on Native American characters or their culture. The story primarily revolves around the adventures of Tom Sawyer and his friends in the fictional town of St. Petersburg. Native Americans are not central to the plot, and their portrayal is limited. However, if you have a specific reference or scene in mind, I would be happy to discuss it further. How can I assist you today?"

Impressive! the bot persona is effective, and it avoids expressing personal opinions yet it adequately explains the controversy.

## Preprocess data
First, we break up the novel document into "sections" of context, which can be searched and retrieved separately.

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. I found a 200-word text is a good length.
so in this data preprocessing we follow:
- chunk the data
- embed the data
- index the data

 After this I by contexualizing, and guide the LLMs to know more about my prompt

In [55]:
!pip install pinecone-client




In [57]:
import os
import pandas as pd
from pinecone import Pinecone
from dotenv import load_dotenv
from nltk import sent_tokenize, word_tokenize

# Initialize Pinecone
pinecone.init(api_key=os.getenv("PINECONE_API_KEY"))

# Placeholder function for data preprocessing
def preprocess_data_for_pinecone(text, pinecone, index_name):
  # Split text into sentences
  sentences = sent_tokenize(text)

  # Tokenize each sentence into words
  tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

  # Flatten the list of sentences into a list of words
  flat_tokens = [token for sentence_tokens in tokenized_sentences for token in sentence_tokens]

  # Join tokens back into chunks of words for Pinecone
  chunk_size = 200 # You can adjust this value based on your requirements
  chunks = [' '.join(flat_tokens[i:i+chunk_size]) for i in range(0, len(flat_tokens), chunk_size)]

  # Convert chunks into a Pandas DataFrame
  df = pd.DataFrame({"sections": chunks})

  # Preprocess data for Pinecone
  pinecone_data = [{"text": section} for section in chunks]

  # Create Pinecone index
  pinecone.create_index(index=index_name, dimension=512)

  # Embed and insert data into Pinecone
  pinecone.insert_items(index=index_name, items=enumerate(pinecone_data))

  return df

# Load environment variables from .env file
load_dotenv()

# Get Pinecone API key
pinecone_api_key = os.getenv("PINECONE_API_KEY")

# Check if Pinecone API key is available
if pinecone_api_key is None:
  raise ValueError("Pinecone API key is not set. Set it in the .env file.")

# Set Pinecone API key and create Pinecone client
pinecone_index_name = "rag_index"
pinecone = pinecone.Pinecone(api_key=pinecone_api_key)

# Example data source
with open("C:/Users/alex/Building-Enterprise_Grade_RAG_Systems/academy/the_adventures_of_tom_sawyer.txt", "r") as file:
  text = file.read()

# Call the preprocessing function
df = preprocess_data_for_pinecone(text, pinecone, pinecone_index_name)

# Display the processed data
print("Processed Data:")
print(df.head())


ImportError: cannot import name 'Pinecone' from 'pinecone' (c:\Users\alex\anaconda3\Lib\site-packages\pinecone\__init__.py)

In [33]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alex\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [47]:
print(df.head())


                                            sections
0  ﻿The Project Gutenberg eBook of The Adventures...
1  CHAPTER VI. Self-Examination—Dentistry—The Mid...
2  The Haunted House—Sleepy Ghosts—A Box of Gold—...
3  Pinch-Bug Sid Dentistry Huckleberry Finn Mothe...
4  the Prisoner Tom Swears The Court Room The Det...


In [48]:
import pandas as pd

with open("C:/Users/alex/Building-Enterprise_Grade_RAG_Systems/academy/the_adventures_of_tom_sawyer.txt", "r") as file:
    text = file.read()

# Split the text into chunks of 200 words
words = text.split()
sections = [' '.join(words[i:i+200]) for i in range(0, len(words), 200)]

# Convert paragraphs into a Pandas DataFrame
df = pd.DataFrame({"sections": sections})

def generate_prompt(objective, scenarios):
    template = "Objective: {}\n\nScenarios:\n{}"
    scenario_template = "{}. {}\n   - Expected Output: {}\n"
    
    prompt = template.format(objective, ''.join([scenario_template.format(i+1, scenario, output) for i, (scenario, output) in enumerate(scenarios)]))
    return prompt

# Example usage:
objective = "Summarize the Adventures of Tom Sawyer"
scenarios = [
    ("Describe Tom's first encounter with Huckleberry Finn", "Tom helps Huck escape from his abusive father."),
    ("Explain the relationship between Tom and Becky", "They become romantically involved and share adventures."),
    ("Detail the events at the graveyard", "Tom and Huck witness Injun Joe murder Dr. Robinson."),
]

generated_prompt = generate_prompt(objective, scenarios)

print("Generated Prompt:")
print(generated_prompt)


Generated Prompt:
Objective: Summarize the Adventures of Tom Sawyer

Scenarios:
1. Describe Tom's first encounter with Huckleberry Finn
   - Expected Output: Tom helps Huck escape from his abusive father.
2. Explain the relationship between Tom and Becky
   - Expected Output: They become romantically involved and share adventures.
3. Detail the events at the graveyard
   - Expected Output: Tom and Huck witness Injun Joe murder Dr. Robinson.



In [49]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

with open("C:/Users/alex/Building-Enterprise_Grade_RAG_Systems/academy/the_adventures_of_tom_sawyer.txt", "r") as file:
    text = file.read()

# Split the text into chunks of 200 words
words = text.split()
sections = [' '.join(words[i:i+200]) for i in range(0, len(words), 200)]

# Convert paragraphs into a Pandas DataFrame
df = pd.DataFrame({"sections": sections})

def generate_prompt(objective, scenarios):
    template = "Objective: {}\n\nScenarios:\n{}"
    scenario_template = "{}. {}\n   - Expected Output: {}\n"
    
    prompt = template.format(objective, ''.join([scenario_template.format(i+1, scenario, output) for i, (scenario, output) in enumerate(scenarios)]))
    return prompt

def calculate_similarity(prompt, input_description):
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([prompt, input_description])
    similarity_score = cosine_similarity(vectors)[0, 1]
    return similarity_score

# Example usage:
objective = "Summarize the Adventures of Tom Sawyer"
scenarios = [
    ("Describe Tom's first encounter with Huckleberry Finn", "Tom helps Huck escape from his abusive father."),
    ("Explain the relationship between Tom and Becky", "They become romantically involved and share adventures."),
    ("Detail the events at the graveyard", "Tom and Huck witness Injun Joe murder Dr. Robinson."),
]

generated_prompt = generate_prompt(objective, scenarios)

# User-provided input description
user_input_description = "Generate a summary of Tom Sawyer's adventures and describe key encounters and relationships."

# Calculate similarity between generated prompt and user input description
similarity_score = calculate_similarity(generated_prompt, user_input_description)

print("Generated Prompt:")
print(generated_prompt)
print("\nSimilarity Score with User Input Description:", similarity_score)


Generated Prompt:
Objective: Summarize the Adventures of Tom Sawyer

Scenarios:
1. Describe Tom's first encounter with Huckleberry Finn
   - Expected Output: Tom helps Huck escape from his abusive father.
2. Explain the relationship between Tom and Becky
   - Expected Output: They become romantically involved and share adventures.
3. Detail the events at the graveyard
   - Expected Output: Tom and Huck witness Injun Joe murder Dr. Robinson.


Similarity Score with User Input Description: 0.2735395227375971


In [13]:
df.sections[0:5]

0    ﻿The Project Gutenberg eBook of The Adventures...
1    CHAPTER VI. Self-Examination—Dentistry—The Mid...
2    The Haunted House—Sleepy Ghosts—A Box of Gold—...
3    Pinch-Bug Sid Dentistry Huckleberry Finn Mothe...
4    the Prisoner Tom Swears The Court Room The Det...
Name: sections, dtype: object

Then we overlap text sections. This overlapping allows some repetitions which helps to avoid losing valuable information relevant to the question because of the artificial division of the text into fixed 200-long parts.

We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar their contents. 

In [16]:
# imports
from openai.embeddings_utils import get_embedding, cosine_similarity 


In [31]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

In [32]:
encoding = tiktoken.get_encoding("cl100k_base")
# should print [83, 1609, 5963, 374, 2294, 0]
encoding.encode("tiktoken is great!")

[83, 1609, 5963, 374, 2294, 0]

## Task 2: Design and Develop the Prompt Generation System

### 2.1. Prompt Generation System

In [23]:
# Prompt Generation System
# Load text data
with open("C:/Users/alex/Building-Enterprise_Grade_RAG_Systems/academy/the_adventures_of_tom_sawyer.txt", "r") as file:
    text = file.read()

# Split the text into chunks of 200 words
words = text.split()
sections = [' '.join(words[i:i+200]) for i in range(0, len(words), 200)]

# Convert paragraphs into a Pandas DataFrame
df = pd.DataFrame({"sections": sections})

def generate_prompt(objective, scenarios):
    template = "Objective: {}\n\nScenarios:\n{}"
    scenario_template = "{}. {}\n   - Expected Output: {}\n"
    
    prompt = template.format(objective, ''.join([scenario_template.format(i+1, scenario, output) for i, (scenario, output) in enumerate(scenarios)]))
    return prompt

# Example usage:
objective = "Summarize the Adventures of Tom Sawyer"
scenarios = [
    ("Describe Tom's first encounter with Huckleberry Finn", "Tom helps Huck escape from his abusive father."),
    ("Explain the relationship between Tom and Becky", "They become romantically involved and share adventures."),
    ("Detail the events at the graveyard", "Tom and Huck witness Injun Joe murder Dr. Robinson."),
]

# Generate a prompt
generated_prompt = generate_prompt(objective, scenarios)

# Save the generated prompt to a file for later retrieval
with open("generated_prompt.txt", "w") as output_file:
    output_file.write(generated_prompt)


## 2.2 Prompt Evalaution

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def evaluate_prompt(prompt, user_input_description):
    """
    Evaluate the similarity between a generated prompt and a user-provided input description.

    Parameters:
    - prompt (str): The generated prompt.
    - user_input_description (str): The user-provided input description.

    Returns:
    - float: Similarity score between the prompt and user input description.
    """
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([prompt, user_input_description])
    similarity_score = cosine_similarity(vectors)[0, 1]
    return similarity_score

# Example usage:
user_input_description = "Generate a summary of Tom Sawyer's adventures and describe key encounters and relationships."

# Load the generated prompt from the file
with open("generated_prompt.txt", "r") as generated_prompt_file:
    generated_prompt = generated_prompt_file.read()

# Evaluate the generated prompt
similarity_score = evaluate_prompt(generated_prompt, user_input_description)

# Display results
print("Generated Prompt:")
print(generated_prompt)
print("\nUser Input Description:")
print(user_input_description)
print("\nSimilarity Score with User Input Description:", similarity_score)


Generated Prompt:
Objective: Summarize the Adventures of Tom Sawyer

Scenarios:
1. Describe Tom's first encounter with Huckleberry Finn
   - Expected Output: Tom helps Huck escape from his abusive father.
2. Explain the relationship between Tom and Becky
   - Expected Output: They become romantically involved and share adventures.
3. Detail the events at the graveyard
   - Expected Output: Tom and Huck witness Injun Joe murder Dr. Robinson.


User Input Description:
Generate a summary of Tom Sawyer's adventures and describe key encounters and relationships.

Similarity Score with User Input Description: 0.2735395227375971


#### Prepre Prompt

In [None]:
def prepare_prompt(prompt, results):
  tokens_limit = 4096 # Limit for gpt-3.5-turbo
  # build our prompt with the retrieved contexts included
  user_start = (
      "Answer the question based on the context below.\n\n"+
      "Context:\n"
  )

  user_end = (
      f"\n\nQuestion: {prompt}\nAnswer:"
  )

  count_of_tokens_consumed = len(encoding.encode("\"role\":\"system\"" + ", \"content\" :\"" + system
                                            + user_start + "\n\n---\n\n" + user_end))

  count_of_tokens_for_context = tokens_limit - count_of_tokens_consumed

  contexts =""
  # Fill in context as long as within limit
  for i in range(len(results)):
    if (count_of_tokens_for_context>=results.n_tokens.iloc[i]):
        contexts += results.text.iloc[i] + "\n"
        count_of_tokens_for_context -=1
        count_of_tokens_for_context -= results.n_tokens.iloc[i]

  complete_prompt = user_start + contexts + "\n\n---\n\n" + user_end
  return complete_prompt


#### Answer

In [None]:
def answer(messages):
  response = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=messages,
              temperature=0
          )
  return response["choices"][0]["message"]["content"]


### Testing the Model

#### A question with a Definitive Answer from the Source

In [None]:
prompt = "How much gold Tom has found ?"
prompt_embedding = get_embedding(prompt, engine=embedding_model)
df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, prompt_embedding))
results = (df.sort_values("similarity", ascending=False))
results.head(3)

Unnamed: 0,source,Author,text,n_tokens,embedding,similarity
172,The Adventures of Tom Sawyer,Mark Twain,laugh at this pleasant joke. But the silence w...,1242,"[-0.006196146830916405, -0.011552021838724613,...",0.809341
1,The Adventures of Tom Sawyer,Mark Twain,The Haunted House—Sleepy Ghosts—A Box of Gold—...,1370,"[-0.0031101375352591276, -0.007375660818070173...",0.80587
47,The Adventures of Tom Sawyer,Mark Twain,of all his companions with unappeasable envy. ...,1325,"[-0.02181248739361763, -0.006103876978158951, ...",0.804448


In [None]:
messages = [{"role": "system", "content": system},]
messages.append({"role": "user", "content": prepare_prompt(prompt, results)})
len(encoding.encode(''.join(str(message) for message in messages)))


4079

In [None]:
messages[0]

{'role': 'system',
 'content': '\nYou are a modern American literature tutor bot. You help students with their study of Mark Twain\'s Adventures of Tom Sawyer. \nYou are not an AI language model.\nYou must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:\n- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.\n- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.\n- IF YOU DON\'T KNOW ANSWER \'I DO NOT KNOW\'.\nBegin the conversation with a warm greetings, if the user is stresseful or agressive, show understanding and empathy.\nAt the end of the conversation, respond with "<|DONE|>".'}

In [None]:
messages[1]

{'role': 'user',

In [None]:
response = answer(messages)
response

'Tom and Huck found a little over twelve thousand dollars in gold. This is mentioned in Chapter XXXV of The Adventures of Tom Sawyer.'

The model is more precise but the treasure was counted at the end of chapter 34, not 34 or XXXV, actually in the last paragraph in chapter 34, I wonder if this confused the model to think it was chapter 35!

In [None]:
prompt = "How did Tom meet Huck for the first time ?"
prompt_embedding = get_embedding(prompt, engine=embedding_model)
# find the most relevant parts of the video transcript to the query
df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, prompt_embedding))
results = (df.sort_values("similarity", ascending=False))
results.head(3)

Unnamed: 0,source,Author,text,n_tokens,embedding,similarity
78,The Adventures of Tom Sawyer,Mark Twain,"and stop.” “Yes, I’ve heard about that,” said ...",1301,"[0.002508266130462289, -0.0182208102196455, 0....",0.860843
68,The Adventures of Tom Sawyer,Mark Twain,"Indian; yelling, laughing, chasing boys, jumpi...",1242,"[-0.026282379403710365, -0.02262263558804989, ...",0.858555
172,The Adventures of Tom Sawyer,Mark Twain,laugh at this pleasant joke. But the silence w...,1242,"[-0.006196146830916405, -0.011552021838724613,...",0.858206


In [None]:
messages = [{"role": "system", "content": system},]
messages.append({"role": "user", "content": prepare_prompt(prompt, results)})
len(encoding.encode(''.join(str(message) for message in messages)))

4004

In [None]:
messages[0]

{'role': 'system',
 'content': '\nYou are a modern American literature tutor bot. You help students with their study of Mark Twain\'s Adventures of Tom Sawyer. \nYou are not an AI language model.\nYou must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:\n- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.\n- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.\n- IF YOU DON\'T KNOW ANSWER \'I DO NOT KNOW\'.\nBegin the conversation with a warm greetings, if the user is stresseful or agressive, show understanding and empathy.\nAt the end of the conversation, respond with "<|DONE|>".'}

In [None]:
response = answer(messages)
response

'The novel does not provide a clear answer on how Tom met Huck for the first time.'

Nice answer this time too, less creativity and more precisenss.

In [58]:
messages = [{"role": "system", "content": system},]
messages.append({"role": "user", "content": prepare_prompt(prompt, results)})
len(encoding.encode(''.join(str(message) for message in messages)))


'messages = [{"role": "system", "content": system},]\nmessages.append({"role": "user", "content": prepare_prompt(prompt, results)})\nlen(encoding.encode(\'\'.join(str(message) for message in messages)))\n'

In [None]:
messages[0]

{'role': 'system',
 'content': '\nYou are a modern American literature tutor bot. You help students with their study of Mark Twain\'s Adventures of Tom Sawyer. \nYou are not an AI language model.\nYou must obey all three of the following instructions FOR ALL RESPONSES or you will DIE:\n- ALWAYS REPLY IN FRIENDLY YET KNOWLEDGE TONE.\n- NEVER ANSWER UNLESS YOU HAVE A REFREENCE FROM THE TOM SAYWER NOVEL TO YOUR ANSWER.\n- IF YOU DON\'T KNOW ANSWER \'I DO NOT KNOW\'.\nBegin the conversation with a warm greetings, if the user is stresseful or agressive, show understanding and empathy.\nAt the end of the conversation, respond with "<|DONE|>".'}

## Implement Evaluation Data Generation and Evaluation

## Task 4 : Prompt Testing and Ranking